i wanted to answer a WhatsApp call from code
not just see that a call is ringing, reject it but actually accept it, let the other phone enter the connected state, get relay/media traffic moving, and eventually play audio into the call
this sounds like something a WhatsApp Web library should not be able to do, and that is mostly true; libraries like Baileys and whatsmeow are great for messages, receipts, groups, media, contacts, app state, but calls are another layer completely
the funny thing is WhatsApp Web already ships the call engine to your browser; it is not a public and voice calling was not global enabled on WhatsApp Web while i was did this, but the code is there, and the call stack is a compiled WASM module with JavaScript around it acting as the browser adapter
this post is about how i made that thing run outside the browser, glued it to Baileys, fixed the signaling errors, got relay packets moving, and used prerecorded audio because testing with a file was easier than testing with a microphone i did not have connected
educational research only
this is not for abusing WhatsApp, bypassing user consent, or building spammy call automation
wa_call_signa handle_incoming_xmpp_offer failed to parse offer
wa_call_signa send Preaccept
core/call_ change_call_state: [ReceivedCall -> AcceptSent]
wa_tp.cc Bind request sent for UDP relay
eventually the call connected and stayed alive until i cut it off
the starting point
i already had a project that fetched WhatsApp Web source files; the project downloaded the current web.whatsapp.com assets and saved interesting files into a files folder so they could be searched, diffed, and read without opening DevTools every time
the first working setup had three folders:
files- source files fetched from WhatsApp Webwasm-loader- the first attempt at loading the WhatsApp Web VoIP WASMbaileys- glue code between Baileys and the WASM loader
the initial goal was intentionally small: receive one incoming call, not make outgoing calls, not build a phone system, not support every edge case, just make an incoming WhatsApp call reach the accepted state
that small goal still required almost everything: raw call signaling, WAP binary node encoding, the WebAssembly loader, callbacks for native-to-JS signaling, relay packet I/O, audio capture and playback drivers, and enough debugging to understand which layer was broken each time
the call UI was hidden behind ABProps
before reverse engineering the call stack, i had to make WhatsApp Web load it in the browser
for my account, WhatsApp Web did not show calling; the code was present in the bundle, but the feature was hidden behind internal experiment flags, and feature code asks WAWebABProps whether a specific experiment is enabled
so i used a Tampermonkey script i had written earlier:
the script runs at document-start, waits for the WhatsApp module system, finds WAWebABProps, and wraps getABPropConfigValue; the idea is to return enabled values for calling-related props while leaving everything else alone
roughly:
function wrap(original) {
return function (...args) {
const key = args[0]
switch (key) {
case "enable_web_calling":
case "enable_web_group_calling":
case "web_voip_call_tab_new_call":
return true
case "calling_lid_version":
return 1
default:
return original(...args)
}
}
}
this does not implement calls it just convinces WhatsApp Web that the current account belongs to a calling-enabled experiment bucket
after that, open DevTools, go to Network -> Wasm, reload web.whatsapp.com, and the browser starts fetching a WASM file that looks like this:
KwJJIha0f3H.wasm
one URL i saw was:
https://static.whatsapp.net/rsrc.php/ys/r/KwJJIha0f3H.wasm
that file is the important part; it is compiled C++ code, and the surrounding JavaScript is mostly a loader plus adapters for browser APIs, so without enabling those AB props, the browser may never fetch the WASM
calls are not messages
the next problem was understanding what actually arrives over the WhatsApp socket
when a call comes in, Baileys emits a friendly call event:
{
"from": "186775343484983:28@lid",
"id": "00EA4DF98484BE6EBC45862BB9E3E1F6",
"status": "offer",
"isVideo": false
}
this is good for logging, but it is not enough for the VoIP engine; the WASM does not want "status: offer", it wants the actual WhatsApp binary node payload, the same thing WhatsApp Web would have passed into the native call stack
the real incoming stanza looks conceptually like this:
{
tag: "call",
attrs: {
from: "186775343484983:28@lid",
id: "35913.11715-64",
platform: "web",
version: "0"
},
content: [
{
tag: "offer",
attrs: {
"call-id": "00D898D54A3A6E16A0B7A51FFE7F9EBA",
"call-creator": "186775343484983:28@lid"
},
content: [...]
}
]
}
and after the offer, more call nodes arrive:
relaylatency
transport
terminate
the first mistake was trying to use only Baileys' parsed call event; the WASM saw an offer with missing content and failed with:
parse_xmpp_offer: empty call-id
handle_incoming_xmpp_offer failed to parse offer
the fix was to listen to the raw call stanza:
sock.ws.on("CB:call", async (node) => {
await voipBridge.handleRawNode(node)
})
that one line changed the whole direction of the project; from that point, the bridge could feed the same kind of node WhatsApp Web itself receives into the WASM
the WAP binary problem
the raw node object is still not what the WASM wants
WhatsApp's native call API expects a base64 string containing a WAP binary node, so the bridge had to take the raw child node, encode it back into WhatsApp's binary XML format, then base64 it
Baileys already had the useful functions:
import { encodeBinaryNode, decodeBinaryNode } from "baileys"
function encodeStanzaBytes(node) {
return Buffer.from(encodeBinaryNode(node))
}
function encodeB64(bytes) {
return Buffer.from(bytes).toString("base64")
}
async function decodeStanza(bytes) {
return decodeBinaryNode(Buffer.from(bytes))
}
the offer path became:
const signalNode = getFirstChildNode(callNode)
const wapBytes = encodeStanzaBytes(signalNode)
const b64Stanza = encodeB64(wapBytes)
voip.handleIncomingSignalingOffer(
b64Stanza,
platform,
version,
String(e || 0),
String(t || 0),
offline ? 1 : 0,
isNotContact ? 1 : 0,
peerJid,
null
)
the important detail is that we pass the signaling child (offer, transport, relaylatency) to the WASM, not a simplified JSON object
once that was fixed, the logs changed from "empty call-id" to the stack actually parsing the offer
Offer from:4983:28@lid call_id:00E525341CB0E2B50259CE72187924AA
Handle MESSAGE Offer
init_local_state begins
create_p2p_transport start
send Preaccept
that was the first real win
the encrypted call key issue
after raw nodes started flowing, there was another parsing problem:
encryption 00D bad enc size 178
handle_incoming_xmpp_offer failed to parse offer, status=70004
this happened because some call offers contain an enc node. that encrypted payload is not directly the key material the WASM expects. Baileys can decrypt normal message payloads, but i had to use its Signal repository to decrypt the call payload and extract the call key
the simplified flow was:
async function decryptIncomingCallEnc(signalNode, info) {
const encNode = findChild(signalNode, "enc")
if (!encNode) return signalNode
const decrypted = await sock.signalRepository.decryptMessage({
jid: info.peerJid,
type: encNode.attrs.type,
ciphertext: encNode.content,
})
const message = proto.Message.decode(unpadRandomMax16(decrypted))
const callKey = message?.call?.callKey
if (callKey) {
encNode.content = Buffer.from(callKey)
}
return signalNode
}
this is the kind of annoying bug that looks like a WASM parser issue at first, but it is actually a missing protocol step before the parser
after replacing the encrypted payload with the extracted call key, the offer stopped failing with bad enc size
loading the wasm outside the browser
finding WAWebVoipWebWasmLoader.js was the map. it showed how WhatsApp Web creates the Emscripten module, where the .wasm file is located, and which callbacks the native side expects
the first Node loader looked roughly like this:
import { createRequire } from "module"
const require = createRequire(import.meta.url)
const createVoipModule = require("./WAWebVoipWebWasmLoader.js")
export async function loadVoipWasm({ wasmPath, persistentDir, callbacks }) {
const Module = {
locateFile(file) {
if (file.endsWith(".wasm")) return wasmPath
return file
},
print(text) {
console.log("[voip:wasm]", text)
},
printErr(text) {
console.error("[voip:wasm]", text)
},
noInitialRun: true,
thisProgram: "wa-voip",
preRun: [
function () {
Module.FS.mkdirTree(persistentDir)
},
],
...callbacks,
}
return createVoipModule(Module)
}
the real loader had more details, especially around filesystem paths, pthread pool size, and native bridge callbacks. but the principle was simple: make the WhatsApp loader think it is still in the environment it expects, then replace browser pieces with Node implementations one by one
after loading the Emscripten module, i wrapped the exports in a small class because calling embind exports directly everywhere gets messy:
class VoipStack {
constructor(module) {
this.module = module
}
initVoipStack(selfJid, phoneJid, lidJid) {
return this.module.initVoipStack(selfJid, phoneJid, lidJid, false)
}
handleIncomingSignalingOffer(stanza, platform, version, e, t, offline, notContact, peerJid) {
return this.module.handleIncomingSignalingOffer(
stanza,
platform,
version,
String(e || 0),
String(t || 0),
offline ? 1 : 0,
notContact ? 1 : 0,
peerJid,
null
)
}
}
once the wrapper had names like initVoipStack, handleIncomingSignalingOffer, handleIncomingSignalingMessage, and acceptCall, the rest of the code became much easier to debug
initializing the call stack
the first clean milestone was loading the module and initializing the VoIP stack
when this worked, the logs looked like this:
[VoipStack] WA Web embind module loaded
[VoipBridge] WASM loaded
VoipInit.cpp:518 initVoipStack called
WasmVoipAVDriverManager.cpp:189 [AV][register_audio_capture_driver_factory] SUCCESS
WasmVoipAVDriverManager.cpp:212 [AV][register_audio_playback_driver_factory] SUCCESS
WasmVoipAVDriverManager.cpp:141 [AV][initialize] SUCCESS
pjlib 2.13 for POSIX initialized
wa_media_api. init_media_endpt_and_codecs Enter
wa_opus.c pjmedia_codec_opus_init success
this told us a few things. WhatsApp's call stack is built on pjlib / pjmedia, Opus is initialized inside the WASM, and the JavaScript side must register audio capture/playback drivers
it also printed:
External audio not supported for this platform - Registering Virtual Audio
that was good news, if the WASM can use a virtual audio device, i can feed it samples myself without worrying about physical audio hardware, drivers, permissions, or sample rates
one issue here was Emscripten threads when the call state became more active, it printed:
Tried to spawn a new thread, but the thread pool is exhausted.
If you want to increase the pool size, use setting -sPTHREAD_POOL_SIZE=...
the loader could provide a larger pthread pool at startup increasing the pool size made these warnings stop being a constant worry during call setup
sending preaccept and accept
once the offer parsed, the WASM did what WhatsApp Web normally does: it sent preaccept
change_call_state call id ...: [None -> ReceivedCall]
send Preaccept to 4983:28@lid
EVENT: Call offer received
the WASM does not send this over the socket by itself it calls back into JavaScript with a binary stanza that JS is supposed to send. so the bridge needed the reverse direction too:
onSignalingXmpp({ peerJid, callId, xmlPayload }) {
const node = await decodeStanza(xmlPayload)
const callNode = wrapOutgoingSignalingNode(node, peerJid)
await sock.sendNode(callNode)
}
one bug here was sending the wrong shape back to Baileys sometimes the WASM gives a child node, not a full call wrapper. if you send the child directly, WhatsApp does not treat it as a call stanza. the fix was to wrap it:
function wrapOutgoingSignalingNode(node, peerJid) {
if (node.tag === "call") {
node.attrs ??= {}
node.attrs.to = peerJid
node.attrs.id ??= makeStanzaId()
return node
}
return {
tag: "call",
attrs: {
to: peerJid,
id: makeStanzaId(),
},
content: [node],
}
}
another bug was decoding the outbound payload with the wrong type Baileys' binary decoder wants a Buffer, and passing the wrong object caused errors like:
TypeError: buffer.readUInt8 is not a function
that was fixed by normalizing every WASM payload to Buffer.from(...) before decoding
after preaccept, accepting the call was much smaller:
await voip.acceptCall(true, false)
for testing, auto accept was just:
if (options.autoAccept) {
setTimeout(() => {
acceptCall(options.autoAcceptMic, false)
}, 250)
}
the log finally became:
wa_call_accept_asymmetric_2 begin
ACTION accept call offer
configure_audio_pipeline_sampling_rates
change_call_state: [ReceivedCall -> AcceptSent]
send Accept to peer
wa_call_accept_asymmetric() status 0
at this point the other phone could see the call as accepted but accepted signaling is not the same thing as media
signaling is not the call
this was the point where the problem split into two parts signaling was mostly working, but the actual call still needed relay allocation, bind requests, packet routing, audio capture, and playback
the logs made this very clear:
Relay List Update
setting relay info, num_relays: 2
Bind request sent for UDP relay
Ping request sent to UDP relay
Data Tx to peer: Relay:(bytes:0,pkts:0), P2P:(bytes:0,pkts:0)
the call stack was trying to talk to Meta relay servers if relay packets were not actually being sent and returned to the WASM, the call would sit in a half-connected state and eventually terminate
the SCTP dead end
one thing i tried too early was a WebRTC/SCTP data channel bridge
the assumption was reasonable: WhatsApp Web is a browser app, browsers use WebRTC, so maybe relay traffic needs a data channel path i tried creating peer connections to relay addresses and negotiating an SCTP data channel
the logs were not encouraging:
[SCTP] 57.144.41.57:3480 ICE state=checking
[SCTP] SDP negotiation done
[SCTP] ICE state=failed
[SCTP] PC state=failed
every relay failed, this told us the immediate missing piece was not "connect a data channel to relay port 3480". the WASM was already sending UDP bind requests to port 3478, so i focused on the UDP relay callbacks instead
this was a useful wrong turn it reduced the search space
relay packets and IPv6 weirdness
the WASM calls JavaScript with something like:
sendDataToRelay({ data, len, ip, port })
so the bridge created UDP sockets:
const udp4 = dgram.createSocket("udp4")
const udp6 = dgram.createSocket("udp6")
udp4.on("message", (packet, rinfo) => {
voip.handleIncomingRelayPacket(packet, rinfo.address, rinfo.port)
})
outgoing packets were sent to the relay:
socket.send(packet, port, ip)
once this was wired, debug logs started showing real traffic:
Relay tx 344 bytes to 163.70.145.133:3478
Relay rx 20 bytes from 2a03:2880:f28a:1db:face:b00c:0:6749:3478
that rx line mattered because before this the WASM was just sending bind requests forever and never hearing back
then IPv6 caused another issue, the stack would start on IPv4, ping the alternate address family, then decide IPv6 looked better:
detected relay connection on alt af
Using IPv6 and resetting relay
Bind request sent for UDP relay [ipv6]:3478
the enviroment did not always have usable IPv6 the workaround was to keep relay aliases: if the WASM selected an IPv6 relay but i knew the matching IPv4 relay, send the packet through IPv4 and report the address back in the form the WASM expected
it is not elegant, but it moved the state machine forward, the native transport code cares about relay identity and observed packet source, and the bridge can map those if it is careful
audio devices
after signaling and relay I/O, audio was the next wall WASM asks JS to register audio capture and playback drivers in the browser, those map to Web Audio and MediaStream in Node, i had to provide our own driver layer
the first driver was silence:
capture.readFrame = () => silenceFrame
that sounds useless, but it is a good test it lets you prove that the call can connect without worrying about microphone permissions, speaker devices, sample rate conversion, or feedback
the negotiated audio format showed up in logs:
audio_device_sampling_rate 16000
audio_device_samples_per_frame 320
conf_bridge_sampling_rate 16000
frame_length_ms 60
for microphone input, the quick version was using ffmpeg to output raw PCM:
const mic = spawn("ffmpeg", [
"-f",
"alsa",
"-i",
"default",
"-ac",
"1",
"-ar",
"16000",
"-f",
"s16le",
"pipe:1",
])
mic.stdout.on("data", (chunk) => {
captureBuffer.push(chunk)
})
playback is the reverse: write raw PCM frames into something that can play them
const speaker = spawn("ffplay", [
"-nodisp",
"-autoexit",
"-f",
"s16le",
"-ac",
"1",
"-ar",
"16000",
"pipe:0",
])
speaker.stdin.write(frame)
this is not the only way to do audio, but ffmpeg/ffplay made it easy to prove the media path before making the audio layer nicer
why prerecorded audio was easier
the call did support microphone audio, but prerecorded audio was easier to test because i did not have a good microphone connected to that machine lol
with a microphone, silence can mean many things: the mic is muted, the format is wrong, the capture process failed, the wrong device was selected, or the call media path is broken. with prerecorded audio, the test is deterministic. call the account and listen for the same file every time
the capture driver just reads frames from a file and pretends they came from a microphone:
class PrerecordedAudioCapture {
readFrame(samples) {
return nextPcmChunkFromFile(samples)
}
}
but the file cannot just be an mp3 or any other format the capture driver wants raw PCM in the format the virtual microphone is pretending to provide
so the test file was converted with ffmpeg:
ffmpeg -i input.mp3 \
-ac 1 \
-ar 16000 \
-f s16le \
greeting.pcm
that gives a mono, 16 kHz, signed 16-bit little-endian PCM file. then the driver can read exact frame sizes:
class PrerecordedAudioCapture {
constructor(filePath, { loop = false } = {}) {
this.audio = fs.readFileSync(filePath)
this.offset = 0
this.loop = loop
}
readFrame(byteLength) {
if (this.offset >= this.audio.length) {
if (!this.loop) return Buffer.alloc(byteLength)
this.offset = 0
}
const end = Math.min(this.offset + byteLength, this.audio.length)
const frame = Buffer.alloc(byteLength)
this.audio.copy(frame, 0, this.offset, end)
this.offset = end
return frame
}
}
this was the most practical way to test audio if the remote phone hears the file, the media path is alive
the final connected call
there was no single beautiful success log it was more like the absence of failure
before the fixes calls ended with stats like:
call_accept_sent: 0
rx_total_bytes: 0
call_result: setup error
call_term_reason: timeout
after the fixes the call accepted and stayed connected until i ended it important behavior was:
- incoming raw
offerparsed - encrypted call key handled
preacceptsentacceptsent- relay packets sent and received
- audio driver initialized
- prerecorded audio could be fed as capture
- call stayed alive
that was the target
what the bridge looked like
by the end the dev bridge had a few main pieces:
VoipStackwrapped the Emscripten module and exportsVoipBridgeconnected WhatsApp signaling to the WASMAudioDeviceimplemented silence, microphone, playback, and prerecorded captureWebRtcBridgeheld the P2P/SCTP experiments
the data flow looked like this:
Baileys raw call nodes
|
v
WAP binary codec
|
v
WhatsApp Web VoIP WASM
|
+--> signaling callback --> send call node
+--> relay callback --> UDP socket
+--> capture callback --> mic / silence / prerecorded audio
+--> playback callback --> speaker
Baileys was useful but it was not the call engine it gave access to raw call stanzas and WhatsApp binary node helpers the actual call engine was WhatsApp's own WASM
that distinction matters becuase the hard part was not inventing a call protocol the hard part was feeding WhatsApp's own call engine the same environment it expects in the browser
what made it hard
the hardest part was keeping the layers separate there were at least five protocols stacked on each other:
- WhatsApp websocket session
- WhatsApp binary node format
- call signaling stanzas
- WhatsApp VoIP native API
- relay/media transport
when something failed, the logs rarely told you which layer was wrong
bad enc sizemeant the encrypted call key was not transformed correctlyempty call-idmeant the wrong stanza shape was passedno active callmeant the offer never created call stateno tx relays availablemeant relay bind/ping responses were not reaching the transportrx_total_bytes: 0meant media never arrived, but that could be signaling, relay, candidates, or audio
the way through was making one layer boring before moving to the next: raw call node, WAP encoding, offer parse, encrypted call key, preaccept, accept, relay tx, relay rx, audio capture, playback
ending
this started as "can we receive a call?"
the answer was yes it took enabling the hidden AB props, fetching the WASM, loading the WhatsApp Web VoIP stack outside the browser, passing raw call stanzas into it, sending its signaling output back through Baileys, implementing relay I/O, and feeding audio through virtual devices
after that, the call worked like a real WhatsApp Web call not because i built a new call engine, but because WhatsApp had already shipped one
i just made it run somewhere else