Receive Real-Time Audio
Receive real-time audio from a bot.
To start receiving real time audio streams, you need to include your websocket URL in create_bot.real_time_media .websocket_audio_destination_url.
This URL should have a
ws://
orwss://
prefix depending on your server's requirements. We highly recommend using the websocket protocol over SSL/TLS (wss
) since the connection is encrypted and much more secure.
Real Time Audio Protocol (Combined Streams)
Combined audio streams are available on the Zoom Web Bot, Microsoft Teams Web Bot, Google Meet Bot, and Webex Bot.
The first message on websocket connection will be:
{
protocol_version: 1,
bot_id: '...',
recording_id: '...',
separate_streams: false,
offset: 0.0
}
The offset
is the offset (in seconds) relative to the in_call_recording
event on the bot.
The following websocket messages will be in binary format as follows:
- All data in the websocket packet is S16LE format audio, sampled at 16000Hz, mono
import asyncio
import websockets
async def echo(websocket):
async for message in websocket:
if isinstance(message, str):
print(message)
else:
with open(f'output/output.raw', 'ab') as f:
f.write(message)
print("wrote message")
async def main():
async with websockets.serve(echo, "0.0.0.0", 8765):
await asyncio.Future()
asyncio.run(main())
Real Time Audio Protocol (Separate Streams)
Separate audio streams per participant are only available on the Zoom Native Bot and the Teams Web Bot under a feature flag. Reach out to the Recall team over Slack if you'd like this enabled for your workspace.
When using separate-stream audio, participant audio streams will be separated via their own websocket connection.
The first message on each connection will be:
{
protocol_version: 1,
bot_id: '...',
separate_streams: true,
offset: 0.0 // Offset (in seconds) relative to the `in_call_recording` event on the bot
}
The following websocket messages will be in binary format as follows:
- First 32 bits are a little-endian unsigned integer representing the
participant_id
. - The remaining data in the websocket packet is S16LE format audio, sampled at 16000Hz, mono
The following is sample code to decode these messages:
import asyncio
import websockets
async def echo(websocket):
async for message in websocket:
if isinstance(message, str):
print(message)
else:
stream_id = int.from_bytes(message[0:4], byteorder='little')
with open(f'output/{stream_id}-output.raw', 'ab') as f:
f.write(message[4:])
print("wrote message")
async def main():
async with websockets.serve(echo, "0.0.0.0", 8765):
await asyncio.Future()
asyncio.run(main())
Upon muting/unmuting, a participant's corresponding websocket connection will disconnect/reconnect accordingly.
Diarization using call events
When receiving audio streams, you can utilize Call Event Webhooks to receive real-time speaker changes. You can also receive speaker timeline change events through a websocket connection by specifying the real_time_media.websocket_speaker_timeline_destination_url
when calling Create Bot.
Websocket example:
{ user_id: 16778240, name: 'John Doe', timestamp: 18.76719 }
timestamp
is the offset (in seconds) relative to the in_call_recording
event for the bot.
Webhook example:
{
"event": "bot.active_speaker_notify",
"data": {
"participant_id": 16778240,
"created_at": "2024-04-08T20:29:44.001399994Z",
"relative_ts": 5.865013889,
"bot_id": "2a06cd2f-b126-4eee-9d48-eebdb3195187"
}
}
relative_ts
is the offset (in seconds) relative to the in_call_recording
event for the bot.
Regardless of which method you use to receive call events, these can be used to determine the participant ID for a stream of audio packets until the next speaker change event.
You can then use the meeting_participants
on bot 9e77800d-ead9-4615-85fb-b71a045c7850
to map the ID to a participant name and attribute the words to the speaker:
// GET https://us-east-1.recall.ai/api/v1/bot/9e77800d-ead9-4615-85fb-b71a045c7850/
{
"meeting_participants": [
{
"id": 100,
"name": "John Doe",
"events": [],
"is_host": true,
"platform": "unknown",
"extra_data": null
}
],
...
}
FAQ
Do muted participants produce audio?
No, muted participants do not produce any audio.
If a participant is unmuted but silent, you will receive empty audio packets.
Will bots receive audio from other bots?
Since bots are participants, if there are other bots in a call, the bot will receive audio from the bot like any other participant.
Since bots are muted by default, unless another bot is outputting audio, the bot will not receive audio packets from other bots.
What is the retry behavior?
If we are unable to connect to your endpoint, or are disconnected, we will re-try the connection every 3 seconds, while the bot is alive.
Updated about 1 month ago