Receive Real-Time Audio

Receive real-time audio from a bot.

📘

To start receiving real time audio streams, you need to include your websocket URL in create_bot.real_time_media .websocket_audio_destination_url.

This URL should have a ws:// or wss:// prefix depending on your server's requirements. We highly recommend using the websocket protocol over SSL/TLS (wss) since the connection is encrypted and much more secure.

Real Time Audio Protocol (Combined Streams)

📘

Combined audio streams are available on the Zoom Web Bot, Microsoft Teams Web Bot, Google Meet Bot, and Webex Bot.

The first message on websocket connection will be:

{
  protocol_version: 1,
  bot_id: '...',
  recording_id: '...',
  separate_streams: false,
  offset: 0.0 
}

The offset is the offset (in seconds) relative to the in_call_recording event on the bot.

The following websocket messages will be in binary format as follows:

  • All data in the websocket packet is S16LE format audio, sampled at 16000Hz, mono
import asyncio
import websockets


async def echo(websocket):
    async for message in websocket:
        if isinstance(message, str):
            print(message)
        else:
            with open(f'output/output.raw', 'ab') as f:
                f.write(message)
                print("wrote message")


async def main():
    async with websockets.serve(echo, "0.0.0.0", 8765):
        await asyncio.Future()

asyncio.run(main())

Real Time Audio Protocol (Separate Streams)

📘

Separate audio streams per participant are only available on the Zoom Native Bot and the Teams Web Bot under a feature flag. Reach out to the Recall team over Slack if you'd like this enabled for your workspace.

When using separate-stream audio, participant audio streams will be separated via their own websocket connection.

The first message on each connection will be:

{
  protocol_version: 1,
  bot_id: '...',
  separate_streams: true,
  offset: 0.0 // Offset (in seconds) relative to the `in_call_recording` event on the bot
}

The following websocket messages will be in binary format as follows:

  • First 32 bits are a little-endian unsigned integer representing the participant_id.
  • The remaining data in the websocket packet is S16LE format audio, sampled at 16000Hz, mono

The following is sample code to decode these messages:

import asyncio
import websockets


async def echo(websocket):
    async for message in websocket:
        if isinstance(message, str):
            print(message)
        else:
            stream_id = int.from_bytes(message[0:4], byteorder='little')
            with open(f'output/{stream_id}-output.raw', 'ab') as f:
                f.write(message[4:])
                print("wrote message")


async def main():
    async with websockets.serve(echo, "0.0.0.0", 8765):
        await asyncio.Future()

asyncio.run(main())

Upon muting/unmuting, a participant's corresponding websocket connection will disconnect/reconnect accordingly.

Diarization using call events

When receiving audio streams, you can utilize Call Event Webhooks to receive real-time speaker changes. You can also receive these messages through a websocket connection by specifying the real_time_media.websocket_speaker_timeline_destination_url when calling Create Bot.

Websocket example:

{ user_id: 16778240, name: 'John Doe', timestamp: 18.76719 }

timestamp is the offset (in seconds) relative to the in_call_recording event for the bot.

Webhook example:

{
    "event": "bot.active_speaker_notify",
    "data": {
        "participant_id": 16778240,
        "created_at": "2024-04-08T20:29:44.001399994Z",
        "relative_ts": 5.865013889,
        "bot_id": "2a06cd2f-b126-4eee-9d48-eebdb3195187"
    }
}

relative_ts is the offset (in seconds) relative to the in_call_recording event for the bot.

Regardless of which method you use to receive call events, these can be used to determine the participant ID for a stream of audio packets until the next speaker change event.

You can then use the meeting_participants on bot 9e77800d-ead9-4615-85fb-b71a045c7850 to map the ID to a participant name and attribute the words to the speaker:

// GET https://us-east-1.recall.ai/api/v1/bot/9e77800d-ead9-4615-85fb-b71a045c7850/
{
  "meeting_participants": [
    {
      "id": 100,
      "name": "John Doe",
      "events": [],
      "is_host": true,
      "platform": "unknown",
      "extra_data": null
    }
  ],
  ...
}

FAQ


Do muted participants produce audio?

No, muted participants do not produce any audio.

If a participant is unmuted but silent, you will receive empty audio packets.

Will bots receive audio from other bots?

Since bots are participants, if there are other bots in a call, the bot will receive audio from the bot like any other participant.

Since bots are muted by default, unless another bot is outputting audio, the bot will not receive audio packets from other bots.

What is the retry behavior?

If we are unable to connect to your endpoint, or are disconnected, we will re-try the connection every 3 seconds, while the bot is alive.