To receive raw audio in real-time from a bot, you can leverage Real-Time Websocket Endpoints.

Quickstart

Setup a websocket endpoint

For demonstration purposes, we've set up a simple websocket receiver to receive and write audio to a file:

import WebSocket from 'ws';
import fs from 'fs';

type AudioDataEvent = {
  event: 'audio_mixed_raw.data';
  data: {
    data: {
      buffer: string; // Base64 encoded audio data
      timestamp: {
        relative: float;
        absolute: string;
      }
    },
    realtime_endpoint: {
      id: string;
      metadata: Record<string, string>;
    },
    recording: {
      id: string;
      metadata: Record<string, string>;
    },
    bot: {
      id: string;
      metadata: Record<string, string>;
    },
    audio_mixed_raw: {
      id: string;
      metadata: Record<string, string>;
    }
  };
};

const wss = new WebSocket.Server({ port: 3456 });

wss.on('connection', (ws) => {
  ws.on('message', (message: WebSocket.Data) => {
    console.log(message);

    // You can listen to the audio using this command:
    // ffmpeg -f s16le -ar 16000 -ac 1 -i /tmp/{RECORDING_ID}.bin -c:a libmp3lame -q:a 2 /tmp/{RECORDING_ID}.mp3
    try {
      const wsMessage = JSON.parse(message.toString()) as AudioDataEvent;

      if (wsMessage.event === 'audio_mixed_raw.data') {
        console.log(wsMessage);

        // Use the recording ID for the file name
        const recordingId = wsMessage.data.recording.id;
        const filePath = `/tmp/${recordingId}.bin`;

        const encodedBuffer = Buffer.from(wsMessage.data.data.buffer, 'base64');
        const decodedBuffer = Buffer.from(encodedBuffer, 'utf8');
        fs.appendFileSync(filePath, decodedBuffer);
      } else {
        console.log("unhandled message", wsMessage.event);
      }
    } catch (e) {
      console.error('Error parsing JSON:', e);
    }
  });

  ws.on('error', (error) => {
    console.error('WebSocket Error:', error);
  });

  ws.on('close', () => {
    console.log('WebSocket Closed');
  });
});

console.log('WebSocket server started on port 3456');

For details on how to verify connections, see Verifying Real-Time Websocket Endpoints.

Once you have a basic server running locally, you'll want to expose it publicly through a tunneling tool such as ngrok. For a full setup guide, see Local Webhook Development.

Start a meeting

Now that we have our websocket server running locally and exposed through our ngrok tunnel, it's time to start a meeting and send a bot to it.

For simplicity, go to meet.new in a new tab to start an instant Google Meet call. Save this URL for the next step.

Configure the bot

Now it's time to send a bot to a meeting while configuring a real-time websocket endpoint.

To do this, call the Create Bot endpoint while providing a real-time endpoint object where:

type: websocket
url: Your publicly exposed ngrok tunnel URL
events: An array including the audio_mixed_raw.data event

And of course, don't forget to set meeting_url to your newly-created Google Meet call.

Example curl:

curl --request POST \
     --url https://us-west-2.recall.ai/api/v1/bot/ \
     --header "Authorization: $RECALLAI_API_KEY" \
     --header "accept: application/json" \
     --header "content-type: application/json" \
     --data '
{
  "meeting_url": "https://meet.google.com/sde-zixx-iry",
  "recording_config": {
    "audio_mixed_raw": {},
    "realtime_endpoints": [
      {
        "type": "websocket",
        "url": "wss://my-tunnel-domain.ngrok-free.app",
        "events": ["audio_mixed_raw.data"]
      }
    ]
  }
}
'

📘
Make sure to set the config.url as a ws or wss endpoint.

Receive the raw audio

Once the bot is on the call and connected to audio, it will begin producing audio_mixed_raw.data events containing packets of mixed audio from the call.

These events have the following shape:

{
  "event": "audio_mixed_raw.data", 
  "data": {
    "data": {
      "buffer": string, // base64-encoded raw audio 16 kHz mono, S16LE(16-bit PCM LE)
      "timestamp": {
      	"relative": float,
        "absolute": string
    	}
    },
    "realtime_endpoint": {
      "id": string,
      "metadata": object,
    },
    "audio_mixed": {
      "id": string,
      "metadata": object
    },
    "recording": {
      "id": string,
      "metadata": object
    },
    "bot": {
      "id": string,
      "metadata": object
    },
  }
}

Where data.buffer is the b64-encoded mixed audio data. The data is mono 16 bit signed little-endian PCM at 16khz.

🎉
And that's it! You're now streaming audio in real-time to a websocket server.

FAQ

Do muted participants produce audio?

No, muted participants do not produce any audio.

If a participant is unmuted but silent, you will receive empty audio packets.

Will bots receive audio from other bots?

Since bots are participants, if there are other bots in a call, the bot will receive audio from the bot like any other participant.

Since bots are muted by default, unless another bot is outputting audio, the bot will not receive audio packets from other bots.

What is the retry behavior?

If we are unable to connect to your endpoint, or are disconnected, we will re-try the connection every 3 seconds, while the bot is alive.

Will the bot's audio/video/transcript be included in the final recording?

You can configure the bot's audio to be included in the final recording by setting the recording_config.include_bot_in_recording.audio: true in the Create Bot request.

You cannot include the bots video or transcript in the final recording at this time. If you want to workaround this, we recommend:

If you only need the transcript:
- Real-time - you will need to fetch it from your TTS provider (if applicable). You can then merge it by aligning the timestamps from the TTS provider's transcript with the generated transcript of the other participants
- Async - you will need to:
  - Include the bot's audio in the recording and enable Perfect Diarization in the Create Bot request config
```
{
  "recording_config": {
    "include_bot_in_recording": {
      "audio": true
    },
    "transcript": {
      "diarization": {
        "use_separate_streams_when_available": true
      }
    }
  }
}
'
```
  - Use Async Transcription to transcribe the call. Note that the bot's name will return as null but you can fetch the bot's name by querying the Retrieve Bot api instead
If you need the video: you will need to send another bot to the call to record the meeting. This bot will capture all participants, including the bot agent that is outputting video. This method can also transcribe as the bot agent is registered as its own participant