Bot Real-time Transcription

Real time transcription allows you to consume transcript utterances(generated by the bot) via Real-Time Endpoints. The following example uses a webhook realtime endpoint (the same applies for websocket).

❗️

Important: Concurrency considerations

When going to production, make sure that your account with your 3rd party transcription provider is configured with high enough concurrency limit to support your anticipated load.

Certain transcription providers require that you reach out to increase your concurrency limit, and we highly recommend checking this prior to running production workloads.

Configuration

To configure a bot for real-time transcription, your Create Bot request must include:

  • A webhook Real-Time Endpoint configured with the transcript.data event
  • A transcript artifact configured with the provider of your choice
curl --request POST \
     --url https://us-west-2.recall.ai/api/v1/bot/ \
     --header "Authorization: $RECALLAI_API_KEY" \
     --header "accept: application/json" \
     --header "content-type: application/json" \
     --data '
{
  "meeting_url": "https://meet.google.com/hzj-adhd-inu",
  "recording_config": {
    "transcript": {
      "provider": {
        "recallai_streaming": {}
      }
    },
    "realtime_endpoints": [
      {
        "type": "webhook",
        "url": "https://my-app.com/api/webhook/recall/transcript",
        "events": ["transcript.data", "transcript.partial_data"]
      }
    ]
  }
}
'

In the above example, we configure real-time transcription using Recall transcription. For a full list of supported transcription providers, see Create Bot.

Diarization

By default, transcriptions use the mixed audio that is a single stream for the entire recording. Alternatively, on supported platforms we allow transcribing each participant's stream separately, allowing perfect diarization. To use this, add the diarization object with use_separate_streams_when_available set to true

curl --request POST \
     --url https://us-west-2.recall.ai/api/v1/bot/ \
     --header "Authorization: $RECALLAI_API_KEY" \
     --header "accept: application/json" \
     --header "content-type: application/json" \
     --data '
{
  "meeting_url": "https://meet.google.com/hzj-adhd-inu",
  "recording_config": {
    "transcript": {
      "provider": {
        "recallai_streaming": {}
      },
      "diarization": {
        "use_separate_streams_when_available": true
      }
    },
    "realtime_endpoints": [
      {
        "type": "webhook",
        "url": "https://my-app.com/api/webhook/recall/transcript",
        "events": ["transcript.data", "transcript.partial_data"]
      }
    ]
  }
}
'
❗️

Increased Transcription Usage

Transcribing multiple streams may result in higher costs from your transcription provider. On average, usage is ~1.8x higher than single-stream transcription.

Webhooks

Once the bot is in the call and is processing audio, it will start generating a transcript in real-time and sending utterances to your real-time webhook endpoint.

For verifying incoming webhooks, please see Real-time Webhook Endpoint Verification.

Payload

A transcription event will be sent to your real-time webhook endpoint whenever a new utterance is returned from the transcription provider.

  • event: transcript.data
  • data.data.words: includes the transcribed text
  • data.data.participant: includes information about the participant(speaker) of the words

The shape of the payload is as follows:

{
  "event": "transcript.data",
  "data": {
    "data": {
      "words": [{
        "text": string,
        "start_timestamp": { "relative": float },
        "end_timestamp": {"relative": float } | null
      }],
      "participant": {
      	"id": number,
      	"name": string | null,
        "is_host": boolean,
        "platform": string | null,
        "extra_data": object,
        "email": string | null
      } 
    },
    "realtime_endpoint": {
      "id": string,
      "metadata": object,
    },
    "transcript": {
      "id": string,
      "metadata": object
    },
    "recording": {
      "id": string,
      "metadata": object
    },
    "bot": {
      "id": string,
      "metadata": object
    }
  }
}

Partial results

When using real-time transcription, the time to receive a transcription webhook can vary according to how long the utterance is. In cases of longer monologues, this delay can be quite significant and may hinder the real-time experience.

To alleviate this, you can leverage partial transcription results to decrease the latency of transcription events, even with longer utterances. These are low-latency partial or intermediate results for an utterance and can be used as intermediates for the final transcript.

When enabled, you will receive multiple transcript utterance parts before getting the finalized transcript utterance. This will manifest as:

  • Partial words - You could receive the following partial transcript utterances in sequence: fur, further, furthermore. When the caption is finalized, you will receive the whole word furthermore
  • Partial sentences - You could receive the following partial transcript utterances in sequence: hel, hello, hello how, hello how are, hello how are you. When the caption is finalized, you will receive the whole word hello how are you?

Enable partial results

To receive partial results, in realtime, add transcript.partial_data to the realtime endpoint events.

{
  ...,
  "recording_config": {
    ...,
  	"realtime_endpoints": [
      {
        "type": "webhook",
        "url": "https://my-app.com/api/webhook/recall/transcript",
        "events": ["transcript.data", "transcript.partial_data"]
      }
    ]
  }
}

A transcription event will be sent to your real-time webhook endpoint whenever a new partial result is available:

  • event: transcript.partial_data
  • data.data.words: includes the partial transcribed text
  • data.data.participant: includes information about the participant(speaker) of the words

The shape of the payload is as follows:

{
  "event": "transcript.partial_data",
  "data": {
    "data": {
      "words": [{
        "text": string,
        "start_timestamp": { "relative": float },
        "end_timestamp": {"relative": float } | null
      }],
      "participant": {
      	"id": number,
      	"name": string | null,
        "is_host": boolean,
        "platform": string | null,
        "extra_data": object,
        "email": string | null
      } 
    },
    "realtime_endpoint": {
      "id": string,
      "metadata": object,
    },
    "transcript": {
      "id": string,
      "metadata": object
    },
    "recording": {
      "id": string,
      "metadata": object
    },
    "bot": {
      "id": string,
      "metadata": object
    }
  }
}

Using partial results

The event property field on the payload indicates whether or not the block is a partial result or not.

For example, if I say "Hey, my name is John. It's really nice to meet you.", I might receive:

A series of partial results (event =transcript.partial_data):

Ay my name

Hey my name is John.

Hey my name is John, it's really nice

Then a short time after this, a more accurate final result (event =transcript.data):

Hey, my name is John. It's really nice to meet you.

This is useful for cases where you may want to use partial results immediately, but then update it with the more accurate, finalized result after receiving it. One common pattern is to to display partial results in your UI, and then replace them with the finalized version once received.

Full Transcript

In addition to receiving individual utterances via the configured realtime endpoints, the transcript will also be available on the bot's recording through media_shortcuts. Recall will trigger a transcript.done webhook once this is available. See Transcript Status Webhooks

To access this, call the Retrieve Bot endpoint. Within the bot's recordings, there will be a media_shortcuts object with a transcript.data.download_url containing a URL to download the full transcript:

{
  "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
	"recordings": [
    {
      "media_shortcuts": {
        "transcript": {
          "data": {
            "download_url": "string"
          },
          ...
        }
      }
    }
  ],
  ...
}

Transcription Errors

Incase where real time transcription fails, a transcript.failed webhook is triggered by Recall. The reason for failure is included via a sub_code in the event payload (see Transcript Status Webhooks) and you can also check the bot logs in the dashboard to see why it failed.

If real-time transcription fails, we recommend falling back to using Async Transcription.

Languages


Automatic language detection

If you don’t know ahead of time which language the conversation will be in, you can set up automatic language detection. Automatically detecting languages is broken up into two types:

  • Language Detection - Detecting the primary spoken language within a recording, without needing to explictly set it
  • Code switching - Alternating between two or more languages or language varieties within a single conversation or speech

Most of the third-party transcription providers that we integrate with support language detection.

❗️

Automatic language detection is not available when using meeting captions.

The table below covers each of these, and their corresponding parameters in the Create Bot recording_config.transcript.provider configuration.

Provider

Create Bot transcript parameter

Supported Languages

recallai_streaming

  • Language detection: set language_code to auto

  • Code switching: not supported

Docs

assembly_ai_async_chunked

  • Language detection: set language_detection to true ( docs )

  • Code switching: set language_detection_options.code_switching to true ( docs )

Docs

aws_transcribe_streaming

  • Language detection: set language_identification to true and specify a list of language_options (docs)
  • Code switching: same as language detection (docs)

Docs

deepgram_streaming

  • Language detection: set model to nova-2 or nova-3 and set language to multi (docs)

  • Code switching: same as language detection (docs)

Docs

Why are transcription webhooks so delayed?

Recall will POST any results from the configured transcription provider as they're received. When using partial results, the frequency is typically in the hundreds of ms to low seconds range but varies slightly by provider.

If you're seeing large delays in results, such as many seconds, or even minutes, this is likely due to the "single-threaded" nature of the transcription feed. Since transcription utterances are sequential and rely on being in a particular order, blocking a webhook request will delay any subsequent requests.

If you're running in a single-threaded environment, you should make sure that any processing of the transcription webhook happens asynchronously to prevent delaying future webhooks.