Diarization: Mapping speakers in the transcript

Choose the right diarization mode so your transcripts use accurate speaker labels and names across remote, shared-mic, and hybrid meetings.

Overview

Diarization is the process of assigning each part of a transcript to a speaker, so the transcript can show who said what.

Speaker labels

To diarize a transcript, the transcript assigns a speaker label to each part of the conversation. A speaker label is the identifier shown in the transcript for a speaker.

There are two types of speaker labels:

  • Participant speaker labels: labels tied to a specific participant in the meeting platform, so you can identify which meeting participant the label refers to.
  • Generic speaker labels: labels that distinguish speakers but are not tied to a participant from the meeting platform, such as numbers (1, 2, 3) or letters (A, B, C).

Diarization methods

A diarization method defines how transcript text is assigned to speakers. It determines how speakers are identified in the transcript and whether the transcript uses participant speaker labels or generic speaker labels.

There are four diarization methods you can use:

Perfect diarization

"Perfect diarization" transcribes each audio stream separately. Because each stream is transcribed independently, Recall can keep speech from each stream separate and return participant speaker labels.

Hybrid diarization

"Hybrid diarization" is the most accurate diarization method when some audio streams may contain more than one speaker. It uses perfect diarization to transcribe each audio stream separately, then applies machine diarization within each stream to distinguish between multiple speakers sharing that same stream.

Speaker-timeline diarization

"Speaker-timeline diarization" uses active speaker events from the meeting platform to map transcript text to the participant identified as speaking during each time range. The transcript is returned with participant speaker labels.

Machine diarization

"Machine diarization" uses a third-party transcription provider to distinguish speakers based on voice characteristics. The transcript is returned with generic speaker labels (e.g., A, B, C or 0, 1, 2), rather than participant speaker labels.

📘

For more information on which diarization method is right for your use case, check out our guide on speaker diarization.

Comparing diarization methods

The table below compares the available diarization methods and their tradeoffs. The methods are listed roughly from most to least accurate in typical use cases:

MethodSupported forTranscribesSpeaker labelsUseful whenCaveats
Hybrid diarizationBot async transcriptionSeparate audio streamsParticipant speaker labels and generic speaker labelsMost participants join from their own device, but some calls may include multiple people sharing the same device (e.g., people joining together from a conference room).Requires additional code to map speaker labels to participants, still uses generic speaker labels for speakers sharing a stream, and does not include raw data from the third-party transcription provider
Perfect diarizationBot async transcription and bot real-time transcriptionSeparate audio streamsParticipant speaker labelsEach participant is joining from their own device, so each person has their own audio streamDoes not distinguish between multiple people speaking in the same audio stream, and does not include raw provider data
Speaker-timeline diarizationBot/DSDK async transcription and bot/DSDK real-time transcriptionSingle mixed audio streamParticipant speaker labelsYou want participant speaker labels and you need the raw data from the third-party transcription providerDepends on meeting-platform active speaker events, which can sometimes be inaccurate or incomplete (e.g. noise, missed speaker changes, or overlapping speech).
Machine diarizationBot/DSDK async transcription and bot/DSDK real-time transcriptionSingle mixed audio streamGeneric speaker labelsMultiple people may be speaking on the same mixed audio, such as when several participants join together from a conference room or shared deviceCan be less accurate when different speakers have similar-sounding voices
📘

Why separate audio streams improve diarization accuracy

Diarization accuracy depends in part on how meeting audio is provided for transcription. Meeting audio is typically provided in one of two ways:

  • Single mixed audio stream: all participant audio is combined into one stream before transcription. This makes speaker attribution harder, especially when multiple people are speaking at the same time or when multiple people are sharing the same device or microphone.
  • Separate audio streams: audio is provided in separate streams instead of one mixed stream. This makes speaker attribution more accurate because each stream can be transcribed independently.

In general, diarization is more accurate when using separate audio streams.

Separate streams platform support

PlatformReal-time separate audio streamsAsync separate audio streams
Zoom✅ Supported✅ Supported
Microsoft Teams✅ Supported✅ Supported
Google Meet✅ Supported✅ Supported
Webex❌ Not supported❌ Not supported

Hybrid diarization

Hybrid diarization uses a combination of participant information from the meeting platform and machine diarization to handle the meeting's join pattern:

  • If participants are joining from their own devices, the anonymous speaker label is replaced with the real participant name from the participants list.
  • If Recall detects that multiple participants are effectively coming from the same device/microphone, the anonymous speaker labels are kept (so voices can still be separated).

Requirements to use hybrid diarization

To use hybrid diarization:

  • The meeting platform must support separate audio streams.
  • You must use bots to record meetings (not supported for DSDK).
  • You must use async transcription (not supported for real-time transcription).
  • Your selected third-party transcription provider must support machine diarization.
🚧

Caveats when using hybrid diarization

Hybrid Diarization can provide the most accurate results, but it comes with a few tradeoffs:

  • Your application will need additional logic/code to map the generic speaker labels to the participants.
  • You will still get generic speaker labels for participants sharing a stream.
  • When enabled, you will not be able to get raw data from the third-party transcription provider for either real-time or async transcription.

There are also some cost considerations when using hybrid diarization:

  • For async transcription, we typically see 0.6x to 1.2x the transcription credit usage, with the average cost difference being around 1x. This is because we optimize the audio output by trimming out silence.

Enabling hybrid diarization

To use it, you will need to implement the logic/algorithm by doing the following:

  • Enable Perfect Diarization by setting diarization.use_separate_streams_when_available: true as seen in the Create Async Transcript.
  • Enable machine diarization through one of the options listed in the machine diarization section of this guide.

You will then receive transcript parts where the participant.id is null and the participant.name contains a temporary composite key instead of a real participant name. For example, you will receive something like:

[
  {
    "participant": {
      "id": null,
      "name": "200-0",
      "is_host": false,
      "platform": "mobile_app",
      "extra_data": {...},
      "email": null
    },
    "words": [...]
  },
  // ... other transcript utterances
]

This temporary participant name key follows the format {participant_id}-{anonymous_label}.

With this, you can then:

  • Fetch the list of participants via the recording.media_shortcuts.participant_events.data.participants_download_url field
  • Build a mapping of each participant_id to its set of anonymous labels across all transcript parts that looks like this:
{
  // participant_id: anonymous_labels[]
  100: [0],
  200: [0, 1]
  // other participants
}
  • Get the list of participant ids with only one anonymous label
  • Iterate over the transcript and:
    • For participants with exactly one anonymous label, replace the anonymous speaker with the real participant name and metadata
    • For participants with multiple anonymous labels (multiple people sharing a device), leave the anonymous labels unchanged
📘

You can use this sample app to see how to get the transcript using hybrid diarization.

Perfect diarization

Perfect diarization is our default and recommended diarization method. It transcribes each participant's audio stream separately instead of using a single mixed audio stream for the entire meeting. This improves speaker attribution accuracy in the transcript, especially when multiple people are speaking at the same time.

When perfect diarization is enabled, the transcript is returned with participant speaker labels out of the box. Perfect diarization is also supported for both real-time and async transcription.

Requirements to use perfect diarization

To use perfect diarization:

  • The meeting platform must support separate audio streams.
  • You must use bots to record meetings (not supported for DSDK).

Caveats when using perfect diarization

🚧

Caveats when using perfect diarization

There are a few tradeoffs to be aware of when using perfect diarization:

  • It does not distinguish between multiple people speaking from the same audio stream, such as multiple participants joining together from a conference room or shared device.
  • It does not support raw provider data for real-time or async transcription. You will not receive transcript.provider_data real-time transcription events and the async transcription raw provider download data in provider_data_download_url will always be [].

There are also some cost considerations when using perfect diarization:

  • For real-time transcription, we typically see around 1.8x the transcription credit usage compared to standard transcription. This is because overlapping speech across separate streams must be transcribed independently.
  • For async transcription, we typically see 0.6x to 1.2x the transcription credit usage, with the average cost difference being around 1x. This is because we optimize the audio output by trimming out silence.

Enabling perfect diarization

To enable perfect diarization, set diarization.use_separate_streams_when_available to true in the transcription configs as seen below.

Enabling perfect diarization for real-time transcription

To configure perfect diarization in a Create Bot request, set recording_config.transcript.diarization.use_separate_streams_when_available to true:

{
  // other create bot request configs
  "recording_config": {
    // other recording_config configs
    "transcript": {
      // other transcript configs
      "diarization": {
        "use_separate_streams_when_available": true
      }
    }
  }
}

For details on how to implement/access the transcript using real-time transcription, see:

Enabling perfect diarization for async transcription

To configure perfect diarization in a Create Async Transcript request, set diarization.use_separate_streams_when_available to true:

{
  // other create async transcript request configs
  "diarization": {
    "use_separate_streams_when_available": true
  }
}

For details on how to implement/access the transcript using async transcription, see Async Transcription.

Speaker-timeline diarization

Speaker-timeline diarization uses active speaker events from the meeting platform to assign transcript text to participants. Instead of transcribing separate audio streams, it relies on the meeting platform to indicate who the active speaker is over time.

When speaker-timeline diarization is used, the transcript is returned with participant speaker labels. This works best when each participant is joining from their own device, because the meeting platform can more reliably associate active speaker events with a specific participant.

Requirements to use speaker-timeline diarization

To use speaker-timeline diarization:

  • The meeting platform must support active speaker events (Google Meet, Microsoft Teams, Zoom support it out of the box; Webex meeting hosts must have a paid account and have closed captions enabled).
🚧

Caveats when using speaker-timeline diarization

There are a few tradeoffs to be aware of when using speaker-timeline diarization:

  • It depends on meeting-platform active speaker events, which can sometimes be inaccurate or incomplete (e.g. noise, missed speaker changes, or overlapping speech).
  • It is not a good fit when multiple people are speaking from the same device or microphone, such as a conference room setup.

It is typically most useful when you need both participant speaker labels and raw data from the third-party transcription provider.

Enabling speaker-timeline diarization

Speaker-timeline diarization is used whenever machine diarization is not enabled and diarization.use_separate_streams_when_available is not set to true.

Enabling speaker-timeline diarization for real-time transcription

To configure speaker-timeline diarization in a Create Bot request, ensure that machine diarization isn't enabled and set recording_config.transcript.diarization.use_separate_streams_when_available to false:

{
  // other create bot request configs
  "recording_config": {
    // other recording_config configs
    "transcript": {
      // other transcript configs
      "diarization": {
        "use_separate_streams_when_available": false
      }
    }
  }
}

For details on how to implement/access the transcript using real-time transcription, see:

Enabling speaker-timeline diarization for async transcription

To configure speaker-timeline diarization in a Create Async Transcript request, ensure that machine diarization isn't enabled and set diarization.use_separate_streams_when_available to false:

{
  // other create async transcript request configs
  "diarization": {
    "use_separate_streams_when_available": false
  }
}

For details on how to implement/access the transcript using async transcription, see Async Transcription.

Machine diarization

Machine diarization is produced by your third-party transcription provider, not Recall. Instead of using meeting-platform participant information, the transcription provider separates speakers based on voice characteristics and returns a diarized transcript to Recall.

When machine diarization is enabled, the transcript is returned with generic speaker labels such as A, B, C or 0, 1, 2 rather than participant speaker labels. This works best when multiple people may be speaking from the same device or microphone, such as a conference room setup.

Requirements to use machine diarization

To use machine diarization:

🚧

Caveats when using machine diarization

There are a few tradeoffs to be aware of when using machine diarization:

  • The transcript uses generic speaker labels, not participant speaker labels, because the transcription provider does not know which meeting participant each voice belongs to.
  • Accuracy can vary by provider.
  • It can be less accurate when different speakers have similar-sounding voices.

Enabling machine diarization

To enable machine diarization, set the provider-specific diarization field in your transcription provider configuration.

ProviderReal-timeAsync
Deepgramdeepgram_streaming.diarize: truedeepgram_async.diarize: true
ElevenLabs-elevenlabs_async.diarize
Assemblyassembly_ai_async_chunked.speaker_labels: trueassembly_ai_async.speaker_labels: true
Revrev_streaming.enable_speaker_switch: true-

Enabling machine diarization for real-time transcription

To configure machine diarization in a Create Bot request, set the provider configs as seen above and recording_config.transcript.diarization.use_separate_streams_when_available to false. An example with Deepgram would look like:

{
  // other create bot request configs
  "recording_config": {
    // other recording_config configs
    "transcript": {
      // other transcript configs
      "diarization": {
        "use_separate_streams_when_available": false
      },
      "provider": {
        "deepgram_streaming": {
          "diarize": true
        }
      }
    }
  } 
}

For details on how to implement/access the transcript using real-time transcription, see:

Enabling machine diarization for async transcription

To configure machine diarization in a Create Async Transcript request, set the provider configs as seen above and diarization.use_separate_streams_when_available to false. An example with Deepgram would look like:

{
  // other create async transcript request configs
  "diarization": {
    "use_separate_streams_when_available": false
  },
  "provider": {
    "deepgram_async": {
      "diarize": true
    }
  }
}

For details on how to implement/access the transcript using async transcription, see Async Transcription.

FAQ

Why am I seeing Speaker A, Speaker B, or 0, 1, 2 instead of names?

That indicates Machine Diarization (via a third-party transcription provider) was used. Machine diarization can separate voices, but it can’t attach real participant names.

To get participant names, remove provider diarization flags such as:

  • assembly_ai_async.speaker_labels
  • deepgram_async.diarize

Why do multiple speakers calling from the same device appear as the same participant in the transcript?

This usually happens when multiple people are sharing one device or microphone and speaker-timeline diarization is being used (or machine diarization is not enabled or not available).

For conference rooms or other shared-device setups, use machine diarization when you only have a mixed audio stream, or hybrid diarization when separate audio streams are available and some streams may contain multiple speakers.

Microsoft Teams: why are speaker names missing or diarization looks wrong?

Teams has a setting that affects whether speakers can be identified in captions/transcripts: Transcription Caption Identification. If this setting is turned off, transcripts will not get diarized properly with multiple speakers.

Where to find this setting in Teams: Accessibility -> Captions and Transcripts -> Transcription -> Automatically identify me in meeting captions and transcripts

Be aware that org-wide Teams policies can override individual user settings.

Is there any additional costs for diarization?

There is no separate diarization feature fee. However, some diarization methods, such as perfect diarization and hybrid diarization, can increase transcription credit usage depending on how audio is processed. See those sections for more details.