Gowajee API
  • Introduction
  • Getting Started
  • Speech to text
    • Models
    • Limitations
      • Rate Limit
      • Astro Model Limitation
      • Word Timestamp
      • Words Boosting
      • Audio Duration Limit
      • Handling Large Files
      • HTTP Multipart/Form-Data
      • Supported File Formats
    • Transcription
      • Synchronous API (Request-Response)
      • Asynchronous API (Webhook Notification)
        • Checking Job Status
    • Real-time Transcription
    • Speaker Separation
      • Multichannel
      • Diarization
    • Words Boosting
    • Raw Audio Format
  • Pricing
Powered by GitBook
On this page
  • Key Features
  • Transcribe with Diarization
  • Introduction
  • Methods to Enable Diarization
  • Enable Automatic Diarization
  • Specify Number of Speakers
  • Define Range of Speakers
  • Reference Speakers (Recommended)
  1. Speech to text
  2. Speaker Separation

Diarization

PreviousMultichannelNextWords Boosting

Last updated 8 months ago

The Diarization method automatically detects and separates speakers within a single audio channel. This method can handle more complex audio inputs where multiple speakers are mixed in the same channel. However, it is significantly slower, taking 2-3 times longer to process compared to transcription without diarization.


Key Features

  • Use Case: Suitable for mixed audio inputs where speakers are not on separate channels.

  • Accuracy: Variable, dependent on the complexity of the audio and number of speakers.

  • Performance: Slow, taking 2-3 times longer than non-diarized processing.

  • Recommendation: Use only when separation is not feasible.


Transcribe with Diarization

Introduction

This section describes the various ways to use diarization with the Gowajee speech-to-text (STT) API. Diarization is the process of identifying and separating speakers within an audio input. Below are the options for configuring diarization to best suit your needs.

Methods to Enable Diarization


Enable Automatic Diarization

Set diarization to true to enable automatic detection of the number of speakers and speaker separation.

Request:

POST /v1/speech-to-text/${MODEL}/transcribe HTTP/1.1
Host: api.gowajee.ai
Content-Type: application/json
X-Api-Key: ${YOUR_API_KEY}

{
  "audioData": "base64_encoded_raw_audio_data",
  "diarization": true
}

Response:

{
    "type": "ASR_PULSE_WITH_DIARIZE",
    "amount": 6.248,
    "output": {
        "results": [
            {
                "transcript": "สวัสดีค่ะก่อนอื่นทางเราขอให้คุณยืนยันตัวตนผ่านระบบเสียง",
                "startTime": 0,
                "endTime": 5.117,
                "speaker": "SPEAKER_00"
            },
            {
                "transcript": "ได้เลยครับ",
                "startTime": 5.231,
                "endTime": 6.248,
                "speaker": "SPEAKER_01"
            }
        ],
        "duration": 6.248,
        "version": "2.2.0"
    }
}

Specify Number of Speakers

Set diarization to true and define numSpeakers (Integer) to specify the number of speakers in the audio.

Request:

POST /v1/speech-to-text/${MODEL}/transcribe HTTP/1.1
Host: api.gowajee.ai
Content-Type: application/json
X-Api-Key: ${YOUR_API_KEY}

{
  "audioData": "base64_encoded_raw_audio_data",
  "diarization": true,
  "numSpeakers": 2
}

Response:

{
    "type": "ASR_PULSE_WITH_DIARIZE",
    "amount": 6.248,
    "output": {
        "results": [
            {
                "transcript": "สวัสดีค่ะก่อนอื่นทางเราขอให้คุณยืนยันตัวตนผ่านระบบเสียง",
                "startTime": 0,
                "endTime": 5.117,
                "speaker": "SPEAKER_00"
            },
            {
                "transcript": "ได้เลยครับ",
                "startTime": 5.231,
                "endTime": 6.248,
                "speaker": "SPEAKER_01"
            }
        ],
        "duration": 6.248,
        "version": "2.2.0"
    }
}

Define Range of Speakers

Set diarization to true and define minSpeakers (Integer) and maxSpeakers (Integer) to automatically detect and separate the number of speakers within the specified range.

Request:

POST /v1/speech-to-text/${MODEL}/transcribe HTTP/1.1
Host: api.gowajee.ai
Content-Type: application/json
X-Api-Key: ${YOUR_API_KEY}

{
  "audioData": "base64_encoded_raw_audio_data",
  "diarization": true,
  "minSpeakers": 1,
  "maxSpeakers": 2,
}

Response:

{
    "type": "ASR_PULSE_WITH_DIARIZE",
    "amount": 6.248,
    "output": {
        "results": [
            {
                "transcript": "สวัสดีค่ะก่อนอื่นทางเราขอให้คุณยืนยันตัวตนผ่านระบบเสียง",
                "startTime": 0,
                "endTime": 5.117,
                "speaker": "SPEAKER_00"
            },
            {
                "transcript": "ได้เลยครับ",
                "startTime": 5.231,
                "endTime": 6.248,
                "speaker": "SPEAKER_01"
            }
        ],
        "duration": 6.248,
        "version": "2.2.0"
    }
}

Reference Speakers (Recommended)

Define refSpeakers with reference to speaker voices and names. This method improves accuracy by using known speaker samples.

Request:

  • application/json

POST /v1/speech-to-text/${MODEL}/transcribe HTTP/1.1
Host: api.gowajee.ai
Content-Type: application/json
X-Api-Key: ${YOUR_API_KEY}

{
  "audioData": "base64_encoded_raw_audio_data",
  "diarization": true,
  "refSpeakers": [
    {
      "name": "Adam",
      "audioData": "base64_encoded_voice_of_adam"
    },
    {
      "name": "Bill",
      "audioData": "base64_encoded_voice_of_bill"
    }
  ]
}
  • multipart-form/data

For file uploads using multipart/form-data, the API will use the filename as the speaker name.

POST /v1/speech-to-text/${MODEL}/transcribe HTTP/1.1
Host: api.gowajee.ai
x-api-key: ${YOUR_API_KEY}
Content-Length: ${AUTO}
Content-Type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW

------WebKitFormBoundary7MA4YWxkTrZu0gW
Content-Disposition: form-data; name="audioData"; filename="audio.wav"
Content-Type: audio/wav

(data)
------WebKitFormBoundary7MA4YWxkTrZu0gW
Content-Disposition: form-data; name="refSpeakers"; filename="Adam.wav"
Content-Type: audio/wav

(data)
------WebKitFormBoundary7MA4YWxkTrZu0gW
Content-Disposition: form-data; name="refSpeakers"; filename="Bill.wav"
Content-Type: audio/wav

(data)
------WebKitFormBoundary7MA4YWxkTrZu0gW

Response:

{
    "type": "ASR_PULSE_WITH_DIARIZE",
    "amount": 6.248,
    "output": {
        "results": [
            {
                "transcript": "สวัสดีครับก่อนอื่นทางเราขอให้คุณยืนยันตัวตนผ่านระบบเสียง",
                "startTime": 0,
                "endTime": 5.117,
                "speaker": "Adam"
            },
            {
                "transcript": "ได้เลยครับ",
                "startTime": 5.231,
                "endTime": 6.248,
                "speaker": "Bill"
            }
        ],
        "duration": 6.248,
        "version": "2.2.0"
    }
}

Multichannel
Enable Automatic Diarization
Specify Number of Speakers
Define Range of Speakers
Reference Speakers (Recommended)