Docs/GuidesTranscription

Transcription (STT)

Turn audio into text — batch for files and archives, streaming for voice agents and live contact-center monitoring.

Wordcab supports three transcription modes. Pick by how the audio arrives:

  • Batch — you already have a file or URL. Poll or webhook.
  • Streaming (chunked upload) — you have audio bytes arriving in real time but no need for partial results.
  • Real-time (WebSocket) — live audio with partial hypotheses and interim words for voice agents, captioning, and contact-center live QA.

Batch transcription

Create a job, then wait. The SDK's wait() helper polls for you; webhooks are covered in the webhooks guide.

job = client.transcripts.create(
    audio_url="https://example.com/call.wav",
    model="cohere-transcribe-2b",
    language="en",
    diarize=True,
    word_timestamps=True,
    redact=["pii", "phi"],
)

transcript = client.transcripts.wait(job.id)

print(transcript.text)                 # full transcript
for u in transcript.utterances:        # diarized segments
    print(f"[speaker {u.speaker}] {u.text}")
const job = await client.transcripts.create({
  audioUrl: "https://example.com/call.wav",
  model: "cohere-transcribe-2b",
  language: "en",
  diarize: true,
  wordTimestamps: true,
  redact: ["pii", "phi"],
});

const transcript = await client.transcripts.wait(job.id);
console.log(transcript.text);
curl -X POST https://api.wordcab.com/api/v1/transcripts \\
  -H "Authorization: Bearer $WORDCAB_API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{
    "audio_url": "https://example.com/call.wav",
    "model": "cohere-transcribe-2b",
    "diarize": true,
    "word_timestamps": true
  }' 

Uploading a local file

For files not reachable by URL, POST the bytes directly with multipart/form-data. The SDK's audio_file parameter handles this.

python
with open("call.wav", "rb") as f:
    job = client.transcripts.create(
        audio_file=f,
        model="qwen3-asr",
    )

Streaming STT (WebSocket)

For live audio — voice agents, captioning, contact-center monitors — open a WebSocket, push μ-law or linear PCM frames, and read partial hypotheses as they settle.

python
from wordcab import Wordcab

client = Wordcab()

async with client.audio.transcriptions.stream(
    model="voxtral-realtime",
    sample_rate=16000,
    language="en",
    partial_hypotheses=True,
) as stream:
    async for event in stream:
        if event.type == "partial":
            print("~", event.text, flush=True)
        elif event.type == "final":
            print(">", event.text)

    await stream.send_audio(pcm_chunk)           # drive from your source
    await stream.close()

Events:

  • partial — interim hypothesis; the text will change. Emitted every ~80 ms by default.
  • final — locked utterance boundary. Safe to pass downstream.
  • speaker_change — diarization boundary when diarize=True.
  • end — stream closed cleanly.

OpenAI-compatible endpoint

If you already use the OpenAI Whisper SDK, Wordcab's /v1/audio/transcriptions is a drop-in replacement. Only the model name changes.

python
from openai import OpenAI

client = OpenAI(base_url="https://api.wordcab.com", api_key="$WORDCAB_API_KEY")

with open("audio.mp3", "rb") as f:
    resp = client.audio.transcriptions.create(
        model="qwen3-asr",
        file=f,
        language="en",
    )
print(resp.text)

Choosing a model

ModelBest forLatencyParams
qwen3-asrReal-time + offline. Strong default.TTFT ~150 ms~2B
voxtral-realtimeLow-latency streaming with tunable delay.200–500 ms configurable4B
cohere-transcribe-2bBatch at scale. >30 min audio per GPU-second on H100.High throughput2B
whisper-large-v3Legacy compat with OpenAI Whisper prompts.Offline only1.5B

All models run inside your deployment boundary. Audio never leaves your VPC or cluster in self-hosted mode.

Diarization and redaction

Pass diarize=True to get speaker-labeled utterances. Pyannote 3.3, tuned by Wordcab, delivers ~9% DER on telephony test sets. Pass redact=["pii","phi"] to mask named entities before the transcript is stored. Redaction runs inside the boundary; the unredacted transcript is never persisted.

Custom vocabulary

For domain terms (drug names, SKUs, account number formats), attach a vocabulary at request time or at the deployment level. See Adapt for the fine-tuning path.