Transcription (STT)

Turn audio into text — batch for files and archives, streaming for voice agents and live contact-center monitoring.

Package availability

Wordcab SDKs, CLI tools, Helm charts, model weights, and deployment packages are delivered directly to each customer for self-hosted installation. They are not publicly published package-manager artifacts, so install commands in these docs are placeholders until your Wordcab team provides your private package source or offline bundle.

Wordcab supports three transcription modes. Pick by how the audio arrives:

Batch — you already have a file or URL. Poll or webhook.
Streaming (chunked upload) — you have audio bytes arriving in real time but no need for partial results.
Real-time (WebSocket) — live audio with partial hypotheses and interim words for voice agents, captioning, and contact-center live QA.

Batch transcription

Create a job, then wait. The SDK's wait() helper polls for you; webhooks are covered in the webhooks guide.

job = client.transcripts.create(
    audio_url="https://example.com/call.wav",
    model="cohere-transcribe-2b",
    language="en",
    diarize=True,
    word_timestamps=True,
    redact=["pii", "phi"],
)

transcript = client.transcripts.wait(job.id)

print(transcript.text)                 # full transcript
for u in transcript.utterances:        # diarized segments
    print(f"[speaker {u.speaker}] {u.text}")

const job = await client.transcripts.create({
  audioUrl: "https://example.com/call.wav",
  model: "cohere-transcribe-2b",
  language: "en",
  diarize: true,
  wordTimestamps: true,
  redact: ["pii", "phi"],
});

const transcript = await client.transcripts.wait(job.id);
console.log(transcript.text);

curl -X POST https://api.wordcab.com/api/v1/transcripts \\
  -H "Authorization: Bearer $WORDCAB_API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{
    "audio_url": "https://example.com/call.wav",
    "model": "cohere-transcribe-2b",
    "diarize": true,
    "word_timestamps": true
  }'

Uploading a local file

For files not reachable by URL, POST the bytes directly with multipart/form-data. The SDK's audio_file parameter handles this.

python

with open("call.wav", "rb") as f:
    job = client.transcripts.create(
        audio_file=f,
        model="qwen3-asr",
    )

Streaming STT (WebSocket)

For live audio — voice agents, captioning, contact-center monitors — open a WebSocket, push μ-law or linear PCM frames, and read partial hypotheses as they settle.

python

from wordcab import Wordcab

client = Wordcab()

async with client.audio.transcriptions.stream(
    model="voxtral-realtime",
    sample_rate=16000,
    language="en",
    partial_hypotheses=True,
) as stream:
    async for event in stream:
        if event.type == "partial":
            print("~", event.text, flush=True)
        elif event.type == "final":
            print(">", event.text)

    await stream.send_audio(pcm_chunk)           # drive from your source
    await stream.close()

Events:

partial — interim hypothesis; the text will change. Emitted every ~80 ms by default.
final — locked utterance boundary. Safe to pass downstream.
speaker_change — diarization boundary when diarize=True.
end — stream closed cleanly.

OpenAI-compatible endpoint

If you already use the OpenAI Whisper SDK, Wordcab's /v1/audio/transcriptions is a drop-in replacement. Only the model name changes.

python

from openai import OpenAI

client = OpenAI(base_url="https://api.wordcab.com", api_key="$WORDCAB_API_KEY")

with open("audio.mp3", "rb") as f:
    resp = client.audio.transcriptions.create(
        model="qwen3-asr",
        file=f,
        language="en",
    )
print(resp.text)

Choosing a model

Model	Best for	Latency	Params
`qwen3-asr`	Real-time + offline. Strong default.	TTFT ~150 ms	~2B
`voxtral-realtime`	Low-latency streaming with tunable delay.	200–500 ms configurable	4B
`cohere-transcribe-2b`	Batch at scale. >30 min audio per GPU-second on H100.	High throughput	2B
`whisper-large-v3`	Legacy compat with OpenAI Whisper prompts.	Offline only	1.5B

All models run inside your deployment boundary. Audio never leaves your VPC or cluster in self-hosted mode.

Diarization and redaction

Pass diarize=True to get speaker-labeled utterances. Pyannote 3.3, tuned by Wordcab, delivers ~9% DER on telephony test sets. Pass redact=["pii","phi"] to mask named entities before the transcript is stored. Redaction runs inside the boundary; the unredacted transcript is never persisted.

Custom vocabulary

For domain terms (drug names, SKUs, account number formats), attach a vocabulary at request time or at the deployment level. See Adapt for the fine-tuning path.

← Previous

Libraries & CLI

Text-to-speech