Transcription (STT)
Turn audio into text — batch for files and archives, streaming for voice agents and live contact-center monitoring.
Wordcab SDKs, CLI tools, Helm charts, model weights, and deployment packages are delivered directly to each customer for self-hosted installation. They are not publicly published package-manager artifacts, so install commands in these docs are placeholders until your Wordcab team provides your private package source or offline bundle.
Wordcab supports three transcription modes. Pick by how the audio arrives:
- Batch — you already have a file or URL. Poll or webhook.
- Streaming (chunked upload) — you have audio bytes arriving in real time but no need for partial results.
- Real-time (WebSocket) — live audio with partial hypotheses and interim words for voice agents, captioning, and contact-center live QA.
Batch transcription
Create a job, then wait. The SDK's wait() helper polls for you; webhooks are covered in the webhooks guide.
job = client.transcripts.create(
audio_url="https://example.com/call.wav",
model="cohere-transcribe-2b",
language="en",
diarize=True,
word_timestamps=True,
redact=["pii", "phi"],
)
transcript = client.transcripts.wait(job.id)
print(transcript.text) # full transcript
for u in transcript.utterances: # diarized segments
print(f"[speaker {u.speaker}] {u.text}")const job = await client.transcripts.create({
audioUrl: "https://example.com/call.wav",
model: "cohere-transcribe-2b",
language: "en",
diarize: true,
wordTimestamps: true,
redact: ["pii", "phi"],
});
const transcript = await client.transcripts.wait(job.id);
console.log(transcript.text);curl -X POST https://api.wordcab.com/api/v1/transcripts \\
-H "Authorization: Bearer $WORDCAB_API_KEY" \\
-H "Content-Type: application/json" \\
-d '{
"audio_url": "https://example.com/call.wav",
"model": "cohere-transcribe-2b",
"diarize": true,
"word_timestamps": true
}' Uploading a local file
For files not reachable by URL, POST the bytes directly with multipart/form-data. The SDK's audio_file parameter handles this.
with open("call.wav", "rb") as f:
job = client.transcripts.create(
audio_file=f,
model="qwen3-asr",
)Streaming STT (WebSocket)
For live audio — voice agents, captioning, contact-center monitors — open a WebSocket, push μ-law or linear PCM frames, and read partial hypotheses as they settle.
from wordcab import Wordcab
client = Wordcab()
async with client.audio.transcriptions.stream(
model="voxtral-realtime",
sample_rate=16000,
language="en",
partial_hypotheses=True,
) as stream:
async for event in stream:
if event.type == "partial":
print("~", event.text, flush=True)
elif event.type == "final":
print(">", event.text)
await stream.send_audio(pcm_chunk) # drive from your source
await stream.close()Events:
partial— interim hypothesis; the text will change. Emitted every ~80 ms by default.final— locked utterance boundary. Safe to pass downstream.speaker_change— diarization boundary whendiarize=True.end— stream closed cleanly.
OpenAI-compatible endpoint
If you already use the OpenAI Whisper SDK, Wordcab's /v1/audio/transcriptions is a drop-in replacement. Only the model name changes.
from openai import OpenAI
client = OpenAI(base_url="https://api.wordcab.com", api_key="$WORDCAB_API_KEY")
with open("audio.mp3", "rb") as f:
resp = client.audio.transcriptions.create(
model="qwen3-asr",
file=f,
language="en",
)
print(resp.text)Choosing a model
| Model | Best for | Latency | Params |
|---|---|---|---|
qwen3-asr | Real-time + offline. Strong default. | TTFT ~150 ms | ~2B |
voxtral-realtime | Low-latency streaming with tunable delay. | 200–500 ms configurable | 4B |
cohere-transcribe-2b | Batch at scale. >30 min audio per GPU-second on H100. | High throughput | 2B |
whisper-large-v3 | Legacy compat with OpenAI Whisper prompts. | Offline only | 1.5B |
All models run inside your deployment boundary. Audio never leaves your VPC or cluster in self-hosted mode.
Diarization and redaction
Pass diarize=True to get speaker-labeled utterances. Pyannote 3.3, tuned by Wordcab, delivers ~9% DER on telephony test sets. Pass redact=["pii","phi"] to mask named entities before the transcript is stored. Redaction runs inside the boundary; the unredacted transcript is never persisted.
For domain terms (drug names, SKUs, account number formats), attach a vocabulary at request time or at the deployment level. See Adapt for the fine-tuning path.