Transcription (STT)
Turn audio into text — batch for files and archives, streaming for voice agents and live contact-center monitoring.
Wordcab supports three transcription modes. Pick by how the audio arrives:
- Batch — you already have a file or URL. Poll or webhook.
- Streaming (chunked upload) — you have audio bytes arriving in real time but no need for partial results.
- Real-time (WebSocket) — live audio with partial hypotheses and interim words for voice agents, captioning, and contact-center live QA.
Batch transcription
Create a job, then wait. The SDK's wait() helper polls for you; webhooks are covered in the webhooks guide.
job = client.transcripts.create(
audio_url="https://example.com/call.wav",
model="cohere-transcribe-2b",
language="en",
diarize=True,
word_timestamps=True,
redact=["pii", "phi"],
)
transcript = client.transcripts.wait(job.id)
print(transcript.text) # full transcript
for u in transcript.utterances: # diarized segments
print(f"[speaker {u.speaker}] {u.text}")const job = await client.transcripts.create({
audioUrl: "https://example.com/call.wav",
model: "cohere-transcribe-2b",
language: "en",
diarize: true,
wordTimestamps: true,
redact: ["pii", "phi"],
});
const transcript = await client.transcripts.wait(job.id);
console.log(transcript.text);curl -X POST https://api.wordcab.com/api/v1/transcripts \\
-H "Authorization: Bearer $WORDCAB_API_KEY" \\
-H "Content-Type: application/json" \\
-d '{
"audio_url": "https://example.com/call.wav",
"model": "cohere-transcribe-2b",
"diarize": true,
"word_timestamps": true
}' Uploading a local file
For files not reachable by URL, POST the bytes directly with multipart/form-data. The SDK's audio_file parameter handles this.
with open("call.wav", "rb") as f:
job = client.transcripts.create(
audio_file=f,
model="qwen3-asr",
)Streaming STT (WebSocket)
For live audio — voice agents, captioning, contact-center monitors — open a WebSocket, push μ-law or linear PCM frames, and read partial hypotheses as they settle.
from wordcab import Wordcab
client = Wordcab()
async with client.audio.transcriptions.stream(
model="voxtral-realtime",
sample_rate=16000,
language="en",
partial_hypotheses=True,
) as stream:
async for event in stream:
if event.type == "partial":
print("~", event.text, flush=True)
elif event.type == "final":
print(">", event.text)
await stream.send_audio(pcm_chunk) # drive from your source
await stream.close()Events:
partial— interim hypothesis; the text will change. Emitted every ~80 ms by default.final— locked utterance boundary. Safe to pass downstream.speaker_change— diarization boundary whendiarize=True.end— stream closed cleanly.
OpenAI-compatible endpoint
If you already use the OpenAI Whisper SDK, Wordcab's /v1/audio/transcriptions is a drop-in replacement. Only the model name changes.
from openai import OpenAI
client = OpenAI(base_url="https://api.wordcab.com", api_key="$WORDCAB_API_KEY")
with open("audio.mp3", "rb") as f:
resp = client.audio.transcriptions.create(
model="qwen3-asr",
file=f,
language="en",
)
print(resp.text)Choosing a model
| Model | Best for | Latency | Params |
|---|---|---|---|
qwen3-asr | Real-time + offline. Strong default. | TTFT ~150 ms | ~2B |
voxtral-realtime | Low-latency streaming with tunable delay. | 200–500 ms configurable | 4B |
cohere-transcribe-2b | Batch at scale. >30 min audio per GPU-second on H100. | High throughput | 2B |
whisper-large-v3 | Legacy compat with OpenAI Whisper prompts. | Offline only | 1.5B |
All models run inside your deployment boundary. Audio never leaves your VPC or cluster in self-hosted mode.
Diarization and redaction
Pass diarize=True to get speaker-labeled utterances. Pyannote 3.3, tuned by Wordcab, delivers ~9% DER on telephony test sets. Pass redact=["pii","phi"] to mask named entities before the transcript is stored. Redaction runs inside the boundary; the unredacted transcript is never persisted.
For domain terms (drug names, SKUs, account number formats), attach a vocabulary at request time or at the deployment level. See Adapt for the fine-tuning path.