The private voice AI runtime your team won't have to build from scratch.
Wordcab Voice is a deployable runtime for transcription, speech generation, and voice workflows — one control surface, infrastructure your team controls.
Production numbers, not marketing ones.
Defaults Wordcab ships on Production-tier hardware. Pilot traffic usually validates them within a week.
Real-time latency
Qwen3-ASR or Voxtral Realtime on a single L40S, with VAD and custom endpointing tuned for telephony audio.
Throughput
Streaming STT with INT8 quantization and tensor parallelism. Scales horizontally under the same control plane.
Inside the boundary
Audio, transcripts, summaries, and artifacts stay in your VPC, data center, or airgapped environment. Always.
/v1/audio/transcriptions, /v1/audio/speechDrop-in API
Point your existing OpenAI SDK at a Wordcab endpoint. Application code does not change when models do.
Operable after launch
Prometheus, OpenTelemetry, Grafana dashboards, preflight checks, support bundles — all in the chart.
The 2026 open voice landscape, ranked by where it actually ships well.
Wordcab Voice tracks the open model landscape and re-baselines defaults quarterly. Every model below is Apache-2.0 or MIT, runs inside your boundary, and ships with a tuned vLLM or ONNX config.
| Model | Role | Params | Latency / throughput | When we default to it |
|---|---|---|---|---|
| Qwen3-ASR | Streaming + offline STT | ~2B | TTFT ~150 ms on vLLM; streams at >300 concurrent on L40S | Default STT for real-time voice agents and mixed batch/streaming |
| Voxtral Realtime | Low-latency streaming STT | 4B | Configurable delay 200–500 ms; competitive with Whisper-large-v3 on multilingual | Live contact-center streams, voice agents with strict interruption budgets |
| Cohere Transcribe 2B | Batch STT at scale | 2B | High-throughput offline: >30 minutes of audio per GPU-second on H100 | Archive backfills, overnight QA batches, compliance transcription |
| Qwen3-TTS | Streaming TTS | ~1B | End-to-end latency ~97 ms, VoiceDesign variant for custom voices | Default TTS for voice agents, IVR replacement, accessibility workflows |
| Kokoro (ONNX) | Local TTS, CPU-friendly | 82M | Runs on CPU at real-time factor <1.0; zero GPU required | Airgap, edge, or when the deployment is CPU-only |
| pyannote 3.3 diarization | Speaker diarization | ~25M | DER ~9% on standard telephony test sets after tuning | Every contact-center and meeting pipeline |
Latency numbers are defaults on Production-tier hardware with INT8 quantization where supported. Customer evals on representative audio are part of every Pilot. Wordcab will not ship a default that underperforms your real workload.
What it actually takes to run.
Hardware baselines for the deployment shapes we see most often. Pilots start on a single node. Production scales horizontally from there.
Voice-agent real-time
SmallUp to ~100 concurrent low-latency streams. Voice + Think on one node.
- GPU: 1× NVIDIA L40S (48 GB) or 2× L4 (24 GB each)
- CPU: 16 vCPU, 64 GB RAM
- Storage: 500 GB NVMe
- Network: 1 Gbps, low-latency to clients
- Models: Qwen3-ASR + Qwen3-TTS + Qwen3.5-4B for Think
Contact-center production
Most commonFull-call QA + redaction + summaries at ~1,000 concurrent streams. HA across two zones.
- GPU: 4× NVIDIA H100 (80 GB) or equivalent
- CPU: 64 vCPU, 256 GB RAM per node
- Storage: 2 TB NVMe + S3/object storage for transcripts
- Kubernetes: 1.28+, 3+ worker nodes, HPA configured
- Models: Voxtral Realtime + Qwen3-TTS + Gemma 4 E4B
Airgap / sovereign
RegulatedFully disconnected environment. Signed offline bundles, mirrored registry, custom CA chain.
- GPU: H100 / A100 / SambaNova RDU (via SCX.ai)
- Registry: Harbor, Artifactory, or ECR private mirror
- Auth: SAML/OIDC via internal IdP
- Monitoring: Prometheus + OTel shipped to your stack
- Updates: Offline bundle cadence coordinated with your change window
CPU-only / edge
ConstrainedNo GPU available. Lower concurrency, simpler operating model.
- CPU: 32 vCPU (AVX-512 preferred), 64 GB RAM
- Models: Kokoro (ONNX) + distilled Voxtral + Qwen3.5-0.8B
- Expected concurrency: ~10–20 streams per node
- Typical use: field service, branch-office deployments, dev/eval
Why building in-house only sounds good on paper.
A demo runs in a sprint. A production voice stack — packaged, multi-tenant, observable, upgradable — is where months disappear. Four places the in-house path quietly turns into a platform team.
Pilots lie about timelines.
A prototype ships in a sprint. The production runtime behind it — packaging, tenancy, observability, upgrade path — takes quarters. Most teams underestimate by 3–4×.
Engineers hired for your product end up maintaining inference infrastructure.
Hosted APIs trade one problem for another.
They clear the model hurdle. In return: unpredictable cost curves, no audit trail, and transcripts leaving the perimeter compliance drew. The control problem arrives after launch.
Cost and control issues surface once the system is already load-bearing.
Open source is not a product.
Self-hosted components give you weights and Dockerfiles. They do not give you packaging, multi-tenant isolation, or an upgrade story customers will operate themselves.
Integration work scales with every model swap and every new environment.
Day-two operations is the real build.
Most stacks are designed for the first inference call. Drift, GPU utilization, incident response, per-tenant isolation, version cutovers — that is where years, not weeks, get spent.
Launch is the easy part. The five years after are not.
Private voice deployment, built for high-risk environments.
Wordcab Voice runs in customer-controlled environments — customer-managed Kubernetes, private cloud, on-prem, hybrid estates, restricted networks, and dedicated deployments. Same product story in each.
Frequently asked questions
We already have speech models working. Why would we still need Wordcab Voice?
What if we want to change models as the landscape moves?
Can Wordcab Voice support both batch and real-time workloads?
Can we start with Voice and add fine-tuning later?
Skip months of platform work.
If your team needs private voice AI without taking on the full platform build — Wordcab Voice is the right place to start.
Talk to an Engineer
We usually respond within one business day.