The private voice AI runtime your team won't have to build from scratch.

Wordcab Voice is a deployable runtime for transcription, speech generation, and voice workflows — one control surface, infrastructure your team controls.

Production numbers, not marketing ones.

Defaults Wordcab ships on Production-tier hardware. Pilot traffic usually validates them within a week.

<400 ms
p99 streaming TTFT, 8 kHz telephony

Real-time latency

Qwen3-ASR or Voxtral Realtime on a single L40S, with VAD and custom endpointing tuned for telephony audio.

1,000+
concurrent streams on 1× H100

Throughput

Streaming STT with INT8 quantization and tensor parallelism. Scales horizontally under the same control plane.

Zero egress
no call-home in the critical path

Inside the boundary

Audio, transcripts, summaries, and artifacts stay in your VPC, data center, or airgapped environment. Always.

OpenAI-compatible
/v1/audio/transcriptions, /v1/audio/speech

Drop-in API

Point your existing OpenAI SDK at a Wordcab endpoint. Application code does not change when models do.

99.9% SLA
Production tier

Operable after launch

Prometheus, OpenTelemetry, Grafana dashboards, preflight checks, support bundles — all in the chart.

The 2026 open voice landscape, ranked by where it actually ships well.

Wordcab Voice tracks the open model landscape and re-baselines defaults quarterly. Every model below is Apache-2.0 or MIT, runs inside your boundary, and ships with a tuned vLLM or ONNX config.

Model Role Params Latency / throughput When we default to it
Qwen3-ASR
Apache 2.0 · Jan 2026
Streaming + offline STT ~2B TTFT ~150 ms on vLLM; streams at >300 concurrent on L40S Default STT for real-time voice agents and mixed batch/streaming
Voxtral Realtime
Apache 2.0 · Feb 2026
Low-latency streaming STT 4B Configurable delay 200–500 ms; competitive with Whisper-large-v3 on multilingual Live contact-center streams, voice agents with strict interruption budgets
Cohere Transcribe 2B
Apache 2.0 · Mar 2026
Batch STT at scale 2B High-throughput offline: >30 minutes of audio per GPU-second on H100 Archive backfills, overnight QA batches, compliance transcription
Qwen3-TTS
Apache 2.0 · Jan 2026
Streaming TTS ~1B End-to-end latency ~97 ms, VoiceDesign variant for custom voices Default TTS for voice agents, IVR replacement, accessibility workflows
Kokoro (ONNX)
Apache 2.0 weights · MIT runtime
Local TTS, CPU-friendly 82M Runs on CPU at real-time factor <1.0; zero GPU required Airgap, edge, or when the deployment is CPU-only
pyannote 3.3 diarization
MIT · Wordcab-tuned
Speaker diarization ~25M DER ~9% on standard telephony test sets after tuning Every contact-center and meeting pipeline

Latency numbers are defaults on Production-tier hardware with INT8 quantization where supported. Customer evals on representative audio are part of every Pilot. Wordcab will not ship a default that underperforms your real workload.

What it actually takes to run.

Hardware baselines for the deployment shapes we see most often. Pilots start on a single node. Production scales horizontally from there.

Voice-agent real-time

Small

Up to ~100 concurrent low-latency streams. Voice + Think on one node.

  • GPU: 1× NVIDIA L40S (48 GB) or 2× L4 (24 GB each)
  • CPU: 16 vCPU, 64 GB RAM
  • Storage: 500 GB NVMe
  • Network: 1 Gbps, low-latency to clients
  • Models: Qwen3-ASR + Qwen3-TTS + Qwen3.5-4B for Think

Contact-center production

Most common

Full-call QA + redaction + summaries at ~1,000 concurrent streams. HA across two zones.

  • GPU: 4× NVIDIA H100 (80 GB) or equivalent
  • CPU: 64 vCPU, 256 GB RAM per node
  • Storage: 2 TB NVMe + S3/object storage for transcripts
  • Kubernetes: 1.28+, 3+ worker nodes, HPA configured
  • Models: Voxtral Realtime + Qwen3-TTS + Gemma 4 E4B

Airgap / sovereign

Regulated

Fully disconnected environment. Signed offline bundles, mirrored registry, custom CA chain.

  • GPU: H100 / A100 / SambaNova RDU (via SCX.ai)
  • Registry: Harbor, Artifactory, or ECR private mirror
  • Auth: SAML/OIDC via internal IdP
  • Monitoring: Prometheus + OTel shipped to your stack
  • Updates: Offline bundle cadence coordinated with your change window

CPU-only / edge

Constrained

No GPU available. Lower concurrency, simpler operating model.

  • CPU: 32 vCPU (AVX-512 preferred), 64 GB RAM
  • Models: Kokoro (ONNX) + distilled Voxtral + Qwen3.5-0.8B
  • Expected concurrency: ~10–20 streams per node
  • Typical use: field service, branch-office deployments, dev/eval

Why building in-house only sounds good on paper.

A demo runs in a sprint. A production voice stack — packaged, multi-tenant, observable, upgradable — is where months disappear. Four places the in-house path quietly turns into a platform team.

01

Pilots lie about timelines.

A prototype ships in a sprint. The production runtime behind it — packaging, tenancy, observability, upgrade path — takes quarters. Most teams underestimate by 3–4×.

Engineers hired for your product end up maintaining inference infrastructure.

02

Hosted APIs trade one problem for another.

They clear the model hurdle. In return: unpredictable cost curves, no audit trail, and transcripts leaving the perimeter compliance drew. The control problem arrives after launch.

Cost and control issues surface once the system is already load-bearing.

03

Open source is not a product.

Self-hosted components give you weights and Dockerfiles. They do not give you packaging, multi-tenant isolation, or an upgrade story customers will operate themselves.

Integration work scales with every model swap and every new environment.

04

Day-two operations is the real build.

Most stacks are designed for the first inference call. Drift, GPU utilization, incident response, per-tenant isolation, version cutovers — that is where years, not weeks, get spent.

Launch is the easy part. The five years after are not.

Private voice deployment, built for high-risk environments.

Wordcab Voice runs in customer-controlled environments — customer-managed Kubernetes, private cloud, on-prem, hybrid estates, restricted networks, and dedicated deployments. Same product story in each.

Frequently asked questions

We already have speech models working. Why would we still need Wordcab Voice?
Because models are only part of the problem. Runtime, deployment packaging, observability, control surface, and upgrade path are the parts teams underestimate.
What if we want to change models as the landscape moves?
That's expected. Model choice should follow the workload, hardware, and control boundary — not lock the deployment to one vendor path.
Can Wordcab Voice support both batch and real-time workloads?
Yes. The exact stack depends on the use case. The product is built for teams that need more than one narrow demo path.
Can we start with Voice and add fine-tuning later?
Yes. Many teams start with the runtime and bring in Wordcab Adapt once model fit, domain language, or workflow quality becomes the blocker.

Skip months of platform work.

If your team needs private voice AI without taking on the full platform build — Wordcab Voice is the right place to start.

Talk to an Engineer

We usually respond within one business day.

What are you building?

Or email us directly.