The private voice AI runtime your team won't have to build from scratch.

Wordcab Voice is a deployable runtime for transcription, speech generation, and voice workflows, one control surface, infrastructure your team controls.

Talk to an Engineer

Production numbers, not marketing ones.

Defaults Wordcab ships on Production-tier hardware. Pilot traffic usually validates them within a week.

<400 ms

p99 streaming TTFT, 8 kHz telephony

Real-time latency

Qwen3-ASR or Voxtral Realtime on a single L40S, with VAD and custom endpointing tuned for telephony audio.

1,000+

concurrent streams on 1× H100

Throughput

Streaming STT with INT8 quantization and tensor parallelism. Scales horizontally under the same control plane.

Zero egress

no call-home in the critical path

Inside the boundary

Audio, transcripts, summaries, and artifacts stay in your VPC, data center, or airgapped environment. Always.

OpenAI-compatible

/v1/audio/transcriptions, /v1/audio/speech

Drop-in API

Point your existing OpenAI SDK at a Wordcab endpoint. Application code does not change when models do.

99.9% SLA

Production tier

Operable after launch

Prometheus, OpenTelemetry, Grafana dashboards, preflight checks, support bundles, all in the chart.

The 2026 open voice landscape, ranked by where it actually ships well.

Wordcab Voice tracks the open model landscape and re-baselines defaults quarterly. Every model below is Apache-2.0 or MIT, runs inside your boundary, and ships with a tuned vLLM or ONNX config.

Model	Role	Params	Latency / throughput	When we default to it
Qwen3-ASR Apache 2.0 · Jan 2026	Streaming + offline STT	~2B	TTFT ~150 ms on vLLM; streams at >300 concurrent on L40S	Default STT for real-time voice agents and mixed batch/streaming
Voxtral Realtime Apache 2.0 · Feb 2026	Low-latency streaming STT	4B	Configurable delay 200–500 ms; competitive with Whisper-large-v3 on multilingual	Live contact-center streams, voice agents with strict interruption budgets
Cohere Transcribe 2B Apache 2.0 · Mar 2026	Batch STT at scale	2B	High-throughput offline: >30 minutes of audio per GPU-second on H100	Archive backfills, overnight QA batches, compliance transcription
Qwen3-TTS Apache 2.0 · Jan 2026	Streaming TTS	~1B	End-to-end latency ~97 ms, VoiceDesign variant for custom voices	Default TTS for voice agents, IVR replacement, accessibility workflows
Kokoro (ONNX) Apache 2.0 weights · MIT runtime	Local TTS, CPU-friendly	82M	Runs on CPU at real-time factor <1.0; zero GPU required	Airgap, edge, or when the deployment is CPU-only
pyannote 3.3 diarization MIT · Wordcab-tuned	Speaker diarization	~25M	DER ~9% on standard telephony test sets after tuning	Every contact-center and meeting pipeline

Latency numbers are defaults on Production-tier hardware with INT8 quantization where supported. Customer evals on representative audio are part of every Pilot. Wordcab will not ship a default that underperforms your real workload.

What it actually takes to run.

Hardware baselines for the deployment shapes we see most often. Pilots start on a single node. Production scales horizontally from there.

Voice-agent real-time

Small

Up to ~100 concurrent low-latency streams. Voice + Think on one node.

GPU: 1× NVIDIA L40S (48 GB) or 2× L4 (24 GB each)
CPU: 16 vCPU, 64 GB RAM
Storage: 500 GB NVMe
Network: 1 Gbps, low-latency to clients
Models: Qwen3-ASR + Qwen3-TTS + Qwen3.5-4B for Think

Contact-center production

Most common

Full-call QA + redaction + summaries at ~1,000 concurrent streams. HA across two zones.

GPU: 4× NVIDIA H100 (80 GB) or equivalent
CPU: 64 vCPU, 256 GB RAM per node
Storage: 2 TB NVMe + S3/object storage for transcripts
Kubernetes: 1.28+, 3+ worker nodes, HPA configured
Models: Voxtral Realtime + Qwen3-TTS + Gemma 4 E4B

Airgap / sovereign

Regulated

Fully disconnected environment. Signed offline bundles, mirrored registry, custom CA chain.

GPU: H100 / A100 / SambaNova RDU (via SCX.ai)
Registry: Harbor, Artifactory, or ECR private mirror
Auth: SAML/OIDC via internal IdP
Monitoring: Prometheus + OTel shipped to your stack
Updates: Offline bundle cadence coordinated with your change window

CPU-only / edge

Constrained

No GPU available. Lower concurrency, simpler operating model.

CPU: 32 vCPU (AVX-512 preferred), 64 GB RAM
Models: Kokoro (ONNX) + distilled Voxtral + Qwen3.5-0.8B
Expected concurrency: ~10–20 streams per node
Typical use: field service, branch-office deployments, dev/eval

Why building in-house only sounds good on paper.

A demo runs in a sprint. A production voice stack, packaged, multi-tenant, observable, upgradable, is where months disappear. Four places the in-house path quietly turns into a platform team.

Pilots lie about timelines.

A prototype ships in a sprint. The production runtime behind it, packaging, tenancy, observability, upgrade path, takes quarters. Most teams underestimate by 3–4×.

Engineers hired for your product end up maintaining inference infrastructure.

Hosted APIs trade one problem for another.

They clear the model hurdle. In return: unpredictable cost curves, no audit trail, and transcripts leaving the perimeter compliance drew. The control problem arrives after launch.

Cost and control issues surface once the system is already load-bearing.

Open source is not a product.

Self-hosted components give you weights and Dockerfiles. They do not give you packaging, multi-tenant isolation, or an upgrade story customers will operate themselves.

Integration work scales with every model swap and every new environment.

Day-two operations is the real build.

Most stacks are designed for the first inference call. Drift, GPU utilization, incident response, per-tenant isolation, version cutovers, that is where years, not weeks, get spent.

Launch is the easy part. The five years after are not.

Private voice deployment, built for high-risk environments.

Wordcab Voice runs in customer-controlled environments, customer-managed Kubernetes, private cloud, on-prem, hybrid estates, restricted networks, and dedicated deployments. Same product story in each.

Frequently asked questions

We already have speech models working. Why would we still need Wordcab Voice?

Because models are only part of the problem. Runtime, deployment packaging, observability, control surface, and upgrade path are the parts teams underestimate.

What if we want to change models as the landscape moves?

That's expected. Model choice should follow the workload, hardware, and control boundary, not lock the deployment to one vendor path.

Can Wordcab Voice support both batch and real-time workloads?

Yes. The exact stack depends on the use case. The product is built for teams that need more than one narrow demo path.

Can we start with Voice and add fine-tuning later?

Yes. Many teams start with the runtime and bring in Wordcab Adapt once model fit, domain language, or workflow quality becomes the blocker.

Skip months of platform work.

If your team needs private voice AI without taking on the full platform build. Wordcab Voice is the right place to start.

Talk to an Engineer

We usually respond within one business day.

What are you building?

Speech-to-text Text-to-speech Voice agents Summarization Redaction Fine-tuning

Or email us directly.