The private LLM layer behind your voice stack.
Wordcab Think is the deployable LLM layer for the STT-LLM-TTS stack — summaries, extraction, routing, agents, workflow logic. It also runs standalone for private reasoning workloads.
Built for the part of voice AI that has to reason.
Think is the LLM layer that turns transcripts into decisions — summaries, extraction, routing, supervised tool use, agentic workflows. Deploy it standalone for private inference, or alongside Voice for the full stack.
Low-latency reasoning
Gemma 4 E4B or Qwen3.5 4B via vLLM with prefix caching and FP8 quantization. Meets real-time voice-agent budgets.
Inside your boundary
Transcripts, summaries, tool-call arguments, and downstream artifacts stay in your VPC, data center, or airgap.
LLM-only deployments
Run Think by itself for internal chat, RAG, or document automation — same control plane, narrower scope.
/v1/chat/completions, structured outputs, tool useDrop-in endpoint
Swap base_url, keep your OpenAI SDK. Function calling, JSON mode, and streaming work unchanged.
Hot-swap models
Route by workload — lightweight for extraction, mid-size for agents, larger for open-ended reasoning. No pipeline rewrite.
Small open LLMs are finally good enough for private deployment.
These are the 2026 small-to-mid LLMs Wordcab Think ships defaults for. Re-baselined quarterly. Workloads routed by latency and quality bar — not by whichever model had the noisiest launch.
| Model | Params | License | Strength / benchmark | Role in Think |
|---|---|---|---|---|
| Gemma 4 E4B | ~4B effective | Apache 2.0 | GPQA ~56, MMLU-Pro ~65; audio input support on the small variant | Default reasoning + summarization model |
| Gemma 4 E2B | ~2B effective | Apache 2.0 | Competitive with larger models on instruction following and JSON mode | Extraction, classification, routing |
| Qwen3.5 4B | 5B | Apache 2.0 | Strong tool use, 128k context, native OpenAI-compatible server | Agent workflows, long-context document reasoning |
| Qwen3.5 0.8B | 0.9B | Apache 2.0 | TTFT <80 ms on L4; fits comfortably on CPU for edge | Redaction, PII detection, lightweight routing |
| DeepSeek V3.2 | 236B MoE (21B active) | MIT | Frontier reasoning on open weights; competitive with closed-model baselines | Heavy reasoning, complex agent chains, optional upgrade tier |
| Llama 3.3 70B | 70B | Llama Community | Broad domain strength; widely audited by enterprise security teams | When procurement already approved Llama for the estate |
| Liquid LFM2.5 Audio 1.5B | 1.5B | LFM Open v1.0 (<$10M rev) | Unified audio-in + text-out; novel for audio-first workflows | Audio-native workflows — gated by license review |
All models ship with tuned vLLM or SGLang configs, tensor-parallel presets, and INT8/FP8 quantization options. Benchmarks are public; defaults are Wordcab's, re-evaluated every quarter.
What Think actually costs to run.
Voice-agent reasoning
Real-timeGemma 4 E4B or Qwen3.5 4B. Used inline in voice agents and contact-center flows.
- GPU: 1× L40S (48 GB) or 2× L4 (24 GB)
- Concurrent: ~200 sessions with prefix caching
- TTFT: <180 ms at p99
- Context: up to 32k per session
Batch summarization
ThroughputGemma 4 E4B or Qwen3.5 4B at high throughput for post-call summaries, extraction, redaction.
- GPU: 2× A100 80 GB or 1× H100 80 GB
- Throughput: ~30k summaries/hour on H100
- Use: overnight QA batches, archive backfills
Heavy reasoning
Upgrade tierDeepSeek V3.2 or Llama 3.3 70B for complex agents, contract review, or policy reasoning.
- GPU: 4× H100 80 GB (tensor-parallel)
- Models: DeepSeek V3.2 (MoE), Llama 3.3 70B
- Use: escalation-path reasoning, long-context workflows
CPU / edge inference
ConstrainedQwen3.5 0.8B or Gemma 4 E2B in INT4 for dev, edge, or branch deployments.
- CPU: 16 vCPU (AVX-512), 32 GB RAM
- Concurrency: 5–15 sessions per node
- Use: eval, dev environments, branch-office reasoning
Transcripts are not the product. Decisions are.
Private LLMs should fit the same control boundary as the voice stack.
Wordcab Think runs in customer-controlled environments. Deploy it with Wordcab Voice as the reasoning layer in the STT-LLM-TTS stack — or on its own for private LLM inference, agent workflows, and structured-output pipelines.
Frequently asked questions
Is Wordcab Think only for voice workflows?
Why separate Think from Voice?
Can Think use different models for different workloads?
Where does Adapt fit with Think?
Put the reasoning layer inside your boundary.
If your team needs private LLM inference for voice workflows, agents, summaries, extraction, or structured outputs — Wordcab Think gives you a deployable path without another hosted data path.
Talk to an Engineer
We usually respond within one business day.