The private LLM layer behind your voice stack.

Wordcab Think is the deployable LLM layer for the STT-LLM-TTS stack — summaries, extraction, routing, agents, workflow logic. It also runs standalone for private reasoning workloads.

Built for the part of voice AI that has to reason.

Think is the LLM layer that turns transcripts into decisions — summaries, extraction, routing, supervised tool use, agentic workflows. Deploy it standalone for private inference, or alongside Voice for the full stack.

<180 ms
first-token latency, 4B model on L40S

Low-latency reasoning

Gemma 4 E4B or Qwen3.5 4B via vLLM with prefix caching and FP8 quantization. Meets real-time voice-agent budgets.

Zero egress
no transcripts cross the boundary

Inside your boundary

Transcripts, summaries, tool-call arguments, and downstream artifacts stay in your VPC, data center, or airgap.

Standalone
Voice layer optional

LLM-only deployments

Run Think by itself for internal chat, RAG, or document automation — same control plane, narrower scope.

OpenAI-compatible
/v1/chat/completions, structured outputs, tool use

Drop-in endpoint

Swap base_url, keep your OpenAI SDK. Function calling, JSON mode, and streaming work unchanged.

6 model families
Gemma · Qwen · Llama · Mistral · Cohere · Liquid

Hot-swap models

Route by workload — lightweight for extraction, mid-size for agents, larger for open-ended reasoning. No pipeline rewrite.

Small open LLMs are finally good enough for private deployment.

These are the 2026 small-to-mid LLMs Wordcab Think ships defaults for. Re-baselined quarterly. Workloads routed by latency and quality bar — not by whichever model had the noisiest launch.

Model Params License Strength / benchmark Role in Think
Gemma 4 E4B
Apr 2026 · Multimodal
~4B effective Apache 2.0 GPQA ~56, MMLU-Pro ~65; audio input support on the small variant Default reasoning + summarization model
Gemma 4 E2B
Apr 2026 · Multimodal
~2B effective Apache 2.0 Competitive with larger models on instruction following and JSON mode Extraction, classification, routing
Qwen3.5 4B
Mar 2026
5B Apache 2.0 Strong tool use, 128k context, native OpenAI-compatible server Agent workflows, long-context document reasoning
Qwen3.5 0.8B
Mar 2026
0.9B Apache 2.0 TTFT <80 ms on L4; fits comfortably on CPU for edge Redaction, PII detection, lightweight routing
DeepSeek V3.2
Mar 2026
236B MoE (21B active) MIT Frontier reasoning on open weights; competitive with closed-model baselines Heavy reasoning, complex agent chains, optional upgrade tier
Llama 3.3 70B
Enterprise staple
70B Llama Community Broad domain strength; widely audited by enterprise security teams When procurement already approved Llama for the estate
Liquid LFM2.5 Audio 1.5B
Jan 2026 · License-gated
1.5B LFM Open v1.0 (<$10M rev) Unified audio-in + text-out; novel for audio-first workflows Audio-native workflows — gated by license review

All models ship with tuned vLLM or SGLang configs, tensor-parallel presets, and INT8/FP8 quantization options. Benchmarks are public; defaults are Wordcab's, re-evaluated every quarter.

What Think actually costs to run.

Voice-agent reasoning

Real-time

Gemma 4 E4B or Qwen3.5 4B. Used inline in voice agents and contact-center flows.

  • GPU: 1× L40S (48 GB) or 2× L4 (24 GB)
  • Concurrent: ~200 sessions with prefix caching
  • TTFT: <180 ms at p99
  • Context: up to 32k per session

Batch summarization

Throughput

Gemma 4 E4B or Qwen3.5 4B at high throughput for post-call summaries, extraction, redaction.

  • GPU: 2× A100 80 GB or 1× H100 80 GB
  • Throughput: ~30k summaries/hour on H100
  • Use: overnight QA batches, archive backfills

Heavy reasoning

Upgrade tier

DeepSeek V3.2 or Llama 3.3 70B for complex agents, contract review, or policy reasoning.

  • GPU: 4× H100 80 GB (tensor-parallel)
  • Models: DeepSeek V3.2 (MoE), Llama 3.3 70B
  • Use: escalation-path reasoning, long-context workflows

CPU / edge inference

Constrained

Qwen3.5 0.8B or Gemma 4 E2B in INT4 for dev, edge, or branch deployments.

  • CPU: 16 vCPU (AVX-512), 32 GB RAM
  • Concurrency: 5–15 sessions per node
  • Use: eval, dev environments, branch-office reasoning

Transcripts are not the product. Decisions are.

Hosted LLMs create another sensitive data path after transcription.
Generic summarization breaks when domain language and workflow context matter.
Private voice products need reasoning, routing, and extraction close to the runtime.
Open-source model choice moves fast. Production packaging still takes work.
Day-two operations are usually missing from open-source-first LLM stacks.

Private LLMs should fit the same control boundary as the voice stack.

Wordcab Think runs in customer-controlled environments. Deploy it with Wordcab Voice as the reasoning layer in the STT-LLM-TTS stack — or on its own for private LLM inference, agent workflows, and structured-output pipelines.

Frequently asked questions

Is Wordcab Think only for voice workflows?
No. Think is the LLM layer in the voice stack. It also runs as a standalone private inference product.
Why separate Think from Voice?
Speech models and reasoning models solve different problems. Voice handles transcription and speech generation. Think turns language into decisions, summaries, structured outputs, and workflow actions.
Can Think use different models for different workloads?
Yes. Model choice follows latency, cost, context, quality, hardware, and licensing.
Where does Adapt fit with Think?
Adapt evaluates and improves the full stack when real audio, transcript quality, prompts, model selection, or workflow-level output quality become the blocker.

Put the reasoning layer inside your boundary.

If your team needs private LLM inference for voice workflows, agents, summaries, extraction, or structured outputs — Wordcab Think gives you a deployable path without another hosted data path.

Talk to an Engineer

We usually respond within one business day.

What are you building?

Or email us directly.