Model serving backends
vLLM is the default. SGLang, Triton, and ONNX Runtime are supported for workloads where they win. Backend choice is per pool.
Wordcab runs on the serving stack your platform team trusts. vLLM is the default; SGLang, Triton, and ONNX Runtime are supported for workloads where they win on latency, throughput, or deployment footprint. The chart picks sane defaults per model; override per pool in values.yaml.
vLLM (default)
Default for Think (LLMs) and most Voice (STT/TTS). Tensor-parallel configs, INT8 / FP8 quantization, and prefix caching ship tuned. Use backend: vllm.
think:
llm:
pools:
- name: qwen3.5-4b
backend: vllm
backendConfig:
quantization: fp8
tensorParallelSize: 1
maxModelLen: 131072
enablePrefixCaching: true
gpuMemoryUtilization: 0.92SGLang
Picked for constrained-decoding workloads: structured outputs, multi-turn agents with heavy tool use, JSON-schema-constrained completions. Radix cache wins on throughput for repeated prefixes.
think:
llm:
pools:
- name: qwen3.5-4b-agents
backend: sglang
backendConfig:
radixCache: true
attentionBackend: flashinfer
mem_fraction_static: 0.88NVIDIA Triton
For teams standardized on Triton or running TensorRT-LLM / TensorRT engines. Model repository layout and config files ship in the chart under triton/.
voice:
stt:
pools:
- name: qwen3-asr-triton
backend: triton
backendConfig:
image: nvcr.io/nvidia/tritonserver:25.01-py3
modelRepository: s3://internal/triton-models/
engine: tensorrtONNX Runtime
For CPU and edge deployments. Kokoro TTS and the smaller Think models run here. Useful when GPUs are not an option (airgap nodes, branch-office edge, cost-sensitive batch).
voice:
tts:
pools:
- name: kokoro-cpu
backend: onnxruntime
backendConfig:
executionProvider: cpu # or: cuda, tensorrt, openvino
intraOpNumThreads: 4Switching backends
Backend choice is per pool. A deployment can (and often does) run vLLM for LLMs, Triton for the bulk STT pool, and ONNX for a CPU TTS fallback. The API surface is identical — callers reference models by id, not by backend.
vLLM defaults are within 5–10% of a hand-tuned Triton engine on most open models, and vLLM is much easier to operate. Prefer vLLM until a benchmark run on your real workload tells you otherwise.