Model serving backends

vLLM is the default. SGLang, Triton, and ONNX Runtime are supported for workloads where they win. Backend choice is per pool.

Package availability

Wordcab SDKs, CLI tools, Helm charts, model weights, and deployment packages are delivered directly to each customer for self-hosted installation. They are not publicly published package-manager artifacts, so install commands in these docs are placeholders until your Wordcab team provides your private package source or offline bundle.

Wordcab runs on the serving stack your platform team trusts. vLLM is the default; SGLang, Triton, and ONNX Runtime are supported for workloads where they win on latency, throughput, or deployment footprint. The chart picks sane defaults per model; override per pool in values.yaml.

vLLM (default)

Default for Think (LLMs) and most Voice (STT/TTS). Tensor-parallel configs, INT8 / FP8 quantization, and prefix caching ship tuned. Use backend: vllm.

yaml

think:
  llm:
    pools:
      - name: qwen3.5-4b
        backend: vllm
        backendConfig:
          quantization: fp8
          tensorParallelSize: 1
          maxModelLen: 131072
          enablePrefixCaching: true
          gpuMemoryUtilization: 0.92

SGLang

Picked for constrained-decoding workloads: structured outputs, multi-turn agents with heavy tool use, JSON-schema-constrained completions. Radix cache wins on throughput for repeated prefixes.

yaml

think:
  llm:
    pools:
      - name: qwen3.5-4b-agents
        backend: sglang
        backendConfig:
          radixCache: true
          attentionBackend: flashinfer
          mem_fraction_static: 0.88

NVIDIA Triton

For teams standardized on Triton or running TensorRT-LLM / TensorRT engines. Model repository layout and config files ship in the chart under triton/.

yaml

voice:
  stt:
    pools:
      - name: qwen3-asr-triton
        backend: triton
        backendConfig:
          image: nvcr.io/nvidia/tritonserver:25.01-py3
          modelRepository: s3://internal/triton-models/
          engine: tensorrt

ONNX Runtime

For CPU and edge deployments. Kokoro TTS and the smaller Think models run here. Useful when GPUs are not an option (airgap nodes, branch-office edge, cost-sensitive batch).

yaml

voice:
  tts:
    pools:
      - name: kokoro-cpu
        backend: onnxruntime
        backendConfig:
          executionProvider: cpu        # or: cuda, tensorrt, openvino
          intraOpNumThreads: 4

Switching backends

Backend choice is per pool. A deployment can (and often does) run vLLM for LLMs, Triton for the bulk STT pool, and ONNX for a CPU TTS fallback. The API surface is identical — callers reference models by id, not by backend.

Don't optimize backends before you evaluate them

vLLM defaults are within 5–10% of a hand-tuned Triton engine on most open models, and vLLM is much easier to operate. Prefer vLLM until a benchmark run on your real workload tells you otherwise.

← Previous

Framework integrations

Docs home