Docs/GuidesChat & reasoning

Chat & reasoning

OpenAI-compatible chat completions, tool use, JSON mode, and structured outputs, running on open LLMs inside your boundary.

Wordcab's /v1/chat/completions endpoint is OpenAI-compatible. Use the OpenAI SDK by pointing base_url at Wordcab, or use the native Wordcab SDK, either works. Streaming, tool use, JSON mode, and structured outputs are all supported.

Basic completion

from openai import OpenAI

client = OpenAI(base_url="https://api.wordcab.com", api_key="$WORDCAB_API_KEY")

resp = client.chat.completions.create(
    model="qwen3.5-4b",
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user",   "content": "Summarize: {transcript}"},
    ],
    temperature=0.2,
    max_tokens=512,
)

print(resp.choices[0].message.content)
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.wordcab.com",
  apiKey: process.env.WORDCAB_API_KEY,
});

const resp = await client.chat.completions.create({
  model: "qwen3.5-4b",
  messages: [
    { role: "system", content: "You are a concise assistant." },
    { role: "user",   content: `Summarize: ${transcript}` },
  ],
  temperature: 0.2,
  max_tokens: 512,
});

Streaming

python
stream = client.chat.completions.create(
    model="qwen3.5-4b",
    messages=[{"role": "user", "content": "Explain diarization."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Tool use

Pass a tools array; the model emits tool_calls when it decides a tool should run. Execute it, append the result as a tool message, and call again.

python
tools = [{
    "type": "function",
    "function": {
        "name": "get_account",
        "description": "Look up an account by phone number.",
        "parameters": {
            "type": "object",
            "properties": {"phone": {"type": "string"},
            "required": ["phone"],
        },
    },
}]

resp = client.chat.completions.create(
    model="qwen3.5-4b",
    messages=history,
    tools=tools,
)

msg = resp.choices[0].message
if msg.tool_calls:
    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        result = get_account(**args)
        history.append(msg)
        history.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps(result),
        })
    final = client.chat.completions.create(model="qwen3.5-4b", messages=history)

JSON mode & structured outputs

For workflows that parse the model's output, use strict JSON mode or a Pydantic/Zod schema. The model is constrained at decode time, not post-validated.

python
from pydantic import BaseModel

class CallSummary(BaseModel):
    reason: str
    sentiment: str
    action_items: list[str]

resp = client.chat.completions.create(
    model="qwen3.5-4b",
    messages=history,
    response_format={"type": "json_schema", "json_schema": CallSummary.model_json_schema()},
)

summary = CallSummary.model_validate_json(resp.choices[0].message.content)

Models

ModelRoleContextTypical TTFT
qwen3.5-4bDefault agent/reasoning.128k< 180 ms on L40S
qwen3.5-0.8bRouting, redaction, edge.32k< 80 ms on L4
gemma-4-e4bMultimodal, summarization.128k~200 ms
deepseek-v3.2Frontier reasoning (MoE).128kHigher; reserve for hard jobs
llama-3.3-70bEnterprise baseline.128kDepends on pool

All models live behind the same endpoint; swap by changing the model parameter. Full list: Models endpoint.

Prefix caching

Long system prompts and RAG contexts are cached automatically across requests from the same key when they are identical prefixes. Nothing to configure, latency on the second+ request drops to the marginal token cost of the suffix.