Docs/GuidesChat & reasoning

Chat & reasoning

OpenAI-compatible chat completions, tool use, JSON mode, and structured outputs — running on open LLMs inside your boundary.

Wordcab's /v1/chat/completions endpoint is OpenAI-compatible. Use the OpenAI SDK by pointing base_url at Wordcab, or use the native Wordcab SDK — either works. Streaming, tool use, JSON mode, and structured outputs are all supported.

Basic completion

from openai import OpenAI

client = OpenAI(base_url="https://api.wordcab.com", api_key="$WORDCAB_API_KEY")

resp = client.chat.completions.create(
    model="qwen3.5-4b",
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user",   "content": "Summarize: {transcript}"},
    ],
    temperature=0.2,
    max_tokens=512,
)

print(resp.choices[0].message.content)
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.wordcab.com",
  apiKey: process.env.WORDCAB_API_KEY,
});

const resp = await client.chat.completions.create({
  model: "qwen3.5-4b",
  messages: [
    { role: "system", content: "You are a concise assistant." },
    { role: "user",   content: `Summarize: ${transcript}` },
  ],
  temperature: 0.2,
  max_tokens: 512,
});

Streaming

python
stream = client.chat.completions.create(
    model="qwen3.5-4b",
    messages=[{"role": "user", "content": "Explain diarization."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Tool use

Pass a tools array; the model emits tool_calls when it decides a tool should run. Execute it, append the result as a tool message, and call again.

python
tools = [{
    "type": "function",
    "function": {
        "name": "get_account",
        "description": "Look up an account by phone number.",
        "parameters": {
            "type": "object",
            "properties": {"phone": {"type": "string"},
            "required": ["phone"],
        },
    },
}]

resp = client.chat.completions.create(
    model="qwen3.5-4b",
    messages=history,
    tools=tools,
)

msg = resp.choices[0].message
if msg.tool_calls:
    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        result = get_account(**args)
        history.append(msg)
        history.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps(result),
        })
    final = client.chat.completions.create(model="qwen3.5-4b", messages=history)

JSON mode & structured outputs

For workflows that parse the model's output, use strict JSON mode or a Pydantic/Zod schema. The model is constrained at decode time, not post-validated.

python
from pydantic import BaseModel

class CallSummary(BaseModel):
    reason: str
    sentiment: str
    action_items: list[str]

resp = client.chat.completions.create(
    model="qwen3.5-4b",
    messages=history,
    response_format={"type": "json_schema", "json_schema": CallSummary.model_json_schema()},
)

summary = CallSummary.model_validate_json(resp.choices[0].message.content)

Models

ModelRoleContextTypical TTFT
qwen3.5-4bDefault agent/reasoning.128k< 180 ms on L40S
qwen3.5-0.8bRouting, redaction, edge.32k< 80 ms on L4
gemma-4-e4bMultimodal, summarization.128k~200 ms
deepseek-v3.2Frontier reasoning (MoE).128kHigher; reserve for hard jobs
llama-3.3-70bEnterprise baseline.128kDepends on pool

All models live behind the same endpoint; swap by changing the model parameter. Full list: Models endpoint.

Prefix caching

Long system prompts and RAG contexts are cached automatically across requests from the same key when they are identical prefixes. Nothing to configure — latency on the second+ request drops to the marginal token cost of the suffix.