Chat & reasoning
OpenAI-compatible chat completions, tool use, JSON mode, and structured outputs — running on open LLMs inside your boundary.
Wordcab's /v1/chat/completions endpoint is OpenAI-compatible. Use the OpenAI SDK by pointing base_url at Wordcab, or use the native Wordcab SDK — either works. Streaming, tool use, JSON mode, and structured outputs are all supported.
Basic completion
from openai import OpenAI
client = OpenAI(base_url="https://api.wordcab.com", api_key="$WORDCAB_API_KEY")
resp = client.chat.completions.create(
model="qwen3.5-4b",
messages=[
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "Summarize: {transcript}"},
],
temperature=0.2,
max_tokens=512,
)
print(resp.choices[0].message.content)import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.wordcab.com",
apiKey: process.env.WORDCAB_API_KEY,
});
const resp = await client.chat.completions.create({
model: "qwen3.5-4b",
messages: [
{ role: "system", content: "You are a concise assistant." },
{ role: "user", content: `Summarize: ${transcript}` },
],
temperature: 0.2,
max_tokens: 512,
});Streaming
stream = client.chat.completions.create(
model="qwen3.5-4b",
messages=[{"role": "user", "content": "Explain diarization."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)Tool use
Pass a tools array; the model emits tool_calls when it decides a tool should run. Execute it, append the result as a tool message, and call again.
tools = [{
"type": "function",
"function": {
"name": "get_account",
"description": "Look up an account by phone number.",
"parameters": {
"type": "object",
"properties": {"phone": {"type": "string"},
"required": ["phone"],
},
},
}]
resp = client.chat.completions.create(
model="qwen3.5-4b",
messages=history,
tools=tools,
)
msg = resp.choices[0].message
if msg.tool_calls:
for call in msg.tool_calls:
args = json.loads(call.function.arguments)
result = get_account(**args)
history.append(msg)
history.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result),
})
final = client.chat.completions.create(model="qwen3.5-4b", messages=history)JSON mode & structured outputs
For workflows that parse the model's output, use strict JSON mode or a Pydantic/Zod schema. The model is constrained at decode time, not post-validated.
from pydantic import BaseModel
class CallSummary(BaseModel):
reason: str
sentiment: str
action_items: list[str]
resp = client.chat.completions.create(
model="qwen3.5-4b",
messages=history,
response_format={"type": "json_schema", "json_schema": CallSummary.model_json_schema()},
)
summary = CallSummary.model_validate_json(resp.choices[0].message.content)Models
| Model | Role | Context | Typical TTFT |
|---|---|---|---|
qwen3.5-4b | Default agent/reasoning. | 128k | < 180 ms on L40S |
qwen3.5-0.8b | Routing, redaction, edge. | 32k | < 80 ms on L4 |
gemma-4-e4b | Multimodal, summarization. | 128k | ~200 ms |
deepseek-v3.2 | Frontier reasoning (MoE). | 128k | Higher; reserve for hard jobs |
llama-3.3-70b | Enterprise baseline. | 128k | Depends on pool |
All models live behind the same endpoint; swap by changing the model parameter. Full list: Models endpoint.
Prefix caching
Long system prompts and RAG contexts are cached automatically across requests from the same key when they are identical prefixes. Nothing to configure — latency on the second+ request drops to the marginal token cost of the suffix.