Chat & reasoning
OpenAI-compatible chat completions, tool use, JSON mode, and structured outputs, running on open LLMs inside your boundary.
Wordcab's /v1/chat/completions endpoint is OpenAI-compatible. Use the OpenAI SDK by pointing base_url at Wordcab, or use the native Wordcab SDK, either works. Streaming, tool use, JSON mode, and structured outputs are all supported.
Basic completion
from openai import OpenAI
client = OpenAI(base_url="https://api.wordcab.com", api_key="$WORDCAB_API_KEY")
resp = client.chat.completions.create(
model="qwen3.5-4b",
messages=[
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "Summarize: {transcript}"},
],
temperature=0.2,
max_tokens=512,
)
print(resp.choices[0].message.content)import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.wordcab.com",
apiKey: process.env.WORDCAB_API_KEY,
});
const resp = await client.chat.completions.create({
model: "qwen3.5-4b",
messages: [
{ role: "system", content: "You are a concise assistant." },
{ role: "user", content: `Summarize: ${transcript}` },
],
temperature: 0.2,
max_tokens: 512,
});Streaming
stream = client.chat.completions.create(
model="qwen3.5-4b",
messages=[{"role": "user", "content": "Explain diarization."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)Tool use
Pass a tools array; the model emits tool_calls when it decides a tool should run. Execute it, append the result as a tool message, and call again.
tools = [{
"type": "function",
"function": {
"name": "get_account",
"description": "Look up an account by phone number.",
"parameters": {
"type": "object",
"properties": {"phone": {"type": "string"},
"required": ["phone"],
},
},
}]
resp = client.chat.completions.create(
model="qwen3.5-4b",
messages=history,
tools=tools,
)
msg = resp.choices[0].message
if msg.tool_calls:
for call in msg.tool_calls:
args = json.loads(call.function.arguments)
result = get_account(**args)
history.append(msg)
history.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result),
})
final = client.chat.completions.create(model="qwen3.5-4b", messages=history)JSON mode & structured outputs
For workflows that parse the model's output, use strict JSON mode or a Pydantic/Zod schema. The model is constrained at decode time, not post-validated.
from pydantic import BaseModel
class CallSummary(BaseModel):
reason: str
sentiment: str
action_items: list[str]
resp = client.chat.completions.create(
model="qwen3.5-4b",
messages=history,
response_format={"type": "json_schema", "json_schema": CallSummary.model_json_schema()},
)
summary = CallSummary.model_validate_json(resp.choices[0].message.content)Models
| Model | Role | Context | Typical TTFT |
|---|---|---|---|
qwen3.5-4b | Default agent/reasoning. | 128k | < 180 ms on L40S |
qwen3.5-0.8b | Routing, redaction, edge. | 32k | < 80 ms on L4 |
gemma-4-e4b | Multimodal, summarization. | 128k | ~200 ms |
deepseek-v3.2 | Frontier reasoning (MoE). | 128k | Higher; reserve for hard jobs |
llama-3.3-70b | Enterprise baseline. | 128k | Depends on pool |
All models live behind the same endpoint; swap by changing the model parameter. Full list: Models endpoint.
Prefix caching
Long system prompts and RAG contexts are cached automatically across requests from the same key when they are identical prefixes. Nothing to configure, latency on the second+ request drops to the marginal token cost of the suffix.