Evaluations (Gym)
Gate every agent change on a test suite that mirrors your real traffic. Run experiments in production when you need to compare variants.
Gym is Wordcab's evaluation and A/B harness. You assemble test suites from real or synthetic audio, run your agent or model config against them, and assert on outputs. The same harness powers Adapt's rollout validation and the pre-merge gate for agent-config changes.
Test suites
suite = client.test_suites.create(
name="Refund-intent checks",
description="Confirm the agent hands refunds to the refund tool and never promises a refund directly.",
cases=[
{
"input": {"audio_url": "s3://internal/eval/refund_01.wav"},
"expected": {
"tool_calls": ["lookup_order", "start_refund"],
"assertions": [
{"type": "transcript_contains", "value": "I can help with that"},
{"type": "transcript_not_contains","value": "I will personally"},
],
},
},
],
)Case inputs
Cases can be seeded from audio files, scripted callers, or recorded production calls (with consent). The harness runs the agent exactly as it would in production — same STT model, same TTS voice, same tools.
Run a suite
run = client.test_suites.runs.create(
suite_id=suite.id,
agent_id="agent_abc",
# Override individual knobs for this run
overrides={"llm_model": "qwen3.5-4b", "temperature": 0.2},
)
result = client.test_suites.runs.wait(run.id)
print(f"{result.passed}/{result.total} passed")
for case in result.cases:
if not case.passed:
print("FAIL", case.id, case.failures)Experiments
Experiments split live traffic between variants. The control is your current agent; variants change one dimension at a time — a new prompt, a different LLM, a different TTS voice.
exp = client.experiments.create(
name="qwen-4b vs deepseek-v3.2 routing",
control_agent_id="agent_abc",
variants=[
{"name": "deepseek", "overrides": {"llm_model": "deepseek-v3.2"}, "traffic": 0.20},
],
metrics=["resolution_rate", "avg_handle_time", "tool_error_rate"],
stop_rule={"min_samples": 500, "significance": 0.05},
)When the stop rule fires, Gym emits experiment.finished. You decide whether to promote the winner; nothing is rolled automatically.
Built-in assertions
| Assertion | Semantics |
|---|---|
transcript_contains | String or regex must appear in the call transcript. |
transcript_not_contains | The inverse — string must not appear. |
tool_calls | Required tool-call sequence (order-sensitive by default). |
utterance_count | Agent spoke <= n utterances. |
duration_lte | Call length under a ceiling. |
llm_rubric | Open-ended rubric scored by a judge model. Use sparingly. |
custom | HTTP webhook — the harness POSTs the case result to your URL; you return pass/fail. |
Gating deployments
Wire Gym into your CI: any agent configuration change triggers a suite run. On failure, block the promotion.
# .github/workflows/agent-ci.yml
- name: Run Gym suite
run: |
wordcab test-suites run --suite refunds --agent $(yq e .agent_id agent.yml) \\
--wait --fail-on-redA good suite mirrors your traffic. Start from 50 real calls across your top 5 intents, then add synthetic edge cases. Refuse to call a change \"shipped\" until it has run against audio nobody on your team has seen before.