Docs/GuidesEvaluations

Evaluations (Gym)

Gate every agent change on a test suite that mirrors your real traffic. Run experiments in production when you need to compare variants.

Gym is Wordcab's evaluation and A/B harness. You assemble test suites from real or synthetic audio, run your agent or model config against them, and assert on outputs. The same harness powers Adapt's rollout validation and the pre-merge gate for agent-config changes.

Test suites

python
suite = client.test_suites.create(
    name="Refund-intent checks",
    description="Confirm the agent hands refunds to the refund tool and never promises a refund directly.",
    cases=[
        {
            "input": {"audio_url": "s3://internal/eval/refund_01.wav"},
            "expected": {
                "tool_calls": ["lookup_order", "start_refund"],
                "assertions": [
                    {"type": "transcript_contains",    "value": "I can help with that"},
                    {"type": "transcript_not_contains","value": "I will personally"},
                ],
            },
        },
    ],
)

Case inputs

Cases can be seeded from audio files, scripted callers, or recorded production calls (with consent). The harness runs the agent exactly as it would in production — same STT model, same TTS voice, same tools.

Run a suite

python
run = client.test_suites.runs.create(
    suite_id=suite.id,
    agent_id="agent_abc",
    # Override individual knobs for this run
    overrides={"llm_model": "qwen3.5-4b", "temperature": 0.2},
)

result = client.test_suites.runs.wait(run.id)
print(f"{result.passed}/{result.total} passed")
for case in result.cases:
    if not case.passed:
        print("FAIL", case.id, case.failures)

Experiments

Experiments split live traffic between variants. The control is your current agent; variants change one dimension at a time — a new prompt, a different LLM, a different TTS voice.

python
exp = client.experiments.create(
    name="qwen-4b vs deepseek-v3.2 routing",
    control_agent_id="agent_abc",
    variants=[
        {"name": "deepseek", "overrides": {"llm_model": "deepseek-v3.2"}, "traffic": 0.20},
    ],
    metrics=["resolution_rate", "avg_handle_time", "tool_error_rate"],
    stop_rule={"min_samples": 500, "significance": 0.05},
)

When the stop rule fires, Gym emits experiment.finished. You decide whether to promote the winner; nothing is rolled automatically.

Built-in assertions

AssertionSemantics
transcript_containsString or regex must appear in the call transcript.
transcript_not_containsThe inverse — string must not appear.
tool_callsRequired tool-call sequence (order-sensitive by default).
utterance_countAgent spoke <= n utterances.
duration_lteCall length under a ceiling.
llm_rubricOpen-ended rubric scored by a judge model. Use sparingly.
customHTTP webhook — the harness POSTs the case result to your URL; you return pass/fail.

Gating deployments

Wire Gym into your CI: any agent configuration change triggers a suite run. On failure, block the promotion.

yaml
# .github/workflows/agent-ci.yml
- name: Run Gym suite
  run: |
    wordcab test-suites run --suite refunds --agent $(yq e .agent_id agent.yml) \\
      --wait --fail-on-red
What is a good suite?

A good suite mirrors your traffic. Start from 50 real calls across your top 5 intents, then add synthetic edge cases. Refuse to call a change \"shipped\" until it has run against audio nobody on your team has seen before.