Evaluations (Gym)

Gate every agent change on a test suite that mirrors your real traffic. Run experiments in production when you need to compare variants.

Package availability

Wordcab SDKs, CLI tools, Helm charts, model weights, and deployment packages are delivered directly to each customer for self-hosted installation. They are not publicly published package-manager artifacts, so install commands in these docs are placeholders until your Wordcab team provides your private package source or offline bundle.

Gym is Wordcab's evaluation and A/B harness. You assemble test suites from real or synthetic audio, run your agent or model config against them, and assert on outputs. The same harness powers Adapt's rollout validation and the pre-merge gate for agent-config changes.

Test suites

python

suite = client.test_suites.create(
    name="Refund-intent checks",
    description="Confirm the agent hands refunds to the refund tool and never promises a refund directly.",
    cases=[
        {
            "input": {"audio_url": "s3://internal/eval/refund_01.wav"},
            "expected": {
                "tool_calls": ["lookup_order", "start_refund"],
                "assertions": [
                    {"type": "transcript_contains",    "value": "I can help with that"},
                    {"type": "transcript_not_contains","value": "I will personally"},
                ],
            },
        },
    ],
)

Case inputs

Cases can be seeded from audio files, scripted callers, or recorded production calls (with consent). The harness runs the agent exactly as it would in production — same STT model, same TTS voice, same tools.

Run a suite

python

run = client.test_suites.runs.create(
    suite_id=suite.id,
    agent_id="agent_abc",
    # Override individual knobs for this run
    overrides={"llm_model": "qwen3.5-4b", "temperature": 0.2},
)

result = client.test_suites.runs.wait(run.id)
print(f"{result.passed}/{result.total} passed")
for case in result.cases:
    if not case.passed:
        print("FAIL", case.id, case.failures)

Experiments

Experiments split live traffic between variants. The control is your current agent; variants change one dimension at a time — a new prompt, a different LLM, a different TTS voice.

python

exp = client.experiments.create(
    name="qwen-4b vs deepseek-v3.2 routing",
    control_agent_id="agent_abc",
    variants=[
        {"name": "deepseek", "overrides": {"llm_model": "deepseek-v3.2"}, "traffic": 0.20},
    ],
    metrics=["resolution_rate", "avg_handle_time", "tool_error_rate"],
    stop_rule={"min_samples": 500, "significance": 0.05},
)

When the stop rule fires, Gym emits experiment.finished. You decide whether to promote the winner; nothing is rolled automatically.

Built-in assertions

Assertion	Semantics
`transcript_contains`	String or regex must appear in the call transcript.
`transcript_not_contains`	The inverse — string must not appear.
`tool_calls`	Required tool-call sequence (order-sensitive by default).
`utterance_count`	Agent spoke `<= n` utterances.
`duration_lte`	Call length under a ceiling.
`llm_rubric`	Open-ended rubric scored by a judge model. Use sparingly.
`custom`	HTTP webhook — the harness POSTs the case result to your URL; you return pass/fail.

Gating deployments

Wire Gym into your CI: any agent configuration change triggers a suite run. On failure, block the promotion.

yaml

# .github/workflows/agent-ci.yml
- name: Run Gym suite
  run: |
    wordcab test-suites run --suite refunds --agent $(yq e .agent_id agent.yml) \\
      --wait --fail-on-red

What is a good suite?

A good suite mirrors your traffic. Start from 50 real calls across your top 5 intents, then add synthetic edge cases. Refuse to call a change \"shipped\" until it has run against audio nobody on your team has seen before.

← Previous

Webhooks

Fine-tuning (Adapt)