Experiments
Run variants of an agent against live traffic; Gym reports a winner once the stopping rule is satisfied.
Experiments run variants of an agent against live traffic and report metrics once a stopping rule is satisfied. Results emit an experiment.finished webhook; promotion is always a human decision.
Create an experiment
The baseline agent. All non-allocated traffic stays on this agent.
Each variant has name, overrides, and traffic (0–1).
One or more of resolution_rate, avg_handle_time, tool_error_rate, csat_estimate, custom webhook-scored metrics.
{ min_samples, significance } — the harness stops when both are satisfied or a safety ceiling (max_samples, default 10,000) is hit.
Variant shape
{
"name": "deepseek-variant",
"overrides": {"llm_model": "deepseek-v3.2"},
"traffic": 0.20
}Lifecycle
/stop ends the experiment early. The response carries whatever metrics are available, even if the stop rule had not triggered.