Fine-tuning (Adapt)
Benchmark WER isn't production WER. Adapt closes the gap. 10–30% relative WER reduction is typical from 10–100 hours of your real audio.
Adapt is the layer between a promising pilot and a production rollout. Generic ASR degrades 2.8–5.7× from benchmark to production. 10–30% relative WER reduction is typical from 10–100 hours of targeted fine-tuning on real audio. The full workflow lives inside your approved infrastructure; no audio leaves the boundary.
The four stages
- Data, intake and cleanup. Import raw audio, normalize formats, diarize, align.
- Evaluation, pick baselines with Gym against your real workload.
- Fine-tuning, produce a tuned checkpoint.
- Validation, held-out eval, canary deployment, promote.
Prepare a dataset
dataset = client.datasets.create(
name="contact-center-2026q2",
sources=[
{"type": "s3", "uri": "s3://internal/calls/2026-q1/"},
],
pipeline=[
{"step": "diarize", "model": "pyannote-3.3"},
{"step": "align", "model": "qwen3-asr"},
{"step": "redact", "entities": ["pii", "phi"]},
],
)
client.datasets.wait(dataset.id)
print(dataset.stats) # hours, speakers, utterance count, OOV rateRun an evaluation
Before you fine-tune, baseline candidate models on your data. This decides whether tuning is even needed.
eval_run = client.evaluations.create(
dataset_id=dataset.id,
candidates=["qwen3-asr", "voxtral-realtime", "cohere-transcribe-2b"],
metrics=["wer", "diarization_error_rate", "realtime_factor"],
)
print(client.evaluations.wait(eval_run.id).leaderboard)Fine-tune
job = client.fine_tunes.create(
base_model="qwen3-asr",
dataset_id=dataset.id,
hyperparameters={
"learning_rate": 1e-5,
"epochs": 3,
"batch_size": 16,
},
)
tuned = client.fine_tunes.wait(job.id)
print(tuned.model_id) # use this as "model": "ft:qwen3-asr:abc123"Fine-tune runs execute on hardware inside your deployment. On cloud tiers, this is a reserved GPU pool under your account. On self-hosted, it is your own hardware.
Validate and promote
Run the tuned model against a held-out split and promote only when metrics clear your bar.
validation = client.evaluations.create(
dataset_id=dataset.id,
split="holdout",
candidates=[tuned.model_id, "qwen3-asr"],
metrics=["wer", "diarization_error_rate"],
)
if validation.wait().winner == tuned.model_id:
client.deployments.update(
deployment_id="prod-voice",
routes={"stt": tuned.model_id},
)Custom TTS voices
The same workflow applies to text-to-speech. Submit a voice-cloning dataset with a signed consent, and Adapt produces a voice id that can be used in any agent or speech call.
Everything you feed Adapt stays inside your boundary, and so do the obligations that come with it. Confirm you have the recording consent needed for the jurisdictions the audio came from before you point Adapt at a bucket.