Text-to-speech
Generate speech in real time or in one shot. Six built-in voices, custom voice cloning through Adapt, and SSML support.
Generate speech from text. Two modes: single-shot (whole utterance) and streaming (audio chunks as they are synthesized, good for voice agents and IVR replacement).
Single-shot
audio = client.audio.speech.create(
input="Thanks for calling. Can I have your account number?",
model="qwen3-tts",
voice="ember",
format="mp3",
)
with open("prompt.mp3", "wb") as f:
f.write(audio)const audio = await client.audio.speech.create({
input: "Thanks for calling. Can I have your account number?",
model: "qwen3-tts",
voice: "ember",
format: "mp3",
});
await fs.promises.writeFile("prompt.mp3", Buffer.from(await audio.arrayBuffer()));Streaming
Streaming returns audio frames as they are produced. For voice agents, pipe these straight to the caller's audio path — first-audio latency is typically < 200 ms.
python
with client.audio.speech.stream(
input=next_utterance_text,
model="qwen3-tts",
voice="ember",
format="pcm16", # raw 16-bit PCM @ 24 kHz, good for media servers
sample_rate=24000,
) as stream:
for chunk in stream:
audio_out.write(chunk)Voices
List available voices, including any custom voices trained through Adapt.
python
voices = client.voices.list(language="en")
for v in voices.data:
print(f"{v.id}: {v.name} ({v.gender}, {v.style})")Built-in voices:
| Voice | Style | Best for |
|---|---|---|
ember | Warm, neutral | Default agent voice. |
slate | Calm, professional | Enterprise IVR, support. |
marin | Bright, friendly | Consumer apps, outbound reminders. |
onyx | Deep, authoritative | Announcements, brand reads. |
brook | Soft, measured | Healthcare, sensitive topics. |
ash | British, articulate | Explainers, narration. |
Models
| Model | Notes | Typical latency |
|---|---|---|
qwen3-tts | Default. Streaming TTS with VoiceDesign for custom voices. | < 100 ms first-audio |
kokoro | CPU-only, 82M params. For airgap or edge. | RTF < 1.0 on modern CPU |
Formats
mp3— default, compressed.wav— 16-bit PCM in a WAV container.pcm16— raw 16-bit PCM, specifysample_rate. For media servers.ulaw— 8 kHz μ-law, ready for telephony media streams.opus— Opus in an Ogg container, good for WebRTC.
Controls
speed(0.5 – 2.0) — playback rate.pitch(±12 semitones) — transpose.style(neutral|calm|expressive) — emotional register.ssml— pass SSML instead of plain text for pauses, pronunciation, and emphasis.
Cloning
Voice cloning requires a signed consent and sample pack per voice. Talk to your account team — cloning is only enabled on Production and Sovereign tiers.