Text-to-speech

Generate speech in real time or in one shot. Six built-in voices, custom voice cloning through Adapt, and SSML support.

Generate speech from text. Two modes: single-shot (whole utterance) and streaming (audio chunks as they are synthesized, good for voice agents and IVR replacement).

Single-shot

audio = client.audio.speech.create(
    input="Thanks for calling. Can I have your account number?",
    model="qwen3-tts",
    voice="ember",
    format="mp3",
)

with open("prompt.mp3", "wb") as f:
    f.write(audio)

const audio = await client.audio.speech.create({
  input: "Thanks for calling. Can I have your account number?",
  model: "qwen3-tts",
  voice: "ember",
  format: "mp3",
});

await fs.promises.writeFile("prompt.mp3", Buffer.from(await audio.arrayBuffer()));

Streaming

Streaming returns audio frames as they are produced. For voice agents, pipe these straight to the caller's audio path, first-audio latency is typically < 200 ms.

python

with client.audio.speech.stream(
    input=next_utterance_text,
    model="qwen3-tts",
    voice="ember",
    format="pcm16",       # raw 16-bit PCM @ 24 kHz, good for media servers
    sample_rate=24000,
) as stream:
    for chunk in stream:
        audio_out.write(chunk)

Voices

List available voices, including any custom voices trained through Adapt.

python

voices = client.voices.list(language="en")
for v in voices.data:
    print(f"{v.id}: {v.name} ({v.gender}, {v.style})")

Built-in voices:

Voice	Style	Best for
`ember`	Warm, neutral	Default agent voice.
`slate`	Calm, professional	Enterprise IVR, support.
`marin`	Bright, friendly	Consumer apps, outbound reminders.
`onyx`	Deep, authoritative	Announcements, brand reads.
`brook`	Soft, measured	Healthcare, sensitive topics.
`ash`	British, articulate	Explainers, narration.

Models

Model	Notes	Typical latency
`qwen3-tts`	Default. Streaming TTS with VoiceDesign for custom voices.	< 100 ms first-audio
`kokoro`	CPU-only, 82M params. For airgap or edge.	RTF < 1.0 on modern CPU

Formats

mp3, default, compressed.
wav. 16-bit PCM in a WAV container.
pcm16, raw 16-bit PCM, specify sample_rate. For media servers.
ulaw. 8 kHz μ-law, ready for telephony media streams.
opus. Opus in an Ogg container, good for WebRTC.

Controls

speed (0.5 – 2.0), playback rate.
pitch (±12 semitones), transpose.
style (neutral | calm | expressive), emotional register.
ssml, pass SSML instead of plain text for pauses, pronunciation, and emphasis.

Cloning

Voice cloning requires a signed consent and sample pack per voice. Talk to your account team, cloning is only enabled on Production and Sovereign tiers.

← Previous

Transcription

Chat & reasoning