Docs/GuidesText-to-speech

Text-to-speech

Generate speech in real time or in one shot. Six built-in voices, custom voice cloning through Adapt, and SSML support.

Generate speech from text. Two modes: single-shot (whole utterance) and streaming (audio chunks as they are synthesized, good for voice agents and IVR replacement).

Single-shot

audio = client.audio.speech.create(
    input="Thanks for calling. Can I have your account number?",
    model="qwen3-tts",
    voice="ember",
    format="mp3",
)

with open("prompt.mp3", "wb") as f:
    f.write(audio)
const audio = await client.audio.speech.create({
  input: "Thanks for calling. Can I have your account number?",
  model: "qwen3-tts",
  voice: "ember",
  format: "mp3",
});

await fs.promises.writeFile("prompt.mp3", Buffer.from(await audio.arrayBuffer()));

Streaming

Streaming returns audio frames as they are produced. For voice agents, pipe these straight to the caller's audio path — first-audio latency is typically < 200 ms.

python
with client.audio.speech.stream(
    input=next_utterance_text,
    model="qwen3-tts",
    voice="ember",
    format="pcm16",       # raw 16-bit PCM @ 24 kHz, good for media servers
    sample_rate=24000,
) as stream:
    for chunk in stream:
        audio_out.write(chunk)

Voices

List available voices, including any custom voices trained through Adapt.

python
voices = client.voices.list(language="en")
for v in voices.data:
    print(f"{v.id}: {v.name} ({v.gender}, {v.style})")

Built-in voices:

VoiceStyleBest for
emberWarm, neutralDefault agent voice.
slateCalm, professionalEnterprise IVR, support.
marinBright, friendlyConsumer apps, outbound reminders.
onyxDeep, authoritativeAnnouncements, brand reads.
brookSoft, measuredHealthcare, sensitive topics.
ashBritish, articulateExplainers, narration.

Models

ModelNotesTypical latency
qwen3-ttsDefault. Streaming TTS with VoiceDesign for custom voices.< 100 ms first-audio
kokoroCPU-only, 82M params. For airgap or edge.RTF < 1.0 on modern CPU

Formats

  • mp3 — default, compressed.
  • wav — 16-bit PCM in a WAV container.
  • pcm16 — raw 16-bit PCM, specify sample_rate. For media servers.
  • ulaw — 8 kHz μ-law, ready for telephony media streams.
  • opus — Opus in an Ogg container, good for WebRTC.

Controls

  • speed (0.5 – 2.0) — playback rate.
  • pitch (±12 semitones) — transpose.
  • style (neutral | calm | expressive) — emotional register.
  • ssml — pass SSML instead of plain text for pauses, pronunciation, and emphasis.
Cloning

Voice cloning requires a signed consent and sample pack per voice. Talk to your account team — cloning is only enabled on Production and Sovereign tiers.