Docs/API/Speech

Speech

Generate audio from text. Streaming supported. OpenAI-compatible path.

Create speech

POST/v1/audio/speech

OpenAI-compatible text-to-speech endpoint. Returns an audio stream (or buffer, depending on your client) in the requested format.

Body

inputstringRequired

Text or SSML to synthesize. Max 4,096 characters per request; longer inputs are split at sentence boundaries and streamed back seamlessly.

modelstringRequired

Model id. qwen3-tts is the default; kokoro for CPU-only deployments.

voicestringRequired

Voice id. Built-in voices: ember, slate, marin, onyx, brook, ash. Custom voices from Adapt are also valid.

response_formatstringOptional

One of mp3 (default), wav, pcm16, ulaw, opus.

sample_rateintegerOptional

Required with pcm16. 8000, 16000, 24000, or 48000.

speednumberOptional

0.5 – 2.0. Default 1.0.

pitchnumberOptional

Semitones, -12 to +12. Default 0.

stylestringOptional

neutral, calm, or expressive.

streambooleanOptional

If true, respond with Transfer-Encoding: chunked as audio is produced.

Example

bash
curl -X POST https://api.wordcab.com/v1/audio/speech \\
  -H "Authorization: Bearer $WORDCAB_API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "qwen3-tts",
    "voice": "ember",
    "input": "Hello from Wordcab.",
    "response_format": "mp3"
  }' --output hello.mp3

SSML

Pass SSML instead of plain text for pauses, pronunciation, and emphasis. The input must start with <speak>.

xml
<speak>
  Your order <emphasis level="strong">has shipped</emphasis>.
  Tracking number: <say-as interpret-as="characters">1Z999AA10123456784</say-as>.
</speak>