
Convert text into expressive spoken audio

MMaudio v2 produces synchronized audio from video or text inputs, ideal for adding soundtracks to videos when paired with video models. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

MMaudio v2 produces synchronized audio from video or text inputs, ideal for adding soundtracks to videos when paired with video models. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Google Veo3 is Google's flagship text-to-video model with built-in audio, producing synchronized video and sound from text prompts. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Google Veo 3 Fast creates text-to-video with synchronized audio, delivering faster, more cost-effective results than standard Veo 3; commercial use allowed and pricing starts at $0.25/second. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Google Veo3 Fast provides faster, more cost-effective Image-to-Video generation vs Veo 3, with commercial use allowed and $0.25/sec pricing. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Google Veo 3 is Google's flagship image-to-video model that creates audio-enabled videos from images. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Minimax Speech-02 Turbo is a high-definition text-to-speech model delivering natural voice output. Cost: $0.03 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

ElevenLabs Multilingual V2 is a multilingual text-to-speech model; cost $0.1 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Minimax Voice Clone creates high-quality voice clones from short reference clips, closely matching tone, accent, and speaking style. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Minimax Speech 02 HD is Minimax's high-definition text-to-speech model delivering clear HD voices; pricing $0.05 per 1,000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

MiniMax Speech 2.5 HD Preview offers HD TTS with enhanced multilingual expressiveness, accurate voice cloning, and 40-language support. Ready-to-use REST API, best performance, no coldstarts, affordable pricing.

Minimax Speech 2.5 Turbo Preview: HD TTS with multilingual support, accurate voice replication across 40 languages. $0.04/1000 chars. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

ElevenLabs Turbo V2.5 is a text-to-speech model available via WaveSpeedAI, billed at $0.05 per 1000 characters for TTS requests. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

MiniMax Voice Design generates natural voices from textual descriptions - no cloning - lets you set tone, accent and personality. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

ElevenLabs Turbo V2 is a Text-To-Speech model available via WaveSpeedAI, billed at $0.05 per 1000 characters for API requests. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

ElevenLabs Flash V2 is a Text-to-Speech model that converts text into spoken audio using the ElevenLabs Flash V2 engine. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

ElevenLabs Flash v2.5 is a text-to-speech model on WaveSpeedAI, billed at $0.05 per 1000 characters for generated speech. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

ElevenLabs Multilingual V1 provides natural-sounding multilingual text-to-speech across many languages. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Lipsync-2-pro creates studio-grade lip synchronization for video-to-video editing in minutes, not weeks. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Sync Lipsync-2 synchronizes lip movements in any video to supplied audio, enabling realistic mouth alignment for films, podcasts, games, or animations. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.
Run any model in the Speech Generation collection through a single REST API. Pay per generation — no subscriptions, no minimums — with industry-leading latency on a 99.9% uptime infrastructure.
Per-call pricing for every Speech Generation model. The price is listed on each model page — no platform fees on top.
Most Speech Generation image models complete in under 2 seconds. Video and 3D models run several times faster than self-hosted alternatives.
Multi-region failover and automatic retries keep your production traffic online — even during provider outages.
Each model has its own per-call price listed on the model page. We bill per successful generation, with no subscription fees or minimums.
Image models in this collection typically complete in under 2 seconds. Video and 3D models depend on duration and resolution but are usually several times faster than self-hosted runs.
Yes — every account gets $1 in free credits on signup, enough to try most Speech Generation models without a credit card.
Standard accounts have generous concurrent-job limits. Enterprise plans offer custom RPM, higher concurrency, and dedicated capacity — contact sales for details.