Seedance 2.0 15% OFF | Create in Video Generator →

Dashboard Explore AI GeneratorHOT Desktop App

LLM

Settings

Speech Generation

Convert text into expressive spoken audio

Our selection

video-dubbing

wavespeed-ai/mmaudio-v2

MMaudio v2 produces synchronized audio from video or text inputs, ideal for adding soundtracks to videos when paired with video models. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Try it now!See docs

All models

19 models

video-dubbing

wavespeed-ai/mmaudio-v2

text-to-video

google/veo3

Google Veo3 is Google's flagship text-to-video model with built-in audio, producing synchronized video and sound from text prompts. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-video

google/veo3-fast

Google Veo 3 Fast creates text-to-video with synchronized audio, delivering faster, more cost-effective results than standard Veo 3; commercial use allowed and pricing starts at $0.25/second. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

image-to-video

google/veo3-fast/image-to-video

Google Veo3 Fast provides faster, more cost-effective Image-to-Video generation vs Veo 3, with commercial use allowed and $0.25/sec pricing. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

image-to-video

google/veo3/image-to-video

Google Veo 3 is Google's flagship image-to-video model that creates audio-enabled videos from images. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-02-turbo

Minimax Speech-02 Turbo is a high-definition text-to-speech model delivering natural voice output. Cost: $0.03 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/multilingual-v2

ElevenLabs Multilingual V2 is a multilingual text-to-speech model; cost $0.1 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

audio-to-audio

minimax/voice-clone

Minimax Voice Clone creates high-quality voice clones from short reference clips, closely matching tone, accent, and speaking style. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-02-hd

Minimax Speech 02 HD is Minimax's high-definition text-to-speech model delivering clear HD voices; pricing $0.05 per 1,000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.5-hd-preview

MiniMax Speech 2.5 HD Preview offers HD TTS with enhanced multilingual expressiveness, accurate voice cloning, and 40-language support. Ready-to-use REST API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.5-turbo-preview

Minimax Speech 2.5 Turbo Preview: HD TTS with multilingual support, accurate voice replication across 40 languages. $0.04/1000 chars. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/turbo-v2.5

ElevenLabs Turbo V2.5 is a text-to-speech model available via WaveSpeedAI, billed at $0.05 per 1000 characters for TTS requests. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/voice-design

MiniMax Voice Design generates natural voices from textual descriptions - no cloning - lets you set tone, accent and personality. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/turbo-v2

ElevenLabs Turbo V2 is a Text-To-Speech model available via WaveSpeedAI, billed at $0.05 per 1000 characters for API requests. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/flash-v2

ElevenLabs Flash V2 is a Text-to-Speech model that converts text into spoken audio using the ElevenLabs Flash V2 engine. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/flash-v2.5

ElevenLabs Flash v2.5 is a text-to-speech model on WaveSpeedAI, billed at $0.05 per 1000 characters for generated speech. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/multilingual-v1

ElevenLabs Multilingual V1 provides natural-sounding multilingual text-to-speech across many languages. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

digital-human

sync/lipsync-2-pro

Lipsync-2-pro creates studio-grade lip synchronization for video-to-video editing in minutes, not weeks. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

digital-human

sync/lipsync-2

Sync Lipsync-2 synchronizes lip movements in any video to supplied audio, enabling realistic mouth alignment for films, podcasts, games, or animations. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Speech Generation API — pricing & performance

Run any model in the Speech Generation collection through a single REST API. Pay per generation — no subscriptions, no minimums — with industry-leading latency on a 99.9% uptime infrastructure.

Why run Speech Generation on WaveSpeedAI

Transparent pricing

Per-call pricing for every Speech Generation model. The price is listed on each model page — no platform fees on top.

Optimized for low latency

Most Speech Generation image models complete in under 2 seconds. Video and 3D models run several times faster than self-hosted alternatives.

99.9% uptime

Multi-region failover and automatic retries keep your production traffic online — even during provider outages.

Frequently asked questions

How much does the Speech Generation API cost?+

Each model has its own per-call price listed on the model page. We bill per successful generation, with no subscription fees or minimums.

How fast are Speech Generation models on WaveSpeedAI?+

Image models in this collection typically complete in under 2 seconds. Video and 3D models depend on duration and resolution but are usually several times faster than self-hosted runs.

Can I try the API without a credit card?+

Yes — every account gets $1 in free credits on signup, enough to try most Speech Generation models without a credit card.

Are there rate limits?+

Standard accounts have generous concurrent-job limits. Enterprise plans offer custom RPM, higher concurrency, and dedicated capacity — contact sales for details.

Explore 1,000+ AI Models

Browse our full catalog of state-of-the-art AI models — image, video, 3D, audio, LLM, and more.

wavespeed.ai/models →

Build with the API

Integrate AI into your own apps. RESTful API with client libraries — no cold starts, pay per use.

wavespeed.ai/docs →