TanStack AI provides support for text-to-speech generation through dedicated TTS adapters. This guide covers how to convert text into spoken audio using OpenAI and Gemini providers.
Text-to-speech (TTS) is handled by TTS adapters that follow the same tree-shakeable architecture as other adapters in TanStack AI. The TTS adapters support:
OpenAI: TTS-1, TTS-1-HD, and audio-capable GPT-4o models
Gemini: Gemini 2.5 Flash TTS (experimental)
fal.ai: Kokoro, ElevenLabs, MiniMax, Chatterbox, Dia, Orpheus, F5-TTS, VibeVoice, and more
import { generateSpeech } from '@tanstack/ai'
import { openaiSpeech } from '@tanstack/ai-openai'
// Generate speech from text (uses OPENAI_API_KEY from environment)
const result = await generateSpeech({
adapter: openaiSpeech('tts-1'),
text: 'Hello, welcome to TanStack AI!',
voice: 'alloy',
})
// result.audio contains base64-encoded audio data
console.log(result.format) // 'mp3'
console.log(result.contentType) // 'audio/mpeg'import { generateSpeech } from '@tanstack/ai'
import { geminiSpeech } from '@tanstack/ai-gemini'
// Generate speech from text (uses GOOGLE_API_KEY or GEMINI_API_KEY from environment)
const result = await generateSpeech({
adapter: geminiSpeech('gemini-3.1-flash-tts-preview'),
text: 'Hello from Gemini TTS!',
})
console.log(result.audio) // Base64 encoded audiofal.ai offers a broad selection of TTS models — Google's brand-new gemini-3.1-flash-tts, ElevenLabs v3, MiniMax 2.6 HD, Kokoro's multilingual voices, and more. Pass the model ID as a string literal for fully typed modelOptions.
import { generateSpeech } from '@tanstack/ai'
import { falSpeech } from '@tanstack/ai-fal'
// Google Gemini 3.1 Flash TTS — 80+ languages, expressive audio tags
const result = await generateSpeech({
adapter: falSpeech('fal-ai/gemini-3.1-flash-tts'),
text: '[warm, enthusiastic] Welcome to TanStack AI!',
voice: 'Kore',
})// Kokoro multilingual
const result = await generateSpeech({
adapter: falSpeech('fal-ai/kokoro/american-english'),
text: 'Hello from fal!',
voice: 'af_heart',
speed: 1.0,
})
console.log(result.audio) // Base64 encoded audio
console.log(result.format) // e.g. "wav"// ElevenLabs v3 with model-specific options
const result = await generateSpeech({
adapter: falSpeech('fal-ai/elevenlabs/tts/eleven-v3'),
text: 'Welcome to TanStack AI.',
// The fal adapter maps top-level `voice`/`speed` into the model input;
// `modelOptions` is reserved for model-specific keys.
voice: 'Rachel',
modelOptions: {
stability: 0.5,
},
})All TTS adapters support these common options:
| Option | Type | Description |
|---|---|---|
| text | string | The text to convert to speech (required) |
| voice | string | The voice to use for generation |
| format | string | Output audio format (e.g., "mp3", "wav") |
OpenAI provides several distinct voices:
| Voice | Description |
|---|---|
| alloy | Neutral, balanced voice |
| echo | Warm, conversational voice |
| fable | Expressive, storytelling voice |
| onyx | Deep, authoritative voice |
| nova | Friendly, upbeat voice |
| shimmer | Clear, gentle voice |
| ash | Calm, measured voice |
| ballad | Melodic, flowing voice |
| coral | Bright, energetic voice |
| sage | Wise, thoughtful voice |
| verse | Poetic, rhythmic voice |
| Format | Description |
|---|---|
| mp3 | MP3 audio (default) |
| opus | Opus audio (good for streaming) |
| aac | AAC audio |
| flac | FLAC audio (lossless) |
| wav | WAV audio (uncompressed) |
| pcm | Raw PCM audio |
const result = await generateSpeech({
adapter: openaiSpeech('tts-1-hd'),
text: 'High quality speech synthesis',
voice: 'nova',
format: 'mp3',
speed: 1.0, // top-level option, 0.25 to 4.0
modelOptions: {
instructions: 'Speak in a calm, measured tone', // GPT-4o audio models only
},
})Note: voice, format, and speed are top-level generateSpeech options, not modelOptions keys.
| Option | Type | Description |
|---|---|---|
| instructions | string | Voice style instructions (GPT-4o audio models only) |
Note: The instructions and stream_format options are only available with the gpt-4o-audio-preview model, not with tts-1 or tts-1-hd.
The TTS result includes:
interface TTSResult {
id: string // Unique identifier for this generation
model: string // The model used
audio: string // Base64-encoded audio data
format: string // Audio format (e.g., "mp3")
contentType: string // MIME type (e.g., "audio/mpeg")
duration?: number // Duration in seconds (if available)
}// Convert base64 to audio and play
function playAudio(result: TTSResult) {
const audioData = atob(result.audio)
const bytes = new Uint8Array(audioData.length)
for (let i = 0; i < audioData.length; i++) {
bytes[i] = audioData.charCodeAt(i)
}
const blob = new Blob([bytes], { type: result.contentType })
const url = URL.createObjectURL(blob)
const audio = new Audio(url)
audio.play()
// Clean up when done
audio.onended = () => URL.revokeObjectURL(url)
}import { writeFile } from 'fs/promises'
async function saveAudio(result: TTSResult, filename: string) {
const audioBuffer = Buffer.from(result.audio, 'base64')
await writeFile(filename, audioBuffer)
console.log(`Saved to ${filename}`)
}
// Usage
const result = await generateSpeech({
adapter: openaiSpeech('tts-1'),
text: 'Hello world!',
})
await saveAudio(result, 'output.mp3')TanStack AI provides React hooks and server-side streaming helpers to build full-stack text-to-speech with minimal boilerplate.
Server — Create an API route that wraps generateSpeech as a streaming response:
// routes/api/generate/speech.ts
import { generateSpeech, toServerSentEventsResponse } from '@tanstack/ai'
import { openaiSpeech } from '@tanstack/ai-openai'
import { createFileRoute } from '@tanstack/react-router'
export const Route = createFileRoute('/api/generate/speech')({
server: {
handlers: {
POST: async ({ request }) => {
const body = await request.json()
const { text, voice, format, model } = body.data
const stream = generateSpeech({
adapter: openaiSpeech(model ?? 'tts-1'),
text,
voice,
format,
stream: true,
})
return toServerSentEventsResponse(stream)
},
},
},
})Client — Use the useGenerateSpeech hook with a connection adapter:
import { useGenerateSpeech, fetchServerSentEvents } from '@tanstack/ai-react'
function SpeechGenerator() {
const { generate, result, isLoading, error } = useGenerateSpeech({
connection: fetchServerSentEvents('/api/generate/speech'),
})
const playAudio = () => {
if (!result) return
const audioData = atob(result.audio)
const bytes = new Uint8Array(audioData.length)
for (let i = 0; i < audioData.length; i++) {
bytes[i] = audioData.charCodeAt(i)
}
const blob = new Blob([bytes], { type: result.contentType })
const url = URL.createObjectURL(blob)
const audio = new Audio(url)
audio.play()
audio.onended = () => URL.revokeObjectURL(url)
}
return (
<div>
<button
onClick={() => generate({ text: 'Hello, welcome to TanStack AI!' })}
disabled={isLoading}
>
{isLoading ? 'Generating...' : 'Generate Speech'}
</button>
{error && <p>Error: {error.message}</p>}
{result && <button onClick={playAudio}>Play Audio</button>}
</div>
)
}For non-streaming usage with TanStack Start server functions:
// lib/server-functions.ts
import { createServerFn } from '@tanstack/react-start'
import { generateSpeech } from '@tanstack/ai'
import { openaiSpeech } from '@tanstack/ai-openai'
export const generateSpeechFn = createServerFn({ method: 'POST' })
.inputValidator((data: { text: string; voice?: string }) => data)
.handler(async ({ data }) => {
return generateSpeech({
adapter: openaiSpeech('tts-1'),
text: data.text,
voice: data.voice,
})
})import { useGenerateSpeech } from '@tanstack/ai-react'
import { generateSpeechFn } from '../lib/server-functions'
function SpeechGenerator() {
const { generate, result, isLoading } = useGenerateSpeech({
fetcher: (input) => generateSpeechFn({ data: input }),
})
// ... same UI as above
}For TanStack Start server functions that stream results. The fetcher receives type-safe input and returns an SSE Response — the client parses it automatically:
// lib/server-functions.ts
import { createServerFn } from '@tanstack/react-start'
import { generateSpeech, toServerSentEventsResponse } from '@tanstack/ai'
import { openaiSpeech } from '@tanstack/ai-openai'
export const generateSpeechStreamFn = createServerFn({ method: 'POST' })
.inputValidator((data: { text: string; voice?: string }) => data)
.handler(({ data }) => {
return toServerSentEventsResponse(
generateSpeech({
adapter: openaiSpeech('tts-1'),
text: data.text,
voice: data.voice,
stream: true,
}),
)
})import { useGenerateSpeech } from '@tanstack/ai-react'
import { generateSpeechStreamFn } from '../lib/server-functions'
function SpeechGenerator() {
const { generate, result, isLoading } = useGenerateSpeech({
fetcher: (input) => generateSpeechStreamFn({ data: input }),
})
// ... same UI as above
}The useGenerateSpeech hook accepts:
| Option | Type | Description |
|---|---|---|
| connection | ConnectionAdapter | Streaming transport (SSE, HTTP stream, custom) |
| fetcher | (input) => Promise<TTSResult | Response> | Direct async function, or server function returning an SSE Response |
| onResult | (result) => TOutput | null | void | Callback when audio is generated. Optionally return a transformed value (see Result Transform) |
| onError | (error) => void | Callback on error |
| onProgress | (progress, message?) => void | Progress updates (0-100) |
And returns:
| Property | Type | Description |
|---|---|---|
| generate | (input: SpeechGenerateInput) => Promise<void> | Trigger generation |
| result | TOutput | null | The result (or transformed result), or null |
| isLoading | boolean | Whether generation is in progress |
| error | Error | undefined | Current error, if any |
| status | GenerationClientState | 'idle' | 'generating' | 'success' | 'error' |
| stop | () => void | Abort the current generation |
| reset | () => void | Clear result, error, and return to idle |
The onResult callback can optionally return a transformed value that replaces the stored result. This is useful for converting raw API responses into a more convenient format for your components.
Transform behavior:
Return a non-null value to replace the stored result with the transformed value
Return null to keep the previous result unchanged (useful for filtering)
Return nothing (void) to store the raw result as-is (backward compatible)
Example: Convert base64 audio to a playable Audio element
import { useGenerateSpeech, fetchServerSentEvents } from '@tanstack/ai-react'
function SpeechPlayer() {
const { generate, result, isLoading } = useGenerateSpeech({
connection: fetchServerSentEvents('/api/generate/speech'),
onResult: (raw) => {
const audioData = atob(raw.audio)
const bytes = new Uint8Array(audioData.length)
for (let i = 0; i < audioData.length; i++) {
bytes[i] = audioData.charCodeAt(i)
}
const blob = new Blob([bytes], { type: raw.contentType })
const url = URL.createObjectURL(blob)
return {
audio: new Audio(url),
duration: raw.duration,
}
},
})
return (
<div>
<button
onClick={() => generate({ text: 'Hello world!', voice: 'alloy' })}
disabled={isLoading}
>
Generate
</button>
{result && (
<button onClick={() => result.audio.play()}>
Play Audio
</button>
)}
</div>
)
}TypeScript automatically infers the result type from your onResult return value — no explicit generic parameter needed. In this example, result is inferred as { audio: HTMLAudioElement; duration?: number } | null, so result.audio.play() is fully type-safe.
| Model | Quality | Speed | Use Case |
|---|---|---|---|
| tts-1 | Standard | Fast | Real-time applications |
| tts-1-hd | High | Slower | Production audio |
| gpt-4o-audio-preview | Highest | Variable | Advanced voice control |
| Model | Status | Notes |
|---|---|---|
| gemini-2.5-flash-preview-tts | Experimental | May require Live API for full features |
try {
const result = await generateSpeech({
adapter: openaiSpeech('tts-1'),
text: 'Hello!',
})
} catch (error) {
if (error.message.includes('exceeds maximum length')) {
console.error('Text is too long (max 4096 characters)')
} else if (error.message.includes('Speed must be between')) {
console.error('Invalid speed value')
} else {
console.error('TTS error:', error.message)
}
}Tip: To trigger speech generation from your frontend with loading states, see Generation Hooks.
Debugging: When a TTS request fails or produces unexpected output, pass debug: true on generateSpeech({...}) to log the outgoing request, every raw provider chunk, and any caught error. See Debug Logging.
The TTS adapters use the same environment variables as other adapters:
OpenAI: OPENAI_API_KEY
Gemini: GOOGLE_API_KEY or GEMINI_API_KEY
For production use or when you need explicit control:
import { createOpenaiSpeech } from '@tanstack/ai-openai'
import { createGeminiSpeech } from '@tanstack/ai-gemini'
// OpenAI
const openaiAdapter = createOpenaiSpeech('tts-1', 'your-openai-api-key')
// Gemini
const geminiAdapter = createGeminiSpeech('gemini-3.1-flash-tts-preview', 'your-google-api-key')Text Length: OpenAI TTS supports up to 4096 characters per request. For longer content, split into chunks.
Voice Selection: Choose voices appropriate for your content—use onyx for authoritative content, nova for friendly interactions.
Format Selection: Use mp3 for general use, opus for streaming, wav for further processing.
Caching: Cache generated audio to avoid regenerating the same content.
Error Handling: Always handle errors gracefully, especially for user-facing applications.