matrix
Provider readiness
Gemini 3.1 Flash Live
gemini-live
Highest-priority realtime candidate; available through the new AI Studio key.
Run real audio call smoke and measure LiveKit 3.1 compatibility limits.
Gemini 3.1 Flash Lite
gemini-lite
Fast practical baseline when we need external STT/TTS control.
Use for voice-quality A/B when TTS is the variable.
Gemini 3.1 Flash TTS
gemini-tts-31
Controllable Gemini TTS path: 30 voices, prompt/director-note style tuning, Russian supported, streaming PCM chunks via Cloud TTS.
Keep Leda/fast in favorites and as the saved field-test default; retest Leda vs Kore after the classic per-call override fix.
xAI Grok Voice Think Fast
xai-voice
Native realtime voice model with tools; the interesting xAI brain candidate, not just REST TTS.
Run a shorter human xAI call focused on pauses/tool continuation before any baseline promotion.
OpenAI GPT Realtime 2
openai-realtime-2
OpenAI finally has a materially newer realtime voice model: reasoning voice agent, 128k realtime context, stronger tools, preambles, and native speed/voice controls.
Run room smoke with marin/cedar and compare tool continuation, TTFT, VAD and Russian conversational feel against Gemini Live.
xAI Grok TTS
xai-tts
Six voices surfaced in API; RU MP3 and expressive tags work. Bridge now uses 24 kHz PCM plus LiveKit resampling instead of unsafe direct 8 kHz phone codecs.
Keep as voice-color candidate; compare against Gemini Cloud TTS and Eleven before using broadly.
xAI Grok STT
xai-stt
MP3/WAV batch STT is fast and usable; observed transient 500s and bad raw μ-law Russian behavior.
Benchmark real phone corpus against Deepgram Nova 3 and Google Chirp 3; test streaming separately.
Deepgram Nova 3
deepgram
Strong multilingual fallback, useful when Russian mixes with product terms.
Re-run benchmark on phone-call corpus.
Google Chirp 3 STT
google-stt
Dedicated RU path via Speech-to-Text V2, good second opinion.
Keep as classic baseline and compare noisy PSTN calls.
ElevenLabs Flash
eleven
Production-safe voice quality baseline; VIA already knows this territory.
Investigate v3/Expressive Mode boundary separately.
Cartesia Sonic 3
cartesia
Fast classic-mode TTS with curated ES/EN voices.
Use as latency reference, not as the only naturalness candidate.
Hume Octave 2
hume
Most interesting emotional/naturalness candidate from the research pass.
Build sample pack before LiveKit integration work.
MiniMax Speech 2.8
minimax
Multilingual expressive TTS with tags; plausible VIA voice contender.
Generate RU/ES samples and compare phone-path artifacts.
switch map
Runtime modes
Gemini Live baseline
google-live
google-realtime / gemini-3.1-flash-live-preview via AI Studio
Leda, ru-RU, STT/TTS = none
native tool_response
Быстрые живые разговоры, проверка tools, phone/web smoke.
Classic composable stack
classic
google / gemini-3.1-flash-lite-preview
Google Chirp 3 STT + Gemini Cloud TTS Leda/ru-RU, 1.2x
AgentSession generateReply
A/B голоса, STT benchmark, fallback если realtime ведёт себя странно.
xAI Grok Voice Think Fast
xai-realtime
xai-realtime / grok-voice-think-fast-1.0
ara/eve/rex/sal/leo/una, STT/TTS = none
official LiveKit xAI plugin
Проверить, может ли xAI стать конкурентом Gemini Live для VIA.
OpenAI GPT Realtime 2
openai-realtime-2
openai-realtime / gpt-realtime-2
marin/cedar/etc, STT/TTS = none, speed via native audio output
official LiveKit OpenAI plugin
Проверить, догнал ли OpenAI Gemini Live по tool use, long context и естественности речи.