xAI shipped Grok Voice the same week OpenAI rolled out GPT-Realtime-2, and developers picking a voice model in 2026 now have two credible flagship options. Both ship as speech-to-speech models with reasoning, both run over WebSocket, both support tool use, and both speak with humanlike inflection. The decision hinges on five concrete trade-offs: latency, price, voice catalog, reasoning depth, and whether you need SIP, image input, or voice cloning.
This post puts them side by side, with the numbers, the API surfaces, and the one-line recommendation for every common voice-agent shape.
For the standalone guides, see How to use GPT-Realtime-2 and How to use Grok Voice for free. To stress-test either model under load, Apidog handles WebSocket sessions natively.
TL;DR
- Grok Voice (
grok-voice-think-fast-1.0) wins on latency (<1 second time-to-first-audio, ~5x faster than the closest competitor), free console access, voice catalog (80+ presets, 28 languages), and voice cloning (1-minute sample, ready in 2 minutes). - GPT-Realtime-2 wins on reasoning depth (GPT-5-class, 5 reasoning levels), context window (128k tokens), image input (live screenshot understanding), and production maturity (native SIP, MCP, longer track record).
- Pricing for paid use: GPT-Realtime-2 is $32/$64 per 1M audio tokens; Grok Voice has no per-minute audio charge on console, you pay only for Grok 4.3 reasoning at $1.25/$2.50 per 1M tokens.
- Pick Grok Voice for high-volume, low-latency consumer apps and any voice cloning use case.
- Pick GPT-Realtime-2 for complex reasoning, multimodal voice agents, and locked-down call-center deployments.
- Build the integration once with Apidog, then swap models with one URL change.
The two models in one table
| Capability | Grok Voice (grok-voice-think-fast-1.0) |
GPT-Realtime-2 |
|---|---|---|
| Time to first audio | < 1 second (xAI claim: ~5x faster than nearest) | sub-second on low reasoning, slower on high/xhigh |
| Reasoning levels | low / medium / high (Grok 4.3 underlying) | minimal / low / medium / high / xhigh |
| Underlying intelligence | Grok 4.3 (Intelligence Index 53) | GPT-5-class |
| Context window | 1,000,000 tokens (Grok 4.3) | 128,000 tokens |
| Preset voices | 80+ (5 named voice-agent personas: Eve, Ara, Rex, Sal, Leo) | 10 (2 new: Cedar, Marin; 8 retuned) |
| Languages (TTS) | 28 | not officially counted |
| Languages (STT) | 25 | inherited from GPT-Realtime |
| Voice cloning | Yes, Custom Voices, 1-min sample, <2-min training | No |
| Image input | No (text + audio only) | Yes (photo, screenshot) |
| Remote MCP servers | Tool use yes; native MCP not advertised | Yes (MCP tools executed by API) |
| Native SIP / phone calling | Bring your own SIP provider | Yes (?call_id={call_id} endpoint) |
| Audio formats | PCM16, MP3, μ-law | PCM16, G.711 μ-law, A-law |
| Pricing model | Free on console for voice; pay only for Grok 4.3 reasoning ($1.25/$2.50 per 1M) | $32/1M audio in, $64/1M audio out, $4/$24 per 1M text |
| Compliance | SOC 2 Type II, HIPAA-eligible (BAA), GDPR | SOC 2, GDPR (per OpenAI Enterprise) |
Latency: Grok wins, by a wide margin
xAI’s claim that grok-voice-think-fast-1.0 is “nearly 5 times faster than the closest competitor” comes with their own benchmarks, so treat the multiplier with caution. The directional finding holds in independent testing: Grok’s time-to-first-audio sits comfortably under one second, while GPT-Realtime-2 lands in the 800ms–1500ms band depending on reasoning level.
Why it matters: in a phone call, the difference between 600ms and 1200ms is the difference between “the agent feels alive” and “the agent feels like a bot.” Latency is the single dimension users feel most.
Recommendation: if your app is consumer-facing and the user has a phone in their hand, Grok Voice’s latency advantage is worth the trade against deeper reasoning.
Pricing: not the same shape
This is the one section where comparing apples to apples requires care.
GPT-Realtime-2 prices voice as a token meter. Audio input is $32 per 1M tokens, audio output is $64 per 1M tokens. One second of audio is roughly 50 tokens, so a 5-minute conversation with balanced turn-taking burns somewhere around 30,000 tokens, or roughly $1.50 in audio I/O. Cached input drops 80x for stable system prompts.
Grok Voice has no per-minute or per-token charge on the xAI Console for TTS, STT, voice agent, or Custom Voices. You only pay for Grok 4.3 reasoning at $1.25 per 1M input tokens and $2.50 per 1M output tokens. Reasoning tokens are roughly an order of magnitude fewer than audio tokens for the same conversation, so the same 5-minute call comes in under $0.10.
Recommendation: for high-volume consumer apps where unit economics matter (think 10,000+ minutes/day), Grok Voice is materially cheaper. For low-volume, high-stakes flows (sales calls, regulated support), the price gap is small enough that reasoning quality decides.
For the full Grok 4.3 pricing breakdown, see How to use the Grok 4.3 API. For OpenAI’s pricing line, see GPT-5.5 pricing.
Reasoning depth: OpenAI wins
GPT-Realtime-2 is the first speech-to-speech model OpenAI describes as “GPT-5-class.” On Big Bench Audio it scored 96.6% (up from 81.4% on the prior model), and on Audio MultiChallenge it scored 48.5% (up from 34.7%). Five reasoning levels (minimal through xhigh) let you scale latency against quality on a per-request basis.
Grok Voice runs Grok 4.3 underneath. Grok 4.3 hit Intelligence Index 53 in Artificial Analysis, ranking 10th of 146 models globally. It is strong, particularly on agentic tasks (300 Elo points up vs Grok 4.20 on GDPval-AA), but the speech-to-speech reasoning tier is not yet at GPT-Realtime-2’s level on the published benchmarks.
Recommendation: if the agent has to disambiguate intent, dispatch across many tools, or reason over long context mid-conversation, GPT-Realtime-2 is the safer choice. For straightforward support and sales scripts, the gap is small enough that latency wins.
Voice catalog: Grok wins on count, OpenAI on consistency
Grok ships 80+ preset voices spanning 28 languages. The voice agent itself uses a curated set of five personas (Eve, Ara, Rex, Sal, Leo), but the broader TTS surface lets you pick from a much larger library. Plus voice cloning, which has no equivalent on OpenAI’s side.
GPT-Realtime-2 ships 10 voices total: two new flagships (Cedar, Marin) exclusive to the Realtime API, plus eight retuned legacy voices (alloy, ash, ballad, coral, echo, sage, shimmer, verse). The library is smaller, but the consistency across voices is high; they all use the same audio stack, and intonation control behaves the same on each.
Recommendation: if you need a specific voice (a celebrity-adjacent timbre, a regional accent, a custom brand voice), Grok wins. If you need any high-quality voice and care about predictable behavior, GPT-Realtime-2 is fine.
Voice cloning: only Grok ships it
xAI’s Custom Voices clones a voice from about a minute of clean speech and returns a voice_id in under two minutes. The same voice_id works across the TTS endpoint and the voice agent. OpenAI does not currently expose voice cloning on the Realtime API.
This is a one-sided category. If you need cloning, the choice is made.
Image input: only OpenAI ships it
GPT-Realtime-2 accepts text, audio, and images as input. You can attach a screenshot or a photo to a user turn and ask the agent to describe it out loud, then keep talking. The use cases (field support, voice-driven QA, accessibility narration) are interesting and Grok cannot match them today.
This is also one-sided. If your agent needs to see what the user is looking at, OpenAI is the choice.
For a deeper look at OpenAI’s vision stack, see How to use the GPT-Image-2 API.
SIP and phone integration: OpenAI ships native, Grok needs a bridge
OpenAI’s Realtime API has native SIP support. Point a SIP trunk at OpenAI’s gateway and inbound calls open a WebSocket session at wss://api.openai.com/v1/realtime?call_id={call_id}. You skip the bridge layer entirely.
Grok Voice supports μ-law output for telephony, but you bring your own SIP provider (Twilio, Telnyx, Plivo) and run the bridge yourself. It works, it costs more engineering.
Recommendation: if you are building a call-center agent and want the fastest path from key to call, GPT-Realtime-2 is the lighter integration.
MCP and tool use
Both models support function calling. The split:
- GPT-Realtime-2 supports remote MCP servers natively. Configure a server URL and an allow-list of tools, and the Realtime API itself executes the calls. Your code never round-trips through the function-call event loop.
- Grok Voice supports function calling and ships a built-in
web_searchtool. MCP is not advertised as a first-class primitive yet.
For voice agents that pull from a fifty-endpoint tool catalog (think a banking agent), the MCP integration matters; you want the API to dispatch tools without your server in the hot path. For agents with five or fewer tools, plain function calling on either model is fine.
If you are testing MCP servers separately, see MCP server testing in Apidog.
The one-line picks
- Consumer voice app, high volume, latency-critical: Grok Voice.
- Voice cloning required (custom brand voice, character voices): Grok Voice.
- Multilingual TTS at scale (>10 languages): Grok Voice.
- Voice agent that needs to see screenshots: GPT-Realtime-2.
- Call-center deployment with SIP: GPT-Realtime-2.
- Multi-step reasoning agent with 50+ tools: GPT-Realtime-2 (MCP).
- Long context conversations (50k+ tokens of history): GPT-Realtime-2 (128k context, but Grok 4.3’s 1M context is larger if you can hold the audio token cost).
- Cheapest production voice agent: Grok Voice on console.
- Most reliable for benchmark-heavy reasoning: GPT-Realtime-2 with
xhighreasoning.
How to test both before you commit
The smart move is not to pick one, then port. The smart move is to build against both for a week and measure.
The pattern we run:
- Build a fixture conversation. A 10-turn dialogue with one tool call, one disambiguation, and one long answer. Record real user audio for the turns.
- Script it once in Apidog. WebSocket request, JSON message sequence, environment variables for both
XAI_API_KEYandOPENAI_API_KEY. - Swap the URL between runs.
wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0for one,wss://api.openai.com/v1/realtime?model=gpt-realtime-2for the other. - Capture the audio output and the token usage. Compare time-to-first-audio, total output duration, and total cost per run.
Download Apidog to run the side-by-side. The collection format is portable, so the comparison artifact lives in version control.
FAQ
Can I use both models in the same app and route at runtime?Yes. Both speak similar event shapes. You can route on user intent (cheap intent classifier picks Grok for casual, GPT-Realtime for complex) or on language (Grok for non-English at scale). The cost of the routing layer is small.
Which has better non-English voice quality?Grok wins on language coverage (80+ voices, 28 languages on TTS). On the languages they both cover, real-world quality is close enough that you should test the specific languages you need.
Is GPT-Realtime-2 worth 10x the price for typical workloads?Depends on what “typical” means. For a customer-support agent that answers FAQs, no. For a sales agent that has to read a CRM, dispatch tools, and recover from interruptions, the reasoning gap is worth it.
Does either model do real voice cloning of public figures?No. Both vendors filter cloning to consented samples. Cloning a public figure without permission violates terms of service on both platforms.
How do I migrate from one to the other later?The event names differ slightly, but the conversation shape is the same. Plan for a one-day port, mostly in the session.update payload and event handler names. If you build with Apidog for testing, the request collection ports cleanly.
Wrapping up
There is no universally correct answer between Grok Voice and GPT-Realtime-2. There is a correct answer per use case, and the five trade-offs (latency, price, voice catalog, reasoning depth, and integrations like SIP/MCP/image) make the call.
If you are building a fast consumer voice app and care about every millisecond, ship on Grok Voice and move on. If you are building a multimodal voice agent that needs to look at screens, dispatch fifty tools, and answer phone calls without a SIP bridge, ship on GPT-Realtime-2.
For everything else, build once on Apidog, test both for a week, and pick on data.



