We benchmarked twelve OpenAI models against the latency budget of a real voice agent — five runs each, warmed and cold, default and priority tier. The data shows a clear portfolio: which model belongs on the live turn, which belongs on bounded sub-tasks, and which belongs off the wire. Here is the methodology, the gotchas behind the headline numbers, and the per-turn routing pattern Cloudax uses in production. By Cloudax.
Live voice has an inflexible latency budget. A human-sounding turn-around — the gap between a caller stopping and the agent starting — sits at roughly 700–900 milliseconds end-to-end. By the time you account for ASR, network round-trips, TTS and audio buffering, the LLM call itself has maybe 400 milliseconds to produce the first token. That budget rules out most of the models in OpenAI's catalogue for the live turn.
We benchmarked twelve OpenAI models — across the nano, mini and full size classes, on both the default and priority service tiers — with five runs each, warmed and cold. Each run measured time-to-first-token (TTFT), which is what matters for live voice, not tokens-per-second throughput. The agent prompt was a realistic Cloudax production system message, not a stripped-down hello-world.
The priority service tier delivers materially lower TTFT for the same model, but the delta is not uniform. For some smaller models the gap is small enough that default-tier is fine for the live turn. For the larger models the priority tier is the difference between fitting and missing the budget. The data shows where the line sits for each size class.
The practical pattern is not "pick one model and run everything on it" — it's per-turn routing. Cloudax uses the fastest nano-class model on the live turn (where the caller is waiting), a mini-class model on bounded sub-tasks where a slightly slower but more capable model can think for a moment, and the larger full-class models off the wire — for batch summaries, post-call analysis and decisioning that doesn't sit on the conversational critical path.
Every voice agent stack should know four numbers: median TTFT under load, p95 TTFT under load, the network round-trip from your serverless platform to the model endpoint, and the time-to-first-audio out of TTS. Those four numbers — not benchmark accuracy — decide whether the conversation sounds human.