From Clunky to Conversational: Zoom and Twilio CEOs See Voice AI Turning a Corner Despite Drive-Thru Stumbles

Voice AI's awkward phase is ending: latency is down, voices sound natural, and pilots are moving to production. Start with structured use cases and bake in security.

Published on: Oct 17, 2025
From Clunky to Conversational: Zoom and Twilio CEOs See Voice AI Turning a Corner Despite Drive-Thru Stumbles

Voice AI's awkward phase is ending: What IT and product teams should do next

Voice AI has been available for years, but clunky delivery, awkward pauses, and accuracy gaps have kept adoption stuck. At the Goldman Sachs Communacopia + Technology conference, the CEOs of Zoom and Twilio said those roadblocks are finally getting addressed.

The takeaway: latency is dropping, voices sound natural, and multilingual agents are moving from demo to production. The next two to three years will bring a surge of voice-first solutions, especially where speed and scale matter.

Why customers increasingly prefer voice AI

Twilio CEO Khozema Shipchandler said internal data shows customers often prefer voice AI over humans-especially in healthcare. The reason: people feel there's an "asymmetry in knowledge between the two sides," and the weird human-to-bot pauses vanish with a fully virtual interaction.

"You don't have these awkward pauses when you have these interactions take place between a human on one side and then a voice AI agent on the other side," Shipchandler said. That alone reduces friction and makes conversations feel more consistent.

Latency and naturalness are catching up

Historically, voice AI lagged in reaction time. Shipchandler said latency is now close to solved for many use cases, which unlocks realistic back-and-forth and barge-in support.

Zoom CEO Eric Yuan said Zoom has invested heavily in multilingual, natural-sounding agents. The goal is simple: remove awkward pauses and keep conversations flowing.

Reality check: accuracy still bites in noisy settings

Drive-through pilots at major chains struggled. Some restaurants, including McDonald's, have paused voice AI deployments after systems misread vocal orders and accents in the wild. Reports confirm the tech isn't consistent enough yet for high-noise, high-variance environments.

Jack Gold, principal analyst at J. Gold Associates, put it plainly: voice is harder than text. Accents, regional variations, and context make English alone a moving target. Still, voice is a natural way to handle inquiries, and many customers won't type. In food delivery, around 35% of orders still happen by phone-prime territory for voice AI to speed things up.

Scale advantage: unlimited capacity

"The voice AI's capacity is unlimited," Shipchandler said. That matters for peaks, after-hours coverage, and servicing long-tail interactions without hiring sprees.

More people are now talking to ChatGPT instead of typing prompts, Yuan said. The behavior shift signals how fast voice-native experiences could spread.

Security: stop spoofing before you scale

Voice spoofing is a real risk. Shipchandler pointed to a practical path: identify a voice signature up front, then apply light verification in the background so customers can start fast and stay secure.

Zoom is engaging CISOs and publishing guidance on deploying AI safely. Treat verification, consent, and audit logs as table stakes.

Implementation playbook for IT and product teams

  • Start with structured use cases: appointment scheduling, order status, password resets, eligibility checks.
  • Design for interruption and handoff: support barge-in, fast escalation to humans, and clear error recovery flows.
  • Set a latency budget: aim for sub-300ms round-trip for natural turn-taking; cache prompts and use streaming TTS/ASR.
  • Choose must-have capabilities: accent coverage, multilingual support, DTMF fallback, redaction of PII, and analytics.
  • Measure what matters: containment rate, average handle time, deflection from agents, first-contact resolution, and CSAT.
  • Data flywheel: capture transcripts (with consent), label outcomes, improve prompts and call flows, and retrain regularly.
  • Security by default: voice signature checks, call-risk scoring, consent prompts, encryption, and audit trails.
  • Pilot in low-noise channels first: inbound support lines, callbacks, and scheduled outreach before tackling drive-through noise.

What this means for your roadmap

Short term (0-6 months): pick one workflow with clear business value and measurable KPIs. Ship a pilot with human-in-the-loop fallback and weekly iteration on prompts, grammars, and policies.

Mid term (6-18 months): expand to multilingual scenarios, add barge-in and dynamic knowledge retrieval, and integrate trust checks. Build a central analytics layer to compare human vs. AI performance.

Longer term (18-36 months): expect better accuracy as training data improves, plus stronger spoofing defenses. Keep humans available for edge cases, but let voice AI take the volume.

Upskill your team

If you're standing up voice-led experiences across product, ops, or support, give your team structured training so you don't reinvent the wheel. Explore role-based learning paths and certifications at Complete AI Training: Courses by Job or validate skills with the AI Automation Certification.

The bottom line

Voice AI is moving past its awkward years. Latency is down, voices sound natural, and use cases with clear structure are ready for production. Accuracy and spoofing still need attention, but the path is clear for teams that ship, measure, and iterate.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)