AI Audio Models Will Become Commodities. Here's How Product Teams Win Anyway
Mati Staniszewski, co-founder and CEO of ElevenLabs, put it plainly at Bitcoin World Disrupt 2025: AI audio models will commoditize. The advantage sits with proprietary models for now, but over time, baseline quality will level out and access will widen.
If you build products, this isn't bad news. It's a signal. The value shifts from the model itself to the experience, the data, and the workflow you own.
Why commoditization is coming
- More players and open-source progress reduce gaps in quality
- Model training costs continue to fall and tooling improves
- Benchmarks stabilize, making "good enough" easy to meet
- Differentiation narrows to edge cases like rare languages or niche voice styles
Staniszewski acknowledged there will be subtle differences for specific voices or languages. But those gaps shrink with each release cycle.
ElevenLabs' current bet: build now, compound later
Why keep building models if they'll commoditize? Because right now they're the fastest path to quality. ElevenLabs wants control over latency, prosody, and reliability so they can solve the hard problems end to end.
Their stance is practical: win the next 12-24 months with proprietary performance, then use that expertise to create applications and fused systems that live beyond the model layer.
Where product value moves next
- Applications over algorithms: Ship workflows that matter to specific jobs, not just demos
- Data moats: Rights-cleared, consented datasets and feedback loops that improve outcomes
- User experience: Real-time responsiveness, controls for style and emotion, and clear fail states
- Ecosystem integration: CRM, contact center, CMS, design tools, analytics, compliance
- Trust and rights management: Consent, watermarking, detection, and enterprise-grade voice usage controls
What to build now: a 24-month product plan
0-12 months
- Use the best proprietary model you can for mission-critical quality and uptime
- Instrument latency, jitter, and cost per audio minute; set hard SLOs by use case
- Ship an audio-first UX: real-time streaming, interrupt and barge-in, style controls, safe fallback voices
- Collect consented, role-based feedback data to fine-tune prompts, voices, and guardrails
- Add evaluation gates: Word Error Rate, Mean Opinion Score, style transfer accuracy, multilingual parity
12-24 months
- Introduce a model abstraction layer so you can swap engines without a rewrite
- Blend audio with LLMs and video for richer experiences (voice + reasoning + visuals)
- Localize at scale: pronunciation dictionaries, domain lexicons, and region-specific compliance
- Optimize unit economics with smart routing: pick models by task, language, and cost targets
Build, buy, or blend
- Buy when you need reliability now and your edge is UX or workflow
- Blend when you serve multiple use cases and need cost/quality routing
- Build only if you have a durable data advantage or must meet strict constraints (on-device, privacy, or unique languages)
Moats that outlast model parity
- Unique, rights-cleared data and continuous human feedback loops
- Deep integration into daily workflows and enterprise systems
- Clear governance: consent, watermarking, misuse prevention, audit trails
- Distribution: channels, partnerships, and community that compound
Multi-modal is the next step
Staniszewski expects fused systems-audio plus LLMs plus video-within one to two years. Think synchronized voice with generated visuals and reasoning in one flow.
Google's Veo shows how fast video generation is moving, and points to what combined systems can do next. See the overview from Google AI here: Veo.
Metrics that matter
- Latency (P95/P99) and interruption handling for live experiences
- Cost per minute and per conversation
- Word Error Rate and Character Error Rate across languages
- Mean Opinion Score and style transfer consistency
- Content safety hit rate, voice-rights compliance, and detection success
- Product KPIs: activation, task completion, time-to-first-value, retention
Risk, consent, and safety
- Voice cloning consent and contracts for talent and end users
- Watermarking, detection models, and traceability for generated audio
- Clear misuse policies, throttling, and abuse monitoring
- Regional compliance: data residency, retention, and audit logs
Quick playbook for product teams
- Pick one high-value workflow and instrument it end to end
- Start proprietary, add model abstraction by quarter two
- Ship real-time UX with controls users actually need
- Collect consented feedback; close the loop weekly
- Add safety, watermarking, and rights checks before scale
- Plan multi-modal paths: voice + LLM now, video next
- Localize with lexicons and pronunciation tools early
- Report on cost, latency, and quality like core SLAs
Conclusion
AI audio models will get cheaper, better, and common. The edge moves to product: experiences users trust, data they consent to, and workflows that save time every single day.
ElevenLabs is building models today to win quality now and compound into multi-modal systems later. That's the cue for product teams: ship outcomes, not models.
If you're aligning your roadmap to audio + LLM + video, this curated list can help you upskill by role: AI courses by job.
FAQs
Q1: What does "commoditization" of AI audio models mean?
It means the core technology becomes widely available and less differentiated. As more vendors and open-source options reach similar quality, the unique edge of any single model declines.
Q2: Why is ElevenLabs still building its own models?
Quality and reliability win deals right now. Owning the stack lets them solve hard problems in natural speech and control performance, while building expertise that feeds future applications.
Q3: What is multi-modal AI and how will ElevenLabs engage?
Multi-modal systems handle audio, text, and video together. ElevenLabs plans to pair its audio strengths with LLMs and video models through partnerships and open-source integrations to deliver unified experiences.
Q4: Where did Staniszewski share these insights?
On stage at the Bitcoin World Disrupt 2025 conference.
Q5: What's the long-term strategy for creating value?
Build strong models to lead now, then compound that into products where AI and design work as one. Think software-hardware synergy, but applied to AI features and real customer workflows.
Your membership also unlocks: