Microsoft’s VibeVoice AI Instantly Creates Realistic Multi-Speaker Podcasts From Text

Microsoft’s VibeVoice converts text into natural-sounding podcasts with up to four distinct voices and 90 minutes of audio. It supports English and Chinese and runs locally or online.

Microsoft’s New AI Turns Text into Full Podcasts with Impressive Quality

Microsoft’s latest AI project, VibeVoice, offers something different from the usual Copilot integrations. This open-source text-to-speech (TTS) system converts plain text into natural-sounding audio, including multi-speaker conversations that can stretch up to 90 minutes.

What Is VibeVoice?

VibeVoice is a framework built to generate expressive, long-form audio like podcasts from text. It tackles common issues in TTS systems such as scalability, speaker consistency, and natural turn-taking in conversations. Unlike many models limited to one or two speakers, VibeVoice handles up to four distinct voices within a single audio file.

You can try it out either by installing the software locally or using the online version—though the latter involves waiting in a queue for processing.

There are currently two versions available: a 1.5 billion parameter model and a 7 billion parameter model.
The smaller model supports up to 90 minutes of audio with a 64k context window.
The larger model produces up to 45 minutes with a 32k context window, but likely offers higher quality.
A lighter 0.5 billion parameter version is planned for real-time audio generation.

Running the models locally requires about 7GB of VRAM for the smaller one and up to 18GB for the larger. This means many GPUs can handle the smaller model without needing a high-end AI rig.

Currently, VibeVoice supports English and Chinese, with plans to add more languages later.

How Does VibeVoice Work?

At its core, you enter text, and VibeVoice generates speech. It can create multi-speaker audio files that simulate real conversations, making it suitable for podcasts or dialogue-driven content. While it can attempt singing, that feature is still rough around the edges.

The voices sound fairly natural but still have an AI-generated tone. Future updates may include voice cloning and expanded emotional expression. It already supports multilingual output, covering English and Mandarin for now.

Besides podcast creation and voiceovers, text-to-speech technology like VibeVoice offers practical benefits such as improved accessibility for users who rely on audio content.

Testing the model with a single speaker reading a snippet of text shows promising results. You can find more advanced examples on the project’s page, demonstrating multiple speakers and bilingual capabilities.

Once streaming audio generation becomes available, VibeVoice could integrate with chat assistants, providing human-like audio responses without relying on external servers.

For those interested in exploring this tool, you can find setup instructions and demos on the VibeVoice GitHub repository and on Hugging Face.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Advertisement

Microsoft’s VibeVoice AI Instantly Creates Realistic Multi-Speaker Podcasts From Text

Microsoft’s New AI Turns Text into Full Podcasts with Impressive Quality

What Is VibeVoice?

How Does VibeVoice Work?

Related AI News for IT and Development

Holiverse Is Building an Offline AI Device so Your Data Stays Yours

Machine-Readable or Invisible: MCP Sets the Rules for AI Travel Booking

Pioneer AI Foundry Hits MCP Cloud Execution Milestone, Updates Bitcoin Treasury and Venture Stakes

AI Traffic Is Rewriting Latin America's Networks: Ciena's 2026 Outlook

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: