Microsoft’s VibeVoice AI Instantly Creates Realistic Multi-Speaker Podcasts From Text

Microsoft’s VibeVoice converts text into natural-sounding podcasts with up to four distinct voices and 90 minutes of audio. It supports English and Chinese and runs locally or online.

Published on: Sep 01, 2025
Microsoft’s VibeVoice AI Instantly Creates Realistic Multi-Speaker Podcasts From Text

Microsoft’s New AI Turns Text into Full Podcasts with Impressive Quality

Microsoft’s latest AI project, VibeVoice, offers something different from the usual Copilot integrations. This open-source text-to-speech (TTS) system converts plain text into natural-sounding audio, including multi-speaker conversations that can stretch up to 90 minutes.

What Is VibeVoice?

VibeVoice is a framework built to generate expressive, long-form audio like podcasts from text. It tackles common issues in TTS systems such as scalability, speaker consistency, and natural turn-taking in conversations. Unlike many models limited to one or two speakers, VibeVoice handles up to four distinct voices within a single audio file.

You can try it out either by installing the software locally or using the online version—though the latter involves waiting in a queue for processing.

  • There are currently two versions available: a 1.5 billion parameter model and a 7 billion parameter model.
  • The smaller model supports up to 90 minutes of audio with a 64k context window.
  • The larger model produces up to 45 minutes with a 32k context window, but likely offers higher quality.
  • A lighter 0.5 billion parameter version is planned for real-time audio generation.

Running the models locally requires about 7GB of VRAM for the smaller one and up to 18GB for the larger. This means many GPUs can handle the smaller model without needing a high-end AI rig.

Currently, VibeVoice supports English and Chinese, with plans to add more languages later.

How Does VibeVoice Work?

At its core, you enter text, and VibeVoice generates speech. It can create multi-speaker audio files that simulate real conversations, making it suitable for podcasts or dialogue-driven content. While it can attempt singing, that feature is still rough around the edges.

The voices sound fairly natural but still have an AI-generated tone. Future updates may include voice cloning and expanded emotional expression. It already supports multilingual output, covering English and Mandarin for now.

Besides podcast creation and voiceovers, text-to-speech technology like VibeVoice offers practical benefits such as improved accessibility for users who rely on audio content.

Testing the model with a single speaker reading a snippet of text shows promising results. You can find more advanced examples on the project’s page, demonstrating multiple speakers and bilingual capabilities.

Once streaming audio generation becomes available, VibeVoice could integrate with chat assistants, providing human-like audio responses without relying on external servers.

For those interested in exploring this tool, you can find setup instructions and demos on the VibeVoice GitHub repository and on Hugging Face.