microsoft/VibeVoice
VibeVoice is an open-source family of frontier voice AI models (ASR and TTS) using 7.5 Hz continuous speech tokenizers and a next-token diffusion framework guided by an LLM. It enables 60-minute long-form ASR with diarization and custom hotwords, and 90-minute multi-speaker TTS with up to 4 speakers and multilingual support. The project ships Colab demos, Hugging Face releases, finetuning code, and a real-time streaming variant.