p.enthalabs

VibeVoice: Open-source frontier voice AI

github.com · Read Story HN original

Comments

Note that this just covers the Speech-to-Text/Speech-Recognition aspect (a-la whisper), there's also models for long-form Text-To-Speech and steaming Text-To-Speech.
“VibeVoice can only handle up to an hour of audio”

Why?

So we've really just settled on Vibe as the verb for AI then?
Why use precise technical language when you can just vibe with your AI system?
I'd be willing to bet it will be "Word of the Year" for 2026. Merriam-Webster had 'slop' for 2025, and 'polarization' for 2024. Is there a prediction market for this?
it'll probably be something we're not even talking about yet - we still have 7 months in which to make the world even worse
Isn't this project the one Microsoft published but then soon after pulled it for security/safety reasons? What has changed since then?
Look at the "News" section in the readme - The original TTS model is gone from this repo (you can still find it other places), but the SST/ASR, long form TTS, and streaming TTS models are newer.
It’s confusing (at least for me) because the project covers a number of things including what you are mentioning.
[off topic]

When explanations get posted directly in HN comments, I imagine someone somewhere in the world is able to learn in spite of their Internet restrictions/firewalls

People will also post their own interpretations in response to comments, and quickly find out they missed something.

… But if you try to automate it, like include a summary under every HN post, you encourage laziness too much and are pre-chewing too heavily. Some balance here.

[on topic]

(OK I’m done making excuses, time to read the article… thanks for the encouragement!)

I thought this was not explained in the readme directly but in fact I missed it. I wasn’t going to read Microsoft entire changelog! But it was substantive, thanks to sibling commenter:

“2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.”

Seems quite heavy for a STT model, Parakeet and Whisper are much smaller and perform great for quick dictation and transcription of longer files. I guess that's due to additional accuracy and speaker diarisation?

The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck

This is not a new model. Also, it hallucinates a lot. Also, it's very heavy and slow in inference. It's also bad in multilingual.

Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.

Yeah, I don't get why it is suddenly getting so much attention today, it is all over twitter too
To be fair, his Midas touch is a result of consistency and a lot of hard work.

It's like the gardener at one of the Oxford colleges said - it's really easy to create these perfect lawns, just turn up every day and trim and water it - for a couple hundred years.

I thought they rolled it as well?
As always with people: listen to what they say, not to what they do...

After all, they rarely do what they say themselves, so it's surely not entirely made up nonsense!

there is so much more subversive marketing out there than any of us can really fathom. i try not to be too paranoid but it's getting a lot harder every day.

i know someone who worked in what we might call the 'astroturfing' space within the entertainment industry. after having a few discussions with him and with things like this[0] becoming more known, it's really difficult to afford any assumption of organic intent when money is on the line - especially at the scale that microsoft works at compared to something as comparatively quaint as the music industry.

[0] https://www.wired.com/story/geese-chaotic-good-marketing-ind...

It is not good for text to speech (TTS) as well. I am trying it for few days. First of all 1.5B model documentation is not there. 0.5B realtime is shit model. I was converting text, line by line and it was randomly adding music and couldn't handle special characters like "…".

I really disappointed with this model to say the least.

The 7B parameter Vibevoice TTS model is still the most impressive local TTS model i've tried. It was pulled by Microsoft a few days after its release due to "abuse potential" but it can be found in various community maintained huggingface repos.
yep, it seems this was trained on large amount of podcasts with ad jingles or phone call queues with elevator music. I was also pretty disappointed to run the TTS last week.
You just saved me an afternoon.
you saved us a lot of time here.... i unstarred the repo

moving on....

I don't really pay attention to stars. Do people use them as bookmarks? Why would you star a repo if you knew so little about it?
I exclusively use stars as bookmarks which is why I always found it strange when people talked about lots of stars meaning high quality or trustworthy…I’ve learned since then that I’m probably in the minority (both in using stars as bookmarks and not caring about how many stars a repo has).
Stars for me are basically "this might be interesting but I don't have time to look at it now, hopefully I'll think about it later and give it a second look".
Judging by how many people apparently are paying bots to give their lazily vibe-coded repos thousands of stars, it seems like people both simultaneously take stars seriously while not taking them seriously at all. It breaks my brain.
I'm shocked, shocked to find that Microsoft takes credit for a slow, unoriginal product that doesn't actually do what it advertises.
Imagine the balls it took to willingly attach the Microsoft label to the front of the product that is Teams.
I mean the same can be said about most versions of Windows as well. People act like Windows 11 is where it all went sour, but I've personally kind of hated it since Windows XP.

I feel like a recurring pattern with Microsoft is to create something quickly, market it aggressively and push for everyone to use it immediately, and only once it is installed everywhere do people suddenly realize how terrible it is, but it's too late to change.

I'm surprised you picked XP as the falling point. I didn't enjoy the days of reinstalling 95/98/ME every 6 months to avoid driver weirdness and seemingly random failures. XP was built on the foundation of 2000, which tended to make it more robust vs. its predecessors.

Vista on the other hand...

I mean, part of it is that I really hated the Fisher Price look to it, but it was also the first time I ever felt like I had to "hack" things to make stuff work. I had to muck with registry keys. Oh, and it was the first time that I noticed that Windows repair tools do not work.

I suspect I might have hated 9x more but I was pretty young when they came out and I didn't really "get into" computers until XP, and I disliked it enough to dual-boot Linux as a twelve year old.

It has some perks, is a bit more expressive in some cases, but overall is trained on really noisy data, uses more memory, and isn't that fast - I'm talking about the (7b?) version that they released then removed quickly (vibevoice-community on github) - I still use chatterbox turbo and sometimes qwen TTS.
Saved a lot of my time thanks!
Yes, the SOTA is currently much more advanced.
You have selected Microsoft Sam as the computer's default voice.
My friends and I had fun in the computer lab with Microsoft Sam, inputting long strings of characters to create funny sound effects. Sususususususu.
I the past month or so, I added 2 models to my app Whisper Memos (https://whispermemos.com):

- Cohere Transcribe (self hosted)

- Grok Speech To Text (they provide an API, only $0.10/hr!)

They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?

I've had good experiences with the Mistral Voxtral models (I've used the API, but some of the model-variants are open weight)
Have you tried qwen?
Any non-Musk alternatives that are comparable in quality and cost?
Our default is still OpenAI Whisper. Grok is just a choice for users who might prefer it.
Voxtral competes on price ($0.003/min) and quality. Speechmatics has best in class accuracy but is a bit more expensive ($0.004/min)
Does Cohere work with longer transcripts? Do you have to do some magic to merge recordings over 35 seconds long?
> we should stop calling this type of model open source. They are indeed "open weight”

This ship has sailed. It’s now in the same category as hacker/cracker and the pronunciation of GIF.

I think you mean GIF.
It's the same as GIS, you wouldn't say jizz now would you?
I take it that you haven’t met the Arcgees people…
I absolutely do, every single time it comes up.
The developer of the format declared the pronunciation 30+ years ago. It has always been jif.
Yeah, but society overruled them.
How do you pronounce giraffe?
How do you pronounce gift?
gorge = george
Same way I pronounce my first name btw ;) but I think of "gif" as "gift" and this is probably the subconscious association people make without realizing it.
Which is why I find it fun to bring up that in Old English "gift" hadn't yet picked up the "t" and was spelled "gif", but in Old English "g" was most commonly "HY". I like the Old English pronunciation of "gif" as "HYEEF", which is a "compromise" position that often makes some of both soft-g and hard-g "gif" pronunciation fans angry.
I have never heard this third option before but I love it!
I sometimes just pick the opposite of whatever everyone agreed to just for fun. I do the same when people cry about vim or emacs since I have used both. ;)

Some men just want to watch the world burn. At least it's mostly harmless fun anyway. It's even funnier when they bring up how my name is pronounced in defense of "jiff" and I tell them, so you're calling me the expert in "Gi" pronunciation then? :)

I do too. The idea that any one pronunciation is more correct based on the letters is quite amusing, given there's examples that work all ways.
i am absolutely going to from now on
I hadn't thought about how to pronounce GIS, but do you have a problem with the pronunciation of the Japanese Industrial Standards: JIS?
I've been pronouncing both of them as /dʒis/ like hiss and not /dʒɪz/. I however am not a native english speaker of English. I wonder if native speakers gravitate towards the z more?
I think it depends on region. Related, many speakers pronounce chips and salza, Tezla, Wezley.
I would end both with the S sound, but I'm operating under the assumption that the person I was replying to either pronounces their Ss as Zs or can't tell the difference between the S and Z sounds.

Because the other assumption I could have gone with is the less charitable take that they know GIS with a soft G doesn't sound like jizz, but they were just looking for a crude way to mock the soft G.

And "hallucination" which should have been "delusion".

Way early on (spring 2023) people tried to stop it, but no luck.

Why would it be delusion? It’s making something up which isn’t there and describing it.
A hallucination is a false sensory experience.

A delusion is a false mental belief.

Basically hallucinations are false external things, and delusions false internal things. You hallucinate a pink elephant, you delude yourself into thinking trump won 2020.

The inventor of GIF didn't begin with a document* clearly laying out what is and isn't to be called a "GIF."

I think it's right to push back whenever a huge tech corporation tries to build goodwill by falsely using terms like "open source."

*https://opensource.org/osd

> inventor of GIF didn't begin with a document clearly laying out what is and isn't to be called a "GIF”*

Neither did the inventors of AI. A third party published a document after corporations went with open weights = open source and a spoiler block in FOSS wanted all training data published.

> it's right to push back whenever a huge tech corporation tries to build goodwill by falsely using terms like "open source

I think it’s counterproductive. Most people only see a squabble, which makes any ensuing points from the open-source community seem silly. Those who care can continue using the more-precise language they choose to.

Put another way, there is a difference between using terms like cracker and fully spelling out cryptocurrency, and telling people who use hacker and crypto more loosely that they’re wrong. They aren’t wrong and that isn’t meaningful feedback. At the same time, the person using the precise language isn’t wrong either.

There's a big difference between correcting some random commenter on an internet forum and correcting Microsoft.

> think it’s counterproductive. Most people only see a squabble, which makes any ensuing points from the open-source community seem silly.

Only to people that truly don't care whether something's open source. In which case, Microsoft using the term (correctly or incorrectly) won't change their perception.

But the people who do care won't like to be mislead by Microsoft. There's a reason the term is right in the headline: people respond to it.

I wish I had time to come up with a better example, but it's like if a AAA game company says they've released "native Linux build," but really they're just packaging the Windows build with Wine.

99% of people won't care, neither about the news nor the deception. But for that last 1%, any goodwill garnered with the headline would be gone, and the game company are the ones who look foolish, not the people calling them out.

To be fair, the initiators of the "Open Source" movement also co-opted a term that previously had a much more flexible meaning (and had been around for more than a decade at that point.) Just writing a document attributing specific criteria to a term does not grant one authority over the use of that term.

Ironically, the roots of the Open Source movement are a direct reponse to the Free Software movement largely because it was considered too ideological and unfriendly to corporate interests (i.e. monetization.)

I mean, you have "AI" which means just about anything in marketing speak, "Agentic" is kind of becoming similar, hopefully they don't goof that one too badly, would be nice to know what you are trying to sell me. Used to be "Cloud" meant storage not just hosting (I guess it still does).

Then there's "Smart" in front of Car, Phone, TV, and so on... Meaning different things.

I do think "Open Weight" should be more commonly used. There's definitely communities that spring up that build the training infrastructure and inference infrastructure around open models on the other hand.

Openwashing is the new greenwashing, which, coincidently, seems to have gone out of fashion a few hundred datacentres ago.
it was replaced with abundancewashing
What is "abundancewashing"?
> “This means a future of abundance. A future where there is no poverty, where people can have whatever they want in terms of goods and services.” – Elon Musk

> “I think we see a path now where the world gets much more abundant and much better every year.” – Sam Altman

https://www.diamandis.com/blog/elon-sam-abundance

Indeed. We now live in a world where freeware is named open source. We are very sorry, Stallman.
If you're going to apologize to Stallman, you should apologize for conflating open source with software freedom. ;D
With free libre software, where freedom and liberty are about what the end user is empowered with actually, the software is mostly metonymic. Free software, free society, because there are free people in the middle of course.
Right, as I said elsewhere, maybe let's just let "open-source" have it.

"Open-source" can be "anything you can go out and grab a copy of and use" but doesn't give you much legal certainty about any of it, and reserve "free software" for the other, better thing.

But, free software lost it's way around GPLv3. From the end user's perspective, GPLv3 says that you can only use the software if it's either a cloud service, on hypothetical open firmware devices, or if you install it yourself.

AGPLv3 partially solves the issue by blocking people like Google from using it to build proprietary cloud services that take away their users' freedom. (It still doesn't solve the problem where providers use network effects to achieve the same end game.)

> From the end user's perspective, GPLv3 says that you can only use the software if it's either a cloud service, on hypothetical open firmware devices, or if you install it yourself.

What in the world do you mean?

The anti-tivo clause bans things like Apple pre-installing GPLv3 software on macs, but allows them to let you use exactly the same software as long as they do not give users access to the binary. AGPLv3 blocks both use cases, GPLv2 blocks neither.

On the spectrum of "things that take away user freedom", withholding the source code is bad. Withholding the source code, the binaries and physical access to the computer is obviously much worse! This latter business model is heavily subsidized by GPLv3.

I don't understand this either. The GPL doesn't address end users and their use of software at all, to be technical. It only addresses what terms of copyright redistributors of GPLed software are allowed to apply in-turn to subsequent end users.
The point of the Free in free software was always to protect the users of the software, not the vendors or the redistributors. (This is why the license focuses on the redistributors -- the mechanisms of the license limit their rights in order to protect others' rights.)

The first sentence of the GNU manifesto says this, and a few sections later in the document elaborate on the point:

https://www.gnu.org/gnu/manifesto.html

Note, in particular, footnote [1] which explains that its OK for distributors to ask for payment, but that it's never OK for users to have to ask for permission to use the software, and the section "Why I Must Write GNU".

Since then, software service monopolies became common, and all of the most end-user-hostile systems on earth rely heavily on the GNU system. At this point, we're paying for permission to use those services with our money, our data, our democracy, etc.

I certainly cannot give you permission to use any of the GPLed services that I have used, or that I've been paid to extend. Therefore, I say the free software movement has lost its way.

I see your point and I agree. It's just that when you say "GPLv3 says that you can only use the software if it's either a cloud service, hypothetical open firmware devices" that's a stretch and not really true. AIUI vendors can pre-install GPLv3 software as long as they let you actually then replace the software (i.e. no DRM or locked bootloader). The firmware can still be non-GPL and non-replaceable. You just can't use GPLv3 code in the non-replaceable bootloader or firmwares.