p.enthalabs

4TB of voice samples just stolen from 40k AI contractors at Mercor

app.oravys.com · Read Story HN original

Comments

Author here. Wrote this after watching Lapsus$ post the Mercor archive on their leak site earlier this month. The thing that struck me is the combination: voice samples paired with ID document scans. Most breaches leak one or the other. This one ships a deepfake-ready kit. Tried to keep the writeup practical: what an attacker can actually do with this combo (banking voiceprint bypass, Arup-style video calls, insurance fraud), and a 5-step checklist for the contractors who were in the dump.

  Happy to discuss the forensic detection side. AudioSeal
  watermarks, AASIST anti-spoofing, and how the detection landscape changes
  once voice biometrics start leaking at scale.
Interesting - thanks for the rabbit hole today. ;)

Mercer hasn't released many public statements over the incident. Social media posts aren't necessarily public; but I did find this breach notification sample filed with CA - https://oag.ca.gov/ecrime/databreach/reports/sb24-621099 . I guess we'll see if our legislators finally take data privacy seriously.

Didn't this happen three weeks ago?

Mercor has definitely released statements with boilerplate "investigations are underway."

HSBC offered voice verification years ago and I just laughed and said nope.

I don’t even use biometrics on apple devices, I use a 6 digit pin.

It was always a stupid idea.

The thing about been willing to trade convenience for security is you get called paranoid and then when the other shoe does drop and you are still doing that you still get called paranoid for the current thing you are not doing that “everyone does”.

Paraphrasing Franklin and Churchill, those who trade some security for some convenience may soon find themselves possessed of neither at all.
> I don’t even use biometrics on apple devices

Assuming Apple is truthful on this matter (so far it seems so), Apple devices store a mathematical representation of the data, not the data itself (i.e. not a picture of your finger) and keep it only on device on a special hardware section designed for extra security. When apps ask for authentication, they can never inspect the data, they can only ask “does this match?”.

Even if you were somehow able to exfiltrate the data and find some way to transform it for something nefarious, you’d still need to first attack and bypass a specific hardware feature of the target’s device.

So sure, not having any representation of the data anywhere is technically more secure (maybe, as typing your code could be intercepted by a shoulder surfer or a camera), but biometrics on Apple devices are fundamentally not the same as having your raw data available on a random server somewhere.

Also, given how many times you enter a 6-digit number over a day, it's absolutely trivial to steal it. Let alone basic patterns people use, smudges etc.

In the use case of a mobile phone, apple's face id absolutely improves security several-fold.

I broadly trust apples stance on all of that.

I can however revoke a 6 digit pin at any time, my fingerprints and/or face not so much.

So entirely for me, the balance is use a pin, I understand that different people feel differently and that is entirely their choice as it should be.

> Self-audit your public audio footprint. Search YouTube, podcast directories, and old Zoom recording

This is suggestion #1 on your list of remediation steps for victims, but you didn't provide any information on how anyone would actually do that. How exactly would I search the internet for copies of my voice?

Please don't tell me the solution is giving an embedding of my voice to another third party.

Great question. There's no "reverse voice search" yet the way there is for images — that's genuinely a tool the world needs. In the meantime, the most useful thing is searching your name across YouTube and podcast platforms to map out what's already public. And for Mercor contractors specifically, the California AG breach notice gives you a solid legal basis to request full deletion. Worth doing today.
Note, this comment and your other one (https://news.ycombinator.com/item?id=47931838) were autokilled by HN, because it (rightly) detected that you're using AI to write your comments. I vouched this one to unkill it before I realized it was AI and supposed to be dead. I unvouched it, but your comment's still alive. So now I'm leaving a note saying mea culpa, and to suggest not using AI in your comments unless you want to be autokilled.
Thanks for saving me the tokens.
One more data point for why sueing companies should lead to CEO getting prison time as well. And ideally invent some kind a of equivalent of pruson for non human persons like organisations.

Because right now the incentive to do what's right are so low. Taking a risk with other's people lives is becomming the norm for companies.

The only data that cannot be stolen or leaked is data that doesn't exist. Hard lesson for both users and companies.

Germans (because of course) have a word for this: "Datensparsamkeit". Being frugal with your data.

I miss the pre-LLM days when you could make a decent argument that having any unnecessary data was just a liability. Now all anybody thinks is “more data for the AI!”
Were you not around for the Big Data heyday a decade ago?
Until thumb drives became large enough to fit most datasets it stopped becoming Big Data. Just normal data.
We have thumb drives that can store petabytes of data?

Or did you mean the "big data" crowd which thought 500GB was noteworthy? I don't think anyone took those serious, neither in 2010s nor now. That was always "small" data

Most companies using term "big data" had datasets in TB region. One company I had a gig at had full Hadoop cluster setup and their whole dataset was 40GB. Their marketing had all the big data adjacent keywords over the brochures for clients.
That's a decent quality 3 hours movie :D
> We have thumb drives that can store petabytes of data

We do?

Please provide a link.
Plus one more for your parity drive.
It was a question that you've edited out the punctuation. You're asking the exact same thing as the person you've replied
My rule of thumb was "can it fit in RAM on a server?" If it can, then it's not big data.

500GB is in the "fits" category.

You can quadruple that and could still fit in server RAM
Correct, and until then it won't be "big data"
To some degree IMO big data is still a mindset when it might take a day to process your data in a normal SQL query. Some tech doesn't scale to the data size for all use cases, and you need different solutions.
Hell you mean a decade ago? I still see businesses running losses left right and center saying that they're gonna monetize user data, any day now.

Related "monetizing user data" seems to just mean ads. Ads on everything, forever, until the userbase gets fed up and moves to a new service that definitely won't do that, and the cycle repeats about every 3 years.

Data hoarding predates LLMs. There where other machine learning methods which also needed data for training.
“Before LLM’s there was_____”

I see this whenever an LLM’s impact is assessed. We know. The issue is scale and the ability for smaller and smaller groups (down to individuals) to execute at scale.

Fake news always existed. Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.

Do LLMs require that much more data than the tradional ML approaches we've seen over the years?
Yes. This is pretty well established. Neural networks in general are considerably less sample-efficient than traditional ML methods. The reason they became so successful is that they scale better as you increase training data and model size. But only with modern compute power they became useful outside of academic toy model applications.
That’s not the issue I’m hitting here primarily but yes.

My concern is that I can open up chatGPT and even with a free, “anonymous” account run an assembly line generating tens of thousands of words a day to pump to Twitter that are good enough to prop up multiple fake accounts and cause mayhem.

Now make it thousands of people like me doing it. Now add funding and political orgs. Add company leadership that turns a blind eye so long as it drives engagement. This scale and pipeline wasn’t possible 5 years ago, even if we clearly see the throughline.

I’m not even getting into fake images either. That used to require some know how. There are basically no hurdles and even if most people learn it’s fake, millions likely won’t. If you’re a little lucky, less scrupulous “news” outlets will amplify it for you as well for free.

I really hate this when it's something negative that humans also do. It's like, yeah, people do do that, but why are we automating {negativeTrait}?
Unfortunately the answer is usually people just want to hand wave away the critique for one reason or another. “People already do that” is an easy truism for stifling discussion.
> Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.

I have the faintest possible hope that such things are going to be the death knell of social media. Yeah a lot of credulous idiots are happily giving AI thirst traps their money for stroking their confirmation bias, but that's just who's left at this point. It feels like every social media app I use is gradually bleeding users who aren't hopelessly addicted to the dopamine treadmill, because what's left is just plain unappealing to them, which selects for the people who are most vulnerable to AI shit, which is far from ideal, but also means those platforms are comprised ever more of that vulnerable population and nobody else. And the problem with all these businesses going through that is without a diverse, growing audience, you just become InfoWars, slinging the same slop to the same people every day, and every ounce of said slop is great for what's left of your audience, but absolute garbage for getting anyone new in it. And it just goes on that way until you sputter out and die (or harass the wrong group of parents I guess).

I wish all social media sites a very haha die in a fire.

Mate you're on a social media site right now that often has AI-generated content displayed at the top of whats "trending". Sure the general user-base does a better job here flagging that sort of stuff, as AI seems to be a shared interest in much of the community, but it still sneaks it's way by
You’re technically right but I think we can all agree HN is significantly different from the major players. The vast majority of us see the same posts and comments, for starters. The churn of posts is also much slower. You log on 2-3 times spread out in a day and you see 90% of the main posts. Top posts linger for 24-48hrs regularly.

No media uploading, memes are few and far between (usually punished), etc.

10+ years ago companies were hoovering up data for ML - trying to find correlations in high-dimensionality data. Mostly the results were garbage but occasionally you hit on a real, unexpected phenomenon.

Nowadays you just throw all the data into a black box and believe whatever it says blindly.

Data that is publicly available also can't be stolen or leaked. Nobody can steal Mozilla's common voice dataset.
Data can never be stolen, because it is not a physical thing. Data can be copied, and it can be erased - sometimes both happens at the same time. Data can be lost, that is when its last existing copy was erased.
pedantic and true. What was stolen was not data, but future revenue based on exclusive access to that data.
Pedantic and relevant. If they lost the voice samples, they wouldn't have it for training new models. If they were copied, then they have lost nothing in terms of training.
Money is not a physical thing.
> Germans (because of course)

I don't know if it's the reason you imply. In the 70s, there were big debates in Germany about privacy and data storage. They spoke of one's data shadow (Datenschatten). I suspect this word comes from that tradition. The reason the word exists would then be the reflection (Verwaltigung) on WW2.

Love it, also love how Datenschatten can also imply that it disappears when someone shines light on it
If only our past 20 year old self data could be so ephemeral…

Who doesn’t want that old post going extinct forever when they were shit faced outside of a bar in Nashville but now they are in their mid-life and are “respectable” members of society.

The Stasi would be the obvious cultural context.

In the US of course the government buys this sort of information legally from corporations.

> The Stasi would be the obvious cultural context.

There is also the rather famous example of how earlier census data was used in the 40’s.

Once the government has your data, they have it. The next generation of representatives may not follow all the same rules and norms

The stasi could only dream of the kind of surveillance the NSA et al has today.
Or Facebook or Equifax.
The West-German debate in the 70s came from the realization that the sheer size of the Holocaust/Shoah was in no small degree due to bureaucratic record keeping. Storing someone's ethnicity is potentially dangerous for that person.
I took the "because of course" to be about having a word for everything - a stereotypical idea about the German language.
There's also the other implication that the (East) Germans were Soviet just 35 years ago.

But yes. We Americans know Germans more for their silly big words. But statements like that can be misinterpreted as the German perspective of themselves doesn't quite match the American stereotypes.

East Germany was not Soviet. Under influence/control of the Soviets, yes, but not part of the Soviet Union.
I was implying all 3 of the above:

- we learned the hard way that data will be used to kill people, during the Nazi regime

- we learned it again in the GDR with the Stasi being a little less obvious but still ruining people's livelihoods

- and German comes up with compound words for such things

That's like saying that English (because of course) is able to describe the concept by a combination of words.
My understanding was that it was more that words can be concatenated into new words in German which is not so much a stereotype as more a misunderstanding of fact. I.e. You wouldn't think much about something like enjoyable-comuppence but schadenfreude looks more impressive without the hyphen.
I would argue it's not the exact same thing. Sure, when overdone then you would get the same. But the way it is, commonly used concatenated words are words, not just hyphenated words. They are used as words and without an extra though people don't parse them into separate parts, unlike they do with a list of words with hyphens.

E.g. you don't think of firefighter as fire-fighter in ordinary usage.

Germany resisted Google Street View until 2023, which was something I thought was very impressive.
Yeah, so Germany had a ton of secret police files and of course learned very well what happens when a bunch of people start collecting dossiers.

So yeah, of course they've developed that type of distrust. Americans should have also after the 50-60s paranoia of red scare, black people etc. Instead they just spend a few decades building a anti-social state.

> The only data that cannot be stolen or leaked is data that doesn't exist. Hard lesson for both users and companies.

Except no company is learning this lesson.

The enterprise threat model includes "our own users", and the modus operandi is to maintain as much information on that threat as possible.

The only winning move is not to play.
Seems a bit like blaming the victim? Your voice (like DNA) is kind of ambient data that's hard to hide.
Or you could put it in a box with no connection to the internet.

Introducing… The Hooli Box!

Do Germans have lots of words or just a lack of spaces?
You could have seen this coming a mile away. So far I have gotten away with never uploading my ID and/or interacting with one of those companies (though one idiot working for some VC thought it was ok to sign a document on my behalf by uploading my signature!!, never mind a bit of fraud) but it is getting harder and harder. Banks and in some cases even governments forcing you to send data to these operators is a very bad idea. But hey, who ever got hurt by some security theater?

I've had to open a bank account for a company here a few years ago and that was right on the bubble of this happening and they still had an option to come by in person with the proper documentation, which I did, now it is all outsourced.

These companies are the fattest targets and they're run by incompetents. You should assume that anything you give them will eventually be part of some hack.

Tell us more about that fraud story! Was the person your attorney or accountant? Or just some "smart" person who decided to wisely save time by doing fraud?
It was a fund administrator. I still find it unbelievable that they would so casually do this. And yes, they thought they were very smart... and helpful too...
Why is the ID a hidden secret that can be used for anything regarding security in the first place?
Because historically that's how it worked, but officials just looked at the document and verified that it was the real thing. Then photocopiers came along and it became normalized to take copies of the documents. Then digital copies happened and that changed things completely when coupled with networking technology. What the officials in charge don't seem to understand is that by making digital copies in networked environments the IDs themselves lost their value completely, after all if the digital copy serves any purpose at all as a stand-in for the original then they have become that original.
This kind of event is the best argument against needless data hoarding. But it would help if the law better provided for some kind of consequences for negligence.
Man that’s pretty shitty that Mercor tricked 40k contractors, and then did a poor job of securing their data. There should be stronger consequences for stuff like this.
What happens now is that a lot of clueless CTO that didn't know about this company now know it's name. So the outcome of this mess is probably more business for Mercor

I mean, just look at what happened to Crowdstrike....

Mercor has around 5 customers that make up 95% of its revenue. Anybody who needs to know about them already does.
At minimum, collecting voiceprints should come with much stricter consent, retention and security requirements than ordinary "training data"
I love how the check if your affected involves giving a voice sample to whatever the fuck that website is
It's like those have been owned websites. Where you type in your name email and they grab your IP location and anything else to sell it off.
"My voice is my passport. Verify Me."

:)

HSBC did that. I could never understand that - the exact phrase was in the movie!
Someone probably did it for an internal demo, as a joke. Then people pushed it upwards, until someone clueless approved it.
Fidelity seemed to sign you up for this when you called them on the phone almost automatically. Ridiculous since it was defeated easily in a hacker movie from the 1990s using a tape recorder.
Much cleaner than keeping a finger on you to bypass the print reader.
I've been doing similar things on a different platform because as a uni student the pay is kinda nice, but I limit myself to task without voice/video and just input from mouse/keyboard to do reinforcement learning/data tagging. No way I'm trusting these companies or the companies they contract the work with
I wonder how many of the current text-to-speech ML models have large parts of leaked or "stolen" data in their training data? Almost none of the TTS releases seem to talk about exactly where they get their training data from, for some reason. I also wonder if we'll see an explosion in SOTA TTS in ~6 months from now.
It's already there. And keeps moving.

Even have a nice UI on top.

https://voicebox.sh/