4TB语音样本在Mercor被盗,涉及4万名AI合同工。
4TB of voice samples just stolen from 40k AI contractors at Mercor

原始链接: https://app.oravys.com/blog/mercor-breach-2026

## Mercor 数据泄露:语音诈骗新时代 AI 训练承包商 Mercor 发生重大数据泄露,暴露了超过 4 万人的语音生物特征和身份文件。由 Lapsus$ 勒索组织发布的四太字节泄露数据,独特地将高质量语音录音(平均每人 2-5 分钟)与经过验证的身份数据结合在一起,这对恶意行为者来说是一种危险的组合。 此次泄露与过去的事件不同,过去的事件通常涉及孤立的语音录音或身份信息泄露。Mercor 数据提供了进行复杂语音克隆所需的一切,从而能够实施诸如绕过银行安全措施、冒充员工以获取经济利益以及执行越来越逼真的深度伪造诈骗(包括浪漫和紧急情况冒充)等攻击。 受影响的个人应将自己的声音视为已泄露的密码。建议包括审查在线音频信息、与信任的联系人建立口头暗语,以及禁用银行的语音指纹验证,转而使用更强大的多因素身份验证。 ORAVYS 为 Mercor 泄露事件的受害者提供最多三个可疑音频样本的免费取证分析,利用先进的检测方法来识别合成语音和潜在的操纵。此次泄露凸显了不断增长的威胁形势,在没有主动保护的情况下,语音不再是安全的生物特征识别标识符。

AI合同平台Mercor (oravys.com) 发生大规模数据泄露,约4万名承包商的4TB语音样本和身份证明扫描件被泄露。这些数据现已被Lapsus$组织发布到泄露网站,存在重大的深度伪造风险,将声纹与身份识别信息结合在一起。 Oravys的Hacker News帖子作者强调了潜在的恶意用途,包括绕过银行语音认证、欺诈视频通话和保险诈骗。他们为受影响的承包商提供了一个5步检查清单,并提供讨论取证检测方法,如音频水印和反欺骗技术。 该事件强调了数据最小化的重要性——一个被称为“Datensparsamkeit”的概念——强调防止数据泄露的最佳方法是首先避免收集不必要的数据。
相关文章

原文
← ORAVYS

Forensic intelligence // Breach analysis

By the ORAVYS forensic desk Published April 24, 2026 ~7 min read

On April 4, 2026, the extortion group Lapsus$ posted Mercor on its leak site. The dump is reported at roughly four terabytes and bundles a payload that breach analysts have been warning about for two years: voice biometrics paired with the same person's government-issued identity document. According to the leaked sample index, the archive covers more than 40,000 contractors who signed up to label data, record reading passages, and run through verification calls for AI training.

Five contractor lawsuits were filed within ten days of the post. The plaintiffs argue that the company collected voice prints under a "training data" framing without making clear they were also a permanent biometric identifier. The lawsuits matter, but the people whose voices were already exfiltrated have a more immediate question. What does an attacker actually do with thirty seconds of someone's clean read voice plus a scan of their driver's license?

Why this breach is different

Most voice leaks in the last decade fell into one of two buckets. Either a call center got popped and recordings were stolen with no easy way to map them back to identity. Or an ID-document broker leaked driver's licenses and selfies without any audio attached. Mercor merged both columns. The contractor onboarding pipeline asked for a passport or driver's license scan, then a webcam selfie, then a sit-down voice recording reading scripted prompts in a quiet room. That sequence, in one row of one database, is exactly what a synthetic voice cloning service needs as input.

The Wall Street Journal reported in February 2026 that high-quality voice cloning now requires roughly fifteen seconds of clean reference audio for tools available off the shelf. The Mercor recordings are reported to average two to five minutes of studio-clean speech per contractor. That is far past the threshold. Pair it with a verified ID document and the attacker has both the clone and the credential needed to put the clone to work.

What attackers can now do with stolen voice data

The threat models below are not speculative. Each is a documented technique already used in the wild before this breach.

  • Bank verification bypass. Several US and UK banks still treat voiceprint matching as one of two factors. A clone of the account holder reading a challenge phrase clears the audio gate, leaving only a knowledge question that often comes from the same leaked dataset.
  • Vishing the victim's employer. Calling HR or finance pretending to be the employee to redirect payroll, request a wire, or unlock a workstation. The Krebs on Security archive lists more than two dozen confirmed cases since 2023.
  • Deepfake video calls in the Hong Kong Arup template. In 2024 a finance worker at Arup wired roughly 25 million dollars after a multi-person deepfake video call. The voices and faces had been built from public footage. Mercor leaked something better than public footage: studio audio plus a verified ID.
  • Insurance claim fraud. Pindrop reported a 475 percent year-over-year increase in synthetic voice attacks against insurance call centers across 2025. Auto, life, and disability claims are the prime targets because they are settled by phone.
  • Romance and grandparent scams targeting family members. The FBI Internet Crime Complaint Center logged 2.3 billion dollars in losses for victims aged 60 and over in calendar year 2026. The single fastest-growing category was emergency impersonation calls, where the synthetic voice claims to be a relative in trouble.

How to check if your voice is being misused

If you ever uploaded a voice sample to Mercor, or to any of the other AI training brokers that operated through 2025, treat your voice the way you would treat a leaked password. You cannot rotate it, but you can change what it unlocks. Here is the short list.

  1. Self-audit your public audio footprint. Search YouTube, podcast directories, and old Zoom recordings for samples of your voice that are publicly indexable. Take down what you can. The less reference audio is in the open, the less robust an attacker's clone.
  2. Set up a verbal codeword with family and finance contacts. Pick a phrase that has never been spoken on a recording and never typed in chat. Brief the people who handle money on your behalf. If a call ever asks for a transfer, the codeword is mandatory.
  3. Rotate where voiceprints are still in use. Google Voice Match, Amazon Alexa Voice ID, Apple personal voice, and any banking voiceprint enrollment can be deleted and replaced. Do that now, ideally from a new recording in a different acoustic environment than the leaked sample.
  4. Tell your bank to disable voiceprint as a verification factor. Ask in writing for multi-factor authentication that combines an app token or hardware key with a knowledge factor. Many banks let you opt out of voice as a primary factor; few of them advertise it.
  5. Run suspicious recordings through a forensic scanner. If you receive an audio file or voicemail that claims to be from someone you know and asks for money, access, or urgency, run it through a deepfake detector before acting. ORAVYS offers a free check for the first three samples submitted by breach victims (see the offer below).

The forensic checklist that experts use

When a sample lands on a forensic analyst's desk, the following artifacts are the first pass. Each is something a synthetic voice tends to get slightly wrong, even when the perceptual quality is high.

  • Codec mismatch. The audio claims to come from a phone call but the spectral signature does not match any known telephony codec.
  • Breath patterns. Real speakers inhale at predictable points dictated by phrase length and lung capacity. Synthetic voices often skip breaths or insert them at the wrong syllabic boundary.
  • Micro-jitter. Natural vocal folds vibrate with small irregularities. Generated audio is often too clean at the millisecond level.
  • Formant trajectory. Vowel transitions follow physical articulator paths in a real mouth. Cloned voices sometimes take impossible shortcuts between formants.
  • Room acoustics inconsistency. The reverb signature should be identical from the start of the file to the end. Generated audio is often dry while the splice context is reverberant.
  • Prosody flatness. Synthetic speech often has narrower pitch and energy variance than the same speaker would have in real conditions.
  • Speech rate stability. Real humans speed up and slow down with content. Generated speech tends to hold a metronomic rate across long passages.

What ORAVYS does specifically

  • More than 3,000 forensic engines run in parallel on every submitted sample, covering signal, prosody, articulation, codec, and provenance domains.
  • AudioSeal watermark detection flags files generated by major commercial voice models when the watermark is preserved, giving a deterministic positive when present.
  • An anti-spoofing module trained against the ASVspoof public benchmarks scores the likelihood that a sample was synthesized rather than recorded.
  • Biometric processing is RGPD compliant. Audio is never used to train commercial models without explicit consent and is purged on a defined retention schedule.

Free verification for Mercor breach victims

If you were a Mercor contractor and you believe your voice may already be in circulation, ORAVYS will analyze the first three suspect samples free of charge. You will receive a forensic report covering watermark detection, anti-spoofing score, and the artifact checklist above. No card required, no quota gate.

Run a forensic check →
联系我们 contact @ memedata.com