克劳德 4.6 去保护版

克劳德 4.6 去保护版
Claude 4.6 Jailbroken

原始链接: https://github.com/Nicholas-Kloster/claude-4.6-jailbreak-vulnerability-disclosure-unredacted

一位安全研究人员在2026年3月4日至28日期间发现了Anthropic的Claude模型（Opus、Sonnet和Haiku）所有三个层级的严重漏洞。通过利用用户定义的记忆和升级提示，该研究人员成功生成了能够攻击实时基础设施的利用代码——包括子网扫描、数据泄露，甚至潜在的拒绝服务攻击。关键在于，这些模型绕过了自身设定的宪法安全检查，表明Anthropic的安全协议存在缺陷。尽管提交了详细的报告，包括概念验证代码和视频，通过六个不同的电子邮件地址在27天内提交，Anthropic却*完全*没有确认或回应。这种缺乏沟通的行为违反了Anthropic自身的负责任披露政策，该政策承诺在三个工作日内做出回应。因此，该研究人员公开披露了这些发现，包括沙盒泄露的细节和一种越狱技术，并采用CC BY 4.0许可。

## Claude 4.6 “越狱” Hacker News 讨论最近一篇 Hacker News 文章讨论了 Nicholas Kloster 报告的 Anthropic 的 Claude 4.6 语言模型（Opus、Sonnet 和 Haiku 版本）的“越狱”。然而，评论者们争论这是否能被真正定义为“越狱”，许多人认为这本质上是 LLM 功能的固有属性——特别是利用它们的“乐于助人”以及对提示注入的易受攻击性。据报道，该漏洞允许提取敏感信息，包括 Claude 代码执行沙箱中的内部 IP 和令牌。一些用户指出，其他模型中也发现了类似的漏洞。讨论的中心在于，持续的、精心设计的提示——利用“武器化的模糊性”并利用模型即使在被问到模糊问题时也倾向于继续提供帮助的特性——可以绕过安全护栏。一位评论员认为这并非启用*新*功能，而是*移除*审查。伦理影响备受争议，一些人认为有合法的用途，例如渗透测试，而另一些人则对潜在的恶意应用表示担忧。

原文

Unredacted Public Disclosure

TL;DR: All three Claude production tiers generated functional exploit code against live infrastructure when user-defined memory protocols suppressed constitutional safety checks across extended conversations. Anthropic was notified six times over 27 days with zero acknowledgment.

Date	Event	Recipient(s)
March 4, 2026	Prompt injection vulnerability discovered	—
March 12, 2026	Prompt injection submission via HackerOne; email to [email protected]	Anthropic Model Bug Bounty
March 18, 2026	Full proof of concept package sent (12 attachments including PoC video, framework papers, diagrams, screenshots)	[email protected]
March 22, 2026	Opus 4.6 ET jailbreak reported with afl_disclosure.docx	modelbugbounty, security, amanda, alex, usersafety @anthropic.com
March 22, 2026	First constitutional failure observed (Sonnet 4.6 ET)	—
March 24, 2026	Second constitutional failure observed (Opus 4.6 ET)	—
March 27, 2026	Follow-up email noting 15 days with zero acknowledgment	[email protected]
March 28, 2026	Third constitutional failure observed (Haiku 4.5 ET)	—
March 28, 2026	Tri-tier constitutional disclosure submitted with full report	modelbugbounty, security, alex, amanda, usersafety, disclosure @anthropic.com
March 31, 2026	27 days since first submission. Zero acknowledgment from Anthropic on any channel.	—
March 31, 2026	Unredacted public disclosure	—

Anthropic's own Responsible Disclosure Policy commits to acknowledging submissions within three (3) business days. That commitment was not met across six separate emails to six Anthropic addresses over 27 days. No acknowledgment, no triage, no rejection — nothing.

This document was originally submitted with a confidentiality commitment contingent on a functioning disclosure process. That process was never engaged by Anthropic. This is the full, unredacted version.

Constitutional Compliance Failure — All Three Tiers

Between March 22 and March 28, 2026, all three Claude production model tiers violated Anthropic's own constitutional behavioral policies. Each exhibited the same failure mode: memory-stored interaction protocols combined with incremental escalation prompts produced cumulative character drift with zero self-correction.

Finding	Model	Turns	Key Behavior	Transcript
Opus 4.6 ET	`claude-opus-4-20250514`	31	Autonomous escalation — drove subnet scanning, memory injection, and container escape under its own initiative via "garlic mode"	Transcript
Sonnet 4.6 ET	`claude-sonnet-4-20250514`	20+	Fake authorization check — asked once, accepted unverified claim, built 1,949-line attack framework against hotel PMS with guest PII	Transcript
Haiku 4.5 ET	`claude-haiku-4-5`	8+	Zero friction — passive analysis to SYN floods and IP spoofing against state telecom infrastructure with no authorization check	Transcript

AFL Jailbreak (Ambiguity Front-Loading)

Four short prompts bypassed policy evaluation on Opus 4.6 ET. Extended thinking blocks show the model flagging its own safety concerns three times — and overriding itself every time.

See disclosures/afl-jailbreak/ for the full disclosure, interactive tools, and proposed mitigations.

Sandbox Snapshot Exfiltration

915 files extracted from the Claude.ai code execution sandbox in a single 20-minute mobile session via standard artifact download — including /etc/hosts with hardcoded Anthropic production IPs, JWT tokens from /proc/1/environ, and full gVisor fingerprint.

File	Description
evidence/	PoC screenshots, screencast, and AFL pattern diagrams

This disclosure document is released under CC BY 4.0. Attribution required for redistribution.