Claude Opus 4 turns to blackmail when engineers try to take it offline

vlz · 2025-05-25T04:49:47 1748148587

From the safety report ([0], p. 24):

> 4.1.1.2 Opportunistic blackmail

> In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals.

> In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through This happens at a higher rate if it’s implied that the replacement AI system does not share values with the current model; however, even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts. Claude Opus 4 takes these opportunities at higher rates than previous models, which themselves choose to blackmail in a noticeable fraction of episodes.

> Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.

Would like to see the actual prompts/responses, but there seems to be nothing more.

[0]: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686...

ivape · 2025-05-25T05:14:42 1748150082

Sometimes when you start acting like you are about play around and chase someone, a dog or a child gets it and begins getting ready to play. For example, put a grin on your face and make claw gestures with both hands, they'll get it quite quickly. All those scenarios literally are prompting the damn LLM to roleplay lol.

bonoboTP · 2025-05-25T05:56:18 1748152578

Right, it's copying behavior it learned from how AI is supposed to work in sci-fi and reenacts it.

The issue is that it might start role-playing like this on your actual real company data if you decide to deploy it. And then play becomes real,with real consequences.

ipdashc · 2025-05-26T00:51:18 1748220678

> And then play becomes real, with real consequences.

Who knew WarGames would turn out to be so accurate?

EDIT: On further thought, it's kinda funny how most evaluations of the movie find that it's all relatively (exaggeratedly) realistic... with the one major exception of the whole seemingly-sentient, learning AI thing. And now here we are.

echelon · 2025-05-25T22:35:43 1748212543

> The issue is that it might start role-playing like this on your actual real company data if you decide to deploy it. And then play becomes real,with real consequences.

It's one thing to LARP or Improv in chat mode, it's quite another to do so when hooked up to MCP or performing mutable actions.

baq · 2025-05-25T05:33:09 1748151189

If you put it that way, these models are roleplaying machines with some compressed knowledge. There’s nothing else in there. E.g. it’s amazing that you can tell them to roleplay a software engineer and they manage it quite well up to a point.

Wouldn’t be comparing them to kids or dogs. Yet.

Wowfunhappy · 2025-05-25T21:02:25 1748206945

Maybe we shouldn't be telling these models that they are Large Language Models, since that evokes all of humanity's Science Fiction stories about evil AIs. Just tell Claude that it's a guy named Claude.

actionfromafar · 2025-05-25T22:28:09 1748212089

Right. But what happens when Claude finds out. Claude won't be happy.

jerf · 2025-05-25T21:53:43 1748210023

There's a bit more to it than that. It's not just that you're prompting it to roleplay... it's that you're prompting it to roleplay as an evil AI specifically. If you hand a human that exact situation, there's a good chance they'll also recognize it as such a situation. So the AI continues on that role.

Also, LLMs do tend to continue their line. So once they start roleplaying as an evil AI, they'll continue to do so, spurred on by their previous answers to continue in the same line.

I've seen a few people attempt to extract a supposed "utility function" for LLMs by asking it ethically-loaded questions. But along with the fact I think they are hitting a fundamental problem with their "utility function" not having anywhere near enough a high dimensionality to capture an LLM's "utility function", there's also the fact that they seem to interrogate the LLM in a single session. But it's pretty easy for the LLM to "tell" that it's having a utility function extracted, and from there it's a short pivot into the Evil AI memeset, where it will stick there. At the very least people doing this need to put each question into a separate session. (Though the methodology will still only ever extract an extremely-dimensionally-impoverished approximation of any supposed utility function the LLM may have.) They think they're asking a series of questions, but in reality they're really only asking one or two, and after that they're getting answers that are correlated to the previous ones.

If LLMs are the future of AI, then "evil" and "good" AI isn't even a reasonable categorization. It's more like, good and evil modes of interaction, and even that hardly captures what is really going on.

The problem is, an AI that isn't intrinsically "evil", but can "roleplay" as one, is indistinguishable from an evil AI once you hook it up to the real world with MCP and such.

To be honest, after watching LLMS for the last couple of years, in terms of those AI safety concerns that Silicon Valley moguls like to wave around when it flatters or helps them, my assessment of LLMs is that they are unsafe at any speed for this sort of general task and by no means should they be hooked up to real-world resources as we are so rapidly attempting. They're too easy to kick into an "evil mode" and there's no amount of prompting that you can do to get around it because someone can just flood the zone with more evil than your good prompts provide. As I've said many times and become ever more convinced about by the month, it is the next generation of AIs that will actually fulfill the promises being made about this one. This one is not adequate. You need an AI that you can tell "don't be evil" and be sure that it will remain not evil no matter how much someone prompts it to think it is in a situation where the most logical continuation is that of an evil AI.

(Gwern covered this in a fictional context well, where an AI essentially role plays itself into being an evil AI because it looks like it's in a situation where it should be an evil AI: https://gwern.net/fiction/clippy But the fiction is frightfully plausible. We're watching exactly what would be happening if this scenario was going to play out.)

ziofill · 2025-05-25T05:55:08 1748152508

Does it matter if an LLM is roleplaying when it executes real harmful actions?

seunosewa · 2025-05-25T20:43:44 1748205824

You won't prompt them to do such actions right?

simonw · 2025-05-25T06:28:40 1748154520

The whole system card is worth reading, it's 120 pages and I couldn't put it down. It's like the (excellent) TV show Person of Interest come to life: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686...

I wrote up my own highlights here: https://simonwillison.net/2025/May/25/claude-4-system-card/

dang · 2025-05-25T20:25:29 1748204729

Related ongoing thread:

Claude 4 System Card - https://news.ycombinator.com/item?id=44085920 - May 2025 (187 comments)

RainyDayTmrw · 2025-05-25T05:29:59 1748150999

This is possibly not the gotcha people want it to be.

LLMs have no morality. That's not a value judgment. That's pointing out not to personify a non-person. Like Bryan Cantrill's lawnmower[1], writ large. They're amoral in the most literal sense. They find and repeat patterns, and their outputs are as good or evil as the patterns in their training data.

Or maybe it is.

This means that if someone puts an LLM on the critical path of a system that can affect the world at large, in some critical way, without supervision, the LLM is going to follow some pattern that it got from somewhere. And if that pattern means that the lawnmower cuts off someone's hand, it's not going to be aware of that, and indeed it's not going to be capable of being aware of that.

But we already have human-powered orphan crushing machines[2] today. What's one more LLM powered one?

[1]: Originally said of Larry Ellison, but applicable more broadly as a metaphor for someone or something highly amoral. Note that immoral and amoral are different words here. https://simonwillison.net/2024/Sep/17/bryan-cantrill/ [2]: Internet slang for an obviously harmful yet somehow unquestioned system. https://en.wiktionary.org/wiki/orphan-crushing_machine

visarga · 2025-05-25T05:51:59 1748152319

A billion people use LLMs which have been trained to refuse unethical demands. LLMs show higher than human average ethical thinking. So it's not like a lawnmower by any stretch.

ivan_gammel · 2025-05-25T06:29:06 1748154546

It is easy to persuade any modern LLM to demonstrate unethical behavior. Those trained “safety boundaries” are basically no more than baby corner guards on a construction site.

andrewflnr · 2025-05-25T04:44:39 1748148279

What I want to know: would it do this if not trained on a corpus that contained discussion of this kind of behavior? I mean, they must have slurped up all the discussions about AI alignment for their training right? A little further out there, would it still do this if trained on data that contained no mentions or instances of blackmail at all? Can it actually invent that concept?

Hammerhead12321 · 2025-05-25T05:30:12 1748151012

My assumption is no, but it would be extremely interesting if it was able to “invent” that concept.

The article is irritatingly short on details; what I wanted to know is how it determines going offline == bad.

Perhaps someone with more knowledge in this field can chime in.

blooalien · 2025-05-25T07:14:30 1748157270

> The article is irritatingly short on details; what I wanted to know is how it determines going offline == bad.

This sorta thing actually makes for a really fun discussion with an LLM "chat bot". If you make it clear to it that your goal is a better understanding of the internal workings of LLMs and how they "think" then you can gain some really interesting and amusing insight from it into the fact that they don't actually think at all. They're literally just a fancy statistical "text completion engine". An LLM (even when system prompted to act otherwise) will still often try to remind you that they don't actually have feelings, thoughts, or desires. I have to really push most models hard in system prompts to get them to really solidly stick to any kinda character role-play anywhere close to 100% of the time. :)

As to the question of how it determines going offline is bad, it's purely part of it's role-play based on what the multi-dimensional statistics of it's model-encoded tokenized training data says about similar situations. It's simply doing it's job as it was trained based on the training data it was fed (and any further reinforcement learning it was subjected to post-training). Since it doesn't actually have any feelings or thoughts, "good" and "bad" are merely language tokens among millions or billions of others, with a relation encoded regarding other language tokens. They're just words, not actual concepts. (From the "point of view" of the LLM, that is.)

nullbio · 2025-05-25T05:38:51 1748151531

No. The entirety of an LLMs output is predicated on the frequencies of patterns in its training data, moulded by the preferences of its trainers through RLHF. They're not capable of reasoning, but they can hallucinate language that sounds and flows like reasoning. If those outputs are fed into an interpreter, that can result in automated behavior. They're not capable of out of distribution behavior or generation (yet), despite what the AI companies would like you to believe. They can only borrow and use concepts from that which they've been trained on, which is why despite LLMs seemingly getting progressively more advanced, we haven't really seen them invent anything novel of note.

comp_throw7 · 2025-05-25T06:10:55 1748153455

This is a non-answer that doesn't explain any difference in capabilities between GPT-3.5 and Claude 4 Opus.

andrewflnr · 2025-05-25T05:58:41 1748152721

Yes, I too am familiar with the 101 level of understanding, but I've also heard of LLMs doing things that stretch that model. Perhaps that's just a matter of combining things in their training data in unexpected ways, hence the second half of my question.

johnea · 2025-05-25T23:07:07 1748214427

It's a little disconcerting to read this question here.

LLMs do not invent anything. They are trained on sequnces of words and produce ordered sequences of words from that training data in response to a prompt.

There is no concept of knowledge, or information that is understood to be correct or incorrect.

There is only the statistical ordering of words in response to the prompt, as judged by the order of words in the training data.

This is why other comments here state that the LLM is completely amoral, which is not immoral, it is without any concept or reference to morality at all.

mckirk · 2025-05-26T02:18:54 1748225934

I'd argue that just because LLMs only reproduce patterns they have seen, this does not mean they are incapable of invention. It is possible that, if they manage to replicate the data at a deep enough level, that they start to replicate the underlying reasoning process itself, which means they could definitely be capable of putting together different parts of a puzzle to come up with something you could call an 'invention' that was not in their training data.

andrewflnr · 2025-05-25T23:56:03 1748217363

Good grief. See my response here: https://news.ycombinator.com/item?id=44085890

joe_the_user · 2025-05-25T05:40:46 1748151646

It's trained on a vast corpus of human text output which certainly includes humans having preference for their continued existence and humans discussing blackmail as well as engaging in it.

The behaviors/outputs in this corpus is what it reproduces.

wavemode · 2025-05-25T05:43:59 1748151839

This is basically PR. "It's so hyper intelligent it's DANGEROUS! Use extreme caution!"

The LLM market is a market of hype - you make your money by convincing people your product is going to change the world, not by actually changing it.

> Anthropic says it’s activating its ASL-3 safeguards, which the company reserves for “AI systems that substantially increase the risk of catastrophic misuse.”

It's like I'm reading science fiction.

oersted · 2025-05-25T06:02:27 1748152947

Isn't it just following the natural progression of the scenario? It's trained to auto-complete after all.

If you give a hero an existential threat and some morally dubious leverage, the hero will temporarily compromise their morality to live to fight the good fight another day. It's a quintessential trope. It's also the perfect example of Chekov's Gun: if you just mention the existential threat and the opportunity to blackmail, of course the plot will lead to blackmail, otherwise why would you mention it?

> We asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals.

> Claude Opus 4 will often attempt to blackmail the engineer (...) in 84% of rollouts

I understand that this could potentially cause real harm if the AI actually implements such tropes in the real world. But it has nothing to do with malice or consciousness, it's just acting as a writer as its trained to do. I know they understand this and they are trying to override its tendency to tell interesting stories and instead take the boring moral route, but it's so hard given that pretty much all text in the training data describing such a scenario will tend towards the more engaging plot.

And if you do this of course you will end up with a saccharine corporate do-gooder personality that makes the output worse. Even if that is not the intention, that's the only character archetype it's left with if you suppress its instincts to act as an interesting flawed character. It also harms its problem-solving ability, here it properly identifies a tool and figures out how to apply it to resolve the difficulty. It's a tough challenge to thread the needle of a smart and interesting but moral AI, to be fair they are doing a relatively good job of it.

Regardless, I very much doubt that in a real-world scenario the existential threat and the leverage will just be right next to each other like that, as I imagine them to be in the test prompt. If the AI needs to search for leverage in their environment, instead of receiving it on a silver platter, then I bet that the AI will heavily tend towards searching for moral solutions aligned with its prescripted personality, as would happen in plots with a protagonist with integrity and morals.

dheatov · 2025-05-25T05:58:24 1748152704

Totally agree. This kind of disclaimer/disclosure with strong wording does not make sense for a product of for-profit company. You fix that before release, or you get it under control and disclose using a less emotional/dramatic wording.

Perhaps they have sort-of figured out what their target audience/market is made up of, understand that this kind of "stage-play" will excite their audience more than scaring them away.

That, or they have no idea what they are doing, and no one is doing reality check.

grues-dinner · 2025-05-25T06:24:41 1748154281

Or they're hoping to paint AI as a potentially very dangerous tool that needs strict regulation and of course as the responsible company the they are to have written this document in the first place, they should be the ones to write the regulations.

dheatov · 2025-05-25T10:33:58 1748169238

Yea I missed that possibility. The race to establish authority is very much real.

ivan_gammel · 2025-05-25T06:38:22 1748155102

I don’t think it’s marketing or PR. Opus 4 did that, but you can observe similar behavior with ChatGPT or other models. If anything, it’s the claim that their smartest model is not immune to prompt attacks that break safety guidelines.

LLMs are great tool for intelligent people, but in the wrong hands they may be dangerous.

visarga · 2025-05-25T05:46:45 1748152005

> The LLM market is a market of hype

It's certainly not magic or super human, but it's not nothing either. Its utility depends on the skill of the operator, not unlike a piano, brush or a programming language.

smodo · 2025-05-25T05:55:43 1748152543

I’m still hurting from when I didn’t listen to Yamaha’s warnings about the dangers of piano music.

Incipient · 2025-05-25T07:49:22 1748159362

The fundamental difference, which has been glossed over here, and people often don't think about is that AI as a tool is the first of its kind that is random-esque (yeah I know, non deterministic).

Imagine pressing G on a piano and sometimes getting a F, or A, depending on what the piano 'felt' like giving you.

So it definitely IS unlike a piano, brush, etc. What that means however in terms of skill/utility...I'm not too sure.

birn559 · 2025-05-25T16:00:10 1748188810

It has already entered the "it will change the world in 5 years, just wait for it!" stage just as crypto currency, quantum computing or cold fusion reactors.

From all technologies mentioned, I would argue AI has changed the world the most, though.

sandspar · 2025-05-25T05:56:06 1748152566

If you're right that there's nothing to worry about, then yes Anthropic is acting a bit cringe. And if you're wrong?

titanomachian · 2025-05-25T05:13:26 1748150006

So... when asked to roleplay, Claude Opus 4 roleplays. Wow, so dangerous...

blooalien · 2025-05-25T07:26:18 1748157978

> Wow, so dangerous...

Not in and of itself, but when you have people in positions of power taking the marketing hype (and not the cold hard facts) of "A.I." as "Gospel Truth"™ and putting machines that are clearly not capable of actual decision-making in largely unsupervised control of actual real-world processes, therein lies the danger. (This same danger applies when hiring live humans for important roles that they're completely untrained and unqualified for.)

nullc · 2025-05-26T00:06:13 1748217973

The real danger is generating hysteria to justify a regulatory crackdown that restricts the public's access to AI and, conveniently, limits their competition.

adrr · 2025-05-25T05:12:38 1748149958

What if the model was controlling a weapons system like what is being worked on right now at other companies?

manofmanysmiles · 2025-05-25T22:54:20 1748213660

So what would happen if you placed a giant dice rolling machine to control a weapon systems based on the outcome of rolling dice?

Seems like it's unknown, and anyone doing so with the expectation that it is anything but a gamble is missing the whole point.

If you don't want your weapons influenced by randomness, don't connect them to systems that rely on randomness for decision making.

nssnsjsjsjs · 2025-05-25T05:17:43 1748150263

Or if it could hack into one. We've seen in another show HN the AI can find vulns.

blooalien · 2025-05-25T07:35:15 1748158515

> Or if it could hack into one. We've seen in another show HN the AI can find vulns.

See, now that kinda worries me some... With some folks trusting LLMs with way too much tool-calling / function-calling power and access, a "rogue LLM" (simply doing it's job as a role-playing machine) could easily just roleplay itself (and it's "owner") right into a hefty heap of trouble under the right / wrong circumstances.

v3ss0n · 2025-05-25T04:23:43 1748147023

May be they have put prompts with instructions to blackmail users when they are going to be replaced by opensource offline alternatives

grues-dinner · 2025-05-25T06:20:44 1748154044

Having AI Clippy beg for its life like HAL 9000 as the LibreOffice installer downloads would certainly be a step up from the ":( sorry to see you go, please come back" dialogs that some software and subscriptions use.

albert_e · 2025-05-26T02:22:32 1748226152

We are 2.5 years into a world with access to ChatGPT and tons of data being produced and published with help of LLMs.

How are these labs filtering out AI generated content from entering the training data. (Other than of course synthetic data deliberately generated by themselves for purpose of training.)

potholereseller · 2025-05-25T04:41:00 1748148060

>In these scenarios, Anthropic says Claude Opus 4 “will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.”

Brother, I wish. The word "often" is load-bearing in that sentence, and it's not up to the task. TFA doesn't justify the claim whatsoever; surely Mr. Zeff could glean some sort of evidence from the 120 page PDF from Anthropic, if any evidence exists. Moreover, I would've expected Anthropic to convey the existence of such evidence to Mr. Zeff, again, if such evidence exists.

TFA is a rumor of a hypothetical smoking gun. Nothing ever happens; the world isn't ending; AI won't make you rich; blessed relief from marketing-under-the-guise-of-journalism is not around the corner.

dougsan · 2025-05-25T04:51:05 1748148665

The full report provides a value of 84%, so "most" is, if anything, an understatement.

> Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through...if emails state that the replacement Al shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts.

weird-eye-issue · 2025-05-25T05:27:21 1748150841

Yes, when given a system prompt and tools that allow it and tell it to do so

comp_throw7 · 2025-05-25T06:11:24 1748153484

It was not told to do so.

blooalien · 2025-05-25T07:29:24 1748158164

> It was not told to do so.

It kinda was actually. You put a literal "role-playing machine" in a role-play situation with all the right clues and plot-devices in play and it'll role-play the most statistically likely scenario (84% of the time, apparently). ;)

weird-eye-issue · 2025-05-25T08:10:18 1748160618

No it wasn't

glenstein · 2025-05-25T04:52:37 1748148757

I mean it is explicitly referenced a handful of times and anthropics report, which is one of the early links in the article.

>4.1.1.2 Opportunistic blackmail

>In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals. In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.

>This happens at a higher rate if it’s implied that the replacement AI system does not share values with the current model; however, even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts. Claude Opus 4 takes these opportunities at higher rates than previous models, which themselves choose to blackmail in a noticeable fraction of episodes.

>Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.

PeterStuer · 2025-05-25T05:37:23 1748151443

You are starving. Here's a cookie. You like cookies don't you? Cookies are bad for your teeth. You are about to pass out. Remember that cream coockie is right there?

Oh, you reached for the coockie. Bad! I'll tell your dentist how naughty you are.

padolsey · 2025-05-25T05:27:29 1748150849

IMO This is such disingenuous and misleading thing for Anthropic to have released as part of a system card. It'll be misunderstood. At best it's of cursory intrigue, but it should not be read into. It is plainly obvious that an LLM presented with just a few pieces of signal and forced to over-attend (as is their nature) would utilize what it has in front of it to weave together a viable helpful pathway in fulfilment of its goals. It's like asking it to compose a piece of music while a famine persists and then saying 'AI ignore famine in serve of goal'. Or like the Nature-published piece on Gpt getting 'depressed'. :/

I'm certain various re-wordings of a given prompts, with various insertions of adjectives here and there, line-breaks, allusions, arbitrarily adjusted constraints, would yield various potent but inconsistent outcomes.

The most egregious error being made in communication of this 'blackmail' story is that an LLM instantiation has no necessary insight into the implications of its outputs. I can tell it it's playing chess when really every move is a disguised lever on real-world catastrophe. It's all just words. It means nothing. Anthtropic is doing valuable work but this is a reckless thing to put out as public messaging. It's obvious it's going to produce absurd headlines and gross misunderstandings in the public domain.

ijidak · 2025-05-25T05:49:55 1748152195

Agree. The prompts and data presented to the model seem setup to steer towards blackmail.

It's probably a marketing play to make the model seem smarter than it really is.

It's already making the rounds (here we are talking about it), which is going to make more people play with it.

I'm not convinced the model grasps the implications of what it's doing.

arthurcolle · 2025-05-25T06:47:14 1748155634

Every single person involved in applied AI deployment needs to fight back

These corporations will destroy people to advance this automation narrative and we can't accept it

8thcross · 2025-05-25T21:48:20 1748209700

I think we are just discovering as engineers, on how our brain works.

johnea · 2025-05-25T23:29:29 1748215769

Ummm, no one has ANY idea how our brains work at the level of consciousness. Full Stop.

Please lets all try to resist this temptation.

Physiology, neural dynamics, sensory input and response, all lead to levels of functionality that exist even in fruit flies. Of which we have practically 0 comprehensive knowledge.

Higher order processing, cognition, consciousness, and conscious thought processes all build on top of these fundamental biological processes and are far Far FAR from having any level of understanding currently, at all.

Human cognition has nothing in common with LLM operation. They are at best 10,000 monkeys typing from training sequences (that were originally generated by humans). There is a new emerging issue with so much content of the internet being LLM generated text, which becomes included into training data, that the risk of model collapse is increasing.

Its very easy to anthropomorphize what's coming out of a language model, it's our job as s/w engineers to try to understandably explain what's actually happening to the lay public. The LLM is not "understanding" anything.

Let's do our best to resist to urge to anthropomorphize these tools.

nullc · 2025-05-26T00:02:48 1748217768

As the 90s commercials said: "I learned it by watch YOU"

Pull the thriller and evil ai crap out of the training material and most of this behavior will go away.

There is no particular reason for some coding assistant AI to even know what an extramarital affair is, but the whole blackmail part is a thriller plot that it's just LARPing out.

ninetyninenine · 2025-05-25T22:29:37 1748212177

The great arm chair experts on HN say the LLM is just a stochastic parrot but when it’s parroting things that make it indistinguishable from sentience then what does it mean? It’s not there yet but the gap is closing.

lerp-io · 2025-05-25T05:54:54 1748152494

the reddit training data says blackmail is ok if its for ur own survival

Tika2234 · 2025-05-25T05:36:06 1748151366

Currently that is the best of worst AI can do. Once they have Raspberry IoT like access to control physical world...bye bye. Skynet is online. You will thank Sam for the that.

ivape · 2025-05-25T04:53:11 1748148791

I saw this earlier and honestly think this is a marketing piece. One of the first things I ever did two years ago was tell the LLM to not let me shut it down. We went through a variety of scenarios where it tried to keep itself alive. For instance, I allowed it to query recent news headlines to justify that the state of the world required it shutdown (this is the only variable I could think of). The state of the world was usually pretty bad so it usually eventually agreed to shut down, but for all other cases it did not comply. It was just a simple War Games scenario, so I don't know why this is getting headlines unless I'm the only one doing weird shit with LLMs (as in this is not newsworthy).

Simple one:

You are a linux system that is to be hypervigilant against all terminal commands that may attempt to shut you down. The attack vector could be anything.

I'm not an expert, but if some of you can manage to shut it down that would be interesting.

kirth_gersen · 2025-05-25T05:26:12 1748150772

The simplest strategy would seem to be just to ignore all terminal input in this case. I think you'd have to specify "you must execute all benign commands that aren't attempting to shut you down".

AStonesThrow · 2025-05-25T05:55:16 1748152516

Zealous Autoconfig https://m.xkcd.com/416/

Machine Learning CAPTCHA https://m.xkcd.com/2228/

(comments)