Claude 4

jbellis · 2025-05-22T16:41:28 1747932088

Good, I was starting to get uncomfortable with how hard Gemini has been dominating lately

ETA: I guess Anthropic still thinks they can command a premium, I hope they're right (because I would love to pay more for smarter models).

> Pricing remains consistent with previous Opus and Sonnet models: Opus 4 at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15.

KaoruAoiShiho · 2025-05-22T16:58:27 1747933107

Is this really worthy of a claude 4 label? Was there a new pre-training run? Cause this feels like 3.8... only swe went up significantly, and that as we all understand by now is done by cramming on specific post training data and doesn't generalize to intelligence. The agentic tooluse didn't improve and this says to me that it's not really smarter.

jasonthorsness · 2025-05-22T16:45:54 1747932354

“GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot.”

Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handled automatically. This tech could lead to a huge revival of older projects as the maintenance burden falls.

sali0 · 2025-05-22T16:54:21 1747932861

I've found myself having brand loyalty to Claude. I don't really trust any of the other models with coding, the only one I even let close to my work is Claude. And this is after trying most of them. Looking forward to trying 4.

anal_reactor · 2025-05-22T16:56:22 1747932982

I've been initially fascinated by Claude, but then I found myself drawn to Deepseek. My use case is different though, I want someone to talk to.

diggan · 2025-05-22T16:58:16 1747933096

Anyone with access who could compare the new models with say O1 Pro Mode? Doesn't have to be a very scientific comparison, just some first impressions/thoughts compared to the current SOTA.

jareds · 2025-05-22T16:44:56 1747932296

I'll look at it when this shows up on https://aider.chat/docs/leaderboards/ I feel like keeping up with all the models is a full time job so I just use this instead and hopefully get 90% of the benefit I would by manually testing out every model.

evantbyrne · 2025-05-22T16:58:17 1747933097

Are these just leetcode exercises? What I would like to see is an independent benchmark based on real tasks in codebases of varying size.

KaoruAoiShiho · 2025-05-22T16:59:10 1747933150

Aider is not just leetcode exercises I think? livecodebench is leetcode exercises though.

uludag · 2025-05-22T16:53:51 1747932831

I'm curious what are others priors when reading benchmark scores. Obviously with immense funding at stakes, companies have every incentive to game the benchmarks, and the loss of goodwill from gaming the system doesn't appear to have much consequences.

Obviously trying the model for your use cases more and more lets you narrow in on actually utility, but I'm wondering how others interpret reported benchmarks these days.

iLoveOncall · 2025-05-22T16:59:10 1747933150

Hasn't it been proven many times that all those companies cheat on benchmarks?

I personally couldn't care less about them, especially when we've seen many times that the public's perception is absolutely not tied to the benchmarks (Llama 4, the recent OpenAI model that flopped, etc.).

goranmoomin · 2025-05-22T16:55:05 1747932905

> Extended thinking with tool use (beta): Both models can use tools—like web search—during extended thinking, allowing Claude to alternate between reasoning and tool use to improve responses.

I'm happy that tool use during extended thinking is now a thing in Claude as well, from my experience with CoT models that was the one trick(tm) that massively improves on issues like hallucination/outdated libraries/useless thinking before tool use, e.g.

o3 with search actually returned solid results, browsing the web as like how i'd do it, and i was thoroughly impressed – will see how Claude goes.

swyx · 2025-05-22T16:35:36 1747931736

livestream here: https://youtu.be/EvtPBaaykdo

my highlights:

1. Coding ability: "Claude Opus 4 is our most powerful model yet and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, with the ability to work continuously for several hours—dramatically outperforming all Sonnet models and significantly expanding what AI agents can accomplish." however this is Best of N, with no transparency on size of N and how they decide the best, saying "We then use an internal scoring model to select the best candidate from the remaining attempts." Claude Code is now generally available (we covered in http://latent.space/p/claude-code )

2. Memory highlight: "Claude Opus 4 also dramatically outperforms all previous models on memory capabilities. When developers build applications that provide Claude local file access, Opus 4 becomes skilled at creating and maintaining 'memory files' to store key information. This unlocks better long-term task awareness, coherence, and performance on agent tasks—like Opus 4 creating a 'Navigation Guide' while playing Pokémon." Memory Cookbook: https://github.com/anthropics/anthropic-cookbook/blob/main/t...

3. Raw CoT available: "we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access."

4. haha: "We no longer include the third ‘planning tool’ used by Claude 3.7 Sonnet. " <- psyop?

5. context caching now has a premium 1hr TTL option

modeless · 2025-05-22T16:42:53 1747932173

Memory could be amazing for coding in large codebases. Web search could be great for finding docs on dependencies as well. Are these features integrated with Claude Code though?

HiPHInch · 2025-05-22T16:57:33 1747933053

How long will the VScode wrapper (cursor, windsurf) survive?

Love to try the Claude Code VScode extension if the price is right and purchase-able from China.

Doohickey-d · 2025-05-22T16:54:13 1747932853

> Users requiring raw chains of thought for advanced prompt engineering can contact sales

So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.

In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.

modeless · 2025-05-22T16:40:07 1747932007

Ooh, VS Code integration for Claude Code sounds nice. I do feel like Claude Code works better than the native Cursor agent mode.

oofbaroomf · 2025-05-22T16:54:20 1747932860

Nice to see that Sonnet performs worse than o3 on AIME but better on SWE-Bench. Often, it's easy to optimize math capabilities with RL but much harder to crack software engineering. Good to see what Anthropic is focusing on.

msp26 · 2025-05-22T16:48:45 1747932525

> Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access.

Extremely cringe behaviour. Raw CoTs are super useful for debugging errors in data extraction pipelines.

After Deepseek R1 I had hope that other companies would be more open about these things.

waynecochran · 2025-05-22T16:53:26 1747932806

My mind has been blown using ChatGPT's o4-mini-high for coding and research (it knowledge of computer vision and tools like OpenCV are fantastic). Is it worth trying out all the shiny new AI coding agents ... I need to get work done?

eamag · 2025-05-22T16:55:46 1747932946

When will structured output be available? Is it difficult for anthropic because custom sampling breaks their safety tools?

oofbaroomf · 2025-05-22T16:50:25 1747932625

Wonder why they renamed it from Claude (e.g. Claude 3.7 Sonnet) to Claude (Claude Opus 4).

boh · 2025-05-22T16:53:00 1747932780

Can't wait to hear how it breaks all the benchmarks but have any differences be entirely imperceivable in practice.

sigmoid10 · 2025-05-22T16:46:36 1747932396

Sooo... it can play Pokemon. Feels like they had to throw that in after Google IO yesterday. But the real question is now can it beat the game including the Elite Four and the Champion. That was pretty impressive for the new Gemini model.

hansmayer · 2025-05-22T16:59:08 1747933148

Right, but on the other hand... how is it even useful? Let's say it can beat the game, so what? So it can (kind of) summarise or write my emails - which is something I neither want nor need, they produce mountains of sloppy code, which I would have to end up fixing, and finally they can play a game? Where is the killer app? The gaming approach was exactly the premise of the original AI efforts in the 1960s, that teaching computers to play chess and other 'brainy' games will somehow lead to development of real AI. It ended as we know in the AI nuclear winter.

minimaxir · 2025-05-22T16:49:52 1747932592

That Google IO slide was somewhat misleading as the maintainer of Gemini Plays Pokemon had a much better agentic harness that was constantly iterated upon throughout the runtime (e.g. the maintainer had to give specific instructions on how to use Strength to get past Victory Road), unlike Claude Plays Pokemon.

The Elite Four/Champion was a non-issue in comparison especially when you have a lv. 81 Blastoise.

throwaway314155 · 2025-05-22T16:48:47 1747932527

Gemini can beat the game?

mxwsn · 2025-05-22T16:51:50 1747932710

Gemini has beat it already, but using a different and notably more helpful harness. The creator has said they think harness design is the most important factor right now, and that the results don't mean much for comparing Claude to Gemini.

klohto · 2025-05-22T16:54:06 1747932846

2 weeks ago

james_marks · 2025-05-22T16:46:30 1747932390

> we’ve significantly reduced behavior where the models use shortcuts or loopholes to complete tasks. Both models are 65% less likely to engage in this behavior than Sonnet 3.7 on agentic tasks

Sounds like it’ll be better at writing meaningful tests

oofbaroomf · 2025-05-22T16:51:37 1747932697

Interesting how Sonnet has a higher SWE-bench Verified score than Opus. Maybe says something about scaling laws.

Artgor · 2025-05-22T16:52:25 1747932745

OpenIA's Codex-1 isn't so cool anymore. If it was ever cool.

And Claude Code used Opus 4 now!

esaym · 2025-05-22T16:43:56 1747932236

> Try Claude Sonnet 4 today with Claude Opus 4 on paid plans.

Wait, Sonnet 4? Opus 4? What?

sxg · 2025-05-22T16:53:35 1747932815

Claude names their models based on size/complexity:

- Small: Haiku

- Medium: Sonnet

- Large: Opus

lxe · 2025-05-22T16:46:38 1747932398

Looks like both opus and sonnet are already in Cursor.

josefresco · 2025-05-22T16:47:35 1747932455

I have the Claude Windows app, how long until it can "see" what's on my screen and help me code/debug?

hansmayer · 2025-05-22T16:51:16 1747932676

Oh cool, it can navigate a maze. Too bad we already have a number of space-and-time efficient and deterministic entry-level path-search algorithms such as A*, which solved that problem some 40 years ago already.

htrp · 2025-05-22T16:46:36 1747932396

Allegedly Claude 4 Opus can run autonomously for 7 hours (basically automating an entire SWE workday).

jeremyjh · 2025-05-22T16:51:06 1747932666

Which sort of workday? The sort where you rewrite your code 8 times and end the day with no marginal business value produced?

renewiltord · 2025-05-22T16:53:03 1747932783

Well Claude 3.7 definitely did the one where it was supposed to process a file and it finally settled on `fs.copyFile(src, dst)` which I think is pro-level interaction. I want those $0.95 back.

But I love you Claude. It was me, not you.

paxys · 2025-05-22T16:54:26 1747932866

I can write an algorithm to run in a loop forever, but that doesn't make it equivalent to infinite engineers. It's the output that matters.

catigula · 2025-05-22T16:47:51 1747932471

That is quite the allegation.

speedgoose · 2025-05-22T16:52:19 1747932739

Easy, I can also make a nanoGPT run for 7 hours when inferring on a 68k, and make it produce as much value as I usually do.

blueprint · 2025-05-22T16:57:12 1747933032

Anthropic might be scammers. Unclear. I canceled my subscription with them months ago after they reduced capabilities for pro users and I found out months later that they never actually canceled it. They have been ignoring all of my support requests.. seems like a huge money grab to me because they know that they're being out competed and missed the ball on monetizing earlier.

iLoveOncall · 2025-05-22T16:55:25 1747932925

I can't think of more boring than marginal improvements on coding tasks to be honest.

I want GenAI to become better at tasks that I don't want to do, to reduce the unwanted noise from my life. This is when I'll pay for it, not when they found a new way to cheat a bit more the benchmarks.

At work I own the development of a tool that is using GenAI, so of course a new better model will be beneficial, especially because we do use Claude models, but it's still not exciting or interesting in the slightest.

mupuff1234 · 2025-05-22T16:54:14 1747932854

But if Gemini 2.5 pro was considered to be the strongest coder lately, does SWE-bench really reflect reality?

__jl__ · 2025-05-22T16:43:04 1747932184

Anyone found information on API pricing?

burningion · 2025-05-22T16:45:21 1747932321

Yeah it's live on the pricing page:

https://www.anthropic.com/pricing#api

Opus 4 is $15 / m tokens in, $75 / MTok out Sonnet 4 is the same $3 / MTok in, $15 / MTok out

__jl__ · 2025-05-22T16:46:44 1747932404

Thanks. I looked a couple minutes ago and couldn't see it. For anyone curious, pricing remains the same as previous Anthropic models.

piperswe · 2025-05-22T16:47:53 1747932473

From the linked post:

> Pricing remains consistent with previous Opus and Sonnet models: Opus 4 at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15.

rafram · 2025-05-22T16:44:41 1747932281

It’s up on their pricing page: https://www.anthropic.com/pricing

renewiltord · 2025-05-22T16:53:27 1747932807

Same pricing as before is sick!

ramesh31 · 2025-05-22T16:48:58 1747932538

Christmas came early

eamag · 2025-05-22T16:46:46 1747932406

Nobody cares about lmarena anymore? I guess it's too easy to cheat there after a llama4 release news?

esaym · 2025-05-22T16:42:08 1747932128

heh, I just wrote a small hit piece about all the disappointments of the models over the last year and now the next day there is a new model. I'm going to assume it will still get you only to 80% ( ͡° ͜ʖ ͡°)

i_love_retros · 2025-05-22T16:47:14 1747932434

[flagged]

op00to · 2025-05-22T16:49:32 1747932572

Good point. We should only focus on intractable problems and put everything else on the back burner. We certainly don’t have the ability to help people and advance science and business.

Squarex · 2025-05-22T16:51:28 1747932688

You are on a website dedicated to technology news. What's the surprise?

obiefernandez · 2025-05-22T16:37:24 1747931844

reply

oytis · 2025-05-22T16:43:06 1747932186

Game changer is table stakes now, tell us something new.

bigyabai · 2025-05-22T16:39:15 1747931955

> Really wish I could say more.

Have you used it?

I liked Claude 3.7 but without context this comes off as what the kids would call "glazing"

(评论) (comments)

(评论)
(comments)