(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=44063703

Hacker News 的讨论集中在 Anthropic 发布的 Claude 4 上。用户们谨慎乐观,质疑它是否是重大升级还是仅仅是增量改进。编码改进,尤其是在自主代理场景和工具使用(例如在“扩展思考”期间进行网络搜索)方面,被重点提及,GitHub Copilot 可能从中受益。然而,一些人担心“基准测试游戏”以及基准分数的效用,因为存在针对特定指标进行优化的动机。一个关键问题是转向总结“思维链”(CoT),仅通过付费才能访问原始 CoT,这阻碍了提示工程。一些用户表示对 Claude 的编码任务的品牌忠诚度,而另一些用户则转向了 Deepseek 等替代方案。谷歌 IO 演示的宝可梦游戏能力也正在被研究,以及现有的迷宫导航路径搜索算法。定价与之前的模型保持一致。另一些人则对公司未能及时取消订阅以及支持响应迟钝表示担忧。

相关文章
  • Claude 4 2025-05-22
  • (评论) 2025-04-09
  • (评论) 2023-11-22
  • Claude 3.7十四行诗和克劳德代码 2025-02-25
  • (评论) 2025-03-11

  • 原文
    Hacker News new | past | comments | ask | show | jobs | submit login
    Claude 4 (anthropic.com)
    209 points by meetpateltech 24 minutes ago | hide | past | favorite | 60 comments










    Good, I was starting to get uncomfortable with how hard Gemini has been dominating lately

    ETA: I guess Anthropic still thinks they can command a premium, I hope they're right (because I would love to pay more for smarter models).

    > Pricing remains consistent with previous Opus and Sonnet models: Opus 4 at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15.



    Is this really worthy of a claude 4 label? Was there a new pre-training run? Cause this feels like 3.8... only swe went up significantly, and that as we all understand by now is done by cramming on specific post training data and doesn't generalize to intelligence. The agentic tooluse didn't improve and this says to me that it's not really smarter.


    “GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot.”

    Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handled automatically. This tech could lead to a huge revival of older projects as the maintenance burden falls.



    I've found myself having brand loyalty to Claude. I don't really trust any of the other models with coding, the only one I even let close to my work is Claude. And this is after trying most of them. Looking forward to trying 4.


    I've been initially fascinated by Claude, but then I found myself drawn to Deepseek. My use case is different though, I want someone to talk to.


    Anyone with access who could compare the new models with say O1 Pro Mode? Doesn't have to be a very scientific comparison, just some first impressions/thoughts compared to the current SOTA.


    I'll look at it when this shows up on https://aider.chat/docs/leaderboards/ I feel like keeping up with all the models is a full time job so I just use this instead and hopefully get 90% of the benefit I would by manually testing out every model.


    Are these just leetcode exercises? What I would like to see is an independent benchmark based on real tasks in codebases of varying size.


    Aider is not just leetcode exercises I think? livecodebench is leetcode exercises though.


    I'm curious what are others priors when reading benchmark scores. Obviously with immense funding at stakes, companies have every incentive to game the benchmarks, and the loss of goodwill from gaming the system doesn't appear to have much consequences.

    Obviously trying the model for your use cases more and more lets you narrow in on actually utility, but I'm wondering how others interpret reported benchmarks these days.



    Hasn't it been proven many times that all those companies cheat on benchmarks?

    I personally couldn't care less about them, especially when we've seen many times that the public's perception is absolutely not tied to the benchmarks (Llama 4, the recent OpenAI model that flopped, etc.).



    > Extended thinking with tool use (beta): Both models can use tools—like web search—during extended thinking, allowing Claude to alternate between reasoning and tool use to improve responses.

    I'm happy that tool use during extended thinking is now a thing in Claude as well, from my experience with CoT models that was the one trick(tm) that massively improves on issues like hallucination/outdated libraries/useless thinking before tool use, e.g.

    o3 with search actually returned solid results, browsing the web as like how i'd do it, and i was thoroughly impressed – will see how Claude goes.



    livestream here: https://youtu.be/EvtPBaaykdo

    my highlights:

    1. Coding ability: "Claude Opus 4 is our most powerful model yet and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, with the ability to work continuously for several hours—dramatically outperforming all Sonnet models and significantly expanding what AI agents can accomplish." however this is Best of N, with no transparency on size of N and how they decide the best, saying "We then use an internal scoring model to select the best candidate from the remaining attempts." Claude Code is now generally available (we covered in http://latent.space/p/claude-code )

    2. Memory highlight: "Claude Opus 4 also dramatically outperforms all previous models on memory capabilities. When developers build applications that provide Claude local file access, Opus 4 becomes skilled at creating and maintaining 'memory files' to store key information. This unlocks better long-term task awareness, coherence, and performance on agent tasks—like Opus 4 creating a 'Navigation Guide' while playing Pokémon." Memory Cookbook: https://github.com/anthropics/anthropic-cookbook/blob/main/t...

    3. Raw CoT available: "we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access."

    4. haha: "We no longer include the third ‘planning tool’ used by Claude 3.7 Sonnet. " <- psyop?

    5. context caching now has a premium 1hr TTL option



    Memory could be amazing for coding in large codebases. Web search could be great for finding docs on dependencies as well. Are these features integrated with Claude Code though?


    How long will the VScode wrapper (cursor, windsurf) survive?

    Love to try the Claude Code VScode extension if the price is right and purchase-able from China.



    > Users requiring raw chains of thought for advanced prompt engineering can contact sales

    So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.

    In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.



    Ooh, VS Code integration for Claude Code sounds nice. I do feel like Claude Code works better than the native Cursor agent mode.


    Nice to see that Sonnet performs worse than o3 on AIME but better on SWE-Bench. Often, it's easy to optimize math capabilities with RL but much harder to crack software engineering. Good to see what Anthropic is focusing on.


    > Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access.

    Extremely cringe behaviour. Raw CoTs are super useful for debugging errors in data extraction pipelines.

    After Deepseek R1 I had hope that other companies would be more open about these things.



    My mind has been blown using ChatGPT's o4-mini-high for coding and research (it knowledge of computer vision and tools like OpenCV are fantastic). Is it worth trying out all the shiny new AI coding agents ... I need to get work done?


    When will structured output be available? Is it difficult for anthropic because custom sampling breaks their safety tools?


    Wonder why they renamed it from Claude (e.g. Claude 3.7 Sonnet) to Claude (Claude Opus 4).


    Can't wait to hear how it breaks all the benchmarks but have any differences be entirely imperceivable in practice.


    Sooo... it can play Pokemon. Feels like they had to throw that in after Google IO yesterday. But the real question is now can it beat the game including the Elite Four and the Champion. That was pretty impressive for the new Gemini model.


    Right, but on the other hand... how is it even useful? Let's say it can beat the game, so what? So it can (kind of) summarise or write my emails - which is something I neither want nor need, they produce mountains of sloppy code, which I would have to end up fixing, and finally they can play a game? Where is the killer app? The gaming approach was exactly the premise of the original AI efforts in the 1960s, that teaching computers to play chess and other 'brainy' games will somehow lead to development of real AI. It ended as we know in the AI nuclear winter.


    That Google IO slide was somewhat misleading as the maintainer of Gemini Plays Pokemon had a much better agentic harness that was constantly iterated upon throughout the runtime (e.g. the maintainer had to give specific instructions on how to use Strength to get past Victory Road), unlike Claude Plays Pokemon.

    The Elite Four/Champion was a non-issue in comparison especially when you have a lv. 81 Blastoise.



    Gemini can beat the game?


    Gemini has beat it already, but using a different and notably more helpful harness. The creator has said they think harness design is the most important factor right now, and that the results don't mean much for comparing Claude to Gemini.


    2 weeks ago


    > we’ve significantly reduced behavior where the models use shortcuts or loopholes to complete tasks. Both models are 65% less likely to engage in this behavior than Sonnet 3.7 on agentic tasks

    Sounds like it’ll be better at writing meaningful tests



    Interesting how Sonnet has a higher SWE-bench Verified score than Opus. Maybe says something about scaling laws.


    OpenIA's Codex-1 isn't so cool anymore. If it was ever cool.

    And Claude Code used Opus 4 now!



    > Try Claude Sonnet 4 today with Claude Opus 4 on paid plans.

    Wait, Sonnet 4? Opus 4? What?



    Claude names their models based on size/complexity:

    - Small: Haiku

    - Medium: Sonnet

    - Large: Opus



    Looks like both opus and sonnet are already in Cursor.


    I have the Claude Windows app, how long until it can "see" what's on my screen and help me code/debug?


    Oh cool, it can navigate a maze. Too bad we already have a number of space-and-time efficient and deterministic entry-level path-search algorithms such as A*, which solved that problem some 40 years ago already.


    Allegedly Claude 4 Opus can run autonomously for 7 hours (basically automating an entire SWE workday).


    Which sort of workday? The sort where you rewrite your code 8 times and end the day with no marginal business value produced?


    Well Claude 3.7 definitely did the one where it was supposed to process a file and it finally settled on `fs.copyFile(src, dst)` which I think is pro-level interaction. I want those $0.95 back.

    But I love you Claude. It was me, not you.



    I can write an algorithm to run in a loop forever, but that doesn't make it equivalent to infinite engineers. It's the output that matters.


    That is quite the allegation.


    Easy, I can also make a nanoGPT run for 7 hours when inferring on a 68k, and make it produce as much value as I usually do.


    Anthropic might be scammers. Unclear. I canceled my subscription with them months ago after they reduced capabilities for pro users and I found out months later that they never actually canceled it. They have been ignoring all of my support requests.. seems like a huge money grab to me because they know that they're being out competed and missed the ball on monetizing earlier.


    I can't think of more boring than marginal improvements on coding tasks to be honest.

    I want GenAI to become better at tasks that I don't want to do, to reduce the unwanted noise from my life. This is when I'll pay for it, not when they found a new way to cheat a bit more the benchmarks.

    At work I own the development of a tool that is using GenAI, so of course a new better model will be beneficial, especially because we do use Claude models, but it's still not exciting or interesting in the slightest.



    But if Gemini 2.5 pro was considered to be the strongest coder lately, does SWE-bench really reflect reality?


    Anyone found information on API pricing?


    Yeah it's live on the pricing page:

    https://www.anthropic.com/pricing#api

    Opus 4 is $15 / m tokens in, $75 / MTok out Sonnet 4 is the same $3 / MTok in, $15 / MTok out



    Thanks. I looked a couple minutes ago and couldn't see it. For anyone curious, pricing remains the same as previous Anthropic models.


    From the linked post:

    > Pricing remains consistent with previous Opus and Sonnet models: Opus 4 at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15.



    It’s up on their pricing page: https://www.anthropic.com/pricing


    Same pricing as before is sick!


    Christmas came early


    Nobody cares about lmarena anymore? I guess it's too easy to cheat there after a llama4 release news?


    heh, I just wrote a small hit piece about all the disappointments of the models over the last year and now the next day there is a new model. I'm going to assume it will still get you only to 80% ( ͡° ͜ʖ ͡°)


    [flagged]



    Good point. We should only focus on intractable problems and put everything else on the back burner. We certainly don’t have the ability to help people and advance science and business.


    You are on a website dedicated to technology news. What's the surprise?




    Game changer is table stakes now, tell us something new.


    > Really wish I could say more.

    Have you used it?

    I liked Claude 3.7 but without context this comes off as what the kids would call "glazing"







    Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



    Search:
    联系我们 contact @ memedata.com