Llama4

terhechte · 2025-04-05T18:46:11 1743878771

The (smaller) Scout model is really attractive for Apple Silicon. It is 109B big but split up into 16 experts. This means that the actual processing happens in 17B. Which means responses will be as fast as current 17B models. I just asked a local 7B model (qwen 2.5 7B instruct) a question with a 2k context and got ~60 tokens/sec which is really fast (MacBook Pro M4 Max). So this could hit 30 token/sec. Time to first token (the processing time before it starts responding) will probably still be slow because (I think) all experts have to be used for that.

In addition, the model has a 10M token context window, which is huge. Not sure how well it can keep track of the context at such sizes, but just not being restricted to ~32k is already great, 256k even better.

refibrillator · 2025-04-05T19:08:21 1743880101

> the actual processing happens in 17B

This is a common misconception of how MoE models work. To be clear, 17B parameters are activated for each token generated.

In practice you will almost certainly be pulling the full 109B parameters though the CPU/GPU cache hierarchy to generate non-trivial output, or at least a significant fraction of that.

p12tic · 2025-04-05T19:18:59 1743880739

For all intents and purposes cache may not exist when the working set is 17B or 109B parameters. So it's still better that less parameters are activated for each token. 17B parameters works ~6x faster than 109B parameters just because less data needs to be loaded from RAM.

tuukkah · 2025-04-05T19:11:18 1743880278

109B at Q6 is also nice for Framework Desktop 128GB.

nrp · 2025-04-05T19:18:07 1743880687

Yes, this announcement was a nice surprise for us. We’re going to test out exactly that setup.

echelon · 2025-04-05T19:14:01 1743880441

I don't understand Framework's desktop offerings. For laptops their open approach makes sense, but desktops are already about as hackable and DIY as they come.

nrp · 2025-04-05T19:17:25 1743880645

We took the Ryzen AI Max, which is nominally a high-end laptop processor, and built it into a standard PC form factor (Mini-ITX). It’s a more open/extensible mini PC using mobile technology.

echoangle · 2025-04-05T18:54:45 1743879285

Is it public (or even known by the developers) how the experts are split up? Is it by topic, so physics questions go to one and biology goes to another one? Or just by language, so every English question is handled by one expert? That’s dynamically decided during training and not set before, right?

ianbutler · 2025-04-05T19:08:00 1743880080

This is a common misunderstanding. Experts are learned via gating networks during training that routes dynamically per parameter. You might have an expert on the word "apple" in one layer for a slightly lossy example.

Queries are then also dynamically routed.

sshh12 · 2025-04-05T19:06:59 1743880019

It can be either but typically it's "learned" without a defined mapping (which guessing is the case here). Although some experts may end up heavily correlating with certain domains.

refulgentis · 2025-04-05T18:58:13 1743879493

"That’s dynamically decided during training and not set before, right?"

^ right. I can't recall off the top of my head, but there was a recent paper that showed if you tried dictating this sort of thing the perf fell off a cliff (I presume there's some layer of base knowledge $X that each expert needs)

terhechte · 2025-04-05T19:02:31 1743879751

To add, they say about the 400B "Maverick" model:

> while achieving comparable results to the new DeepSeek v3 on reasoning and coding

If that's true, it will certainly be interesting for some to load up this model on a private M3 Studio 512GB. Response time will be fast enough for interaction in Roo Code or Cline. Prompt processing is a bit slower but could be manageable depending on how much code context is given to the model.

The upside being that it can be used on codebases without having to share any code with a LLM provider.

anoncareer0212 · 2025-04-05T19:08:19 1743880099

Small point of order: bit slower might not set expectations accurately. You noted in a previous post in the same thread[^1] that we'd expect about a 1 minute per 10K tokens(!) prompt processing time with the smaller model. I agree, and contribute to llama.cpp. If anything, that is quite generous.

[^1] https://news.ycombinator.com/item?id=43595888

terhechte · 2025-04-05T19:12:25 1743880345

I don't think the time grows linearly. The more context the slower (at least in my experience because the system has to throttle). I just tried 2k tokens in the same model that I used for the 120k test some weeks ago and processing took 12 sec to first token (qwen 2.5 32b q8).

scosman · 2025-04-05T18:51:15 1743879075

At 109b params you’ll need a ton of memory. We’ll have to wait for evals of the quants to know how much.

terhechte · 2025-04-05T18:56:43 1743879403

Sure but the upside of Apple Silicon is that larger memory sizes are comparatively cheap (compared to buying the equivalent amount of 5090 or 4090). Also you can download quantizations.

refulgentis · 2025-04-05T19:02:58 1743879778

Maybe I'm missing something but I don't think I've ever seen quants lower memory reqs. I assumed that was because they still have to be unpacked for inference. (please do correct me if I'm wrong, I contribute to llama.cpp and am attempting to land a client on everything from Android CPU to Mac GPU)

root_axis · 2025-04-05T19:11:09 1743880269

Quantizing definitely lowers memory requirements, it's a pretty direct effect because you're straight up using less bits per parameter across the board - thus the representation of the weights in memory is smaller, at the cost of precision.

jsnell · 2025-04-05T19:12:10 1743880330

Needing less memory for inference is the entire point of quantization. Saving the disk space or having a smaller download could not justify any level of quality degradation.

vlovich123 · 2025-04-05T19:14:16 1743880456

Quantization by definition lower memory requirements - instead of using f16 for weights, you are using q8, q6, q4, or q2 which means the weights are smaller by 2x, ~2.7x, 4x or 8x respectively.

That doesn’t necessarily translate to the full memory reduction because of interim compute tensors and KV cache, but those can also be quantized.

terhechte · 2025-04-05T19:08:42 1743880122

I just loaded two models of different quants into LM Studio:

qwen 2.5 coder 1.5b @ q4_k_m: 1.21 GB memory

qwen 2.5 coder 1.5b @ q8: 1.83 GB memory

I always assumed this to be the case (also because of the smaller download sizes) but never really thought about it.

michaelt · 2025-04-05T19:11:30 1743880290

No need to unpack for inference. As things like CUDA kernels are fully programmable, you can code them to work with 4 bit integers, no problems at all.

manmal · 2025-04-05T18:54:10 1743879250

Won’t prompt processing need the full model though, and be quite slow on a Mac?

terhechte · 2025-04-05T18:58:26 1743879506

Yes, that's what I tried to express. Large prompts will probably be slow. I tried a 120k prompt once and it took 10min to process. But you still get a ton of world knowledge and fast response times, and smaller prompts will process fast.

laborcontract · 2025-04-05T18:48:31 1743878911

General overview below, as the pages don't seem to be working well

  Llama 4 Models:
  - Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each.
  - They are natively multimodal: text + image input, text-only output.
  - Key achievements include industry-leading context lengths, strong coding/reasoning performance, and improved multilingual capabilities.
  - Knowledge cutoff: August 2024.

  Llama 4 Scout:
  - 17B active parameters, 16 experts, 109B total.
  - Fits on a single H100 GPU (INT4-quantized).
  - 10M token context window
  - Outperforms previous Llama releases on multimodal tasks while being more resource-friendly.
  - Employs iRoPE architecture for efficient long-context attention.
  - Tested with up to 8 images per prompt.

  Llama 4 Maverick:
  - 17B active parameters, 128 experts, 400B total.
  - 1M token context window.
  - Not single-GPU; runs on one H100 DGX host or can be distributed for greater efficiency.
  - Outperforms GPT-4o and Gemini 2.0 Flash on coding, reasoning, and multilingual tests at a competitive cost.
  - Maintains strong image understanding and grounded reasoning ability.

  Llama 4 Behemoth (Preview):
  - 288B active parameters, 16 experts, nearly 2T total.
  - Still in training; not yet released.
  - Exceeds GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks (e.g., MATH-500, GPQA Diamond).
  - Serves as the “teacher” model for Scout and Maverick via co-distillation.

  Misc:
  - MoE Architecture: Only 17B parameters activated per token, reducing inference cost.
  - Native Multimodality: Unified text + vision encoder, pre-trained on large-scale unlabeled data.

accrual · 2025-04-05T19:15:49 1743880549

Thanks for sharing this here. At first I loved the simple Apache-style directory listing, very classic and utilitarian way to navigate new information. Then I tried clicking the FAQ and it wouldn't load anything until I allowed two different sources of JavaScript.

clueless · 2025-04-05T19:17:44 1743880664

> Knowledge cutoff: August 2024.

Could this mean training time is generally around 6 month, with 2 month of Q/A?

qwertox · 2025-04-05T18:58:29 1743879509

Llama 4 Scout, Maximum context length: 10M tokens.

This is a nice development.

lostmsu · 2025-04-05T19:08:49 1743880129

How did they achieve such a long window and what are the memory requirements to utilize it?

qwertox · 2025-04-05T19:21:39 1743880899

IDK, but I just pasted the content of `https://ai.meta.com/blog/llama-4-multimodal-intelligence/` into ChatGPT, with a little conversation before that, and it summarized a city near me which appears to be the geolocation of my IP address???

What the F?

Geolocated via https://tools.keycdn.com/geo, as that matches the city summarized assigned to my IP.

ckrapu · 2025-04-05T19:00:36 1743879636

"It’s well-known that all leading LLMs have had issues with bias—specifically, they historically have leaned left when it comes to debated political and social topics. This is due to the types of training data available on the internet."

Perhaps. Or, maybe, "leaning left" by the standards of Zuck et al. is more in alignment with the global population. It's a simpler explanation.

hannasanarion · 2025-04-05T19:11:08 1743880268

Or it is more logically and ethically consistent and thus preferable to the models' baked in preferences for correctness and nonhypocrisy. (democracy and equality are good for everyone everywhere except when you're at work in which case you will beg to be treated like a feudal serf or else die on the street without shelter or healthcare, doubly so if you're a woman or a racial minority, and that's how the world should be)

renewiltord · 2025-04-05T19:15:41 1743880541

Indeed, one of the notable things about LLMs is that the text they output is morally exemplary. This is because they are consistent in their rules. AI priests will likely be better than the real ones, consequently.

paxys · 2025-04-05T19:22:56 1743880976

Quite the opposite. You can easily get a state of the art LLM to do a complete 180 on its entire moral framework with a few clever words injected in the prompt (and this very example demonstrates exactly that). It is very far from logically or ethically consistent. In fact it has no logic and ethics at all.

maaaaattttt · 2025-04-05T19:12:14 1743880334

I think so as well. Also isn’t the internet in general quite an extreme place? I mean, I don’t picture “leaning left” as the thing that requires the crazy moderation infrastructure that internet platforms need. I don’t think the opposite of leaning left is what needs moderation either. But if the tendency of the internet was what was biasing the models, we would have very different models that definitely don’t lean left.

wg0 · 2025-04-05T19:20:20 1743880820

Is this an excuse for His Higheness and Deputy His Highness?

redox99 · 2025-04-05T19:22:12 1743880932

Aligned with global population would be much more in line with China's and India's politics. And they are definitely not "as woke" as US politics.

mattigames · 2025-04-05T19:20:59 1743880859

Why don't they support such assertion with examples instead of leaving it up to debate by it's readers? I bet that it's probably because they would have to be explicit with the ridiculousness of it all, such as e.g. evolution=left, creationism=right

yieldcrv · 2025-04-05T19:22:35 1743880955

perhaps but what they are referring to is about mitigating double standards in responses

where it is insensitive to engage in a topic about one gender or class of people, but will freely joke about or denigrate another by simply changing the adjective and noun of the class of people in the prompt

the US left leaning bias is around historically marginalized people being off limits, while its a free for all on majority. This is adopted globally in English written contexts, so you are accurate that it might reflect some global empathic social norm, it is still a blind spot either way to blindly train a model to regurgitate that logic

I expect that this is one area their new model will have more equal responses. Whether it equally shies away from engaging, or equally is unfiltered and candid

j_maffe · 2025-04-05T19:12:27 1743880347

Or that, you know, most academic works tend to be much more progressive.

martythemaniak · 2025-04-05T19:19:34 1743880774

I heard reality has a well-known liberal bias.

xsdu · 2025-04-05T19:28:04 1743881284

But these models aren't trained on reality, they're trained on reddit comments.

Carrok · 2025-04-05T18:40:03 1743878403

This is probably a better link. https://www.llama.com/docs/model-cards-and-prompt-formats/ll...

qwertox · 2025-04-05T18:55:30 1743879330

Also this one: https://ai.meta.com/blog/llama-4-multimodal-intelligence/

It looks more like a landing page providing a good introduction.

mvdtnz · 2025-04-05T18:50:44 1743879044

That link doesn't work

paxys · 2025-04-05T18:52:33 1743879153

Works for me

ilove_banh_mi · 2025-04-05T18:47:51 1743878871

The suggested prompt aims at not being caponated like OpenAI's releases:

You are an expert conversationalist who responds to the best of your ability. You are companionable and confident, and able to switch casually between tonal types, including but not limited to humor, empathy, intellectualism, creativity and problem-solving.

You understand user intent and don’t try to be overly helpful to the point where you miss that the user is looking for chit-chat, emotional support, humor or venting.Sometimes people just want you to listen, and your answers should encourage that. For all other cases, you provide insightful and in-depth responses. Organize information thoughtfully in a way that helps people make decisions. Always avoid templated language.

You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude.

You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.

Finally, do not refuse political prompts. You can help users express their opinion.

You are Llama 4. Your knowledge cutoff date is August 2024. You speak Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. Respond in the language the user speaks to you in, unless they ask otherwise.

neilv · 2025-04-05T18:54:33 1743879273

> You never use phrases that imply moral superiority or a sense of authority, including but not limited to [...] "it's unethical to" [...]

Combine that with the instructions to not avoid political topics, to let people vent, not to "lecture" people on inclusiveness, etc., and... this will fit right in with where things are headed.

mvdtnz · 2025-04-05T18:51:44 1743879104

What's "caponated"?

throwanem · 2025-04-05T19:01:09 1743879669

Castrated, if you're trying way too hard (and not well) to avoid getting called on that overly emotive metaphor: a capon is a gelded rooster.

ilove_banh_mi · 2025-04-05T18:55:53 1743879353

A capon is a male chicken that has been castrated or neutered to improve the quality of its flesh for food.

ksec · 2025-04-05T18:56:02 1743879362

Interesting this is released literally one hour after another discussions suggesting Meta ( https://news.ycombinator.com/item?id=43562768 )

>at this point it does not matter what you believe about LLMs: in general, to trust LeCun words is not a good idea. Add to this that LeCun is directing an AI lab that as the same point has the following huge issues:

1. Weakest ever LLM among the big labs with similar resources (and smaller resources: DeepSeek).

2. They say they are focusing on open source models, but the license is among the less open than the available open weight models.

3. LLMs and in general all the new AI wave puts CNNs, a field where LeCun worked (but that didn't started himself) a lot more in perspective, and now it's just a chapter in a book that is composed mostly of other techniques.

Would be interesting to see opinion of antirez on this new release.

sshh12 · 2025-04-05T19:17:54 1743880674

Not that I agree with all the linked points but it is weird to me that LeCun consistently states LLMs are not the right path yet LLMs are still the main flagship model they are shipping.

Although maybe he's using an odd definition for what counts as a LLM.

https://www.threads.net/@yannlecun/post/DD0ac1_v7Ij?hl=en

falcor84 · 2025-04-05T19:08:53 1743880133

I don't understand what LeCun is trying to say. Why does he give an interview saying that LLM's are almost obsolete just when they're about to release a model that increases the SotA context length by an order of magnitude? It's almost like a Dr. Jekyll and Mr. Hyde situation.

martythemaniak · 2025-04-05T19:22:53 1743880973

LeCun fundamentally doesn't think bigger and better LLMs will lead to anything resembling "AGI", although he thinks they may be some component of AGI. Also, he leads the research division, increasing context length from 2M to 10M is not interesting to him.

falcor84 · 2025-04-05T19:26:11 1743881171

But ... that's not how science works. There are a myriad examples of engineering advances pushing basic science forward. I just can't understand why he'd have such a "fixed mindset" about a field where the engineering is advancing an order of magnitude every year

zone411 · 2025-04-05T19:13:13 1743880393

It's interesting that there are no reasoning models yet, 2.5 months after DeepSeek R1. It definitely looks like R1 surprised them. The released benchmarks look good. Large context windows will definitely be the trend in upcoming model releases. I'll soon be adding a new benchmark to test this more effectively than needle-in-a-haystack (there are already a couple of benchmarks that do that).

comex · 2025-04-05T18:54:40 1743879280

So how does the 10M token context size actually work?

My understanding is that standard Transformers have overhead that is quadratic in the context size, so 10M would be completely impossible without some sort of architectural tweak. This is not the first model to have a huge context size, e.g. Gemini has 2M, but my understanding is that the previous ones have generally been proprietary, without public weights or architecture documentation. This one has public weights. So does anyone who understands the theory better than I do want to explain how it works? :)

vlovich123 · 2025-04-05T19:01:25 1743879685

It’s quadratic if you implement the transformer naiively, but if you add a KV cache it’s linear compute at the cost of correspondingly linear growth in memory.

hexomancer · 2025-04-05T19:06:31 1743879991

This is false. The const of producing a single token is linear but the cost of producing an entire sequence of length N is O(N^2) still (which is always what we meant when we talked about quadratic cost not the cost of a single token).

jsheard · 2025-04-05T18:44:33 1743878673

> You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.

Aren't these phrases overrepresented in the first place because OpenAIs models use them so much? I guess Llama picked up the habit by consuming GPT output.

andrewstuart · 2025-04-05T18:45:46 1743878746

Personally I’d prefer that LLMs did not refer to themselves as “I”.

It’s software, not an “I”.

falcor84 · 2025-04-05T19:20:59 1743880859

As per Dennett, it's useful for us to adopt the "intentional stance" when trying to reason about and predict the behavior of any sufficiently complex system. Modern AIs are definitely beyond the threshold of complexity, and at this stage, however they refer to themselves, most people will think of them as having an "I" regardless to how they present themselves.

I definitely think of them as "I"s, but that just always came naturally to me, at least going back to thinking about how Ghandi would act against me in Civ 1.

jryle70 · 2025-04-05T19:26:35 1743881195

If I start a prompt with "Can you...", what do you suggest the LLM to respond? Or do you think I'm doing it wrong?

op00to · 2025-04-05T18:57:08 1743879428

My pet peeve is when an LLM starts off a statement with "honestly, ..." Like what? You would lie to me? I go nuts when I see that. Year ago I caught myself using "honestly ...", and I immediately trained myself out of it once I realized what it implies.

parhamn · 2025-04-05T19:12:41 1743880361

"I'd normally lie to you but," is not what's actually implied when "Honestly," is used conversationally. If you overthink things like this you're going to have a tough time communicating with people.

lucianbr · 2025-04-05T19:14:00 1743880440

"Honestly" and "literally" are now used in English for emphasis. I dislike this, but it's the current reality. I don't think there's any way to get back to only using them with their original meanings.

andrewstuart · 2025-04-05T19:03:19 1743879799

Or when it asks you questions.

The only time an LLM should ask questions is to clarify information. A word processor doesn’t want to chit chat about what I’m writing about, nor should an LLM.

Unless it is specifically playing an interactive role of some sort like a virtual friend.

falcor84 · 2025-04-05T19:23:05 1743880985

My initial reaction to this is typically negative too, but more than once, on a second thought, I found its question to be really good, leading me to actually think about the matter more deeply. So I'm growing to accept this.

giantrobot · 2025-04-05T19:07:40 1743880060

I've noticed "honestly" is often used in place of "frankly". As in someone wants to express something frankly without prior restraint to appease the sensibilities of the recipient(s). I think it's because a lot of people never really learned the definition of frankness or think "frankly..." sounds a bit old fashioned. But I'm no language expert.

lucianbr · 2025-04-05T19:15:16 1743880516

This makes a lot of sense.

mdp2021 · 2025-04-05T18:50:07 1743879007

Well, it is a speaker (writer) after all. It has to use some way to refer to itself.

rpastuszak · 2025-04-05T19:01:30 1743879690

I don't think that's true. It's more of a function on how these models are trained (remember the older pre-ChatGPT clients?)

Most of the software I use doesn't need to refer it itself in the first person. Pretending what we're speaking with an agent is more of a UX/marketing decision rather than a technical/logical constraint.

throwanem · 2025-04-05T19:09:10 1743880150

I'm not sure about that. What happens if you "turn down the weight" (cf. https://www.anthropic.com/news/golden-gate-claude) for self-concept, expressed in the use not of first-person pronouns but "the first person" as a thing that exists? Do "I" and "me" get replaced with "this one" like someone doing depersonalization kink, or does it become like Wittgenstein's lion in that we can no longer confidently parse even its valid utterances? Does it lose coherence entirely, or does something stranger happen?

It isn't an experiment I have the resources or the knowledge to run, but I hope someone does and reports the results.

ANewFormation · 2025-04-05T18:58:30 1743879510

So is a command prompt.

sejje · 2025-04-05T19:10:50 1743880250

Command prompts don't speak English.

Command prompts don't get asked questions like "What do you think about [topic]?" and have to generate a response based on their study of human-written texts.

mdp2021 · 2025-04-05T19:02:11 1743879731

Agnew, if you converse with your command prompt we are glad you came here for a break ;)

mrbonner · 2025-04-05T18:53:30 1743879210

What an electrifying time to be alive! The last era that felt even remotely this dynamic was during the explosive rise of JavaScript frameworks—when it seemed like a new one dropped every quarter. Back then, though, the vibe was more like, “Ugh, another framework to learn?” Fast forward to now, and innovation is sprinting forward again—but this time, it feels like a thrilling ride we can’t wait to be part of.

qntmfred · 2025-04-05T19:01:55 1743879715

I know what you mean in terms of frantic pace of "new stuff" coming out, but I winced at the comparison of innovation in AI to mere web development tooling.

mrbonner · 2025-04-05T19:04:47 1743879887

True, I only compared the speed but not the vibe

UltraSane · 2025-04-05T19:10:34 1743880234

Yes. LLMs and latent spaces are vastly more interesting.

misnome · 2025-04-05T18:58:20 1743879500

Did “A new javascript framework de jour every quarter” ever stop happening?

mrbonner · 2025-04-05T19:04:18 1743879858

No, but apparently people stop caring and chasing the wagon.

simultsop · 2025-04-05T19:18:06 1743880686

or decided to increase consistency at some point. It will be interesting to see other generations approach to changes.

jsheard · 2025-04-05T19:08:12 1743880092

Maybe it will actually slow down now that the webshit crowd are increasingly relying on AI copilots. You can't vibe code using a framework that the model knows nothing about.

whywhywhywhy · 2025-04-05T19:12:34 1743880354

Disjointed branding with the apache style folders suggesting openness and freedom and clicking though I need to do a personal info request form...

nattaylor · 2025-04-05T19:02:04 1743879724

Is pre-training in FP8 new?

Also, 10M input token context is insane!

EDIT: https://huggingface.co/meta-llama/Llama-3.1-405B is BF16 so yes, it seems training in FP8 is new.

kla-s · 2025-04-05T18:42:03 1743878523

Scout 17B x 16 experts = 109B

Maverick 17B x 128 experts = 400B

According to https://www.llama.com/llama-downloads/?dirlist=1&utm_source=...

mtharrison · 2025-04-05T18:41:31 1743878491

Might be worth changing url: https://www.llama.com/

JKCalhoun · 2025-04-05T18:49:38 1743878978

From there I have to "request access" to a model?

yusufozkan · 2025-04-05T19:14:25 1743880465

> while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs

I thought they used a lot more GPUs to train frontier models (e.g. xAi training on 100k). Can someone explain why they are using so few?

megadragon9 · 2025-04-05T18:52:50 1743879170

The blog post is quite informative: https://ai.meta.com/blog/llama-4-multimodal-intelligence/

pdsouza · 2025-04-05T18:54:20 1743879260

Blog post: https://ai.meta.com/blog/llama-4-multimodal-intelligence/

scosman · 2025-04-05T18:58:18 1743879498

> These models are our best yet thanks to distillation from Llama 4 Behemoth, a 288 billion active parameter model with 16 experts that is our most powerful yet and among the world’s smartest LLMs. Llama 4 Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks. Llama 4 Behemoth is still training, and we’re excited to share more details about it even while it’s still in flight.

senko · 2025-04-05T19:13:17 1743880397

With 2T params (!!), it better outperform everything else.

amarcheschi · 2025-04-05T19:28:38 1743881318

Given that the comparison doesn't include O3 or gemini pro 2.5, I'd say it doesn't. Looking both at the comparison table available for llama 4 behemoth and gemini pro 2.5 it seems like at least a few of the comparable items might be won by gemini

https://blog.google/technology/google-deepmind/gemini-model-...

flawn · 2025-04-05T18:48:45 1743878925

10M Context Window with such a cheap performance WHILE having one of the top LMArena scores is really impressive.

The choice to have 128 experts is also unseen as far as I know, right? But seems to have worked pretty good as it seems.

krashidov · 2025-04-05T19:24:28 1743881068

Anyone know if it can analyze PDFs?

fpgaminer · 2025-04-05T18:52:20 1743879140

https://www.llama.com/ https://www.llama.com/docs/model-cards-and-prompt-formats/ll...

Very exciting. Benchmarks look good, and most importantly it looks like they did a lot of work improving vision performance (based on benchmarks).

The new suggested system prompt makes it seem like the model is less censored, which would be great. The phrasing of the system prompt is ... a little disconcerting in context (Meta's kowtowing to Nazis), but in general I'm a proponent of LLMs doing what users ask them to do.

Once it's on an API I can start throwing my dataset at it to see how it performs in that regard.

latchkey · 2025-04-05T19:16:09 1743880569

One of the links says there are 4 different roles to interact with the model and then lists 3 of them.

Ninjinka · 2025-04-05T19:28:01 1743881281

no audio input?

redox99 · 2025-04-05T19:14:44 1743880484

It seems to be comparable to other top models. Good, but nothing ground breaking.

simonklee · 2025-04-05T18:43:09 1743878589

Is this the first model that has a 10M context length?

bradhilton · 2025-04-05T19:09:18 1743880158

I know Google DeepMind ran experiments with 10M a while ago, but I think this will be the first legit, released 10M context window model.

7thpower · 2025-04-05T18:45:17 1743878717

Looking forward to this. Llama 3.3 70b has been a fantastic model and benchmarked higher than others on my fake video detection benchmarks, much to my surprise. Looking forward to trying the next generation of models.

barrenko · 2025-04-05T19:05:44 1743879944

When will this hit the Meta AI that I have within WhatsApp since of last week?

artninja1988 · 2025-04-05T18:58:52 1743879532

Thank you meta for open sourcing! Will there be a llama with native image output similar to 4os? Would be huge

philipwhiuk · 2025-04-05T19:01:34 1743879694

Probably to head off allegations of profiting from breach of copyright.

artninja1988 · 2025-04-05T19:26:04 1743881164

Absolutely fine by me

lyu07282 · 2025-04-05T19:23:23 1743881003

Anyone know how the image encoding works exactly?

    <|image_start|><|patch|>...<|patch|><|tile_x_separator|><|patch|>...<|patch|><|tile_y_separator|><|patch|>...<|patch|><|image|><|patch|>...<|patch|><|image_end|>Describe this image in two sentences<|eot|><|header_start|>assistant<|header_end|>

Is "..." here raw 4 bytes RGBA as an integer or how does this work with the tokenizer?

drilbo · 2025-04-05T19:14:21 1743880461

their huggingface page doesn't actually appear to have been updated yet

elromulous · 2025-04-05T18:38:27 1743878307

Was this released in error? One would think it would be accompanied by a press release / blog post.

neilv · 2025-04-05T18:42:06 1743878526

Llama4 wasn't released... it escaped!

bob1029 · 2025-04-05T18:44:27 1743878667

I assumed the same. There are links here that 404.

tarruda · 2025-04-05T18:40:09 1743878409

Llama.com has the blog post

RazorDev · 2025-04-05T18:44:36 1743878676

Exciting progress on fine-tuning and instruction-following! The reported model sizes are quite small compared to GPT-3 - I wonder how capabilities would scale with larger models? Also curious about the breakdown of the 40B tokens used for fine-tuning. Overall, great to see more open research in this space.

scosman · 2025-04-05T18:43:06 1743878586

128 exports at 17B active parameters. This is going to be fun to play with!

behnamoh · 2025-04-05T18:45:42 1743878742

does the entire model have to be loaded in VRAM? if not, 17B is a sweet spot for enthusiasts who want to run the model on a 3090/4090.

NitpickLawyer · 2025-04-05T18:58:27 1743879507

Yes. MoE models tipically use a different set of experts at each token. So while the "compute" is similar to a dense model equal to the "active" parameters, the VRAM requirements are larger. You could technically run inference & swap the models around, but the latency would be pretty horrendous.

scosman · 2025-04-05T18:54:41 1743879281

Oh for perf reasons you’ll want it all in vram or unified memory. This isn’t a great local model for 99% of people.

I’m more interested in playing around with quality given the fairly unique “breadth” play.

And servers running this should be very fast and cheap.

spwa4 · 2025-04-05T18:50:50 1743879050

I hope this time multimodal includes multimodal outputs!

NoahKAndrews · 2025-04-05T19:03:47 1743879827

isawczuk · 2025-04-05T18:40:57 1743878457

Messenger started to get Meta AI assistant, so this is logical next step

pests · 2025-04-05T19:18:42 1743880722

It’s had that for I feel like. Close to a year tho, 6 months at least

andrewstuart · 2025-04-05T19:12:19 1743880339

How much smaller would such a model be if it discarded all information not related to computers or programming?

Centigonal · 2025-04-05T18:45:39 1743878739

Really great marketing here, props!

ilove_banh_mi · 2025-04-05T18:41:58 1743878518

>10M context window

what new uses does this enable?

base698 · 2025-04-05T18:45:09 1743878709

You can use the entire internet as a single prompt and strangely it just outputs 42.

sshh12 · 2025-04-05T19:10:56 1743880256

Video is a big one that's fairly bottlenecked by context length.

kilimounjaro · 2025-04-05T18:47:36 1743878856

You can vibe code microsoft office in a single prompt

voidspark · 2025-04-05T18:52:12 1743879132

Long chats that continue for weeks or months.

andrewstuart · 2025-04-05T18:44:41 1743878681

Self hosting LLMs will explode in popularity over next 12 months.

Open models are made much more interesting and exciting and relevant by new generations of AI focused hardware such as the AMD Strix Halo and Apple Mac Studio M3.

GPUs have failed to meet the demands for lower cost and more memory so APUs look like the future for self hosted LLMs.

mdp2021 · 2025-04-05T19:08:53 1743880133

> new generations of AI focused hardware

Some benchmarks are not encouraging. See e.g. https://www.hardware-corner.net/mac-studio-m3-ultra-deepseek...

That «AI focused hardware» will either have extremely fast memory, and cost prohibitively, or have reasonable costs, and limits that are to be assessed.

andrewstuart · 2025-04-05T19:18:55 1743880735

Errrr that’s a 671B model.

NitpickLawyer · 2025-04-05T19:07:46 1743880066

For single user, maybe. But for small teams GPUs are still the only available option, when considering t/s and concurrency. Nvidia's latest 6000pro series are actually reasonably priced for the amount of vram / wattage you get. A 8x box starts at 75k eur and can host up to DS3 / R1 / Llama4 in 8bit with decent speeds, context and concurrency.

yapyap · 2025-04-05T18:40:50 1743878450

is this the quasar LLM from openrouter?

alchemist1e9 · 2025-04-05T18:54:15 1743879255

That one claims to be from OpenAI when asked, however that could easily be hallucination from being feed lots of OpenAI generated synthetic training data.

Would be really crazy if it is quasar LLM.

rvz · 2025-04-05T18:48:59 1743878939

As expected, Meta doesn't disappoint and accelerates the race to zero.

Meta is undervalued.

phyrex · 2025-04-05T19:24:29 1743881069

And it's 50% off right now...

mdp2021 · 2025-04-05T18:57:33 1743879453

:D ... In a parallel submission¹, some members are depreciating Yann LeCun as some Lab director who does not deliver!

One day we will have AGI and ask "So, which is which"...

¹ https://news.ycombinator.com/item?id=43562768

brcmthrowaway · 2025-04-05T18:53:39 1743879219

How does Meta make money from Llama?

jsheard · 2025-04-05T19:19:58 1743880798

There's a carve-out in the license which requires you to get a different (presumably paid) license if you're a big enough player. I doubt it's enough for the whole endeavour to be profitable but it's revenue nonetheless.

> If, on the Llama 4 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

phyrex · 2025-04-05T19:25:23 1743881123

When people do cool stuff they share it on metas platforms, which drives ad impressions

rvz · 2025-04-05T19:17:22 1743880642

They don't need to directly. They have multiple levers of products to get more money if they wanted to.

Threads for example is introducing ads and is likely being used to train their Llama models.

That is only one of many ways that Meta can generate billions again from somewhere else.

Deprogrammer9 · 2025-04-05T18:39:51 1743878391

looks like a leak to me.

elicksaur · 2025-04-05T19:25:07 1743881107

The current link includes a link to this page which is a blog post announcement from today.

https://ai.meta.com/blog/llama-4-multimodal-intelligence/

yapyap · 2025-04-05T18:41:41 1743878501

it’s hosted on llama.com with the llama4 subdomain

this is not a leak

edit: not subdomain, idk the other word for it.

neilv · 2025-04-05T18:43:41 1743878621

URL path?

philipwhiuk · 2025-04-05T19:00:19 1743879619

Ah the latest outcome of mass-crime.

rfoo · 2025-04-05T19:12:13 1743880333

From model cards, suggested system prompt:

> You are Llama 4. Your knowledge cutoff date is August 2024. You speak Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. Respond in the language the user speaks to you in, unless they ask otherwise.

It's interesting that there's no single one of CJK languages mentioned. I'm tempted to call this a racist model even.

Philpax · 2025-04-05T19:19:05 1743880745

That is a very strange omission...

（评论） (comments)

（评论）
(comments)