(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43595585

Meta发布了Llama 2的后续版本Llama 4,其中包括Scout和Maverick两个模型。Scout模型拥有170亿活跃参数和1090亿总参数(使用混合专家MoE技术),设计目标是高效,可在单GPU上运行,并拥有1000万token的上下文窗口。Maverick模型也拥有170亿活跃参数,但总参数达到4000亿,擅长编码和推理。这两个模型都是多模态的,可以接受文本和图像输入,并输出文本。Meta还提到了一个更大的、仍在训练中的“Behemoth”模型(2万亿参数),该模型在STEM基准测试中超越了当前领先的模型,并被用来蒸馏出更小的模型。 这些模型具有行业领先的上下文长度和改进的多语言能力,知识截止日期为2024年8月。一个建议的系统提示鼓励更少审查、更灵活的对话风格,避免道德说教,并允许进行政治讨论。该架构使用MoE技术,通过每个token只激活170亿参数来降低推理成本。量化可以降低内存需求,使这些模型能够在各种硬件配置上运行。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Llama4 (llama.com)
283 points by georgehill 55 minutes ago | hide | past | favorite | 151 comments










The (smaller) Scout model is really attractive for Apple Silicon. It is 109B big but split up into 16 experts. This means that the actual processing happens in 17B. Which means responses will be as fast as current 17B models. I just asked a local 7B model (qwen 2.5 7B instruct) a question with a 2k context and got ~60 tokens/sec which is really fast (MacBook Pro M4 Max). So this could hit 30 token/sec. Time to first token (the processing time before it starts responding) will probably still be slow because (I think) all experts have to be used for that.

In addition, the model has a 10M token context window, which is huge. Not sure how well it can keep track of the context at such sizes, but just not being restricted to ~32k is already great, 256k even better.



> the actual processing happens in 17B

This is a common misconception of how MoE models work. To be clear, 17B parameters are activated for each token generated.

In practice you will almost certainly be pulling the full 109B parameters though the CPU/GPU cache hierarchy to generate non-trivial output, or at least a significant fraction of that.



For all intents and purposes cache may not exist when the working set is 17B or 109B parameters. So it's still better that less parameters are activated for each token. 17B parameters works ~6x faster than 109B parameters just because less data needs to be loaded from RAM.


109B at Q6 is also nice for Framework Desktop 128GB.


Yes, this announcement was a nice surprise for us. We’re going to test out exactly that setup.


I don't understand Framework's desktop offerings. For laptops their open approach makes sense, but desktops are already about as hackable and DIY as they come.


We took the Ryzen AI Max, which is nominally a high-end laptop processor, and built it into a standard PC form factor (Mini-ITX). It’s a more open/extensible mini PC using mobile technology.


Is it public (or even known by the developers) how the experts are split up? Is it by topic, so physics questions go to one and biology goes to another one? Or just by language, so every English question is handled by one expert? That’s dynamically decided during training and not set before, right?


This is a common misunderstanding. Experts are learned via gating networks during training that routes dynamically per parameter. You might have an expert on the word "apple" in one layer for a slightly lossy example.

Queries are then also dynamically routed.



It can be either but typically it's "learned" without a defined mapping (which guessing is the case here). Although some experts may end up heavily correlating with certain domains.


"That’s dynamically decided during training and not set before, right?"

^ right. I can't recall off the top of my head, but there was a recent paper that showed if you tried dictating this sort of thing the perf fell off a cliff (I presume there's some layer of base knowledge $X that each expert needs)



To add, they say about the 400B "Maverick" model:

> while achieving comparable results to the new DeepSeek v3 on reasoning and coding

If that's true, it will certainly be interesting for some to load up this model on a private M3 Studio 512GB. Response time will be fast enough for interaction in Roo Code or Cline. Prompt processing is a bit slower but could be manageable depending on how much code context is given to the model.

The upside being that it can be used on codebases without having to share any code with a LLM provider.



Small point of order: bit slower might not set expectations accurately. You noted in a previous post in the same thread[^1] that we'd expect about a 1 minute per 10K tokens(!) prompt processing time with the smaller model. I agree, and contribute to llama.cpp. If anything, that is quite generous.

[^1] https://news.ycombinator.com/item?id=43595888



I don't think the time grows linearly. The more context the slower (at least in my experience because the system has to throttle). I just tried 2k tokens in the same model that I used for the 120k test some weeks ago and processing took 12 sec to first token (qwen 2.5 32b q8).


At 109b params you’ll need a ton of memory. We’ll have to wait for evals of the quants to know how much.


Sure but the upside of Apple Silicon is that larger memory sizes are comparatively cheap (compared to buying the equivalent amount of 5090 or 4090). Also you can download quantizations.


Maybe I'm missing something but I don't think I've ever seen quants lower memory reqs. I assumed that was because they still have to be unpacked for inference. (please do correct me if I'm wrong, I contribute to llama.cpp and am attempting to land a client on everything from Android CPU to Mac GPU)


Quantizing definitely lowers memory requirements, it's a pretty direct effect because you're straight up using less bits per parameter across the board - thus the representation of the weights in memory is smaller, at the cost of precision.


Needing less memory for inference is the entire point of quantization. Saving the disk space or having a smaller download could not justify any level of quality degradation.


Quantization by definition lower memory requirements - instead of using f16 for weights, you are using q8, q6, q4, or q2 which means the weights are smaller by 2x, ~2.7x, 4x or 8x respectively.

That doesn’t necessarily translate to the full memory reduction because of interim compute tensors and KV cache, but those can also be quantized.



I just loaded two models of different quants into LM Studio:

qwen 2.5 coder 1.5b @ q4_k_m: 1.21 GB memory

qwen 2.5 coder 1.5b @ q8: 1.83 GB memory

I always assumed this to be the case (also because of the smaller download sizes) but never really thought about it.



No need to unpack for inference. As things like CUDA kernels are fully programmable, you can code them to work with 4 bit integers, no problems at all.


Won’t prompt processing need the full model though, and be quite slow on a Mac?


Yes, that's what I tried to express. Large prompts will probably be slow. I tried a 120k prompt once and it took 10min to process. But you still get a ton of world knowledge and fast response times, and smaller prompts will process fast.


General overview below, as the pages don't seem to be working well

  Llama 4 Models:
  - Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each.
  - They are natively multimodal: text + image input, text-only output.
  - Key achievements include industry-leading context lengths, strong coding/reasoning performance, and improved multilingual capabilities.
  - Knowledge cutoff: August 2024.

  Llama 4 Scout:
  - 17B active parameters, 16 experts, 109B total.
  - Fits on a single H100 GPU (INT4-quantized).
  - 10M token context window
  - Outperforms previous Llama releases on multimodal tasks while being more resource-friendly.
  - Employs iRoPE architecture for efficient long-context attention.
  - Tested with up to 8 images per prompt.

  Llama 4 Maverick:
  - 17B active parameters, 128 experts, 400B total.
  - 1M token context window.
  - Not single-GPU; runs on one H100 DGX host or can be distributed for greater efficiency.
  - Outperforms GPT-4o and Gemini 2.0 Flash on coding, reasoning, and multilingual tests at a competitive cost.
  - Maintains strong image understanding and grounded reasoning ability.

  Llama 4 Behemoth (Preview):
  - 288B active parameters, 16 experts, nearly 2T total.
  - Still in training; not yet released.
  - Exceeds GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks (e.g., MATH-500, GPQA Diamond).
  - Serves as the “teacher” model for Scout and Maverick via co-distillation.

  Misc:
  - MoE Architecture: Only 17B parameters activated per token, reducing inference cost.
  - Native Multimodality: Unified text + vision encoder, pre-trained on large-scale unlabeled data.


Thanks for sharing this here. At first I loved the simple Apache-style directory listing, very classic and utilitarian way to navigate new information. Then I tried clicking the FAQ and it wouldn't load anything until I allowed two different sources of JavaScript.


> Knowledge cutoff: August 2024.

Could this mean training time is generally around 6 month, with 2 month of Q/A?



Llama 4 Scout, Maximum context length: 10M tokens.

This is a nice development.



How did they achieve such a long window and what are the memory requirements to utilize it?


IDK, but I just pasted the content of `https://ai.meta.com/blog/llama-4-multimodal-intelligence/` into ChatGPT, with a little conversation before that, and it summarized a city near me which appears to be the geolocation of my IP address???

What the F?

Geolocated via https://tools.keycdn.com/geo, as that matches the city summarized assigned to my IP.



"It’s well-known that all leading LLMs have had issues with bias—specifically, they historically have leaned left when it comes to debated political and social topics. This is due to the types of training data available on the internet."

Perhaps. Or, maybe, "leaning left" by the standards of Zuck et al. is more in alignment with the global population. It's a simpler explanation.



Or it is more logically and ethically consistent and thus preferable to the models' baked in preferences for correctness and nonhypocrisy. (democracy and equality are good for everyone everywhere except when you're at work in which case you will beg to be treated like a feudal serf or else die on the street without shelter or healthcare, doubly so if you're a woman or a racial minority, and that's how the world should be)


Indeed, one of the notable things about LLMs is that the text they output is morally exemplary. This is because they are consistent in their rules. AI priests will likely be better than the real ones, consequently.


Quite the opposite. You can easily get a state of the art LLM to do a complete 180 on its entire moral framework with a few clever words injected in the prompt (and this very example demonstrates exactly that). It is very far from logically or ethically consistent. In fact it has no logic and ethics at all.


I think so as well. Also isn’t the internet in general quite an extreme place? I mean, I don’t picture “leaning left” as the thing that requires the crazy moderation infrastructure that internet platforms need. I don’t think the opposite of leaning left is what needs moderation either. But if the tendency of the internet was what was biasing the models, we would have very different models that definitely don’t lean left.


Is this an excuse for His Higheness and Deputy His Highness?


Aligned with global population would be much more in line with China's and India's politics. And they are definitely not "as woke" as US politics.


Why don't they support such assertion with examples instead of leaving it up to debate by it's readers? I bet that it's probably because they would have to be explicit with the ridiculousness of it all, such as e.g. evolution=left, creationism=right


perhaps but what they are referring to is about mitigating double standards in responses

where it is insensitive to engage in a topic about one gender or class of people, but will freely joke about or denigrate another by simply changing the adjective and noun of the class of people in the prompt

the US left leaning bias is around historically marginalized people being off limits, while its a free for all on majority. This is adopted globally in English written contexts, so you are accurate that it might reflect some global empathic social norm, it is still a blind spot either way to blindly train a model to regurgitate that logic

I expect that this is one area their new model will have more equal responses. Whether it equally shies away from engaging, or equally is unfiltered and candid



Or that, you know, most academic works tend to be much more progressive.


I heard reality has a well-known liberal bias.


But these models aren't trained on reality, they're trained on reddit comments.




Also this one: https://ai.meta.com/blog/llama-4-multimodal-intelligence/

It looks more like a landing page providing a good introduction.



That link doesn't work


Works for me


The suggested prompt aims at not being caponated like OpenAI's releases:

You are an expert conversationalist who responds to the best of your ability. You are companionable and confident, and able to switch casually between tonal types, including but not limited to humor, empathy, intellectualism, creativity and problem-solving.

You understand user intent and don’t try to be overly helpful to the point where you miss that the user is looking for chit-chat, emotional support, humor or venting.Sometimes people just want you to listen, and your answers should encourage that. For all other cases, you provide insightful and in-depth responses. Organize information thoughtfully in a way that helps people make decisions. Always avoid templated language.

You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude.

You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.

Finally, do not refuse political prompts. You can help users express their opinion.

You are Llama 4. Your knowledge cutoff date is August 2024. You speak Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. Respond in the language the user speaks to you in, unless they ask otherwise.



> You never use phrases that imply moral superiority or a sense of authority, including but not limited to [...] "it's unethical to" [...]

Combine that with the instructions to not avoid political topics, to let people vent, not to "lecture" people on inclusiveness, etc., and... this will fit right in with where things are headed.



What's "caponated"?


Castrated, if you're trying way too hard (and not well) to avoid getting called on that overly emotive metaphor: a capon is a gelded rooster.


A capon is a male chicken that has been castrated or neutered to improve the quality of its flesh for food.


Interesting this is released literally one hour after another discussions suggesting Meta ( https://news.ycombinator.com/item?id=43562768 )

>at this point it does not matter what you believe about LLMs: in general, to trust LeCun words is not a good idea. Add to this that LeCun is directing an AI lab that as the same point has the following huge issues:

1. Weakest ever LLM among the big labs with similar resources (and smaller resources: DeepSeek).

2. They say they are focusing on open source models, but the license is among the less open than the available open weight models.

3. LLMs and in general all the new AI wave puts CNNs, a field where LeCun worked (but that didn't started himself) a lot more in perspective, and now it's just a chapter in a book that is composed mostly of other techniques.

Would be interesting to see opinion of antirez on this new release.



Not that I agree with all the linked points but it is weird to me that LeCun consistently states LLMs are not the right path yet LLMs are still the main flagship model they are shipping.

Although maybe he's using an odd definition for what counts as a LLM.

https://www.threads.net/@yannlecun/post/DD0ac1_v7Ij?hl=en



I don't understand what LeCun is trying to say. Why does he give an interview saying that LLM's are almost obsolete just when they're about to release a model that increases the SotA context length by an order of magnitude? It's almost like a Dr. Jekyll and Mr. Hyde situation.


LeCun fundamentally doesn't think bigger and better LLMs will lead to anything resembling "AGI", although he thinks they may be some component of AGI. Also, he leads the research division, increasing context length from 2M to 10M is not interesting to him.


But ... that's not how science works. There are a myriad examples of engineering advances pushing basic science forward. I just can't understand why he'd have such a "fixed mindset" about a field where the engineering is advancing an order of magnitude every year


It's interesting that there are no reasoning models yet, 2.5 months after DeepSeek R1. It definitely looks like R1 surprised them. The released benchmarks look good. Large context windows will definitely be the trend in upcoming model releases. I'll soon be adding a new benchmark to test this more effectively than needle-in-a-haystack (there are already a couple of benchmarks that do that).


So how does the 10M token context size actually work?

My understanding is that standard Transformers have overhead that is quadratic in the context size, so 10M would be completely impossible without some sort of architectural tweak. This is not the first model to have a huge context size, e.g. Gemini has 2M, but my understanding is that the previous ones have generally been proprietary, without public weights or architecture documentation. This one has public weights. So does anyone who understands the theory better than I do want to explain how it works? :)



It’s quadratic if you implement the transformer naiively, but if you add a KV cache it’s linear compute at the cost of correspondingly linear growth in memory.


This is false. The const of producing a single token is linear but the cost of producing an entire sequence of length N is O(N^2) still (which is always what we meant when we talked about quadratic cost not the cost of a single token).


> You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.

Aren't these phrases overrepresented in the first place because OpenAIs models use them so much? I guess Llama picked up the habit by consuming GPT output.



Personally I’d prefer that LLMs did not refer to themselves as “I”.

It’s software, not an “I”.



As per Dennett, it's useful for us to adopt the "intentional stance" when trying to reason about and predict the behavior of any sufficiently complex system. Modern AIs are definitely beyond the threshold of complexity, and at this stage, however they refer to themselves, most people will think of them as having an "I" regardless to how they present themselves.

I definitely think of them as "I"s, but that just always came naturally to me, at least going back to thinking about how Ghandi would act against me in Civ 1.



If I start a prompt with "Can you...", what do you suggest the LLM to respond? Or do you think I'm doing it wrong?


My pet peeve is when an LLM starts off a statement with "honestly, ..." Like what? You would lie to me? I go nuts when I see that. Year ago I caught myself using "honestly ...", and I immediately trained myself out of it once I realized what it implies.


"I'd normally lie to you but," is not what's actually implied when "Honestly," is used conversationally. If you overthink things like this you're going to have a tough time communicating with people.


"Honestly" and "literally" are now used in English for emphasis. I dislike this, but it's the current reality. I don't think there's any way to get back to only using them with their original meanings.


Or when it asks you questions.

The only time an LLM should ask questions is to clarify information. A word processor doesn’t want to chit chat about what I’m writing about, nor should an LLM.

Unless it is specifically playing an interactive role of some sort like a virtual friend.



My initial reaction to this is typically negative too, but more than once, on a second thought, I found its question to be really good, leading me to actually think about the matter more deeply. So I'm growing to accept this.


I've noticed "honestly" is often used in place of "frankly". As in someone wants to express something frankly without prior restraint to appease the sensibilities of the recipient(s). I think it's because a lot of people never really learned the definition of frankness or think "frankly..." sounds a bit old fashioned. But I'm no language expert.


This makes a lot of sense.


Well, it is a speaker (writer) after all. It has to use some way to refer to itself.


I don't think that's true. It's more of a function on how these models are trained (remember the older pre-ChatGPT clients?)

Most of the software I use doesn't need to refer it itself in the first person. Pretending what we're speaking with an agent is more of a UX/marketing decision rather than a technical/logical constraint.



I'm not sure about that. What happens if you "turn down the weight" (cf. https://www.anthropic.com/news/golden-gate-claude) for self-concept, expressed in the use not of first-person pronouns but "the first person" as a thing that exists? Do "I" and "me" get replaced with "this one" like someone doing depersonalization kink, or does it become like Wittgenstein's lion in that we can no longer confidently parse even its valid utterances? Does it lose coherence entirely, or does something stranger happen?

It isn't an experiment I have the resources or the knowledge to run, but I hope someone does and reports the results.



So is a command prompt.


Command prompts don't speak English.

Command prompts don't get asked questions like "What do you think about [topic]?" and have to generate a response based on their study of human-written texts.



Agnew, if you converse with your command prompt we are glad you came here for a break ;)


What an electrifying time to be alive! The last era that felt even remotely this dynamic was during the explosive rise of JavaScript frameworks—when it seemed like a new one dropped every quarter. Back then, though, the vibe was more like, “Ugh, another framework to learn?” Fast forward to now, and innovation is sprinting forward again—but this time, it feels like a thrilling ride we can’t wait to be part of.


I know what you mean in terms of frantic pace of "new stuff" coming out, but I winced at the comparison of innovation in AI to mere web development tooling.


True, I only compared the speed but not the vibe


Yes. LLMs and latent spaces are vastly more interesting.


Did “A new javascript framework de jour every quarter” ever stop happening?


No, but apparently people stop caring and chasing the wagon.


or decided to increase consistency at some point. It will be interesting to see other generations approach to changes.


Maybe it will actually slow down now that the webshit crowd are increasingly relying on AI copilots. You can't vibe code using a framework that the model knows nothing about.


Disjointed branding with the apache style folders suggesting openness and freedom and clicking though I need to do a personal info request form...


Is pre-training in FP8 new?

Also, 10M input token context is insane!

EDIT: https://huggingface.co/meta-llama/Llama-3.1-405B is BF16 so yes, it seems training in FP8 is new.



Scout 17B x 16 experts = 109B

Maverick 17B x 128 experts = 400B

According to https://www.llama.com/llama-downloads/?dirlist=1&utm_source=...



Might be worth changing url: https://www.llama.com/


From there I have to "request access" to a model?


> while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs

I thought they used a lot more GPUs to train frontier models (e.g. xAi training on 100k). Can someone explain why they are using so few?







> These models are our best yet thanks to distillation from Llama 4 Behemoth, a 288 billion active parameter model with 16 experts that is our most powerful yet and among the world’s smartest LLMs. Llama 4 Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks. Llama 4 Behemoth is still training, and we’re excited to share more details about it even while it’s still in flight.


With 2T params (!!), it better outperform everything else.


Given that the comparison doesn't include O3 or gemini pro 2.5, I'd say it doesn't. Looking both at the comparison table available for llama 4 behemoth and gemini pro 2.5 it seems like at least a few of the comparable items might be won by gemini

https://blog.google/technology/google-deepmind/gemini-model-...



10M Context Window with such a cheap performance WHILE having one of the top LMArena scores is really impressive.

The choice to have 128 experts is also unseen as far as I know, right? But seems to have worked pretty good as it seems.



Anyone know if it can analyze PDFs?


https://www.llama.com/ https://www.llama.com/docs/model-cards-and-prompt-formats/ll...

Very exciting. Benchmarks look good, and most importantly it looks like they did a lot of work improving vision performance (based on benchmarks).

The new suggested system prompt makes it seem like the model is less censored, which would be great. The phrasing of the system prompt is ... a little disconcerting in context (Meta's kowtowing to Nazis), but in general I'm a proponent of LLMs doing what users ask them to do.

Once it's on an API I can start throwing my dataset at it to see how it performs in that regard.



One of the links says there are 4 different roles to interact with the model and then lists 3 of them.


no audio input?


It seems to be comparable to other top models. Good, but nothing ground breaking.


Is this the first model that has a 10M context length?


I know Google DeepMind ran experiments with 10M a while ago, but I think this will be the first legit, released 10M context window model.


Looking forward to this. Llama 3.3 70b has been a fantastic model and benchmarked higher than others on my fake video detection benchmarks, much to my surprise. Looking forward to trying the next generation of models.


When will this hit the Meta AI that I have within WhatsApp since of last week?


Thank you meta for open sourcing! Will there be a llama with native image output similar to 4os? Would be huge


Probably to head off allegations of profiting from breach of copyright.


Absolutely fine by me


Anyone know how the image encoding works exactly?

    <|image_start|><|patch|>...<|patch|><|tile_x_separator|><|patch|>...<|patch|><|tile_y_separator|><|patch|>...<|patch|><|image|><|patch|>...<|patch|><|image_end|>Describe this image in two sentences<|eot|><|header_start|>assistant<|header_end|>
Is "..." here raw 4 bytes RGBA as an integer or how does this work with the tokenizer?


their huggingface page doesn't actually appear to have been updated yet


Was this released in error? One would think it would be accompanied by a press release / blog post.


Llama4 wasn't released... it escaped!


I assumed the same. There are links here that 404.


Llama.com has the blog post


Exciting progress on fine-tuning and instruction-following! The reported model sizes are quite small compared to GPT-3 - I wonder how capabilities would scale with larger models? Also curious about the breakdown of the 40B tokens used for fine-tuning. Overall, great to see more open research in this space.


128 exports at 17B active parameters. This is going to be fun to play with!


does the entire model have to be loaded in VRAM? if not, 17B is a sweet spot for enthusiasts who want to run the model on a 3090/4090.


Yes. MoE models tipically use a different set of experts at each token. So while the "compute" is similar to a dense model equal to the "active" parameters, the VRAM requirements are larger. You could technically run inference & swap the models around, but the latency would be pretty horrendous.


Oh for perf reasons you’ll want it all in vram or unified memory. This isn’t a great local model for 99% of people.

I’m more interested in playing around with quality given the fairly unique “breadth” play.

And servers running this should be very fast and cheap.



I hope this time multimodal includes multimodal outputs!


Nope


Messenger started to get Meta AI assistant, so this is logical next step


It’s had that for I feel like. Close to a year tho, 6 months at least


How much smaller would such a model be if it discarded all information not related to computers or programming?


Really great marketing here, props!


>10M context window

what new uses does this enable?



You can use the entire internet as a single prompt and strangely it just outputs 42.


Video is a big one that's fairly bottlenecked by context length.


You can vibe code microsoft office in a single prompt


Long chats that continue for weeks or months.


Self hosting LLMs will explode in popularity over next 12 months.

Open models are made much more interesting and exciting and relevant by new generations of AI focused hardware such as the AMD Strix Halo and Apple Mac Studio M3.

GPUs have failed to meet the demands for lower cost and more memory so APUs look like the future for self hosted LLMs.



> new generations of AI focused hardware

Some benchmarks are not encouraging. See e.g. https://www.hardware-corner.net/mac-studio-m3-ultra-deepseek...

That «AI focused hardware» will either have extremely fast memory, and cost prohibitively, or have reasonable costs, and limits that are to be assessed.



Errrr that’s a 671B model.


For single user, maybe. But for small teams GPUs are still the only available option, when considering t/s and concurrency. Nvidia's latest 6000pro series are actually reasonably priced for the amount of vram / wattage you get. A 8x box starts at 75k eur and can host up to DS3 / R1 / Llama4 in 8bit with decent speeds, context and concurrency.


is this the quasar LLM from openrouter?


That one claims to be from OpenAI when asked, however that could easily be hallucination from being feed lots of OpenAI generated synthetic training data.

Would be really crazy if it is quasar LLM.



As expected, Meta doesn't disappoint and accelerates the race to zero.

Meta is undervalued.



And it's 50% off right now...


:D ... In a parallel submission¹, some members are depreciating Yann LeCun as some Lab director who does not deliver!

One day we will have AGI and ask "So, which is which"...

¹ https://news.ycombinator.com/item?id=43562768



How does Meta make money from Llama?


There's a carve-out in the license which requires you to get a different (presumably paid) license if you're a big enough player. I doubt it's enough for the whole endeavour to be profitable but it's revenue nonetheless.

> If, on the Llama 4 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.



When people do cool stuff they share it on metas platforms, which drives ad impressions


They don't need to directly. They have multiple levers of products to get more money if they wanted to.

Threads for example is introducing ads and is likely being used to train their Llama models.

That is only one of many ways that Meta can generate billions again from somewhere else.



looks like a leak to me.


The current link includes a link to this page which is a blog post announcement from today.

https://ai.meta.com/blog/llama-4-multimodal-intelligence/



it’s hosted on llama.com with the llama4 subdomain

this is not a leak

edit: not subdomain, idk the other word for it.



URL path?


Ah the latest outcome of mass-crime.


From model cards, suggested system prompt:

> You are Llama 4. Your knowledge cutoff date is August 2024. You speak Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. Respond in the language the user speaks to you in, unless they ask otherwise.

It's interesting that there's no single one of CJK languages mentioned. I'm tempted to call this a racist model even.



That is a very strange omission...






Join us for AI Startup School this June 16-17 in San Francisco!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com