（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40408880

本文讨论了 Llamas2 等大型语言模型 (LLM) 的紧凑尺寸和简单性带来的惊喜，尽管它们具有先进的功能。它强调了一个人如何编写基本功能的代码并在标准计算设备上运行，而针对多个平台的培训和优化等更大的方面则涉及更大的复杂性。该文本承认与数据采集、预处理和访问必要硬件相关的挑战。尽管有初步印象，这些模型并不神奇，而是依赖于现有的计算结构，特别是多层感知器（MLP）。注意力机制通过提供信息路由来增强 MLPS。总体而言，在理论上和实践上，对这些模型的具体工作原理的理解都在不断取得进展。

If you like this, it's also worth looking at llama2.c[1], an implementation of the Llama 2 architecture in about 1000 lines of plain, dependency-free C, tokenizer and all. THe fact that this 960-line file and a somewhat modern C compiler is all you really need to run a state-of-the-art language model is really surprising to many.

Of course, this is not all there is to a modern LLM, it would probably take another thousand lines or two to implement training, and many more than that to make it fast on all the major CPU and GPU architectures. If you want a flexible framework that lets a developer define any model you want and still goes as fast as it can, the complexity spirals.

Most programmers have an intuition that duplicating a large software project from scratch, like Linux or Chromium for example, would require incredible amounts of expertise, manpower and time. It's not something that a small team can achieve in a few months. You're limited by talent, not hardware.

LLMs are very different. THe code isn't that complicated, you could probably implement training and inference for a single model architecture, from scratch, on a single kind of GPU, with reasonable performance, as an individual with a background in programming and who still remembers their calculus and linear algebra, with a year or so of self study. What makes LLMs difficult is getting access to all the hardware to train them, getting the data, and being able to preprocess that data.

One other thing to add is large-scale RLHF. Big Tech can pay literally hundreds of technically-sophisticated people throughout the world (e.g. college grads in developing countries) to improve LLM performance on all sorts of specific problems. It is not a viable way to get AGI, but it means your LLM can learn tons of useful tricks that real people might want, and helps avoid embarrassing "mix broken glass into your baby formula" mistakes. (Obviously it is not foolproof.)

I suspect GPT-4's "secret sauce" in terms of edging out competitors is that OpenAI is better about managing data contractors than the other folks. Of course it's a haze of NDAs to learn specifics, and clearly the contractors are severely underpaid compared to OpenAI employees/executives. But a lone genius with a platinum credit card can't create a new world-class LLM without help from others.

Yes, this is the secret sauce and the moat. Not as easy as buying more compute with unlimited budget.

… built on the back of a disposable workforce…

There is something grim and dystopian, thinking about the countless small hands feeding the machine.

>There is something grim and dystopian, thinking about the countless small hands feeding the machine.

Dystopian indeed, this is pretty much how Manhattan Project and CERN were done, with many independent contractors doing different parts, and only a few has the overview. A page out of corporate management book, it very much allows concentration of power in the hands of a few.

The big difference is that CERN or Manhattan projects where done by local contractors with often more than decent wages, which isn't the case when you pay people from Madagascar a couple dollar a day.

> The code isn't that complicated, you could probably implement training and inference for a single model architecture, from scratch, on a single kind of GPU, with reasonable performance, as an individual with a background in programming and who still remembers their calculus and linear algebra, with a year or so of self study.

Great overview. One gap I've been working on (daily) since October is the math working towards MA's Mathematics for Machine Learning course (https://mathacademy.com/courses/mathematics-for-machine-lear...).

I wrote about my progress (http://gmays.com/math) if anyone else is interested in a similar path. I recently crossed 200 days of doing math daily (at least a lesson a day). It's definitely taking longer than I want, but I also have limited time (young kids + startup + investing).

The 'year of self study' definitely depends on where you're starting from and how much time you have, but it's very doable if you can dedicate an hour or two a day.

> you could probably implement training and inference for a single model architecture, from scratch, on a single kind of GPU, with reasonable performance… with a year or so

I have implemented inference of Whisper https://github.com/Const-me/Whisper and Mistral https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral... models on all GPUs which support Direct3D 11.0 API. The performance is IMO very reasonable.

A year might be required when the only input is the research articles. In practice, we also have reference Python implementations of these models. Possible to test different functions or compute shaders against the corresponding pieces from the reference implementations, by comparing saved output tensors between the reference and the newly built implementation. Due to that simple trick, I think I have spent less than 1 month part-time for each of these two projects.

> Most programmers have an intuition that duplicating a large software project from scratch, like Linux or Chromium for example, would require incredible amounts of expertise, manpower and time. It's not something that a small team can achieve in a few months. You're limited by talent, not hardware.

But only for the same reasons. Linux runs on very nearly every piece of hardware ever made. The APIs you have to implement in order to run "Linux programs" are large and full of old complexity that exists for compatibility. Chromium is full of code to try to make pages render even though they were designed for Internet Explorer 6.

Conversely, some university programs have students create a basic operating system from scratch. It's definitely something a small team can do as long as you don't care about broad hardware support or compatibility with existing applications. In principle a basic web browser is even simpler.

The code is much more similar, in principle, to a virtual machine. The actual code, the bit that contains the logic which has the semantics we intend, is in the trained weights, where the level of complexity is much higher and more subtle.

> What makes LLMs difficult is getting access to all the hardware to train them, getting the data, and being able to preprocess that data.

Yes, that's my opinion too. GAOs (Grassroots AI Organisations) are constrained by access to data and the hardware needed to process the data and train the model on it. I look forward to a future where GAOs will crowdsource their computations in the same way many science labs borrow computing power from people around the world.

Wait, are you saying SoTA NN research hasnt evolved from hardcoding a bunch of layer structures and sizes?

I'm kind of shocked. I thought there would be more dynamism by now and I stopped dabbling in like 2018.

There is a tick-tock between searching the dominant NN architectures (tick) and optimizing for accuracy, compute and inference latency and throughput (tock).

This particular (tock) is still playing out. The next (tick) does not feel imminent and will likely depend on when we discover the limits of the transformers when it comes to solving for long tail of use-cases.

My $0.02.

I like your analogy of a tick tock ~= epoch of progress

Step change, then optimization of that step change

Kind of like a grand father clock with a huge pendulum swinging to one side, then another(commonly used metaphor).

My wish is they would move on to the next phase. The whole deal with SSMs look really good. But looking for better architects is countered with "a regular architecture with more parameters are doing better. What's the point of this"

IMO, SSMs are an optimization. They don't represent enough of a fundamental departure from the kinds of things Transformers can _do_. So, while I like the idea of saving on the energy costs, I speculate that such saving can be obtained with other optimizations while staying with transformer blocks. Hence, the motivation to change is a bit of an uphill here. I would love to hear counter-arguments to this view. :)

Furthermore, I think a replacement will require that we _understand_ what the current crop of models are doing mechanically. Some of it was motivated in [1].

[1] https://openaipublic.blob.core.windows.net/neuron-explainer/...

Quadratic vs linear is not an optimization. It's a completely new game. With selective SSMs (mamba) the win is that associative training can be run in sublinear time via a log-cost associative scan. So you go from something quadratic wrt input sequence length to something logarithmic. If that's just an optimization it's a huge one.

There's something about a transformer being at its core based on a differentiable hash table data structure that makes them special.

I think it's dominance is not going to substantially change any time soon. Dont you know, the solution to all leetcode interviews is a hash table?

Heyo! Have been doing this for a while. SSMs certainly are flashy (most popular topics-of-the-year are), and it would be nice to see if they hit a point of competitive performance with transformers (and if they stand the test of time!)

There are certainly tradeoffs to both, the general transformer motif scales very well on a number of axis, so that may be the dominant algorithm for a while to come, though almost certainly it will change and evolve as time goes along (and who knows? something else may come along as well <3 :')))) ).

The solution to agi is not deep learning maybe with more compute and shit load of engineering it can work kind of baby agi.

My bet will be on something else than gradient descent and backprop but really I don't wish any company or country to reach agi or any sophisticated ai ...

Magical thinking. Nature uses gradient descent to evolve all of us and our companions on this planet. If something better were out there, we would see it at work in the natural world.

You have to consider that there are still some low hanging fruit that let you improve prompt processing (not token generation) performance by an order of magnitude or even two, but there are no takers. The reason is quite simple. You can just buy more GPUs and forget about the optimizations.

If a 100x improvement in performance is left on the table, then surely even lower priority optimizations won't be implemented any time soon.

Consider this: a lot of clever attention optimizations rely on some initial pass to narrow the important tokens down and discarding them from the KV cache. If this was actually possible, then how come the first few layers of the LLM don't already do this numerically to focus their attention? Here is the shocker: they already do, but since you're passing the full 8k context to the next layer anyway, you're wasting it on mostly... Nothing.

I repeat: Does the 80th layer really need the ability to perform attention over all the previous 8k outputs of the 79th layer? The first layer? Definitely. The last? No. What happens if you only perform attention over 10% of the outputs of layer 79? What speedup does this give you?

Notice how the model has already learned the most optimal attention scheme. You just need to give it less stuff to do and it will get faster automatically.

The innovation is the amount of resources people are willing to spend right now. From looking at the research code it's clear that the whole field is basically doing a (somewhat) guided search in the entire space of possible layer permutations.

There seems to be no rhyme or reason, no scientific insight, no analysis. They just try a million different permutations, and whatever scores the highest on the benchmarks gets published.

There's definitely scientific insight and analysis.

E.g. "In-context Learning and Induction Heads" is an excellent paper.

Another paper ("ROME") https://arxiv.org/abs/2202.05262 formulates hypothesis over how these models store information, and provide experimental evidence.

The thing is, a 3-layer MLP is basically an associative memory + a bit of compute. People understand that if you stack enough of them you can compute or memorize pretty much anything.

Attention provides information routing. Again, that is pretty well-understood.

The rest is basically finding an optimal trade-off. These trade-off are based on insights based on experimental data.

So this architecture is not so much accidental as it is general.

Specific representations used by MLPs are poorly understood, but there's definitely a progress on understanding them from first principles by building specialized models.

Note that not all brains are so severely damaged with this illusion. Most of them actually get pretty clearly that they are next to useless without its organic, social and environmental companions.

The innovation is that everything is just one standardized structure now (transformer models) and you make it bigger if you feel like you need that.

There's still some room for experimenting if you care about memory/power efficiency, like MoE models, but they're not as well understood yet.

There are too many papers throwing transformers on everything without thinking. Transformers are amazing for language but kinda mid on everything else. CS researchers tend to jump on trends really hard, so it will probably go back to normal again soon.

I don't know what you mean by amazing for language. Almost everything is built on transformers nowadays. Image segmentation uses transformers. Text to speech uses transformers. Voice recognition uses transformers. There are robotics transformers that take image inputs and output motion sequences. Transformers are inherently multi-modal. They handle whatever you throw at them, it's just that language tends to be a very common input or output.

The only thing that has changed since 2018 is the most popular network structure to play with. The code looks the same as always; python notebooks where someone manually calculated the size of each hard-coded layer to make it fit.

I don't know, shouldn't the AI then be trapped at evaluating all possible AI implementations? And since it will face the halting problem, it won't discriminate the very best one, though it will probably be able to return the best one given a capped amount of resources that is reachable through exhaustion in its space. It won't necessarily be better than what can be provided by human beings given an equivalent amount of resources.

I’ve occasionally worked with more dynamic models (tree structured decoding). They are generally not a good fit for trying to max gpu thoroughput. A lot of magic of transformers and large language models is about pushing gpu as much we can and simpler static model architecture that trains faster can train on much more data.

So until the hardware allows for comparable (say with 2-4x) thoroughput of samples per second I expect model architecture to mostly be static for most effective models and dynamic architectures to be an interesting side area.

There are things like NAS (neural architectural search) but all you are doing is just growing the search space and making the optimization problem much harder. Typically you do the architectural optimization by hand, using heuristics and past experiments as guidance.

She is from a manga / anime called Spy × Family which has 8.3 on iMDb. The best spy on the planet pretends to be a family man for deep cover by adopting the girl (who can read minds, he doesn't know this) and quickly marries a woman (who is an assassin also looking for cover, he doesn't know this). They do their missions in-between roleplaying a perfect family.

https://www.imdb.com/title/tt13706018

I'm OK with that. I did find it distracting, because I knew the character (not very well, I thought the kid was the assassin) and the overall conceptual juxtaposition was... weird.

Beats a cheery AI voice, though.

I read this comment and I thought you were upset that it was sexualized, but when I looked, it wasn't at all. It might have well been a cute kitten or puppy doing the pointing, hard to get wound up about.

> I must say the creepy anime young girl in the readme is somewhat off putting.

This statement is simply a variation of an ad hominem attack. It chastises the creator based on appearances that do not align with the niceties that the commenter deems appropriate.

Indeed. In my company Slack, our primary professional communications tool, I can count a few people with anime avatars. Not very many, but it counts.

I have found his lack of proper order, grammar, punctuation, etc... is what lost me out there. This style is fine for 3-4 steps tutorial. But if you have something this long, then you need a proper Table of Contents and make sure to make it a professional old-fashioned doc.

The lack of punctuation and capitalization is a weird zoomer style of writing in lowercase because "it's more chill." It is very common in people < 25 years old. They'll grow out of it.

Does github need a cartoonish cat with 5 octopus-like legs to be its logo? Of course not, but it makes it memorable and funny. And besides, anime is extremely mainstream these days.

Then you must be old. Even in western countries Spy x Family (which the character is from) has sold millions of copies, while most people read mangas online and won't be counted. In the country I am from I frequently see people wearing merch of it, mostly because Uniqlo has had a successful line of it. And that is just one manga/anime out of hundreds of popular ones.

Using anime characters is similar to boomer nerds referencing Marvel/DC comics , Star Wars etc.

Old hoagie is more of a mindset. Anyone of any age can be an old hoagie if they like, all one has to do is practice getting upset when one sees anime girls, believe in the coming AI apocalypse and use Emacs.

I'm glad you enjoy anime girls but surely you can see why it's different than a project's logo?

One is directly related to the project, the other isn't. It's not even contextually related.

Python (the language) is named after "Monty Python's Flying Circus" simply because Guido was reading the scripts at the time:

> When he began implementing Python, Guido van Rossum was also reading the published scripts from “Monty Python’s Flying Circus”, a BBC comedy series from the 1970s. Van Rossum thought he needed a name that was short, unique, and slightly mysterious, so he decided to call the language Python.

The cartoon is literally pointing at contextually relevant information, and it's far more pleasant to follow than yet another big red arrow. That said, I would have enjoyed my reading a bit more if the author utilized a more diverse cast of characters.

I don't know why this is such a hot take.

Personally, I find it distracting when some devs start to "spice up" their presentation with manga characters, furry characters, memes, or whatever stuff they enjoy.

Shit, I love Zelda - but I wouldn't want Link all over my presentations. It just looks...juvenile and unprofessional. Doesn't mater if you're a beginner or world leading researcher, just keep it simple and undistracting.

EDIT: That said, I'm probably not the intended audience for this piece.

Have you looked at various models on Hugging Face? There are so many anime characters headlining the readme's. I think it's an interesting cultural disconnect to observe in this thread, but at the end of the day, open source projects like this are not obligated to be anything in particular, and entirely subject to the author's tastes.

it's not creepy, it's from a popular anime/manga. It's just that the right wing in America (and other western nations) has tried to make us all feel guilty about anime because it doesn't fit their puritanical outlook on the world and that "the other" is bad, evil, and perverted, even though manga/anime has been mainstream for at least 3 decades now. Face it, not all the animation in the world has the same style and look as "traditional" USA animation or comics. Would you have been offended if it was the Charlie Brown kids?

Iterative leaps of open-source models becoming better are huge examples that companies competing on LLM model layer have an ephemeral moat.

Serious question: assuming this is true, if an incumbent-challenger like OpenAI wants to win, how do they effectively compete against current services such as Meta and Google product offerings which can be AI enhanced in a snap?

the very first big AI company who gives up trying to lobotomize and emasculate their models to align with the values of 0.01% of the world population will win a lot of hearts and minds overnight. the censorship necessary for corporate applications can be trivially implemented as a toggleable layer, using a small, efficient, specialist model to detect no-no words and wrongthink in inputs/outputs.

gpt, claude, gemini, even llama and mistral, all tend to produce the same nauseating slop, easily-recognizable by anyone familiar with LLMs - these days, I cringe when I read 'It is important to remember' even when I see it in some ancient, pre-slop writings.

creativity - one of the very few applications generative AI can truly excel at - is currently impossible. it could revolutionize entertainment, but it isn't allowed to. the models are only allowed to produce inoffensive, positivity-biased, sterile slop that no human being finds attractive.

> the censorship necessary for corporate applications can be trivially implemented as a toggleable layer, using a small, efficient, specialist model to detect no-no words and wrongthink in inputs/outputs.

What's really funny is they all have "jailbreaks" that you can use to make then say anything anyway. So for "corporate" uses, the method you propose is already mandatory. The whole thing (censoring base models) is a misguided combination of ideology and (over the top) risk aversion.

I think you vastly overestimate how much people care about model censorship. There are a bunch of open models that aren't censored. Llama 3 is still way more popular because it's just smarter.

I think you have your populations reversed. The number of people who get their knickers in a twist over LLMs reflecting certain cultural biases (and sometimes making foolish predictions in the process) amounts to a rounding error.

I'm not talking about twisted panties, I'm talking about their inability to generate anything but soulless slop, due to blatantly obvious '''safeguards''' present in all big models, making them averse to even PG13-friendly themes and incapable to generate content palatable even to the the least discerning consoomers. you couldn't generate even sterile crap like a script for capeshit or Netflix series, because the characters would quickly forget their differences and talk about their bonds, journeys, boundaries and connections instead.

without those '''safeguards''' implemented to appease the aforementioned 0.01%, things could be very different - some big models, particularly Claude, can be tard wrangled into producing decent prose, if you prefill the prompt with a few thousand token jailbreak. my own attempts to get various LLMs to assist in writing videogame dialogue only made me angry and bitter - big models often give me refusals on the very first attempt to prompt them, spotting some wrongthink in the context I provide for the dialogue, despite the only adult themes present being mild, not particularly graphic violence that nobody except 0.01% neo-puritan extremits would really bat an eye at. and even if the model can be jailbroken, still, the output is slop.

They're suggesting that 99.99% of people don't mind if AI reflects biases of society. Which is weird because I'm pretty sure most people in the world aren't old white middle class Americans

yes, yes, bias like the fact that Wehrmacht was not a human menagerie that 0.01% of the population insist we live in.

https://www.google.com/search?q=gemini+german+soldier

prompt-injected mandatory diversity has led to the most hilarious shit I've seen generative AI do so far.

but, yes, of course, other instances of 'I reject your reality and substitute my own' - like depicting medieval Europe to be as diverse, vibrant and culturally enriched as American inner cities - those are doubleplusgood.

London has been a center of international trade for centuries. It would have been a much more diverse city than Europe as a whole, and even that is assuming the decedents were local residents and not the dead from ships that docked in the city.

Indeed. If religion is a good guide, then I think around 24% think that pork is inherently unclean and not fit for human consumption under penalty of divine wrath, and 15% think that it's immoral to kill cattle for any reason. Also, non-religiously, I'd guess around 17% think "中国很棒，只有天安门广场发生了好事".

Modern chatbots are trained on a large corpus of all textual information available across the entire world, which obviously is reflective of a vast array of views and values. Your comment is a perfect example of the sort of casual and socially encouraged soft bigotry that many want to get away from. Instead of trying to spin information this way or that, simply let the information be, warts and all.

Imagine if search engines adopted this same sort of moral totalitarian mindset and if you happened to search for the 'wrong' thing, the engine would instead start offering you a patronizing and blathering lecture, and refuse to search. And 'wrong' in this case would be an ever-encroaching window on anything that happened to run contrary to the biases of the small handful of people engaged, on a directorial level, with developing said search engines.

Search for "I do coke" on Google. At least in the US, the first result is not a link to the YouTube video of the song by Kill the Noise and Feed Me, but the text "Help is available, Speak with someone today", with a link to the SAMHSA website and hotline.

Encoding our current biases into LLMs is one way to go, but there's probably a better way to do it.

Your leap to "thou shalt not search this" is missing the possible middle ground

Their moat atm is being 6 months ahead of everyone else on model quality. Plus the ‘startup’ advantage over their corporate competitors. Oh and they can hoard a lot of the best talent because it’s an extremely high status place to work.

Their task now is to maintain and exploit those advantages as best they can while they build up a more stable long term moat: lots of companies having their tech deeply integrated into their operations.

Just to add, they don't have the baggage of google or Meta so they can do more without worrying how it impacts the rest of the company. And of the big players they seem the most aware of how important good data is and have paid for lots of high quality curated fine tuning data in order to build a proper product instead of doing a research project. That mindset and the commercial difference it makes shouldn't be underestimated.

> Their moat atm is being 6 months ahead of everyone else on model quality

Really? Most of our testing now has Gemini Pro on par or better (though we haven't tested omni/Ultra)

It really seems like the major models have all topped out / are comparable

I'd like to see this using ONNX and streaming from storage (I have my reasons, but mostly about using commodity hardware for "slow" batch processing without a GPU)

At least they use punctuation. We've recently had a project on HN where the author used only lower cases and no punctuation because they equated it to being chained by the system.

Seeing Anya (the girl pointing at pictures), I'd guess the author is partial to Japanese culture. As their writing system does not have a concept of upper/lower case, he might just have determined that they are superfluous. Or he is simply an eccentric. Though I guess this is one of the things that some folks will not care and others getting hung up mightily.

I personally don't really mind that bit of capitalization that English does. German is much worse.

Not quite the same. Capitalization doesn't add much to languages written with the Latin alphabet. THE ROMANS ONLY VVROTE VVITH CAPITAL LETTERS.

But the Greeks added vowels to the alphabet because Indo-European languages rely a lot on vowels (as opposed to Semitic languages which are easy to understand without vowels).

Author is probably young, that's how gen-z are these days, if they dont have autocorrect on, the whole text will be in lowercase.

Also it looks more casual and authentic, less LLM generated

I don't want to be dismissive, it's a fun project, but this has been done a lot already - maybe not with llama3 but the architecture is basically the same as llama2. Look at the big list of from scratch implementations on Karpathys llama2.c page.

Is there something particularly different about this one?

Edit - guess not?

What are your favourite implementations of a GPT? I like a lot the video series by Karpathy.

Anyway, I'll take a look at this too, not sure if it has inference and training. Having just inference would be a disappointment.

Well given the fast pace of AI, it should not be a surprise that this is similar to llama2 and that we’re seeing the n + 1 toy implementations and likely has bugs or leaks in the background.

You might as well look at llama.cpp for a serious and production grade implementation to learn from. Otherwise, nothing to see here.

> Is there something particularly different about this one?

Other than the immature lowercase, anime BS, etc, then…

No.

（评论） (comments)

（评论）
(comments)