Writing an LLM from scratch, part 13 – attention heads are dumb

crystal_revenge · 2025-05-11T19:52:09 1746993129

The most clarifying post I've read on attention is from Cosma Shalizi[0] who points out that "Attention" is quite literally just a re-discovery/re-invention of Kernel smoothing. Probably less helpful if you don't come from a quantitative background, but if you do it makes it shockingly clarifying.

Once you realize this "Multi-headed Attention" is just kernel smoothing with more kernels and doing some linear transformation on the results of these (in practice: average or add)!

0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...

FreakLegion · 2025-05-11T21:00:27 1746997227

It's a useful realization, too, since ways of approximating kernel functions are already well-studied. Google themselves have been publishing in this area for years, e.g. https://research.google/blog/rethinking-attention-with-perfo...

> To resolve these issues, we introduce the Performer, a Transformer architecture with attention mechanisms that scale linearly, thus enabling faster training while allowing the model to process longer lengths, as required for certain image datasets such as ImageNet64 and text datasets such as PG-19. The Performer uses an efficient (linear) generalized attention framework, which allows a broad class of attention mechanisms based on different similarity measures (kernels). The framework is implemented by our novel Fast Attention Via Positive Orthogonal Random Features (FAVOR+) algorithm, which provides scalable low-variance and unbiased estimation of attention mechanisms that can be expressed by random feature map decompositions (in particular, regular softmax-attention). We obtain strong accuracy guarantees for this method while preserving linear space and time complexity, which can also be applied to standalone softmax operations.

esafak · 2025-05-11T21:48:18 1747000098

In kernel methods the kernel is typically given, and things like positional embeddings, layer normalization, causal masking, and so on are missing. Kernel methods did not take off partly due to their computational complexity (quadratic in sample size), and transforms did precisely because they were parallelizable, and thus computationally efficient, compared with the RNNs and LSTMs that came before them.

Reductions of one architecture to another are usually more enlightening from a theoretical perspective than a practical one.

thomasahle · 2025-05-11T21:14:48 1746998088

For those who don't know the term "kernel smoothing", it just means

    ∑ᵢ yᵢ · K(xᵢ, xₒ) ⁄ (∑ⱼ K(xⱼ, xₒ))

In regular attention, we let K(xᵢ, xₒ) = exp().

Note that in Attention we use K(qᵢ, kₒ) where the q (query) and k (key) vectors are not the same.

Unless you define K(xᵢ, xₒ) = exp() as you do in self-attention.

There are also some attention mechanisms that don't use the normalization term, (∑ⱼ K(xⱼ, xₒ)), but most do.

throwup238 · 2025-05-12T01:08:25 1747012105

> ∑ᵢ yᵢ · K(xᵢ, xₒ) ⁄ (∑ⱼ K(xⱼ, xₒ))

That clarifies things...

3abiton · 2025-05-11T21:57:21 1747000641

Wow, thanks for referencing that. What a very detailed and long read!

andrehacker · 2025-05-11T16:40:08 1746981608

This looks very interesting. The easiest way to navigate to the start of this series of articles seems to be https://www.gilesthomas.com/til-deep-dives/page/2

Now if I only could find some time...

Tokumei-no-hito · 2025-05-11T17:07:38 1746983258

maybe it renders differently on mobile but this was the first entry for me. you can use the nav at the end to continue to the next part

https://www.gilesthomas.com/2024/12/llm-from-scratch-1

sitkack · 2025-05-11T17:06:42 1746983202

https://news.ycombinator.com/from?site=gilesthomas.com

badsectoracula · 2025-05-11T17:18:54 1746983934

Too bad the book seems to be using Python and some external library like tiktokens just from chapter 2, meaning that it'll basically stop working next week or so, like everything Python, making the whole thing much harder to follow in the future.

Meanwhile i learned the basics of machine learning and (mainly) neural networks from a book written in 1997[0] - which i read last year[1]. It barely had any code and that code was written in C, meaning it'd still more or less work (though i didn't had to try it since the book descriptions were fine on their own).

Now, Python was supposedly designed to look kinda like pseudocode, so using it for a book could be fine, but at least it should avoid relying on external libraries that do not come with the language itself - and preferably stick to stuff that have equivalent to other languages too.

[0] https://www.cs.cmu.edu/~tom/mlbook.html

[1] which is why i make this comment (and to address the apparent downvotes): even if i get the book now i might end up reading it in 3-4 years. Stuff not working will be a major obstacle. If the book is good, it might end up been recommended by people 2-3 years from now and some people may end up getting it and/or reading it even later in time. So it is important for the book to be self-contained, at least when it comes to books that try to teach the theory/ideas behind things.

y42 · 2025-05-11T18:35:38 1746988538

not sure if rage bait or serious, but: have you ever heard of conda or virtual environment?

tonyarkles · 2025-05-11T20:01:53 1746993713

Those are decent options but you can still run into really ugly issues if you try to go back too far in time. An example I ran into in the last year or two was a Python library that linked against the system OpenSSL. A chain of dependencies ultimately required a super old version of this library and it failed to compile against the current system OpenSSL. Had to use virtualenv inside a Docker container that was based on either Ubuntu 18.04 or 20.04 to get it all to work.

johnmaguire · 2025-05-11T22:20:55 1747002055

Wouldn't this be an issue with C too? Or anything that links against an external library?

andrehacker · 2025-05-11T17:38:15 1746985095

Myeah, C and C++ have the advantage that the compilers support compile for old versions of the language. The languages are in much flux partly because of security problems, partly because features are added from other languages. That means that linking to external libraries using the older language version will fail unless you keep the old version around simply because the maintainer of the external library DID upgrade.

Python is not popular in ML because it is a great language but because of the ecosystem: numpy, pandas, pytorch and everything built on those allows you to do the higher level ML coding without having to reinvent efficient matrix operations for a given hardware infrastructure.

vlovich123 · 2025-05-11T19:42:53 1746992573

> That means that linking to external libraries using the older language version will fail unless you keep the old version around simply because the maintainer of the external library DID upgrade.

This just isn’t true. C ABIs has not seen any change with the updated standards and while C++ doesn’t have a stable ABI boundary you shouldn’t have any problem calling older binary interfaces from new code (or new binary interfaces from old code provided you’re not using some new types that just aren’t available). That’s because the standard library authors themselves do strive to guarantee ABI comparability (or at least libc++ and stdlibc++ - I’m not as confident about MSVC but I have to believe this is generally true there too). Indeed the last ABI breakage in c++ was on Linux in C++11 15 years ago because of changes to std::string.

og_kalu · 2025-05-11T19:56:17 1746993377

>Python is not popular in ML because it is a great language but because of the ecosystem: numpy, pandas, pytorch and everything built on those allows you to do the higher level ML coding without having to reinvent efficient matrix operations for a given hardware infrastructure.

Ecosystems don't poof into existence. There are reasons people chose to write those libraries, sometimes partly or wholly in other languages for python in the first place.

It's not like python was older than or a more prominent language than say C when those libraries began.

badsectoracula · 2025-05-11T17:48:36 1746985716

(i assume with "The languages are in much flux" you meant python and not c/c++ because these aren't in flux)

Yeah i get why Python is currently used[0] and for a theory-focused book Python would still work to outline the algorithms - worst case you boot up an old version of Python in Docker or a VM, but it'd still require using only what is available out of the box in Python. And depending on how the book is written, it may not even be necessary.

That said there are other alternatives nowadays and when trying to learn the theory you may not need to use the most efficient stuff. Using C, C++, Go, Java, C# or whatever other language with a decent backwards compatibility track record (so that it can work in 5-10 years) should be possible and all of these should have some small (if not necessarily uberefficient) library for the calculations you may want to do that you can distribute alongside the book for those who want to try the code out.

[0] even if i wish people would stick on using it only for the testing/experimentation phase and move to something more robust and future proof for stuff meant to be used by others

andrehacker · 2025-05-11T18:30:56 1746988256

"The languages are in much flux" you meant python and not c/c++ because these aren't in flux

No I meant C++.

2011 14882:2011[44] C++11

2014 14882:2014[45] C++14

2017 14882:2017[46] C++17

2020 14882:2020[47] C++20

2024 14882:2024[17] C++23

That is 4 major language changes in 10 years.

As a S/W manager in an enterprise context having to coordinate upgrades of multi-million LOC codebases for mandated security compliance, C++ is not the silver bullet in handling the version problem that exists in every eco system.

As said, the compilers/linkers allow you to run in compatibility mode so as long as you don't care about the new features (and the company you work for doesn't) then, yes, C/C++ is easier for managing legacy code.

YZF · 2025-05-11T19:04:19 1746990259

These are new features. Many of them are part of the library not the language. Generally speaking what you do is enable the new features in your compiler, you don't need to disable that to compile old code. It's not a problem to work on legacy code and use new features for new code either.

theyinwhy · 2025-05-11T16:35:18 1746981318

There are multiple books about this topic now. What are your takes on the alternatives? Why did you choose this one? Appreciate your thoughts!

andrehacker · 2025-05-11T16:50:27 1746982227

It is regarded to be "the best" book on the topic by many. I found just like what Giles Thomas wrote that the book focuses on the details and how to write the lower level code without providing the big picture.

I am personally not very interested in that as these details are likely to change rather quickly while the principles of LLMs and transformers will probably remain relevant for many years.

I have been looking for, but failed, to find a good resource that approaches it the way 3blue1brown [1] explains it but then go deeper from there.

The blog series from Giles seem to take the book and add the background to the details.

[1] https://m.youtube.com/watch?v=wjZofJX0v4M

logicallee · 2025-05-11T17:46:41 1746985601

If you are interested in this sort of thing, you might want to take a look at a very simple neural network with two attention heads that runs right in the browser in pure Javascript, you can view source on this implementation:

https://taonexus.com/mini-transformer-in-js.html

Even after training for a hundred epochs it really doesn't work very well (you can test it in the Inference tab after training it), but it doesn't use any libraries, so you can see the math itself in action in the source code.

westoque · 2025-05-11T20:09:16 1746994156

Must be my ignorance but everytime I see explainers for LLMs similar to the post, it’s hard to believe that AGI is upon us. It just doesn’t feel that “intelligent” but again might just be my ignorance.

throwawaymaths · 2025-05-11T21:44:54 1746999894

eh, transformers are universal differentiable layered hash tables. that's incredibly powerful. most logic is just pulling symbols and matching structures with "hash"es.

if intelligence is just reasonable manipulation of logic it's unsurprising that an LLM could be intelligent, what maybe is surprising is that we have ~intelligence without going up a few more orders of magnitude in size, what's possibly more surprising is that training it on the internet got it doing the things it's doing

jlawson · 2025-05-11T20:14:26 1746994466

Neurons are pretty simple too.

Any arbitrarily complex system must be made of simpler components, recursively down to arbitrary levels of simplicity. If you zoom in enough everything is dumb.

voidspark · 2025-05-11T20:56:18 1746996978

Neurons are surprisingly not simple. Vastly more complex than the ultra simplified model in artificial neural networks.

Lerc · 2025-05-11T22:09:30 1747001370

I think there are two layers of the 'why' in machine learning.

When you look at a model architecture it is described as a series of operations that produces the result.

There is a lower level why, which, while being far from easy to show, describes why it is that these algorithms produce the required result. You can show why it's a good idea to use cosine similarity, why cross entropy was chosen to express the measurement. In Transformers you can show that the the Q and K matrices transform the embeddings into spaces that allows different things to be closer, and using that control over the proportion of closeness allows you to make distinctions. This form of why is the explanation usually given in papers. It is possible to methodically show you will get the benefits described from techniques proposed.

The greater Why is much much harder, Harder to identify and harder to prove. the First why can tell you that something works, but it can't really tell you why it works in a way that can inform other techniques.

In the Transformer, the intuition is that the 'Why' is something along the lines of The Q transforms embeddings into an encoding of what information is needed in the embedding to resolve confusion, and that the K transforms embeddings into information to impart. When there's a match between 'What I want to know about' and 'what I know about' the V can be used as 'the things I know' to accumulate the information where it needs to be.

It's easy to see why this is the hard form, Once you get into the higher semantic descriptions of what is happening, it is much harder to prove that this is actually what is happening, or that it gives the benefits you think it might. Maybe Transformers don't work like that. Sometimes semantic relationships appear to be in processes when there is an unobserved quirk of the mathematics that makes the result coincidentally the same.

In a way I think of the maths of it as picking up a many dimentional object in each hand and magically rotating and (linearly) squishing them differently until they look aligned enough to see the relationship I'm looking at and poking those bits towards each other. I can't really think about that and the semantic "what things want to know about" at the same time, even though they are conceptualisations of the same operation.

The advantage of the lower why is that you can show that it works. The advantage of the upper why is that it can enable you to consider other mechanisms that might do the same function. They may be mathematically different but achieve the goal.

To take a much simpler example in computer graphics. There are many ways to draw a circle with simple loops processing mathematically provable descriptions of a circle. The Bressenham Circle drawing algorithm does so with a why that shows why it makes a circle but the "Why do it that way" was informed by a greater understanding of what the task being performed was.

quantadev · 2025-05-11T18:44:32 1746989072

Regarding this statement about semantic space:

> so long as vectors are roughly the same length, the dot product is an indication of how similar they are.

This potential length difference is the reason "Cosine Similarity" is used instead of dot products for concept comparisons. Cosine similarity is like a 'scale independent dot product', which represents a concept of similarity, independent of "signal strength".

However, if two vectors point in the same direction, but one is 'longer' (higher magnitude) than the other, then what that indicates "semantically" is that the longer vector is indeed a "stronger signal" of the same concept. So if "happy" has a vector direction then "very happy" should be longer vector but in the same direction.

Makes me wonder if there's a way to impose a "corrective" force upon model weights evolution during training so that words like "more" prefixed in front of a string can be guaranteed to encode as a vector multiple of said string? Not sure how that would work with back-propagation, but applying certain common sense knowledge about how the semantic space structures "must be" shaped could potentially be the next frontier of LLM development beyond transformers (and by transformers I really mean the attention heads specialization)

bornfreddy · 2025-05-11T18:54:12 1746989652

Off topic rant: I hate blog posts which quote the author's earlier posts. They should just reiterate if it is important or use a link if not. Otherwise it feels like they want to fill some space without any extra work. The old posts are not that groundbreaking, I assure you. /rant

（评论） (comments)

（评论）
(comments)