（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41412256

作者认为，在讨论机器学习模型时，由于视角不同，自动微分、大规模并行计算等具体技术细节往往被忽视。从数学家的角度来看，这些计算元素会分散注意力。相反，重点应该放在根据基本数学原理构建模型，结合自然语言处理的见解，检查表示输入数据和评估损失的方法，解决有关数据处理的问题，并在更广泛的背景下考虑机器学习。作者认为，除非您打算深入研究硬件或核心库的创建，否则没有必要深入研究复杂的编程问题，例如手动管理内存。相反，作者建议在更早的阶段开始教育过程，重点关注编程基础、文本操作、生成基本统计数据、改进马尔可夫链，以及探索创建人工智能系统的各种架构。虽然马尔可夫链可能与大型语言模型 (LLM) 不同，但它们为理解算法如何分析和生成文本提供了宝贵的基础。立即从现有框架开始会忽略重要的基础知识。此外，虽然术语“软件开发人员”、“编码员”和“工程师”可能会引起不同的含义，但作者认为，应该对专业工程师和开发人员抱有更实质性的期望，特别是在提到为商业应用程序设计的解决方案时。最后，作者发现对包括法学硕士在内的人工智能系统使用“变压器”一词没有任何问题，因为这些技术采用原始输入并通过计算对其进行处理，从而产生与初始输入不同的输出，类似于转换。

Andrej's series is excellent, Sebastian's book + this video are excellent. There's a lot of overlap but they go into more detail on different topics or focus on different things. Andrej's entire series is absolutely worth watching, his upcoming Eureka Labs stuff is looking extremely good too. Sebastian's blog and book are definitely worth the time and money IMO.

Using PyTorch is not "LLMs from the ground up".

It's a fine PyTorch tutorial but let's not pretend it's something low level.

I really like Sebastian's content but I do agree with you. I didn't get into deep learning until starting with Karpathy's series, which starts by creating an autograd engine from scratch. Before that I tried learning with fast.ai, which dives immediately into building networks with Pytorch, but I noped out of there quickly. It felt about as fun as learning Java in high school. I need to understand what I'm working with!

Maybe it's just different learning styles. Some people, me included, like to start getting some immediate real world results to keep it relevant and form some kind of intuition, then start peeling back the layers to understand the underlying principles. With fastAI you are already doing this by the 3rd lecture.

Like driving a car, you don't need to understand what's under the hood you start driving, but eventually understanding it makes you a better driver.

For sure! In both cases I imagine it is a conscious choice where the teachers thought about the trade-offs of each option. Both have their merits. Whenever you write learning material you have to decide where to draw the line of how far you want to break down the subject matter. You have to think quite hard about exactly who you are writing for. It's really hard to do!

You seem to be implying that the top-down approach is a trade off that involves not breaking down the subject matter into as lower level details. I think the opposite is true - when you go top down you can keep teaching lower and lower layers all the way down to physics if you like!

fast.ai also does autograd from scratch - and goes further than Karpathy since it even does matrix multiplication from scratch.

But it doesn’t start there. It uses top-down pedagogy, instead of bottom up.

Bach (Johann Sebastian .. there were many musical Bach's in the family) owned and wrote for harpsichords, lute-harpsichords, violin, viola, cellos, a viola da gamba, lute and spinet.

Never had a piano, not even a fortepiano .. though reportedly he played one once.

We’re digressing to get way off the whole point of the comment, but to address your point, actually piano design has been an area of great innovation over the centuries, with different companies doing it in considerably different ways.

Considering i seem to be the minority here based on all the other responses the message you replied to, the answer i'd give is "by mine, i guess".

At least when i saw the "Building LLMs from the Ground Up" what i expected was someone to open vim, emacs or their favorite text editor and start writing some C code (or something around that level) to implement, well, everything from the "ground" (the operating system's user space which in most OSes is around the overall level of C) and "up".

The problem with this line of thinking is that 1) it's all relative anyway, and 2) The notion of "ground" is completely different depending on which perspective you have.

To a statistician or a practitioner approaching machine learning from a mathematical perspective, the computational details are a distraction.

Yes, these models would not be possible without automatic differentiation and massively parallel computing. But there is a lot of rich detail to consider in building up the model from first mathematical principles, motivating design choices with prior art from natural language processing, various topics related to how input data is represented and loss is evaluated, data processing considerations, putting things into context of machine, learning more broadly, etc. You could fill half a book chapter with that kind of content (and people do), without ever talking about computational details beyond a passing mention.

In my personal opinion, fussing over manual memory management is far afield from anything useful unless you want to actually work on hardware or core library implementations like Pytorch. Nobody else in industry is doing that.

> The problem with this line of thinking is that 1) it's all relative anyway, and 2) The notion of "ground" is completely different depending on which perspective you have.

But if all is relative and depends on your PoV that implies that there isn't actually a problem here, right? :-P

I don't think there is anything wrong with "building up the model from first mathematical principles" as you wrote, it just wasn't what i personally had in mind with the "from the ground up" part.

And FWIW i'm not that stuck up on the "vim and C" aspect, i used those as an example that i expected most would understand and leave little room for misinterpretation in what you'd have to work with (i.e. very very little) and have to implement yourself (pretty much everything) - personally i'd consider it from "the ground up" even if it was in C#, D, Java, JavaScript or even Python, as long as the implementation was done in a way that didn't rely on 3rd party libraries so that whatever is implemented in, say, Java could also be implementable in C#, D, JavaScript or Python with just whatever is available out of the box in those languages or even C, if one doesn't mind writing the extra bookkeeping functionality themselves.

Gluing together premade components is not “from the ground up” by most people’s definition.

People are looking at the ground up for a clear picture of what the thing is actually doing, so masking the important part of what is actually happening, then calling it “ground up” is disingenuous.

Yes, but "what the thing is actually doing" is different depending on what your perspective is on what "the thing" and what "actually" consists of.

If you are interested in how the model works conceptually, how training works, how it represents text semantically, etc., then I maintain that computational details are an irrelevant distraction, not an essential foundation.

How about another analogy? Is SICP not a good foundation for learning about language design because it uses Scheme and not assembly or C?

From scratch is relative. To a python programmer, from scratch may mean starting with dictionaries but a non-programmer will have to learn what python dicts are first.

To someone who already knows excel, from scratch with excel sheets instead of python may work with them.

For the record, if you do not know what a dict actually is, and how it works, it is impossible to use it effectively.

Although if your claim is then that most programmers do not care about being effective, that I would tend to agree with given the 64 gigs of ram my basic text editors need these days.

>For the record, if you do not know what a dict actually is, and how it works, it is impossible to use it effectively.

While I agree it's good to know how your collections work. "Efficient key-value store" may be enough to use it effectively 80% of the time for somebody dabbling in Python.

Sadly I've met enough people that call themselves programmers that didn't even have such a surface level understanding of it.

No it is not. From scratch has a meaning. To me it means: in a way that letxs you undrrstand the important details, e.g. using a programming language without major dependencies.

Calling that from scratch is like saying "Just go to the store and tell them what you want" in a series called: "How to make sausage from scratch".

When I want to know how to do X from scratch I am not interested in "how to get X the fastest way possible", to be frank I am not even interested in "How to get X in the way others typically get it", what I am interested in is learning how to do all the stuff that is normally hidden away in dependencies or frameworks myself — or, you know, from scratch. And considering the comments here I am not alone in that reading.

Your definition doesn’t match mine. My definition is fuzzier. It is “building something using no more than the common tools of the trade”. The term “common” is very era dependent.

For example, building a web server from scratch - I’d probably assume the presence of a sockets library or at the very least networking card driver support. For logging and configuration I’d assume standard I/o support.

It probably comes down to what you think makes LLMs interesting as programs.

It is okay to differ on this. Language is not an exact science. It is however always good to factor in expectations when you describe things.

E.g. when a title says it shows you how to do a thing in vanilla javascript from scratch bringing in jquery in the first step makes that tile a lie. If you bring in a hefty dependency on step 1 and run three imported function the vanilla javascript part might be fine, but the from scratch starts to do some heavy lifting.

You could always go deeper and from some points of view, it's not "from the ground up" enough unless you build your own autograd and tensors from plain numpy arrays.

Your comment is one of the most pompous that I've ever read.

NVDIA value lies only in pytorch and cuda optimizations with respect with pure c implementation, so saying that you need go lower level than cuda or pytorch means simply reinventing Nvidia. Good luck with that

1. I only said the meaning of the title is wrong, and I praised the content

2. I didn't say CUDA wouldn't be ground up or low level (please re-read) (I say in another comment about a no-code guide with CUDA, but it's obviously a joke)

3. And finally, I think your comment comes out as holier than thou and finger pointing and making a huge deal out of a minor semantic observation.

Wanted to say the same thing. As an educator who once gave a course on a similar topic for non-programmers you need to start way, way earlier.

E.g.

1. Programming basics

2. How to manipulate text using programs (reading, writing, tokenization, counting words, randomization, case conversion, ...)

3. How to extract statistical properties from texts (ngrams, etc, ...)

4. How to generate crude text using markov chains

5. Improving on markov chains and thinking about/trying out different topologies

Etc.

Sure markov chains are not exactly LLMS, but they are a good starting point to byild a intuition how programs can extract statistical properties from text and generate new text based on that. Also it gives you a feeling how programes can work on text.

If you start directly with a framework there is some essential understanding missing.

Nice write up Sebastian, looking forward to the book. There are lots of details on the LLM and how it’s composed, would be great if you can expand on how Llama and OpenAI could be cleaning and structuring their training data given it seems this is where the battle is heading in the long run.

But isn't it the beauty of llm's that they need comparably little preparation (unstructured text as input) and pick the features on their own so to say?

edit: grammar

Yes, if you want an LLM that doesn't listen to instructions and just endlessly babbles about anything and everything.

What turned GPT into chatGPT was a lot of structured training with human feedback.

Yes. I checked the Azure usage after training.

Beyond learning how it all works and demo, there is not much practical usage. You can train it on current events if you feed that corpus during training instead of just OpenWebText. Shouldn't be hard.

Quite a cry, in a submission page from one of the most language "obsessed" in this community.

Now: "code" is something you establish - as the content of the codex medium (see https://en.wikipedia.org/wiki/Codex for its history); from the field of law, a set of rules, exported in use to other domains since at least the mid XVI century in English.

"Program" is something you publish, with the implied content of a set of intentions ("first we play Bach then Mozart" - the use postdates "code"-as-"set of rules" by centuries).

"Develop" is something you unfold - good, but it does not imply "rules" or "[sequential] process" like the other two terms.

I am from Brazil and I find this funny because in my circle of friends/co-wroekers we mostly use "coding" when speaking English, or "codar" (code as a Portuguese verb) with other Brazilians. I am not sure why, but I think it is because "program" has a strong association with prostitution in Brazilian Portuguese.

I'm from Europe and my language doesn't have an equivalent to "coding" but i'm still using the English word "coder" and "coding" for decades - in my case i learned it from the demoscene where it was always used for programmers since the 80s. FWIW the Demoscene is (or was at least) largely a European thing (groups outside of Europe did exist but the majority of both groups and demoparties were -and i think still are- in Europe) so perhaps there is some truth about the "coding" word being a European thing (e.g. it sounded ok in some languages and spread from there).

Also in my ears coder always sounded cooler than programmer and it wasn't until a few years ago i first heard that to some people it has negative connotations. Too late to change though, it still sounds cooler to me :-P.

[0] https://en.wikipedia.org/wiki/Demoscene

I am from Europe and I am not completely sure about that to be honest. I also prefer programming.

I also dislike software development as it reminds me of developing a photograhic negative – like "oh let's check out how the software we developed came out".

It should be software engineering and it should be held to a similar standard as other engineering fields if it isn't done in a non-professional context.

The word "development" can mean several things. I don't think "software development" sounds bad when grouped with a phrase like "urban development". It describes growing and tuning software for, well, working better, solving more needs, and with fewer failure modes.

I do agree that a "coder" creates code, and a programmer creates programs. I expect more of a complete program than of a bunch of code. If a text says "coder", it does set an expectation about the professionalism of the text. And I expect even more from a software solution created by a software engineer. At least a specification!

Still, I, a professional software engineer and programmer, also write "code" for throwaway scripts, or just for myself, or that never gets completed. Or for fun. I will read articles by and for coders too.

The word is a signal. It's neither good nor bad, but If that's not the signal the author wants to send, they should work on their communication.

> If that's not the signal the author wants to send

You can't use a language that will be taken by everyone the same way. The public is heterogeneous - its subsets will use different "codes".

> software development

Wrong angle. There is a problem, your consideration of the problem, the refinement of your solution to the problem: the solution gradually unfolds - it is developed.

This is great. Just yesterday I was wondering how exactly transformers/attention and LLMs work. I'd worked through how back-propagation works in a deep RNN a long while ago and thought it would be interesting to see the rest.

This is great! Hope it works on a Windows 11 machine too (I often find that when Windows isn't explicitly mentioned, the code isn't tested on it and usually fails to work due to random issues).

Language is the language model that extends Transformer. Transformer is a base model for any kind of token (words, pixels, etc.).

However, currently there is some language-specific stuff in Transformer that should be moved to Language :) I'm focusing first on language models, and getting into image generation next.

No, I mean, a transformer is a very specific model architecture, and your simple language model has nothing to do with that architecture. Unless I’m missing something.

I still call it a transformer because the inputs are tokenized and computed to produce completions, not from lookups or assembling based on rules.

> Unless I'm missing something.

Only that I said "without taking the LLM approach" meaning tokens aren't scored in high-dimensional vectors, just as far simpler JSON bigrams. I don't think that disqualifies using the term "transformer" - I didn't want to call it a "computer" or a "completer". Have a better word?

> JSON instead of vectors

I did experiment with a low-dimensional vector approach from scratch, you can paste this into your browser console: https://gist.github.com/bennyschmidt/ba79ba64faa5ba18334b4ae...

But the n-gram approach is better, I don't think vectors start to pull away on accuracy until they are capturing a lot more contextual information (where there is already a lot of context inferred from the structure of an n-gram).

And it fits the definition doesn't it since it tokenizes inputs to compute them against pre-trained ones, rather than being based on rules/lookups or arbitrary logic/algorithms?

Even in CSS a matrix "transform" is the same concept - the word "transform" is not unique to language models, more a reference to how 1 set of data becomes another by way of computation.

Same with tile engines / game dev. Say I wanted to rotate a map, this could be a simple 2D tic-tac-toe board or a 3D MMO tile map, anything in between:

Input

[

  [0, 0, 1],
    
  [0, 0, 0],
    
  [0, 0, 0]

]

Output

[

  [0, 0, 0],

  [0, 0, 0],
  
  [0, 0, 1]

]

The method that takes the input and gives that output is called a "transformer" because it is not looking up some rule that says where to put the new values, it's performing math on the data structure whose result determines the new values.

It's not unique to language models. If anything vector word embeddings are much later to this concept than math and game dev.

An example of use of word "Transformer" outside language models in JavaScript is Three.js' https://threejs.org/docs/#examples/en/controls/TransformCont...

I used Three.js to build https://www.playshadowvane.com/ - built the engine from scratch and recall working with vectors (e.g. THREE Vector3 for XYZ stuff) years before they were being popularized by LLMs.

I get this question only on Hacker News, and am baffled as to why (and also the question "isn't this just n-grams, nothing more?").

https://github.com/bennyschmidt/next-token-prediction

^ If you look at this GitHub repo, should be obvious it's a token prediction library - the video of the browser demo shown there clearly shows it being used with an to autocomplete text based on your domain-specific data. Is THAT a Markov chain, nothing more? What a strange question, the answer is an obvious "No" - it's a front-end library for predicting text and pixels (AKA tokens).

https://github.com/bennyschmidt/llimo

This project, which uses the aforementioned library is a chat bot. There's an added NLP layer that uses parts-of-speech analysis to transform your inputs into a cursor that is completed (AKA "answered"). See the video where I am chatting with the bot about Paris? Is that nothing more than a standard Markov chain? Nothing else going on? Again the answer is an obvious "No" it's a chat bot - what about the NLP work, or the chat interface, etc. makes you ask if it's nothing more than a standard [insert vague philosophical idea]?

To me, your question is like when people were asking if jQuery "is just a monad"? I don't understand the significance of the question - jQuery is a library for web development. Maybe there are some similarities to this philosophical concept "monad"? See: https://stackoverflow.com/questions/10496932/is-jquery-a-mon...

It's like saying "I looked at your website and have concluded it is nothing more than an Array."

This page is just a container for a youtube video. I suggest updating this HN link to point to the video directly, which contains the same links as the page in its description.

yeah really valuable stuff. so we know how the ginormous model that we can't train or host works (putting practice there are so many hacks and optimizations that none of them work like this). great.

Not true.

Your resource is really bad.

"We'll then load the trained GPT-2 model weights released by OpenAI into our implementation and generate some text."

Neither the author of the GPT from scratch post, nor eclectic29 who recommended it above did ever promise that the post is about building LLMs from the ground up. That was the original post.

The GPT from scratch post explains, from the ground up, ground being numpy, what calculations take place inside a GPT model.

I’m not sure why you’d want to build an LLM these days - you won’t be able to train it anyway. It’d make a lot of sense to teach people how to build stuff with LLMs, not LLMs themselves.

This has been said about pretty much every subject. Writing your own Browsers, compilers, cryptography, etc. But at least for me even if nothing comes of it just knowing how it really works, What steps are involved are part of using things properly. Some people are perfectly happy using a black box, but without kowning how its made, how do we know the limits? How will the next generation of llms happen if nobody can get excited about the internal workings?

You don’t need to write your own LLM to know how it works. And unlike, say, a browser it doesn’t really do anything even remotely impressive unless you have at least a few tens of thousands of dollars to spend on training. Source: my day job is to do precisely what I’m telling you not to bother doing, but I do have access to a large pool of GPUs. If I didn’t, I’d be doing what I suggest above.

But I mean people can always rent GPUs too. And they're getting pretty ubiquitous as we ramp up from the AI hype craze, I am just an IT monkey at the moment and even I have on-demand access to a server with something like 4x192GB GPUs at work.

It's possible to train useful LLMs on affordable harwdare. It depends on what kind of LLM you want. Sure you won't build the next ChatGPT, but not every language task requires a universal general-purpose LLM with billions of parameters.

It's so fun! And for me at least, it sparks a lot of curiosity to learn the theory behind them, so I would imagine it is similar for others. And some of that theory will likely cross over to the next AI breakthrough. So I think this is a fun and interesting vehicle for a lot of useful knowledge. It's not like building compilers is still super relevant for most of us, but many people still learn to do it!

（评论） (comments)

（评论）
(comments)