（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40502090

该用户正在比较两种语言建模方法：一种使用 PyTorch 或 JAX 以及名为 nanoGPT 的特定模型，另一种使用名为 llm.c 的 C/CUDA 代码。前者尚未完全优化，但后者目前运行速度稍快。然而，用户打算改进这两种方法，旨在建立一个用于教育目的的最小的、独立的存储库。用户的最终目标是探索可以从蛋白质结构预测中删除多少内容，同时通过 LLM.c 保持其通用性，专注于教育他人而不需要广泛的依赖性或复杂性。他们并不特别关心性能增强或药物发现，只是展示了根据现有蛋白质数据库数据预测新结构的能力。 LLM.c 结合了对 GPT-2 架构的多项修改，包括位置编码替换、某些偏差的删除、非线性的使用、增加上下文长度以及对架构参数的调整。还有研究表明，足够长的训练可以减少这些模型之间的差异。

Thank you for the effort you put in your educational work, it helped me and others a lot! In fact, i'm training my nanoGPT version right now. :)

> Ultimately my interest in llm.c is to have a nice, clean, minimal, super dependency-light repo in direct C/CUDA implementation, which I find aesthetically pleasing.

Also, it's awesome that you spend your time on your passion.

Any plans on making a video series on llm.c? :D

Hi Andrej!

First, thank you for your teaching, it has helped me a lot, didn't think I'd ever have the chance to say thank you, but here you are and I hope this gets to you!

Question - what's a relevant (05-2024) baseline to compare the performance of c code to? Back when you made nanoGPT you were seeing "the file train.py reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training". So twice the memory on the c node, but unsure of data size /epochs, any other details I may be missing. I.e. what's the net uplift of running c vs "legacy" torch code?

Thanks again for everything.

The baseline is definitely PyTorch (or JAX), and indeed something like nanoGPT. I just never got nanoGPT "past the finish line" of really crossing the t's and dotting the i's and reproducing the models with as much care as I did now and here in llm.c, and getting to the point where it's a single launch command that just does the thing.

I think I'll try to develop the `train_gpt2.py` inside llm.c to be that, so that we have the two implementations exactly side by side, and it's all nice and comparable.

The C/CUDA code is currently a little bit faster than PyTorch (last time I measured ~2 weeks ago it was about 6% faster), and I think we can push this further. This is done by manually hard-coding a bunch of fusions/optimizations that are non-trivial for torch.compile to find (e.g. our FusedClassifier). But PyTorch has some pending work/PRs that will also speed up their side a lot.

Ultimately my interest in llm.c is to have a nice, clean, minimal, super dependency-light repo in direct C/CUDA implementation, which I find aesthetically pleasing. And on top of that, educational, i.e. using all of the above as an endpoint of an intro LLM course.

I agree - Twitter is still the primary source for a lot of original work and original thoughts. Unfortunately it's gotten more complicated because (1) the threads there have gotten less accessible and (2) some people have assigned the entire site to one side of the culture war.

First, love the videos and other work you've been doing. The micrograd videos are a great way to show people this is all math in the end, and I've linked to specific timestamps in that video and others more times than I can count.

For why I think we have a anti-twitter bias...

Twitter doesn't show replies or any further context without being logged in. Most people will have accounts but I know a lot here deleted theirs or refuse to use it for one reason or another.

Also IMO most here are going to want to read the full source so it just cuts out the middleman. This would usually fall under the "Please submit the original source. If a post reports on something found on another site, submit the latter." guideline which is a little different since the source is yourself, but still the Twitter post doesn't add anything new or novel.

fwiw I totally understand the sentiment! it's actually a bit sad to me that so much of our content is moving from the shared, open web to platforms like twitter, unfortunately there seems to be too much value add around built-in discoverability, comments, ease of authoring, for many people revenue sharing, etc.

Yes, definitely. I had to double check your age (apologies! feels rude somehow) and yep, we're basically the same age. The web was different back then. Maybe not better; maybe that's nostalgia. But never before has more creators had as many tools and avenues to promote and monotonize their work as they do now.

> Keep in mind that here we trained for 10B tokens, while GPT-3 models were all trained for 300B tokens. [...] GPT-3 actually didn't change too much at all about the model (context size 1024 -> 2048, I think that's it?).

Andrej, based on that do you have a rough cost estimate for what it would take to train a GPT-3 Ada (350M)? Do you plan to get there with llm.c ?

The 350M model I trained last night was 30B tokens, 14 hours, ~$200. Conveniently, 300B is exactly 10X the tokens so ~$2K would be the estimate. You'd have to wait 140 hours on one box though. Getting an H100 box instead of A100 will already cut the time latency down probably by a factor of 2-3X, for free, even without going to fp8 (which we do plan to support).

So TLDR at this model scale, llm.c is already there functionally, I think, it's a matter of the compute resources and patience. I currently have this one box from Lambda and I have to look around for a few more boxes and merge the pending PR for multi-node training support. Getting all of this into a nice, stable state is probably a good chunk of the pending work right now.

So, NanoGPT took 1.8 days on 8xA100 for 124M model training on 30.7B tokens using flash attention. This would translate to 14.4hr for 10B tokens. With llm.c it is ~1.5 hr which is almost 10X speedup!

Does this look ballpark correct? Is there any summary of where majority of this improvement comes from?

Would you consider switching your interest to protein structure prediction? In particular, the current most advanced model is a closed-source, closed-weights system that was trained on a proprietary hardware. It is intentionally kept that way for now to enable deepmind to commercialize their product.

The goal here isn't to make the best performing model: it's ablation. How much can we remove from protein structure prediction (such as multiple sequence alignments and molecular dynamics, which were two improvements in AF3), while still having a generalized model that can predict novel folds.

Then focus on teaching the minimal necessary math and code to reproduce the results to the larger biological community. All I can say about AF3 is that it literally taught me that everything I learned about protein structure prediction in the last 30 years was misguided, or outright wrong.

Don't worry about drug discovery or any of the hard stuff. Just continue to show that all that's required to predict novel structures is the existing PDB.

How large is the set of binaries needed to do this training job? The current pytorch + CUDA ecosystem is so incredibly gigantic and manipulating those container images is painful because they are so large. I was hopeful that this would be the beginnings of a much smaller training/fine-tuning stack?

That is 100% my intention and hope and I think we are very close to deleting all of that. Right now on master, I am already only using Python for the tokenization preprocessing. In principle the requirements for llm.c should be extremely minimal. I think this a few days of work that is high on my mind.

Biggest problem right now is finding a place that can host the 135GB of tokens for FineWeb100B. Will probably use S3 or something.

My understanding and suspicion is mostly less than you think. Llama 3 architecture has the following changes on GPT-2:

1. delete the absolute positional encoding and replace with RoPE

2. delete all biases in all layers (in LayerNorms, they turn into RMSNorm)

3. GeLU -> SwiGLU non-linearity in the MLP

4. longer context length

5. architecture hyperparameter changes, e.g. slightly different aspect ratios

And there was a paper that I can't find the reference to anymore that claimed that if you train long enough, the gap becomes even lower. Possibly because the absolutely positional encoding has enough time to train more fully, where as the RoPE layer benefits from the "inductive bias" it adds in the earlier stages of training.

But I don't have full confidence on the above claim, maybe someone has tried or has better/concrete reference.

You might have covered this topic before, but I'm curious about the main performance differences between nanoGPT and llm.c. I'm planning to take your "Zero to Hero" course, and I'd like to know how capable the nanoGPT chatbot you'll build is. Is its quality comparable to GPT-2 when used as a chatbot?

Zero To Hero doesn't make it all the way to a chatbot, it stops at pretraining, and even that at a fairly small scale or character-level transformer on TinyShakespeare. I think it's a good conceptual intro but you don't get too too far as a competent chatbot. I think I should be able to improve on this soon.

Thanks! So, you are considering expanding the Zero to Hero series to include building a basic GPT-2 toy chatbot? I believe you mentioned in one of the early lectures that you planned to include building a toy version of Dalle. Do you still have plans for that as well?

> Why write in CUDA and not just use PyTorch etc?

“LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. […] A few more words on what I want this repo to be: First, I want llm.c to be a place for education.”

Hi Andrej,

Huge fan of all the work you do. Wanted to understand something fundamental and whom better to ask than you: Whats so special about the transformer architecture that its able to predict the next token so beautifully understanding all the intricate previous token relationships? I understand Attention but what so special about this architecture that no other architectures are able to "attend" appropriately to previous tokens? Being a CS guy, its really hard for me to fathom that we have not yet created another architecture which can perform similarly.

I just hope than in a couple of years we'll see a submission here titled "Reproduce GPT-4 on legacy RTX 4090."

Because currently even with open source (?) models we are still consumers, and the training is still the domain of the rich.

More specifically, Cloudflare R2 doesn't charge for egress, and Cloudflare doesn't charge for egress to members in the Bandwidth Alliance which include Azure, Google Cloud, Oracle, Alibaba Cloud, and others, though critically not AWS.

They very much do charge egress fees elsewhere.

I think between Anna's Archive, fineweb and as many github repos as you can scrape you can get a pretty decent dataset.

I doubt Anna's Archive would produce a good model on its own though.

Depends on how the courts rule. If the copyright maximalists prevail, only the wealthiest entities will be able to afford to license a useful data set.

Paradoxically enough, this is the outcome that most "Hacker News" denizens seem to be rooting for.

It's almost as if people believe in fairness and compensating people for their work.

Also, it's worth noting that this is only true as long as we're stuck in the "must train on the entire sum total of human output ever created" local minimum for machine learning. Given that most biological entities learn with much less data, this might well be the thing that prods ML research to using an approach that isn't "IDK, buy a few containers of GPUs, and half a DC of storage, see if that makes things better".

We won’t ever get there or need to because GPT-4 wasn’t trained on one GPU it was trained on thousands. The (most likely) biggest meaningful difference between -2 and -4 is the number of parameters and the training data/duration. I don’t think you’d really learn much more.

It’s not about learning. It’s about owning. Exactly the reason OpenAI stopped being open. Having GPT-4-quality LLMs created by anyone with a gaming PC would be pretty radical.

And you won’t get there. Those models are far too large for a 2024 GPU. Llama-3 70b is arguably close to GPT-4 but is still too large for gaming GPUs (and probably for many years of GPU updates)

“You won’t get there” is a pretty vast statement for all of the future. Two fairly reasonable predictions: 1) the compute needed to get GPT4 performance will decrease. 2) the compute on consumer GPUs will increase.

At some point they cross, and you will be able to run a GPT4-quality LLM on a consumer GPU. At some point after that, you’ll be able to run a GPT4-quality LLM on a 2024 consumer GPU if you can find one.

Important to emphasize, I’m not saying “GPT-4”. Llama-3 was trained on 24k GPU clusters. “Able to do the exact same processing at 1/24k the compute” is different from “Able to get equivalent performance at 1/24k compute”. Even then, given a long enough time scale, the former is possible.

It converges similarly on smaller datasets.

About to kick off a training from scratch run on the same fineweb-10B, which at 324k toks/sec should take about 8.6 hours. And with my kWh cost, that is about $2.50 cost to train.

Will report back tomorrow when the training has finished..

Dude im hoping we get rid of Nvidia completely. I can run llama.cpp inference on a 7B model on my 24 core cpu intel machine using just CPU and it only uses about 4gb of ram and is not that slow. If we could have massive parallel arm core or even riscv machines without the Cuda issues with proprietary driver hell it would be much more open source. And much less wonkage for the normie user

I'm not saying this to be rude, but I think you have a deep misunderstanding of how AI training works. You cannot just skip the matrix multiplications necessary to train the model, or get current hardware to do it faster.

No offence taken! As far as my (shallow!) understanding goes, the main challenge is the need for many GPUs with huge amounts of memory, and it still takes ages to train the model. So regarding the use of consumer GPUs, some work has been done already, and I've seen some setups where people combine of these and are successful. As for the the other aspects, maybe at some point we distill what is really needed to a smaller but excellent dataset that would give similar results in the final models.

Last time I tried GPT-2 on CPU (which I think was shortly before chatGPT was launched), I was getting about 0.2 tokens/sec. CPU utilization was low though, so running inference in parralel gave better results. I was using 2 x E5-2660's.

I have a 24 core Intel cpu and llama3.cpp runs llama3 surprisingly fast in surprisingly little RAM. Yes it becomes a space heater but theres light at the end of the cuda free tunnel

Is this the sort of thing that a person with curiosity and a 4090 could do? It says he used 8xA100s in the cloud to do this but is it just a matter of the 4090 going 8x slower or will memory constraints kill the whole endeavour?

Spend one year to study multiple languages - bash, C, C++, Go, Python ... and even Mojo or Rust. 10-20 hours a week. Being able to read top programming languages is the best investment I ever made. You will become fearless and can see the matrix ;)

You'd have to be deep into ML Infrastructure to use C, probably via CUDA. No-one who develops or uses ML models touches C or even C++. tinygrad and llama.cpp are exceptions.

（评论） (comments)

（评论）
(comments)