(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39737281

您关于培训和评估代码可用性的观点是有效的。 然而,实际上,由于保密协议以及维护和托管此类基础设施所涉及的成本,大多数模型(尤其是 Grok 或 GPT-X 系列等非常大的模型)不太可能公开访问此类详细信息。 重要的是要承认,出于道德和实践原因,真正的开放并不总是可以实现或可取的。 尽管如此,数据源、模型性能和失败案例的透明度是建立问责制、促进创新和确保负责任地使用先进人工智能技术的重要组成部分。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Grok (github.com/xai-org)
971 points by pierre 15 hours ago | hide | past | favorite | 359 comments










Has anyone outside of x.ai actually done inference with this model yet? And if so, have they provided details of the hardware? What type of AWS instance or whatever?

I think you can rent like an 8 x A100 or 8 x H100 and it's "affordable" to play around with for at least a few minutes. But you would need to know exactly how to set up the GPU cluster.

Because I doubt it's as simple as just 'python run.py' to get it going.



If you're just looking to test it out, it's probably easiest to wait for llama.cpp to add support (https://github.com/ggerganov/llama.cpp/issues/6120), and then you can run it slowly if you have enough RAM, or wait for one of the inference API providers like together.ai to add it. I'd like to add it to my NYT Connections benchmarks, and that's my plan (though it will require changing the prompt since it's a base model, not a chat/instruct model).


>it's probably easiest

Cheapest maybe, but easiest is just to rent a p4de.24xlarge from AWS for a couple hours to test (at around $40/hour..).



I'd expect more configuration issues in getting it to run on them than from a tested llama.cpp version, since this doesn't seem like a polished release. But maybe.


Someone could run Grok-1 on a 192GB M2 Mac when a 4-bit quant is released; I'm guessing that TheBloke is already working on it.


Fairly sure the bloke hasn't created any new quants in a month.


TheBloke dissapeared near the day https://nvd.nist.gov/vuln/detail/CVE-2024-23496 was published.

Of course there has been much speculation on this, I have no more information than this that can be backed up by facts, but the timing was suspicious.



He's started a company in the UK: https://suite.endole.co.uk/insight/company/15361921-thebloke...

Interestingly registered just around the corner from where one of my relatives used to live.



And his grant funding supposedly ran out.


Was any .gguf file hosted on HuggingFace found to be crafted in a way to exploit this?


At 8x86B, looks like the largest open model yet by far. Would be interesting to hear how many tokens it's been trained on. Especially important for higher param models in order to efficiently utilize all those parameters.


Considering how poor it is compared to other models, it really emphasises how important fine tuning is. Models with MUCH smaller parameter counts are outperforming it in many metrics.


"it really emphasises how important fine tuning is"

Or rather the quality of the training data?



We don't know since no one is releasing their data.

Calling these models open source is like calling a binary open source because you can download it.

Which in this day and age isn't far from where were at.



A big distinction is that you can built on top (fine-tune) thus released models as well as if they released the pre-training data.


You can fine tune without the pre training data too.

Mistral models are one example, they never released pre training data and there are many fine tunes.



You can also build on top of binaries if you use gotos and machine code.


This seems intentionally obtuse. What you say is true, but it is very obvious that this is much more of a pain than if you had the source code. On the other hand, fine tuning is just as easy, regardless of whether you have the original training data.


One could also disassemble an executable and build on top of it. Not for the faint of heart and probably illegal, but possible unless it was deliberately obfuscated. Compared to that, it is impossible with state-of-the-art methods to systematically extract the training data from an LLM model. Fragments yes, but not all of it.


You can do better - generate synthetic data covering all topics. And to make it less prone to hallucination, use RAG or web search for reference material. The Phi-1.5 model was trained on 300B of synthetic tokens generated with chatGPT and it showed a 5x bump in efficiency, punching well above its line.

Synthetic data can be more diverse if you sample carefully with seeded concepts, and it can be more complex than average web text. You can even diff against a garden variety Mistral or LLaMA and only collect knowledge and skills they don't already have. I call this approach "Machine Study", where AI makes its own training data by studying its corpus and learning from other models.



Or shell scripts


We should just call it open weight models at this point.


FWIW the Grok repo uses the term "open weights".


How about "weights available" as similar to the "source available" moniker?


weights available or model available, but yes.


Their data is the twitter corpus which is public. Or do you want a dump of their database for free too?


Twitter tweet data in itself is both highly idiosyncratic and short by design, which alone is not conductive towards training a LLM.


Saying "It's just the twitter public corpus." is like saying "Here's the Linux Kernel, makefiles not included."


Or even "here's the Linux Kernel makefiles, no sources included, enjoy".


[flagged]



Requiring a company to publish their production database for free is delusional. I haven't mentioned musk anywhere in my comment, you must be obsessed with him.


It's fascinating you doubled down on your own straw man and still have the nerve to call others delusional.

You missed my point, which I wasn't very clear about, so my mistake. Although it doesn't seem like you're interested in understanding anyway.



that's a subtle dig at the fact that they have all of Twitter as a training corpus to use, but we don't know how they weight tweets. which, we know they're not gonna be weighted evenly.


I'm sure just like in X's algorithms, @elon tweets are weighted heavily.


The X algorithm is also opensource, so you can verify before commenting..


X algorithm Github project hasn't been updated in 8 months:

https://github.com/twitter/the-algorithm

So clearly they aren't running it in production.

Also they didn't open source the list of people who are being artificially boosted e.g. Elon.



just because they open sourced it doesn't mean that's actually what they're running on it though


It's not like he needs boosting, he was one of Twitter's top followed accounts long before he bought them. He's pretty good at getting attention.


And yet it’s not enough to curb the desire to tip the scales.

https://arstechnica.com/tech-policy/2023/02/report-musk-had-...



No idea about the current state, but the open sourcing did show, they were favoring elon:

https://mashable.com/article/twitter-releases-algorithm-show...

And personally I never used Twitter much, but I certainly did not follow Elon Musk when I did - yet I had to see lot's of his posts in my feed. Surely just coincidence.



> they were favoring elon

No, and that's not what the article says either. They were just tracking how well his tweets were doing versus others. They were not favoring Elon.



"They were just tracking how well his tweets were doing versus others. "

Yeah, and adjusting it, so he comes out best. That was Musks demand, as the other article shows, that is linked inside, after a Biden tweet performed better than Musk:

https://mashable.com/article/elon-musk-super-bowl-joe-biden-...

They officially boost people, who pay a little bit. Elon payed a lot.

And the source is clearly not the production source and never where in this shape - otherwise why sue someone, who open sourced it?

"But, the release of this source code also comes days after Twitter forced Github to take down other parts of Twitter's source code that was allegedly posted by a former employee without the company's permission. So, clearly, there's still plenty of Twitter that Musk still doesn't want us to see."

Also, you probably missed that:

"Zoë Schiffer of Platformer reported that Twitter actually removed part of the source code that affected the reach of Musk's and other user's tweets before releasing the algorithm to the public."

Which is consistent with quite some other statements, also from Twitter itself and the fact, that the source has not been updated in 8 months.

See also this HN comment and discussion about it:

https://news.ycombinator.com/item?id=35391854

"But the underlying policies and models are almost entirely missing (there are a couple valuable components in [1]). Without those, we can't evaluate the behavior and possible effects of "the algorithm.""



It's not too hard to believe it is a coincidence when the most followed person on a platform shows up in your feed, especially if you follow tech accounts.


Did you not read the article linked in the comment you're replying to?


Sounds a bit far fetched

So changes in power users stats would also result in audience balancing?

Most likely the code was used for analytics and for tracking balance; Elon was a pain in the ass and asked to have custom analytics for his account and devs eventually added him as an audience to be able to get analytics about him easily. A bit dirty but it works.

Most likely the balancing code is somewhere else and it affects only republican / democrats.



> I'm sure just like in X's algorithms, @elon tweets are weighted heavily.

Are you sure or is it the literal opposite and you’re just speculating?



Or even how much it was trained on this dataset, the amount of FLOPs.


Show the proof? Does it include IFT?


I would say it emphasises that training a good model is more than throwing random data and compute


no it empathizes the importance of training smaller models for longer, like the Mistral "overtrained" models


It's actually not the largest. https://huggingface.co/google/switch-c-2048 is 1.6T parameters.


It’s not 8x86B. Total number of parameters is 314B.

Perhaps it’s 8x39B to fit on a single 8xA100 (40GB) server?



They all do this marketing bull.

Mixtral has an 8x7B model but it's actually 46.7B, not 56B params.

Kinda similar to how 4K displays are 3840 pixels wide, not true 4K which would be 4096. Marketing people called it 4K, not engineers.



I've always thought of 4K as "4x FullHD". In that way it makes sense.


So going by your logic, 8K is 8x FullHD?


TV and Digital Cinema have different standards, because of course they do


Active parameters is 86B, so wouldn't that be the size of the largest two experts (where they may all be the same) + the weights of the selector?


Most likely it's a MoE of Grok-0 which would be 8x33B + 50B for the router.


I'd be very curious to see how it performs especially on inputs that's blocked by other models. Seems like Grok will differentiate itself from other OS models from a cencorship and alignment perspective.


Model weights on huggingface: https://huggingface.co/xai-org/grok-1


"Base model trained on a large amount of text data, not fine-tuned for any particular task."

Presumably the version they've been previewing on Twitter is an instruction-tuned model which behaves quite differently from these raw weights.



Love the minimal repo, magnet link, and stating "open weights" instead of "open source". Refreshing!


For what reason would you want to use this instead of open source alternatives like Mistral


Mistral opened their weights only for very small LLaMA-like model.


I'm pretty sure Mixtral outperforms Grok-1 and uses much less memory to do it


One of the interesting things when weights are open sourced is the community can often improve the results. See all the bugs fixed in Gemma for an example.


Doubtful, for purely information theoretic and memory capacity reasons. It may outperform on some synthetic metrics, but in practice, to a human, larger models just feel “smarter” because they have a lot more density in their long tail where metrics never go


I'm a little out of touch, is there a way to see how Grok measures up to other models?


Benchmarks here https://x.ai/blog/grok


And to compare, you can sort by MMLU on here: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb....

Edit: to include my self summary after review: There's a good 100 models better than, a couple 1x7b even. Mixtral stomps it, half mixtral are universally better but one is close to same.



This benchmark is mostly worthless, some of the top models there were trained on benchmark data, which is a known fact in the community.

The only reliable benchmark: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...



No, it's not "mostly worthless" and yes, some of the top models were removed a few months back from being trained on benchmark data.

I urge you to at least think through what alternative you propose before posting so aggressively in these situations. Lmsys doesn't have Grok, or I would have included it. And having _some_ data is better than none.

I also had someone arguing with me 6 months back that we can't trust any benchmarks at all from vendors, which would exclude the blog post. Instead of just repeating that back vehemently, I filled a gap. It's important we don't self-peasantize as a species, all data has its issues, that doesn't mean we throw it all out.



Isn't this Apache licensed? Regardless, you can run multiple models concurrently on the same input using well-known ensemble techniques. (Not to be confused with mixture-of-experts, which is more like training a single model where only a few blocks are chosen to be active at any given time - a kind of sparsity.)


Well if nothing else, this one might be significantly less nerfed. Very interesting to compare to the others.


[flagged]



I’ve been known to get snippy on HN from time to time myself :) So please know that I’m only offering a gentle nudge that I’d want from a fellow long-timer myself regarding a line of discussion that’s liable to age poorly.

Talking about sorting hats for those who do and don’t have the one-percenter AI badge isn’t a super hot look my guy (and I’ve veered dangerously close to that sort of thing myself, this is painful experience talking): while there is no shortage of uninformed editorializing about fairly cutting edge stuff, the image of a small cabal of robed insiders chucking in their cashews while swiping left and right on who gets to be part of the discussion serves neither experts nor their employers nor enthusiastic laypeople. This is especially true for “alignment” stuff, which is probably the single most electrified rail in the whole discussion.

And as a Google employee in the diffuser game by way of color theory, you guys have a “days since we over-aligned an image generation model right into a PR catastrophe” sign on the wall in the micro kitchen right? That looked “control vector” whacky, not DPO with pretty extreme negative prompt whacky, and substantially undermined the public’s trust in the secretive mega labs.

So as one long-time HN user and FAANG ML person to another, maybe ixnay with the atekeepinggay on the contentious AI #1 thread a bit?



Every discipline has its bellwether topics. They’re useful for filtering out people who want to chip in without picking up the tools.


regardless of whether they say it out loud, it is what many of us think - might be good for people to know why their opinions are getting immediately dismissed by insiders


Letting people know how why their opinions are getting dismissed in a productive way is done by citing well-known sources in low-effort way, or by explaining things thoughtfully in a high-effort way: Karpathy has chosen the highest-effort way of most anyone, it seems unlikely that anyone is at a higher rung of "insiderness" than he is, having been at Toronto with (IIRC) Hinton and Alex and those folks since this was called "deep learning", and has worked at this point at most of the best respected labs.

But even if folks don't find that argument persuasive, I'd remind everyone that the "insiders" have a tendency to get run over by the commons/maker/hacker/technical public in this business: Linux destroying basically the entire elite Unix vendor ecosystem and ending up on well over half of mobile came about (among many other reasons) because plenty of good hackers weren't part of the establishment, or were sick of the bullshit they were doing at work all day and went home and worked on the open stuff (bringing all their expertise with them) is a signal example. And what e.g. the Sun people were doing in the 90s was every bit as impressive given the hardware they had as anything coming out of a big lab today. I think LeCun did the original MNIST stuff on a Sun box.

The hard-core DRM stuff during the Napster Wars getting hacked, leaked, reverse engineered, and otherwise rendered irrelevant until a workable compromise was brokered would be another example of how that mentality destroyed the old guard.

I guess I sort of agree that it's good people are saying this out loud, because it's probably a conversation we should have, but yikes, someone is going to end up on the wrong side of history here and realizing how closely scrutinized all of this is going to be by that history has really motivated me to watch my snark on the topic and apologize pretty quickly when I land in that place.

When I was in Menlo Park, Mark and Sheryl had intentionally left a ton of Sun Microsystems iconography all over the place and the message was pretty clear: if you get complacent in this business, start thinking you're too smart to be challenged, someone else is going to be working in your office faster than you ever thought possible.



[flagged]



Then don't link to an "About Me" page [1] that says you do? How is confusion on that subject any reader or commenter's fault?

I don't care if you personally work at Google or not, Google got itself in quite a jam as concerns public perception of their product in particular and the AI topic in general by going overboard with over-alignment, everyone knows that so one assumes that insiders know it, which is one of a great many examples of how strongly-forced models are a real problem for arbitrarily prestigious insider-laden labs.

Framing the debate about whether large, proprietary models are over-aligned or mis-aligned as an acid test for whether or not someone is worth paying attention to is really weird hill to stand on.

[1] https://www.jpohhhh.com/about



[flagged]



I've been around here a pretty long time, but I could still be off base here: as far as I understood people generally posted links to their own blog [1] in their HN profile because they want people to read them? I read your blog and particularly the posts about Gigadiffusion because I wanted to reply from a position of having put some effort into understanding where the poster I was replying to was coming from before popping off with what could be taken as a criticism. If that offends you or creeps you out I'm more than happy to steer clear of it with the parting remark that I really like Material and had hoped that any follow up would give me the opportunity to compliment you on some nice work.

If that's not your blog, you should probably take it off your profile?

[1] https://www.jpohhhh.com/



The 1% who actually work on AI don't use terms as generic as "AI". Way to reveal yourself as college undergrad who read a couple of popular science books, downloaded MNIST data and thinks they are "experts".


The safety crap makes the tools unusable. I used to have a test for it that I thought was decent, but Claude failed that test and it is way better than ChatGPT-4 for code, which means my test was bogus. The people actually working in AI are kind of irrelevant to me. It's whether or not the model will solve problems for me reliably.

People "actually working in AI" have all sorts of nonsense takes.



Another day, another fairly good comment going grey on an AI #1. The over-alignment is really starting to be the dominant term in model utility, Opus and even Sonnet are both subjectively and on certain coding metrics outperforming both the 1106-preview and 0125-preview on many coding tasks, and we are seeing an ever-escalating set of kinda ridiculous hot takes from people with the credentials to know better.

Please stop karma bombing comments saying reasonable things on important topics. The parent is maybe a little spicy, but the GP bought a ticket to that and plenty more.

edit: fixed typo.



> The safety crap makes the tools unusable

For you that may be the case.

But the widespread popularity of ChatGPT and similar models shows that it isn't a serious impediment to adoption. And erring on the side of safety comes with significant benefits e.g. less negative media coverage, investigations by regulators etc.



Seems like marketing and brand recognition might be some confounding variables when asserting ChatGPT's dominance is due to technical and performance superiority.


lol, okay


[flagged]



(not sure you're going to edit again, but in the current one, I'm not sure what Google's silly stock image warning has to do with anything, and I have generally chosen to avoid engaging people doing their politics hobby via AI discussion, since it became okay to across the ideological spectrum of my peers. So, mu is my answer.)

And you're right, I was really surprised to see the harder right people throwing up their hands after the Gemini stuff.



[flagged]



Wouldn't have even noticed had you not pointed it out.


Feel free to explain! You caught my attention now, I'm very curious why it's on topic. Preregistering MD5 of my guess: 7bfcce475114d7696cd1d6a67756761a


[flagged]



No I didn't, at least, I don't think it did but it does sound exactly like me. But then again, I don't know what it'd have to do with anything you said specifically.

https://pastebin.com/yfUWZMmc, idk if it's right because you kinda just went for more free association.



Curious why you're so dismissive of something that's pretty important?


Can someone explain why the weights are posted via a Bittorrent magnet link? I have no way to check the size at the moment, but isn't that a bit unusual? There's also only 21 seeders right now according to https://checker.openwebtorrent.com/




Because Bittorrent is an outstanding tech for delivering large files, more I think about it the more I'm surprised it wasn't taken advantage of more.


It may become a tradition since weights are so large. Perhaps it started when the Llama torrent link leaked. Then, Mistral decided to release their weights using bittorrent.


I'm not sure why you wouldn't tbh. That's a lot of bandwidth.


Spreads the burden/cost of distributing a 300+GB file.


Distributing 300GB via torrent is cheaper than direct, assuming even a few other people seed


Mistral did it too when they released their first open model. They just posted a magnet link on Twitter.


> Can someone explain why the weights are posted via a Bittorrent magnet link?

I think the best way to get an answer to that question is to try to host it yourself and see what happens.



my optimistic explanation is we are going back to the 2000s internet , but probably we are not


How else could/should it be done?


I would have assumed they could just upload it to Github. If it has restrictions on file size I'm sure they could make multiple part compressed files.

Torrents can unfortunately die after a period of time if no one continues seeding it or if they don't use a permanent web based seeder, which doesn't appear to be the case.



GitHub have a soft repository size limit of 5GB, documented here: https://docs.github.com/en/repositories/working-with-files/m...

Soft size limit means "If your repository excessively impacts our infrastructure, you might receive an email from GitHub Support asking you to take corrective action." - I know people who have received such emails.

Most model releases happen through Hugging Face which does not have such a size limit.



They'd probably just charge you for it. They sell "data packs" for LFS.

https://docs.github.com/billing/managing-billing-for-git-lar...



I'd bet Hugging Face would be happy to have hosted these canonically too, so not sure why that doesn't happen more.


The model is also at https://huggingface.co/xai-org


The great thing about torrents is that you (or anyone else who cares) can single-handedly solve the problem you're complaining about by seeding the torrent.


No git would be impossible. I’ve never seen a repo even a few GB in size, if you are uploading non code files you really should not be using git. Git is a version management software for code. I often see repos which images and even videos checked in, please don’t, there are so many far better and more performant solutions out there.

The other approach would be to use AWS S3 or other cloud providers which would cost them money every time someone downloads their code, which is not their prerogative to pay for when they are releasing something for free. Torrents seems like the only good solution, unless someone hosts this on the cloud for free for everyone.



Huggingface will disagree with impossible as their models are available via git, sometimes broken up in pth files.

Still, as far as sentiment goes, yeah git for model weights is an impedance mismatch for sure!



> No git would be impossible. I’ve never seen a repo even a few GB in size, if you are uploading non code files you really should not be using git

It's not actually a limitation in git itself, especially if you use Git LFS. People use Git for Unreal projects and big ones can be half a terabyte or more in size.



Scott Chacon (github cofounder) mentioned in a recent talk that the Windows repo is 300GB https://youtu.be/aolI_Rz0ZqY?si=MOo2eS6dsKKAxmsP


Others have pointed out that GitHub doesn't allow that, but

> Torrents can unfortunately die after a period of time if no one continues seeding it or if they don't use a permanent web based seeder, which doesn't appear to be the case.

So to can web links, especially when they are 300 GB and egressing out of AWS at $0.09/GB or worse (in non-US regions). Each full download would cost $27 at that rate. 10,000 downloads would cost $270,000.

Sure you could go for something with a better cost model like R2, but you can't beat using one or two unmetered connections on a VPN to constantly seed on Bittorrent, your pricing would be effectively free and reliability would be higher than if you just exposed a HTTP server on the Internet in such a way.



> and egressing out of AWS at $0.09/GB

There's a lot of seeders on the torrent that are actually AWS ips too, all with similar configurations which makes me believe that it's probably xAI running them

> on a VPN

That's unnecessary, you don't need a VPN?



No you don't, but if you wanted to host it from your gigabit office IP, you probably would want to.


Why?


GitHub may choose to throttle downloads or remove the files simply because they're taking up too much bandwidth.

A torrent is less likely to go down in the short term.



This is not some crappy DVD rip on The Pirate Bay. It will be seeded as long as its relevant.

Twitter/X has their own massive infrastructure and bandwidth to seed this indefinitely.



Yeah, they can just leave some server running somewhere and just let it seed forever


Its likely over 100GB of data, so I wouldn't say its necessarily unusual to spread out the bandwidth across multiple hosts.


Thanks! I searched and searched for a tool that would show me info via the web about a magnet link but nada


Why not? Mistral was first to do it, it has become tradition.


BitTorrent is just an objectively superior method of delivering a lot of data to a lot of people.


I believe it was Llama 1 that notoriously got leaked with a torrent on 4chan.


I don't understand why you're being downvoted for asking a legitimate question. People not familiar with model weights might be surprised that they are often in tens of gigabytes and in this case even more.


Is this the first major model to be natively FP8? I was wondering why people hadn't done it yet. Seems like a big win when hardware supports it.


No, e.g. Yi-34B.


As far as I can tell Yi-34B is natively 16 bit float, the 8 bit version is quantized. https://huggingface.co/01-ai/Yi-34B#quantization


blog post: https://x.ai/blog/grok-os

  * 314B parameters (86B active at a time)
  * mixture of experts 8 (2 active at a time)
  * weights and architecture licensed under Apache 2.0
(edit:) announcement blog post from last year with benchmarks compared to Claude 2, GPT-3.5 and GPT-4: https://x.ai/blog/grok

(edit2:)TL;DR: somewhat comparable to GPT-3.5, Mixtral and Qwen-1.5-72B in capability but way larger than the open weight models



Is a model so huge that’s only at the level of GPT 3.5 actually good? That seems incredibly inefficient to me.


OpenAI is valued at 90 billion and all they do is make GPT; Twitter is valued at 40 billion and this was essentially a vanity side-project by a cowboy CEO. Presuming that benchmarks and general “it’s about the level of 3.5” is accurate, it’s inefficient, but not incredibly inefficient imho


xAI is a separate entity, and not a X/Twitter subsidiary.


> Twitter is valued at 40 billion

WAS vaulued at 44B.

Now?

Maybe 5 billion.



LOL @ $5 billion, but if it that was the valuation, you'd be making parent's point stronger.


Last I heard they lost 15% of their users, so let's call it 36 billion.


They weren't even 44B when elon took the keys - he specifically tried to back out of the deal because 44B was insane peak '21 asset bubble price. In truth they were probably like 10-15B at that moment. And now that bunch of advertisers left due to we know who it's probably about 10B




twitter was valued around 30 billion when musk tried getting out of buying it (then the market cap went up when it became clear that he would be forced to pay full price)


Twitter didn't have direct competitors other than Mastodon when it was taken at 44B. Now there's Threads, Bluesky and bigger Mastodon.


Honestly, none of those look like meaningful competitors at the moment.


None of these matter


It’s designed to be actively searching real-time posts on X. Apples and oranges.


Why is that relevant to the size?

Post search on X is done as it is with any other data from any other source, you use RAG and function calling to insert the context.

Not related to the size, if I'm not mistaken.



Isn't that... the same thing as search?


The data pipeline isn't included in this release, and we already know it is a pretty simple RAG pipeline using qdrant, https://twitter.com/qdrant_engine/status/1721097971830260030.

Nothing about using data in "real time" predicates that the model parameters need to be this large, and is likely quite inefficient for their "non-woke" instructional use-case.



According to their benchmarks it is superior to GPT-3.5


Since it is MoE, quantized it could be able to run on cheaper hardware with just consumer networking inbetween instead of needing epyc/xeon levels of PCI-e lanes, nvlink, or infiniband type networking. Or it could even run with people pooling smaller systems over slow internet links.


I love the citation for image in the article

> The cover image was generated using Midjourney based on the following prompt proposed by Grok: A 3D illustration of a neural network, with transparent nodes and glowing connections, showcasing the varying weights as different thicknesses and colors of the connecting lines.



Mixtral is also comparable to gpt 3.5 and open.

At 8x7B it's also a fraction of the size. Are there any benchmarks comparing Mixtral to Grok?



Mixtral announcement is here: https://mistral.ai/news/mixtral-of-experts/

Mixtral looks more economical @ capability to size (similar also for Qwen 1.5 72b)



How is it that OpenAI was touted like it was some massive years-long effort that blew all AI research out of the water and now we have so many competitors popping up one after another?


You don't need to be a cutting edge research scientist to train a SOTA LLM. You just need money for scaling. OpenAI's "secret" was just their willingness to spend tens/hundreds of millions without guaranteed returns, and RLHF/instruct fine tuning, both of which are out of the bag now.


Disagree. It took more than 12 months from the release of GPT-4 to someone else producing a model of equivalent quality, and that definitely wasn't due to a shortage of investment from the competition.

There's a huge amount of depth in training a really good LLM. Not helped by the fact that iteration is incredibly expensive - it might take several months (and millions of dollars) before you can tell if your new model is working well or if there was some mistake in the pipeline that lead to a poor quality result.

Almost all of the world-class LLMs outside of OpenAI/DeepMind have been trained by people who previously worked at those organizations - giving them invaluable experience such that they could avoid the most expensive mistakes while training their new models.



While I do agree there is some amount of secret sauce, keep in mind the training takes several months. So from someone to see the success of GPT4, decide they want to invest that amount of money to train the same, raise the money to train the model, find someone competent to supervise the training, train the model for several months, then test and integrate it could easily be a year long even if there was no secret sauce.


There's still no model of equivalent quality to GPT-4.


Claude 3 Opus is reporting superior metrics, particularly in its coding ability, and in the LLM Arena it is statistically tied with GPT-4.


Claude opus is better in my experience


Don’t overlook the training data (used for both training and instruction fine-tuning), it is one of the most crucial aspects, if not the most critical, given the significant differences observed in models with similar architectures.


That only remains an advantage if they can continue climbing the gradient from their lead position. If they hit a snag in scaling, methodology, or research, everyone else on the planet catches up, and then it's anyone's game again.


LLM training is arcane and expensive to experiment with. So OpenAI had to waste a lot of time and GPU-hours on things that didn't work to learn the tricks that did work.

Most of the competitors have lineage straight back to OpenAI, eg the lead of x.ai was previously at OpenAI and Deepmind. Likewise with Mistral and especially Anthropic.



OpenAI still seems to be at the top, except for Anthropic, who may be close, in terms of the capabilities comparing gpt-4 and claude-opus.

This Grok-1 is a large model (~314B), which matches gpt-3.5 released 2 years ago, and at about the same level of much smaller models like, mixtral (~47B) and qwen-1.5 (~72B). Do you think it's competitive?



Egg of Columbus.

Also, the general architecture is well documented, ChatGPT (specifically the chat interface, not GPT-3, not InstructGPT) is what made a lot of people care, and actually reproducing it requires someone wanting to in the first place.



When will we reach an upper limit/dimishing returns in terms of number of parameters and mixture of experts?


We may have already - data is more important than anything else which is why nobody has beat GPT4 yet. Throwing more parameters or more compute at the problem only gets you so far. But Grok was never a contender so there is room to improve on it. It is one of the biggest models open sourced as mentioned, so will be interesting to take a look at for sure.


Claude 3 has *decisively* beat GPT-4, I wonder how all their attributes compare.


Has it, though? LMSys Arena Leaderboard (blind ranking by humans) [0] positions Opus just below GPT-4 with a negligible ELO gap.

[0] https://chat.lmsys.org/



A number of AI companies have a naming/reproducibility issue.

GPT4 Turbo, released last November, is a separate version that is much better than GPT-4 (winning 70% of human preferences in blind tests), released in March 2023.

Claude 3 Opus beats release-day GPT-4 (winning 60% of human preferences), but not GPT-4 Turbo.

In the LMSys leaderboard, release-day GPT-4 is labeled gpt-4-0314, and GPT4 Turbo is labeled gpt-4-1106-preview.



Chatbot Arena is not a blind ranking.

Many, if not most, users intentionally ask the models questions to tease out their canned disclaimers: so they know exactly which model is answering.

On one hand it's fair to say disclaimers affect the usefulness of the model, but on the other I don't think most people are solely asking these LLMs to produce meth or say "fuck", and that has an outsized effect on the usefulness of Chatbot Arena as a general benchmark.

I personally recommend people use it at most as a way to directly test specific LLMs and ignore it as a benchmark.



I don't know if Claude is "smarter" in any significant way. But its harder working. I can ask it for some code, and I never get a placeholder. It dutifully gives me the code I need.


It understands instructions better, it's rarer to have it misunderstand, and I have to be less careful with prompting.


i like some of claudes answers better, but it doesnt seem to be a better coder imo


I've found it to be significantly better for code than GPT-4 - I've had multiple examples where the GPT-4 solution contained bugs but the Claude 3 Opus solution was exactly what I wanted. One recent example: https://fedi.simonwillison.net/@simon/112057299607427949

How well models work varies wildly according to your personal prompting style though - it's possible I just have a prompting style which happens to work better with Claude 3.



> according to your personal prompting style though

I like the notion of someone’s personal prompting style (seems like a proxy for those that can prepare a question with context about the other’s knowledge) - that’s interesting for these systems in future job interviews



What is your code prompting style for Claude? I’ve tried to repurpose some of my GPT-4 ones for Claude and have noticed some degradation. I use the “Act as a software developer/write a spec/implement step-by-step” CoT style.


Almost impossible to describe prompting style, but here are some examples of how I've used Claude 3:

https://gist.github.com/simonw/4cecde4a729f4da0b5059b50c8e01... - writing a Python function

https://gist.github.com/simonw/408fcf28e9fc6bb2233aae694f8cd... - most sophisticated example, building a JavaScript command palette

https://gist.github.com/simonw/2002e2b56a97053bd9302a34e0b83... - asking it to refactor some existing code

I don't use the "Act as a X" format any more, I'm not at all convinced it has a noticeable impact on quality. I think it's yet another example of LLM superstition.



> I don't use the "Act as a X" format any more, I'm not at all convinced it has a noticeable impact on quality. I think it's yet another example of LLM superstition.

It's very contextually dependent. You really have to things like this for your specific task, with your specific model, etc. Sometimes it helps, sometimes it hurts, and sometimes it does nothing at all.



Super helpful! Thanks!


I didn't know people were still doing this "act as etc etc" instructional prompting.

I just tell it my coding problem. Or when making something from scratch, ask for small things and incrementally add.



I've found it significantly better than GPT4 for code and it's become my go-to for coding.

That's actually saying something, because there's also serious drawbacks.

- Feels a little slower. Might just be UI

- I have a lot of experience prompting GPT4

- I don't like using it for non-code because it gives me to much "safety" pushback

- No custom instructions. ChatGPT knows I use macos and zsh and a few other preferences that I'd rather not have to type into my queries frequently

I find all of the above kind of annoying and I don't like having two different LLMs I go to daily. But I mention it because it's a fairly significant hurdle it had to overcome to become the main thing I use for coding! There were a number of things where I gave up on GPT then went to Claude and it did great; never had the reverse experience so far and overall just feels like I've had noticeably better responses.



citation needed (other than 'vibes')


I think Groq is something else?


Indeed, Groq is a company building inference accelerators. Grok is completely unaffiliated.


Edited, I did mean the Grok in the article not the inference chip.


There is no reason to believe GPT-4 had more(or higher quality) data than Google etc. has now. GPT-4 was entirely trained before the Microsoft deal. If OpenAI could pay to acquire data in 2023, >10 companies could acquire similar quality by now, and no one has similar quality model in a year.


The more disregard a company has for intellectual property rights, the more data they can use.

Google had far more to lose from a "copyright? lol" approach than OpenAI did.



> Google had far more to lose from a "copyright? lol" approach than OpenAI did.

The company that scrapes trillions of web pages has an issue with copyright?



I was under the impression training was at best an undefined area of IP law. Is there any aspect of copyright that prohibits training models?


This is being tested by a number of lawsuits right now, most notably the NY Times one: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

The key questions are around "fair use". Part of the US doctrine of fair use is "the effect of the use upon the potential market for or value of the copyrighted work" - so one big question here is whether a model has a negative impact on the market for the copyrighted work it was trained on.



I don’t think the New York Times thing is that much about training, than it is about the fact that ChatGPT can use Bing and Bing has access to New York Times articles for search purposes.


If you read the lawsuit it's absolutely about training. The Bing RAG piece is one of the complaints in there but it's by no means the most important.

Take a look at https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20... - bullet points 2 and 4 on pages 2/3 are about training data. Bullet point 5 is the Bing RAG thing.



Ah, thanks!


Having used both Google's and OpenAI's models, the kind of issue they have are different. Google's models are superior or at least on par in knowledge. It's the instruction following and understanding where OpenAI is significantly better. I don't think pretraining data is the reason of this.


Claude > GPT4. Anyone using these models on a daily basis knows this


I use these models regularly, and Claude is dumb as a rock compared to GPT-4.


It is known


Is there a model card anywhere? I'd like to know what it was trained on.


Well, he delivered.


Partially. Open weights is not open source.


In machine learning models the term open source has been largely accepted to mean sharing weights and, if necessary, inference code. You can argue if this is an abuse of the term but everyone does it, and saying someone didn’t deliver if they used it and published weights would probably mean saying the same about mistral, meta, etc.


Yes. So say the same thing about them Open source has a definition and abusing that hurts all of us except the billionaires.


I get the "open source" argument, but what is the issue here?

If you are able to reproduce the thing in its entirety and you're given no restrictions on its use, it seems compatible with the spirit of open sourcing things.



The architecture of the model is open source. Not just the weights. You can run the entire thing locally.


How hard would it be for an open source group to fine tune this into a chatbot?


What are the languages supported by it?


Tweets.


How are people's experience with this model? Having the most weights is one thing but being a better model than the 70B models is another.


I use grok all the time to find tweets or ask about trends on Twitter. For that it's better than what used to exist. But its not a great model outside that narrow use case.


tbh, I've never seen anyone share anything interesting produced by Grok. I see plenty of posts on X and reddit of people sharing amazing things that GPT-4 and now Claude 3 Opus can do. Grok can roast people. That's pretty much all I've seen.

I'd love to proven wrong if someone cares to share something interesting produced by Grok.



The only other Repository is a fork of Qdrant.


Hey, asking any experts here, what are their first thoughts in the significance of this?

IE, is this comparable to any other model released, or are there significant metric differences that make it better for certain usecases?

The only thing I see, of the top of my head, is that it is a very large model, and I don't think any models of similar size have been released.



Not an expert by any means, but I like learning about this stuff and I play with a lot of open weight models.

I’d say the significance is that it happened. It’s by far the largest open weight model I’ve seen. But I’m not sure why you’d use it over a model like Mixtral, which seems to perform about the same at like 1/6th the size.

But I welcome any contribution to the open weight LLM community. Hopefully people will learn something interesting with this model. And I hope they keep releasing new versions!



If I may ask, how do you load such big models? 300gb seems like a lot to play around with.


You're right, this model is going to be too big for most people to play around with. But to answer your question I have a 128GB of RAM in my M3 MacBook Pro, so I can use most of that for GPU inferencing. But still, this model is going to need to be heavily quantized for me to be able to use it. (fwiw, I probably wont try this one)

In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it. I suspect my computer might be able to run a 3 bit quant, but it might need to go down to 2 bits to have any kind of reasonable context length. But with quants that small I'd expect the model's performance to degrade well below that of Mixtral, so it probably isn't really even worth using. But we'll see; quantization is weird, some models perform better than others when quantized.



Thanks a lot for the hint :)! It awesome that it might run even on a MacBook, actually this is a reason to switch to Mac. Seems, there is nothing similar for a PC laptop with linux or windows.


No problem. I hope more people try these things out, it's the best way to push the industry forward! We can't let the researchers have all the fun.

Apple had plenty of reasons to move forward with their Apple Silicon CPUs and GPUs in the mac, but they really did seem to get lucky with the unified memory architecture. It was kind of just an artifact of their design, but ends up serving the needs of deep neural net models really well!



A top-of-the-line Mac Studio Ultra maxes out at 192GB currently. This is also a MoE model, so only a fraction of parameters have to be in RAM.


MoE doesn’t really help with the memory requirements for the reason mentioned in the other comment. But it does help with reducing the compute needed per inference. Which is good because the M3 Max and M2 Ultra don’t have the best GPUs. A 70B parameter model is pretty slow on my M3 Max, and this model has 86B activations per inference run.


Each token generated may only use a subset of the parameters (86billion instead of 314billion), but the next generated token might use a different subset. If it's anything like Mixtral, it will switch between experts constantly. It helps with memory bandwidth, but all the parameters still need to be in RAM or it would be unbearably slow.


>In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it.

How quickly are new models available through Ollama?



Ollama is just a wrapper around llama.cpp, so when the gguf model files come out it'll be able to run on Ollama (assuming no llama.cpp patch is needed, but even if it is ollama is usually good at getting those updates out pretty quickly).


Few days max.


Tests are not out yet, but:

- It's very large, yes.

- It's a base model, so its not really practical to use without further finetuning.

- Based on Grok-1 API performance (which itself is probably a finetune) its... not great at all.



seems like a large undertrained model, not that exciting imo compared to mixtral

it is also not the biggest model oss, switch transformer was released years ago and is larger and similarly undertrained



> Due to the large size of the model (314B parameters), a machine with enough GPU memory is required to test the model with the example code

What type of machine do you need to play around with this?



Probably a machine with about 628 GB of GPU memory. (2 bytes per parameter)

So 8xH100 (80Gb each) should do it.



'Chunky beast, needs 320 Gb VRAM likely 4 bit, likely is being run 8 bit on 8 x 80 Gb GPUs.'

-Emad



A single 192GB M2 Mac using a 4-bit quant would work.


One subtle thing: Musk said "open-source", we got "open-weights" instead (still better than nothing though, so it's greatly appreciated).


This is the weights and the model under Apache 2.0 license. What do you mean by open-source?

https://github.com/xai-org/grok/blob/main/model.py

https://github.com/xai-org/grok/blob/main/run.py#L25



Still better than most of the "open weights" models that have massively restrictive terms.


He also called permissively licensing Tesla's patents "open sourcing" them. He's at the forefront of misusing the term.


The “source” in “open source” refers to source code which they released. A dataset is not source code, if anyone is misusing the term it’s you.


I consider the weights a binary program and the source code is the training data. The training algorithm is the compiler.

I agree this isn't standard terminology, but it makes the most sense to me in terms of power dynamics and information flow.

We know from interpretability research that the weights do algorithms eg sin approximation etc. So they feel like binary programs to me.



If you can't rebuild it, then how can you be considered to have the "source code" ?

The training data isn't a dataset used at runtime - it's basically the source code to the weights.

Not sure it really matters here though (who has the GPUs and desire to retrain Grok?), but just as a matter of definition "open weights" fits better than "open source".





Dumb question: what should open-source mean in the context of something like this? Open access to the training data and training pipeline as well?


It's not a dumb question, and the answer is "yes".


A big catch here is that you can't slap an open source license on a bunch of copyrighted training data, and to date no-one has created a truly convincing LLM exclusively trained on public domain data. It might happen soon though - there are some convincing effort in progress.


Absolutely, because it’s trained mostly on unlicensed, copyrighted content, they basically can’t release source.


Many people think these companies are training on unlicensed data but I think OpenAI licenses their data, they just “license” it the way one would need to in order to read it.


You all keep using the word "Data"

Data, as in facts, as in the frequency of one word in relation to another.

"Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed..." FROM: https://www.copyright.gov/help/faq/faq-protect.html

It's not a question of if, rather when the cat gets out of the bag and the legal battle starts. The problem is that all the copyright applies to the expression not the factual information it expresses (in this case word relations). Now "how math works" and "the language of the law" are going to make for an interesting court case. I suspect that math wins here but it depends on what judge gets it and how high it goes.



> …I think OpenAI licenses their data…

They've just started to (in response to lawsuits, it must be noted) and in the meantime, they're simultaneously claiming that (1) what they're doing is fair use (a.k.a. fair dealing) and (2) preparing for the day when courts confirm that it isn't.



https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama... this one claims to have been trained only on permissively licensed data.


Agreed. It's ridiculous people have to resort to saying their question dumb to avoid being attacked by toxic commenters.


If you release that instead of the binary weights you can be both more open and less useful for users. Fun


Come on, that's not reasonable to expect from a company, or useful for indie hackers. Having weights that can be used however you like is enough for most people, even large companies.


Maybe it should be called something else? "Openly-licensed"?

Just because the model weights are not really "source" (either as a matter of intuition or for example following the OSI "preferred form in which a programmer would modify the program" definition).



Sure, but I don't want to train anyone's model from scratch. Realistically, I can't download all the training data, or run the pipeline, or train the model. Making all of that available to me would be a massive burden on the company too, so they simply won't do it. If I'm able to fine-tune it, that's enough for me, and imo, that fits with the spirit of open/free software. We have to understand that this is fundamentally a different thing than something like the Linux kernel, and closer to something like an industrial project. The output is just a bunch of numbers instead of something physical.


The Open Source Initiative is actively working on this over the course of this year, and your input will help define that meaning! Please see here for more:

https://opensource.org/blog/open-source-ai-definition-weekly...



Yes, training and evaluation code, i.e., the code used to generate the weights.


Yeah musk said “all design and engineering for the original roadster is now open source” and actually what we got was a few PCB files and zero mechanical design files so I don’t ever trust what he says.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com