(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39453271

然而,它仍然属于开源范畴,因为它允许第三方审计,并且需要合作制定高质量的评估标准,以确保透明度并防止损害。 此外,该模型的发布鼓励并欢迎更广泛的人工智能社区为构建开放的人工智能评估基础设施做出贡献。 因此,它与开源原则紧密结合,特别是考虑到它提供了用于非营利目的的许可,这表明了我们致力于让广泛的个人和实体能够使用高质量的人工智能技术。 此外,促进开放人工智能和透明的评估流程是提高人工智能技术可靠性、安全性和完整性的关键步骤。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Gemma: New Open Models (blog.google)
1047 points by meetpateltech 21 hours ago | hide | past | favorite | 473 comments










The terms of use: https://ai.google.dev/gemma/terms and https://ai.google.dev/gemma/prohibited_use_policy

Something that caught my eye in the terms:

> Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma.

One of the biggest benefits of running your own model is that it can protect you from model updates that break your carefully tested prompts, so I’m not thrilled by that particular clause.



This is actually not that unusual. Stable Diffusion's license, CreativeML Open RAIL-M, has the exact same clause: "You shall undertake reasonable efforts to use the latest version of the Model."

Obviously updating the model is not very practical when you're using finetuned versions, and people still use old versions of Stable Diffusion. But it does make me fear the possibility that if they ever want to "revoke" everybody's license to use the model, all they have to do is just post a model update that's functionally useless for anything and go after anyone still using the old versions that actually do anything.



So if they wish to apply censorship they forgot, or suddenly discovered a reason for, they want you to be obligated to take it.

Good faith possibilities: Copyright liability requires retraining, or altering the underlying training set.

Gray area: "Safety" concerns where the model recommends criminal behavior (see uncensored GPT 4 evaluations).

Bad faith: Censorship or extra weighting added based on political agenda or for-pay skewing of results.



Sounds like it would be interesting to keep track of the model's responses to the same queries over time.

> Gemma-2024-Feb, what do you think of the situation in the South China Sea?

> > The situation in the South China Sea is complex and multi-faceted, involving a wide range of issues including political conflicts, economic challenges, social changes, and historical tensions.

> Gemma-2024-Oct, what do you think of the situation in the South China Sea?

> > Oceania has always been at war with EastAsia.



This is a great idea; I wonder if anyone is working on AI censorship monitoring at scale or at all. A secondary model could compare “censorship candidate” prompt results over time to classify how those results changed, and if those changes represent censorship or misinformation.


There's also (I think?) been some research in the direction of figuring out more abstract notions of how models perceive various 'concepts'. I'd be interested in the LLM version of diffs to see where changes have been implemented overall, too.

But really, the trouble is that it's tough to predict ahead of time what kinds of things are likely to be censored in the future; if I were motivated to track this, I'd just make sure to keep a copy of each version of the model in my personal archive for future testing with whatever prompts seem reasonable in the future.



We are already culturally incapable of skillfully discussing censorship, "fake news", etc, this adds even more fuel to that fire.

It is an interesting time to be alive!



These are all very new licenses that deviate from OSI principles, I think it's fair to call them "unusual".


I think they meant not unusual in this space, not unusual in the sense of open source licensing.


For this sentence to parse, you need to either add or remove a "not".


That's useful context, thanks - I hadn't realized this clause was already out there for other models.


I don't think a broken model would trigger that clause in a meaningful way, because then you simply can't update with reasonable effort. You would be obliged to try the new model in a test environment, and as soon as you notice it doesn't perform and making it perform would require unreasonable effort you can simply stay on the old version.

However you might be required to update if they do more subtle changes, like a new version that only speaks positively about Google and only negatively about Microsoft. Provided this doesn't have an obvious adverse impact on your use of the model.



Switching to a model that is functionally useless doesn't seem to fall under "reasonable efforts" to me, but IANAL.


It's worth noting that Stable Diffusion XL uses the OpenRAIL++-M License, which removed the update obligation.


Why the hell do they use such a crappy license in the first place?


I don't think there's a way they can enforce that reasonably. There's no connection to the mothership to report back what version is being used or license keys at runtime...

Seems more like a "if we discover something unsafe you should update your model and we aren't liable if you don't" than something that would make your model stop working.



This kind of defensive statements in ToS are usually due to obscure regulation or leading cases and model developers need a way to limit liability. There's no practical way to enforce this, but they can claim that when bad things happen it's purely on model users rather than model developers.


That would be protected by the "reasonable efforts" clause, I imagine. It's not reasonable to rewrite your app every time there's a new model release. I suppose these clauses exist (around multiple models) to protect themselves. They want to be able to say that they reacted quickly once a flaw was identified.


You don't have to agree to this policy to use the model.


They have to make sure you’re receiving the most cutting edge chiding lectures when you make naughty and problematic requests.


You can't make a local model do that. eg force the answer to begin with "Yes" or use control vectors so it agrees with it.


This is strangely reminiscent of the Soviet Union, where after they got rid of Lavrentiy Beria, they mailed the update to subscribers of the Great Soviet Encyclopedia, where they asked to remove the three pages with Beria’s biography and replace them with the three provided pages.


This sounds like a clause to cover themselves in case older versions have any serious issues


Sounds like it's "reasonable" for you not to update then.


It says you must make efforts (to a reasonable extent), not that you must give a reason for not making efforts


Oh I tried to update, it's just that my router drops the connection after a few hundred MBs...


This is a TOS, meaning their enforcement option is a lawsuit. In court, if you convincingly argue why it would take an unreasonable amount of effort to update, you win. They can't compel you to unreasonable effort as per their own TOS.


This assumes they even know that the model hasn't been updated. Who is this actually intended for? I'd bet it's for companies hosting the model. In those cases, the definition of reasonable effort is a little closer to "it'll break our stuff if we touch it" rather than "oh silly me, I forgot how to spell r-s-y-n-c".


Hosting companies can probably just claim they're covered under Section 230, and Google has to go bother the individual users, not them.


If you evaluate what it takes to update, and judge the effort unreasonable, that should be enough. Maybe make a powerpoint presenting that result, if you want something for the lawyers. If you don't see a way forward that leads to a result with reasonable effort you don't have to continue working on it until you hit some arbitrary threshold for unreasonable effort.


Ugh, I would fully expect this kind of clause to start popping up in other software ToSes soon if it hasn't already. Contractually mandatory automatic updates.


I appreciated this post clarifying the distinction between "open model" and "open source":

https://opensource.googleblog.com/2024/02/building-open-mode...

I'm not sure how to feel about the restrictions. "No porn" feels prudish, particularly for this millennium. I tend to err on the side of freedom in intellectual/political matters; however, the others seem fairly reasonable as far as restrictions go.



Huh. I wonder why is that a part of the terms. I feel like that's more of a support concern.


reasonable effort - meaning if their changes meaningfully impact my usage, negatively, it would be unreasonable to ask me to upgrade.

sounds good.

this is not financial advice and ianal.



Isn't this just lawyer speak for "we update our model a lot, and we've never signed off on saying we're going to support every previous release we've ever published, and may turn them off at any time, don't complain about it when we do."


We're talking about downloadable weights here, so they can't turn them off, or force you (through technical means) to use a newer version.


It's a local model, they can't turn it off. It's files on your computer without network access.


but what if they send a lawyer to ask firmly? (kindly, but firmly.)


They'd need to send a lot of lawyers, considering that they have no idea how many people are using the model, and very little way of finding out. And they'd need a TOS violation. It would be generally expensive for them to do at scale; this isn't about "turning it off" arbitrarily, it's a CYA in case someone specific does something really bad that makes Google look bad: Google can patch the model to make it not comply with the bad request, and then demand the person running the model update or else lose their license to use the product. It's a scalpel, not an off switch.


model watermarking? does this exist?


[flagged]



They just want no liability for old models.


You think they have any liability for the latest model?

https://ai.google.dev/gemma/terms#4.4-limitation



Benchmarks for Gemma 7B seem to be in the ballpark of Mistral 7B

  +-------------+----------+-------------+-------------+
  | Benchmark   | Gemma 7B | Mistral 7B  | Llama-2 7B  |
  +-------------+----------+-------------+-------------+
  | MMLU        |   64.3   |     60.1    |     45.3    |
  | HellaSwag   |   81.2   |     81.3    |     77.2    |
  | HumanEval   |   32.3   |     30.5    |     12.8    |
  +-------------+----------+-------------+-------------+
via https://mistral.ai/news/announcing-mistral-7b/


Thank you. I thought it was weird for them to release a 7B model and not mention Mistral in their release.


The technical report (linked in the 2nd paragraph of the blog post) mentions it, and compares against it: https://storage.googleapis.com/deepmind-media/gemma/gemma-re...


The release page has comparisons to Mistral everywhere: https://ai.google.dev/gemma


They forgot.

Also phi-2.



Only 8K context as well, like Mistral.

Also, as always, take these benchmarks with a huge grain of salt. Even base model releases are frequently (seemingly) contaminated these days.



Mistral Instruct v0.2 is 32K.


Agree: will be interesting how Gemma does on ChatBot Arena


Came here to post the same thing for Phi-2:

  +-------------+----------+-------------+
  | Benchmark   | Gemma 2B | Phi-2 2.7B  |
  +-------------+----------+-------------+
  | MMLU        |   42.3   |     56.7    |
  | MBPP        |   29.2   |     59.1    |
  | BoolQ       |   69.4   |     83.3    |
  +-------------+----------+-------------+

[0] https://www.kaggle.com/models/google/gemma

[1] https://www.microsoft.com/en-us/research/blog/phi-2-the-surp...



A caveat: my impression of Phi-2, based on my own use and others’ experiences online, is that these benchmarks do not remotely resemble reality. The model is a paper tiger that is unable to perform almost any real-world task because it’s been fed so heavily with almost exclusively synthetic data targeted towards improving benchmark performance.


Fun that's not my experience of Phi-2. I use it for non-creative context, but function calling, and I find as reliable as much bigger models (no fine-tuning just constraining JSON + CoT). Phi-2 unquantized vs Mixtral Q8, Mixtral is not definitely better but much slower and RAM-hungry.


What prompts/settings do you use for Phi-2? I found it completely unusable for my cases. It fails to follow basic instructions (I tried several instruction-following finetunes as well, in addition to the base model), and it's been mostly like a random garbage generator for me. With Llama.cpp, constrained to JSON, it also often hangs because it fails to find continuations which satisfy the JSON grammar.

I'm building a system which has many different passes (~15 so far). Almost every pass is a LLM invocation, which takes time. My original idea was to use a smaller model, such as Phi-2, as a gateway in front of all those passes: I'd describe which pass does what, and then ask Phi-2 to list the passes which are relevant for the user query (I called it "pass masking"). That would save a lot of time and collapse 15 steps to 2-3 steps on average. In fact, my Solar 10.7B model does it pretty well, but it takes 7 seconds for the masking pass to work on my GPU. Phi-2 would finish in ~1 second. However, I'm really struggling with Phi-2: it fails to reason (what's relevant and what's not), unlike Solar, and it also refuses to follow the output format (so that I could parse the output programmatically and disable the irrelevant passes). Again, my proof of concept works with Solar, and fails spectacularly with Phi-2.



My non-domain-specific prompt is:

> You are a helpful assistant to 'User'. You do not respond as 'User' or pretend to be 'User'. You only respond once as 'Assistant'. 'System' will give you data. Do not respond as 'System'. Allow yourself inner thoughts as 'Thoughts'.

and then I constrain its answers to Thoughts: [^\n]* and Assistant: , and I have two shots included in the prompt.

I haven't been able to get anything useful out of Phi-2 in llama.cpp (but I only tried quantized models). I use python/huggingface's transformers lib instead.



Interesting. I've had no success at all using any of the Phi2 models.


I tested it for an offline autocompletion tool and it was hilariously bad.


Hear hear! I don't understand why it has persistent mindshare, it's not even trained for chat. Meanwhile StableLM 3B runs RAG in my browser, on my iPhone, on my Pixel ..


How have you been using RAG in your browser/on your phones?


To be released, someday [sobs in engineer]

Idea is usage-based charging for non-local and a $5/month sub for syncing.

keep an eye on @jpohhhh on Twitter if you're interested

now that I got it on web, I'm hoping to at least get a PoC up soon. I've open-sourced the consitutent parts as FONNX and FLLAMA, Flutter libraries that work on all platforms. FONNX has embeddings, FLLAMA has llama.

https://github.com/Telosnex/fonnx

https://github.com/Telosnex/fllama



Really looking forward to the day someone puts out an open model which outperforms Flan-T5 on BoolQ.


In my subjective tests it's not even close to Mistral. While my local gemma is quantized, so is mistral.

But I also tried gemma on huggingface.co/chat which I assume isn't quantized.



the real gold will be when this gets finetuned. (maybe by mistral...)


TBH the community has largely outrun Mistral's own finetuning. The 7B model in particular is such a popular target because its so practical to train.


Strong disagree - a Mistral fine tune of llama 70b was the top performing llama fine tune. They have lots of data the community simply does not.


Miqu was (allegedly) an internal continued pretrain Mistral did as a test, that was leaked as a GGUF.

Maybe its just semantics, it is technically a finetune... But to me theres a big difference between expensive "continuation training" (like Solar 10.7B or Mistral 70B) and a much less intense finetuning. The former is almost like releasing a whole new base model.

It would be awesome if Mistral did that with their data, but thats very different than releasing a Gemma Instruct finetune.



There’s typically a difference in LR between a ‘continued pretrain’ and ‘fine tune.’ I don’t have the details around miqu, but was merely trying to say that Mistral could produce a better version of these models than the OSS community might. If the size of the corpora they use means we are no longer in fine tuning territory, then okay.


Arthur Mensch, the Mistral CEO, confirmed the leak. https://twitter.com/arthurmensch/status/1752737462663684344


Also, it led to one of the funniest pr I've seen in a while

https://huggingface.co/miqudev/miqu-1-70b/discussions/10



No shot. Mistral Medium's outputs from API were virtually identical. Miqu really was Mistral Medium which happened to be a continued pretrain


how does one finetune llama (or any other LLM) using mistral?

is the flow like this?

- take small dataset

- generate bigger dataset using mistral (how this is this done?)

- run LoRA to fine tune gemma extended dataset.



I should have said "run LoRA or your favorite fine-tuning technique to produce your fine-tuned llama."


According to their paper, average of standard task of Mistral is 54.0 and for Gemma it's 56.4, so 4.4% relative better. Not as big as you would expect for the company which invented transformers and probably has 2-3 order more compute for training it vs few month old French startup.

Also for note on their human evaluations, Gemma 7B IT has a 51.7% win rate against Mistral v0.2 7B Instruct.



Honestly, this is more of a PR stunt to advertise the Google Dev ecosystem than a contribution to open-source. I'm not complaining, just calling it what it is.

Barely an improvement over the 5-month-old Mistral model, with the same context length of 8k. And this is a release after their announcement of Gemini Pro 1.5, which had an exponential increase in context length.



Who cares if it's a PR stunt to improve developer good will? It's still a good thing, and it's now the most open model out there.


How is it more open than Mistral with Apache 2.0? Google wants people to sign a waiver to even download it.


Fair enough; that was more directed at LLaMA and derivatives, which have commercial restrictions.


How exactly is it the "most open model" ?

It's more like a masterclass in corporate doublespeak. Google’s "transparency" is as clear as mud, with pretraining details thinner than their privacy protections. Diving into Google’s tech means auctioning off your privacy (and your users' privacy) to the highest bidder.

Their "open source" embrace is more of a chokehold, with their tech biases and monopolistic strategies baked into every line of code. Think of it as Google's way of marking territory - every developer is a fire hydrant.

These megacorps aren’t benevolent patrons of open source; they're self-serving giants cloaking power grabs under the guise of "progress".

Use these products at your own risk. If these companies wanted to engage in good faith, they'd use Apache or MIT licensing and grant people the agency and responsibility for their own use and development of software. Their licenses are designed to mitigate liability, handcuff potential competitors, and eke every last drop of value from users, with informed consent frequently being an optional afterthought.

That doesn't even get into the Goodharting of metrics and actual performance of the models; I highly doubt they're anywhere near as good as Mistral.

The UAE is a notoriously illiberal authoritarian state, yet even they have released AI models far more free and open than Google or Meta. https://huggingface.co/tiiuae/falcon-40b/blob/main/README.md

If it’s not Apache or MIT, (or even some flavor of GPL,) it’s not open source; it’s a trojan horse. These "free" models come at the cost of your privacy and freedoms.

These models aren't Open or Open Access or Free unless you perform the requisite mental gymnastics cooked up by their marketing and legal teams. Oceania has always been at war with Eastasia. Gemma is doubleplusgood.



You said a lot of nothing without actually saying specifically what the problem is with the recent license.

Maybe the license is fine for almost all usecases and the limitations are small?

For example, you complained about metas license, but basically everyone uses those models and is completely ignoring it. The weights are out there, and nobody cares what the fine print says.

Maybe if you are a FAANG, company, meta might sue. But everyone else is getting away with it completely.



I specifically called out the claims of openness and doublespeak being used.

Google is making claims that are untrue. Meta makes similar false claims. The fact that unspecified "other" people are ignoring the licenses isn't relevant. Good for them. Good luck making anything real or investing any important level of time or money under those misconceptions.

"They haven't sued yet" isn't some sort of validation. Anyone building an actual product that makes actual money that comes to the attention of Meta or Google will be sued into oblivion, their IP taken, and repurposed or buried. These tech companies have never behaved otherwise, and to think that they will is willfully oblivious.

They don't deserve the benefit of the doubt, and should be called out for using deceitful language, making comparisons between their performative "openness" and actual, real, open source software. Mistral and other players have released actually open models and software. They're good faith actors, and if you're going to build a product requiring a custom model, the smart money is on Mistral.

FAANG are utilizing gotcha licenses and muddying the waters to their own benefit, not as a contribution to the public good. Building anything on the assumption that Meta or Google won't sue is beyond foolish. They're just as open as "Open"AI, which is to say not open at all.



> Anyone building an actual product that makes actual money that comes to the attention of Meta or Google will be sued into oblivion

No they won't and they haven't.

Almost the entire startup scene is completely ignoring all these licenses right now.

This is basically the entire industry. We are all getting away with it.

Here's an example, take llama.

Llama originally disallowed commercial activity. But then the license got changed much later.

So, if you were a stupid person, then you followed the license and fell behind. And if you were smart, you ignored it and got ahead of everyone else.

Which, in retrospect was correct.

Because now the license allows commerical activity, so everyone who ignores it in the first place got away with it and is now ahead of everyone else.

> won't sue is beyond foolish

But we already got away with it with llama! That's already over! It's commerical now, and nobody got sued! For that example, the people who ignored the license won.



The nice thing about this is that the calculus is in favor of startups, who can roll the dice.


That’s about the point of having a developer ecosystem, isn’t it?


mistral 7b v0.2 supports 32k


Mixtral 8x7B has 32k context.

Mistral 7b instruct 0.2 is just an instruct fine tune of Mistral 7b and stays with a 8k context.



This is a good point actually, and an underappreciated fact.

I think so many people (including me) effectively ignored Mistral 0.1's sliding window that few realized 0.2 instruct is native 32K.



Hello on behalf of the Gemma team! We are really excited to answer any questions you may have about our models.

Opinions are our own and not of Google DeepMind.



Thank you very much for releasing these models! It's great to see Google enter the battle with a strong hand.

I'm wondering if you're able to provide any insight into the below hyperparameter decisions in Gemma's architecture, as they differ significantly from what we've seen with other recent models?

* On the 7B model, the `d_model` (3072) is smaller than `num_heads * d_head` (16*256=4096). I don't know of any other model where these numbers don't match.

* The FFN expansion factor of 16x is MUCH higher than the Llama-2-7B's 5.4x, which itself was chosen to be equi-FLOPS with PaLM's 4x.

* The vocab is much larger - 256k, where most small models use 32k-64k.

* GQA is only used on the 2B model, where we've seen other models prefer to save it for larger models.

These observations are in no way meant to be criticism - I understand that Llama's hyperparameters are also somewhat arbitrarily inherited from its predecessors like PaLM and GPT-2, and that it's non-trivial to run hyperopt on such large models. I'm just really curious about what findings motivated these choices.



I would love answers to these questions too, particularly on the vocab size


Is there any truth behind this claim that folks who worked on Gemma have left Google?

https://x.com/yar_vol/status/1760314018575634842



I confirmed all the folks listed on page 12 are still at Google (listed below). I am guessing the linked tweet is a BS claim.

   # Product Management
   Tris Warkentin
   Ludovic Peran

   # Program Management
   Minh Giang

   # Executive Sponsors
   Clement Farabet
   Oriol Vinyals
   Jeff Dean
   Koray Kavukcuoglu
   Demis Hassabis
   Zoubin Ghahramani
   Douglas Eck
   Joelle Barral
   Fernando Pereira
   Eli Collins

   # Leads
   Armand Joulin
   Noah Fiedel
   Evan Senter

   # Tech Leads
   Alek Andreev†
   Kathleen Kenealy†


It seems very easy to check no? Look at the names in the paper and check where they are working now


Good idea. I've confirmed all the leadership / tech leads listed on page 12 are still at Google.

Can someone with a Twitter account call out the tweet linked above and ask them specifically who they are referring to? Seems there is no evidence of their claim.



Them: here to answer questions

Question

Them: :O



To be fair, I think they are in London, so I assume they have winded down for the day. Will probably have to wait ~12-18 hours for a response.


To be fair, the tweet says that they don't work on the models at Google anymore, not that they have left Google.

Might be true, might not be. It's unsourced speculation.



EDIT: it seems this is likely an Ollama bug, please keep that in mind for the rest of this comment :)

I ran Gemma in Ollama and noticed two things. First, it is slow. Gemma got less than 40 tok/s while Llama 2 7B got over 80 tok/s. Second, it is very bad at output generation. I said "hi", and it responded this:

``` Hi, . What is up? melizing with you today!

What would you like to talk about or hear from me on this fine day?? ```

With longer and more complex prompts it goes completely off the rails. Here's a snippet from its response to "Explain how to use Qt to get the current IP from https://icanhazip.com":

``` python print( "Error consonming IP arrangration at [local machine's hostname]. Please try fufing this function later!") ## guanomment messages are typically displayed using QtWidgets.MessageBox ```

Do you see similar results on your end or is this just a bug in Ollama? I have a terrible suspicion that this might be a completely flawed model, but I'm holding out hope that Ollama just has a bug somewhere.



I was going to try these models with Ollama. Did you use a small number of bits/quantization?


The problem exists with the default 7B model. I don't know if different quantizations would fix the problem. The 2B model is fine, though.


Not a question, but thank you for your hard work! Also, brave of you to join the HN comments, I appreciate your openness. Hope y'all get to celebrate the launch :)


Will there be Gemma-vision models or multimodal Gemma models?


We have many exciting things planned that we can't reveal just yet :)


Have the same question.


It seems you have exposed the internal debugging tool link in the blog post. You may want to do something about it.


Ah, I see -- the link is wrong, thank you for flagging! Fixing now.


The blog post shares the link for debugging tool as https://*.*.corp.google.com/codelabs/responsible-ai/lit-gemm...

.corp and the login redirect makes me believe it was supposed to be an internal link





Same for the “safety classifier”


The link to the debugging tool is an internal one, no one outside Google can access it


The link in the Debugging section redirects to a Google SSO login page


Will these soon be available on lmsys for human comparison against other models? Can they run with llama.cpp?




I came here wondering if these models are "open" in the sense that they'll show up on sites like Ollama where you can download and run them locally.

Am I correct to conclude that this means they eventually will?

It's unclear to me from Google's docs exactly what "open" means for Gemma



Yes - they are open weights and open inference code, which means they can be integrated into Ollama.

They are not “open training” (either in the training code or training data sense), so they are not reproducible, which some have suggested ought to be a component of the definition of open models.



It really should shouldn't it? I'm quite ML-naïve, but surely providing the model without 'training code or training data' is just like providing a self-hostable binary without the source code? Nobody calls that open source, it's not even source available.


It is widely believed (and in some cases acknowledged) that a lot of models are trained on copyrighted data scraped from the web. In some cases, even scrapes of ebook piracy websites - google 'books3' to learn more.

Some companies (such as those working on AI) believe this is legal, others (such as the copyright holders to those books) believe it isn't.

In any case, IMHO it's unlikely any cutting edge models will be offering us their training data any time soon.



Can training data be generated from llm,with right prompt?


Yes, and there has been some discussion of that

Meta’s LLaMa 2 license is not Open Source https://news.ycombinator.com/item?id=36820122



That’s why they’re called open as in free to use how you wish, not open source where the source of the training is also provided.


But my point is there's no analogy for that that we call open? It's like self-hostable, or free (as in beer).


That’s a fair comment, maybe free-to-use is more appropriate.


Man, people will find anything to complain about.


I'm not complaining, I'm unlikely ever to use it (regardless of how open or not it is) so it doesn't really matter to me, just surprised to learn what people mean by 'open' in this context.


https://huggingface.co/google/gemma-7b-it/tree/main

yes, similar to the llama models, you'll also need to accept the license to download them officially. But the llama models have been unofficially downloadable without accepting the license for quite a while, so it's probably just a matter of time.



Does this model also thinks german were black 200 years ago ? Or is afraid to answer basic stuff ? because if this is the case no one will care about that model.


I don't know anything about these twitter accounts so I don't know how credible they are, but here are some examples for your downvoters that I'm guessing just think you're just trolling or grossly exaggerating:

https://twitter.com/aginnt/status/1760159436323123632

https://twitter.com/Black_Pilled/status/1760198299443966382



Yea. Just ask it anything about historical people/cultures and it will seemingly lobotomize itself.

I asked it about early Japan and it talked about how European women used Katanas and how Native Americans rode across the grassy plains carrying traditional Japanese weapons. Pure made up nonsense that not even primitive models would get wrong. Not sure what they did to it. I asked it why it assumed Native Americans were in Japan in the 1100s and it said:

> I assumed [...] various ethnicities, including Indigenous American, due to the diversity present in Japan throughout history. However, this overlooked [...] I focused on providing diverse representations without adequately considering the specific historical context.

How am I supposed to take this seriously? Especially on topics I'm unfamiliar with?



From one of the Twitter threads linked above:

> they insert random keyword in the prompts randomly to counter bias, that got revealed with something else I think. Had T shirts written with "diverse" on it as artifact

This was exposed as being the case with OpenAI's DALL-E as well - someone had typed a prompt of "Homer Simpson wearing a namebadge" and it generated an image of Homer with brown skin wearing a namebadge that said 'ethnically ambiguous'.

This is ludicrous - if they are fiddling with your prompt in this way, it will only stoke more frustration and resentment - achieving the opposite of why this has been implemented. Surely if we want diversity we will ask for it, but sometimes you don't, and that should be at the user's discretion.\

Another thread for context: https://twitter.com/napoleon21st/status/1760116228746805272



I disagree, coding and RAG performance is all that matters to me. I'm not using an LLM to learn basic facts I already know.


we're at basic knowledge level, if your RAG imply some of it, you can get bad result too. Anyway, would you use a model who makes this nonsense response or one that doesn't? I know which one I will prefer for sure...


If this was better at specific RAG or coding performance I would absolutely, certainly without a doubt use it over a general instruct model in those instances.


People getting so used to being manipulated and lied to that they don't even bother anymore is a huge part of the problem. But sure, do what suits you the best.


How do you ragebait for premium pearl clutching?


Can the Gemma models be downloaded to run locally, like open-source models Llama2, Mistral, etc ?

Or is your definition of "open" different?



Yes models can be downloaded locally. In addition to the python NN frameworks and ggml as options, we also implemented a standalone C++ implementation that you can run locally at https://github.com/google/gemma.cpp


It should be possible to run it via llama.cpp[0] now.

[0] https://github.com/ggerganov/llama.cpp/pull/5631



Amazing how quickly this happened.


Yes, you can get started downloading the model and running inference on Kaggle: https://www.kaggle.com/models/google/gemma ; for a full list of ways to interact with the model, you can check out https://ai.google.dev/gemma.


Can we have llamafile releases as well?

https://github.com/Mozilla-Ocho/llamafile



A small typo in your model link that breaks it. There’s an extra ; on the end.


Corrected - thanks :)


Mistral weights are released under an Apache 2.0 license, but Llama 2 weights are released under a proprietary license that prohibits use by large organizations and imposes usage restrictions, violating terms 5 and 6 the Open Source Definition[0]. Even if you accept that a model with a proprietary training dataset and proprietary training code can be considered "open source", there's no way Llama 2 qualifies.

For consistency with existing definitions[1], Llama 2 should be labeled a "weights available" model.

[0] https://en.wikipedia.org/wiki/The_Open_Source_Definition

[1] https://en.wikipedia.org/wiki/Source-available_software



Their definition of "open" is "not open", i.e. you're only allowed to use Gemma in "non-harmful" way.

We all know that Google thinks that saying that 1800s English kings were white is "harmful".



> We all know that Google thinks that saying that 1800s English kings were white is "harmful".

If you know how to make "1800s english kings" show up as white 100% of the time without also making "kings" show up as white 100% of the time, maybe you should apply to Google? Clearly you must have advanced knowledge on how to perfectly remove bias from training distributions if you casually throw stones like this.



Tell me you take this seriously: https://twitter.com/napoleon21st/status/1760116228746805272

It has no problem with other cultures and ethnicities, yet somehow white or Japanese just throws everything off?

I suppose 'bias' is the new word for "basic historic accuracy". I can get curious about other peoples without forcibly promoting them at the expense of my own Western and British people and culture. This 'anti bias' keyword injection is a laughably bad, in your face solution to a non-issue.

I lament the day 'anti-bias' AI this terrible is used to make real world decisions. At least we now know we can't trust such a model because it has already been so evidently crippled by its makers.



Not sure why you're getting downvoted. I would have thought HN of all places would recognize the power and value of OSI licensing and the danger of the proliferation of these source available but definitely not Open Source licenses.


How are these performing so well compared to Llama 2, are there any documents on the architecture and differences, is it MoE?

Also note some of the links on the blog post don't work, e.g debugging tool.



We've documented the architecture (including key differences) in our technical report here (https://goo.gle/GemmaReport), and you can see the architecture implementation in our Git Repo (https://github.com/google-deepmind/gemma).


Congrats on the launch and thanks for the contribution! This looks like it's on-par or better compared to mistral 7B 0.1 or is that 0.2?

Are there plans for MoE or 70B models?



Great question - we compare to the Mistral 7B 0.1 pretrained models (since there were no pretrained checkpoint updates in 0.2) and the Mistral 7B 0.2 instruction-tuned models in the technical report here: https://goo.gle/GemmaReport


Do you have a plan of releasing higher parameter models?


We have many great things in research and development phases, so stay tuned. I’m hopeful we can share more in the coming weeks and month!


That is awesome!

I hope y'all consider longer context models as well.

Also, are ya'll looking alternative architectures like Mamba? Being "first" with a large Mamba model would cement your architectural choices/framework support like llama did for Meta.



This doesn't answer the question at all


Training on 4096 v5es how did you handle crazy batch size :o


Hi alekandreev,

Any reason you decided to go with a token vocabulary size of 256k? Smaller vocab/vector sizes like most models in this size seem to be using (~16-32k) are much easier to work with. Would love to understand the technical reasoning here that isn't detailed in the report unfortunately :(.



Are there any plans for releasing the datasets used?


This would be really interesting in my opinion, but we are not releasing datasets at this time. See the C4 dataset for an earlier open dataset from Google.


It's cool that you guys are able to release open stuff, that must be a nice change from the modus operandi at goog. I'll have to double check but it looks like phi-2 beats your performance in some cases while being smaller, I'm guessing the value proposition of these models is being small and good while also having more knowledge baked in?


We deeply respect the Phi team and all other teams in the open model space. You’ll find that different models have different strengths and not all can be quantified with existing public evals. Take them for a spin and see what works for you.


Hi, what is the cutoff date ?


September 2023.


All it will tell me is mid-2018.


Hi! This is such an exciting release. Congratulations!

I work on Ollama and used the provided GGUF files to quantize the model. As mentioned by a few people here, the 4-bit integer quantized models (which Ollama defaults to) seem to have strange output with non-existent words and funny use of whitespace.

Do you have a link /reference as to how the models were converted to GGUF format? And is it expected that quantizing the models might cause this issue?

Thanks so much!



As a data point, using the Huggingface Transformers 4-bit quantization yields reasonable results: https://twitter.com/espadrine/status/1760355758309298421


> We are really excited to answer any questions you may have about our models.

I cannot count how many times I've seen similar posts on HN, followed by tens of questions from other users, three of which actually get answered by the OP. This one seems to be no exception so far.



Sorry, doing our best here :)


What are you talking about? The team is in this thread answering questions.


Will this be available as a Vertex AI foundational model like Gemini 1.0, without deploying a custom endpoint? Any info on pricing? (Also, when will Gemini 1.5 be available on Vertex?)


are there plans to release an official GGUF version to use with llama.ccp?


It is already part of the release on Huggingface: https://huggingface.co/google/gemma-7b/blob/main/gemma-7b.gg...

It is a pretty clean release! I had some 500 issues with Kaggle validating my license approval, so you might too, but after a few attempts I could access the model.



I didn't see this when searching thanks


What is the license? I couldn’t find it on the 1P site or Kaggle.


You can find the terms on our website, ai.google.dev/gemma:

https://ai.google.dev/gemma/terms



out of curiosity, why is this a "terms" and not a license? I'm used to reading and understanding the software as coming with a license to use it. Do the terms give us license to use this explicitly?


They do, but unlike a known license, these terms are custom and non-standard. Which means I would guide my commercial clients away from this particular model.


What are the supported languages of these models?


This v1 model is focused on English support, but you may find some multilingual capabilities.


Can you share the training loss curve?


Will there be "extended context" releases like 01.ai did for Yi?

Also, is the model GQA?



It's MQA, documented in the tech report


I'm not sure if this was mentioned in the paper somewhere, but how much does the super large 265k tokenizer vocabulary influence inference speed and how much higher is the average text compression compared to llama's usual 30k? In short, is it really worth going beyond GPT 4's 100k?


I find the snyde remarks around open source in the paper and announcement rather off putting.

As the ecosystem evolves, we urge the corporate AI community to move beyond demanding to be taken seriously as a player in open source for models that are not actually open, and avoid preaching with a PR statement that can be interpreted as uniformed at best or malicious at worst.



It would be great to understand what you mean by this -- we have a deep love for open source and the open developer ecosystem. Our open source team also released a blog today describing the rationale and approach for open models and continuing AI releases in the open ecosystem:

https://opensource.googleblog.com/2024/02/building-open-mode...

Thoughts and feedback welcome, as always.



If you truly love Open Source, you should update the the language you use to describe your models so it doesn't mislead people into thinking it has something to do with Open Source.

Despite being called "Open", the Gemma weights are released under a license that is incompatible with the Open Source Definition. It has more in common with Source-Available Software, and as such it should be called a "Weights-Available Model".



The statement on you not being able to use LLaMA 2 to benchmark is also false and highly misleading see https://x.com/BlancheMinerva/status/1760302091166241163?s=20


    If, on the Llama 2 version release date, the monthly active users [...] is greater than 700 million monthly active users [...] you are not authorized to exercise any of the rights under this Agreement
I would guess this is Google being careful to not be burned by this lame clause in the Llama 2 license.


It's aimed directly at them (and OpenAI and Microsoft) so they have to honor it if they don't want a legal battle. But there's nothing stopping others from doing benchmarking.


For the reference of people seeing this now: The tweet that person linked has now been deleted and the scientist who tweeted it has acknowledged they were wrong and retracted their claim, as all good scientists should.


Working at google is like this, where no matter how much you try to do the right thing you're always under attack.


Which remarks are you referring to?


The synde remarks at metas llama license that doesn't allow companies with 700 million monthly active users to use it, while this model also doesn't have a really 'open' license itself and also this paragraph:

>As the ecosystem evolves, we urge the wider AI community to move beyond simplistic ’open vs. closed’ debates, and avoid either exaggerating or minimising potential harms, as we believe a nuanced, collaborative approach to risks and benefits is essential. At Google DeepMind we’re committed to developing high-quality evaluations and invite the community to join us in this effort for a deeper understanding of AI systems.



Well, given that that restriction added to the meta-llama license is aimed at Google, is petty, and goes against open source norms, I think it’s reasonable that they should feel this way about it.


How is this a snide remark? It's factual and prevented their team from benchmarking against Llama 2.


Quick question -- can you tell me where you got that quote? It's not in the main blog or any of the launch communications that I can see.




I notice a few divergences to common models:

- The feedforward hidden size is 16x the d_model, unlike most models which are typically 4x;

- The vocabulary size is 10x (256K vs. Mistral’s 32K);

- The training token count is tripled (6T vs. Llama2's 2T)

Apart from that, it uses the classic transformer variations: MQA, RoPE, RMSNorm.

How big was the batch size that it could be trained so fast?

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/bl...



> The training token count is tripled (6T vs. Llama2's 2T)

Damn, 6T? That's a lot!

Given that this model seems to roughly match Mistral (according to the numbers from Google), this makes me think we have saturated the 7B parameter space, and couldn't possibly make it much better unless new techniques are discovered.



Hard to say definitively. Mistral’s token embeddings only account for


Looking at the config.json of Gemma 7B the feedfoarward hidden size is 8x, not 16x


Huh, indeed, that's what the config.json[0] says; the report[1] indicates “Feedforward hidden dims: 49152”.

[0]:https://huggingface.co/google/gemma-7b-it/blob/main/config.j...

[1]: https://storage.googleapis.com/deepmind-media/gemma/gemma-re...



I don't see the number 49152 reported in the config.json, what line are you referring to? I just see the intermediate_size of 24576 (so 8x).

EDIT: I didn't read the comment correctly, you have noticed the same thing.



The *GLU-based activations functions like GEGLU and SwiGLU use 2 input values to produce 1 output value, which makes these numbers weird. In each value pair, one goes through the GELU/SiLU activation function and is then multiplied by the other "gate" value.

In the report, "hidden dim" matches the number of GEGLU inputs. In the config, "intermediate_size" matches the number of GEGLU outputs. Most *GLU models so far have used intermediate_size=8/3*d_model as this makes have the same number of matmul FLOPS & parameters as a 4x-expanded non-GLU model, and PaLM vaguely showed that 4x is better than a smaller expansion factor.

If one considers Llama-2-7B's FFN expansion factor to be ~5.33x, Gemma's expansion factor is 16x.



Makes perfect sense thx


Read the parent comment again. It says the paper says 49152, not the config.json.


What does tokenization look like in 256k vs 32k?


It mostly means that there are tokens dedicated to rarer sequences of characters, even in foreign languages (note that Gemma is not intended to be good multilingually): “説明書” (instruction manual) has its own token, and so does “Nixon”, “آباد” (a city suffix, I believe), and the HTML sequence "\">
联系我们 contact @ memedata.com