（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40127806

Llama3 是一种大型语言模型，展现出令人印象深刻的功能，但在冗长的上下文、重复或无法完成答案时遇到问题。这些问题与量化无关，并且影响完整的 F16 8B 模型。与基于基准的预期相比，Llama3 难以处理复杂的指令。它的优势在于摘要、翻译和问答任务。然而，尝试自我引导提问往往会导致捏造事实。要有效地利用 Llama3，请重点关注这些应用领域，同时管理其发明细节的趋势，尤其是视频摘要中发言者的姓名。尽管面临挑战，Llama3 与之前的型号相比仍然有显着的改进。预计很快就会进行解决这些问题的升级。虽然测试各种余弦学习率衰减方案产生了一致的结果，但在改变每个模型大小的训练令牌数量的实验中都取得了改进，反之亦然。微调对于 Llama3 的最佳性能至关重要。确定特定用例的优先级决定了其局限性或优势是否最适用。与商业替代品相比，开源进展在基准分数方面明显滞后，但仍在快速进步。合成内容可以创建大量数据集并提高效率，从而有可能减轻与版权侵权相关的担忧。增强用户体验使公司在竞争中脱颖而出，确保持久的价值。 Llama3 具有广阔的潜力，但还需要进行重大改进以实现更广泛的适用性。

Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into a high ranking on the LMSYS leaderboard, or usefulness in everyday tasks. Let's not dethrone Llama 3 until some real world testing can be done.

That said, I don't think it's impossible for a small model to be very good. I see their "synthetic data" as essentially a way of distilling GPT-4 into smaller models. It would be exciting if a large fraction of the performance of huge models could be transferred to small ones! If true, then Chinchilla-optimal training could make sense again, as you could optimally train a ginormous model and then distill it afterward for efficient inference.

This won't dethrone Llama 3, but it's equally impressive.

They mention this model's relative weakness in the TruthfulQA eval, since it's more lossy trying to pack 'knowledge' into a small model relative to problem-solving skills (which shine on MMLU)

Regardless - still a very useful thing to have offline and on the fly. Those scores are nothing to scoff at.

Given that these pipelines are likely harder harder to imitate than new architectures like Transformers, I assume there has been and will be an intense focus on synthetic data generation and cleansing. Llama 3 used 15T of tokens in its training corpus vs 4.8T in the "scaled-up" version of phi-3. If you made it to the end of this disjointed ramble I'm sorry

Even llama3 has its issues. Ive been quite impressed so far but if the context gets a little long it freaks out, gets stuck repeating the same token or just fails to finish an answer. This is for the full f16 8B model, so it cant be put down to quantization. It also doesnt quite handle complex instructions as well as the benchmarks would imply should.

Supposedly LLMs (especially smaller ones) are best suited to tasks where the answer is in the text, i.e. summarization, translation, and answering questions.

Asking it to answer questions on its own is much more prone to hallucination.

To that end I've been using Llama 3 for summarizing transcripts of YouTube videos. It does a decent job, but... every single time (literally 100% of the time), it will hallucinate a random name for the speaker.* Every time! I thought it might be the system prompt, but there isn't one.

My own prompt is just "{text}\n\n###\n\nPlease summarize the text above."

If I ask it to summarize in bullet points, it doesn't do that.

I'm assuming there was something in the (instruct) training data that strongly encourages that, i.e. a format of summaries beginning with the author's name? Seems sensible enough, but obviously backfires when there's literally no data and it just makes something up...

*In videos where the speaker's name isn't in the transcript. If it's a popular field, it will often come up with something plausible (e.g. Andrew Ng for an AI talk.) If it's something more obscure, it'll dream up something completely random.

The technique to use is to give the model an “out” for the missing/negative case.

"{text}\n\n###\n\nPlease summarize the text above. The text is a video transcript. It may not have the names of the speakers in it. If you need to refer to an unnamed speaker, call them Speaker_1, Speaker_2 and so on."

Especially for small models I had very bad results for use in translation. Even trying all kinds of tricks didn’t help (apparently prompting in the target language helps for some). Encoder-decoder models such as FLAN-T5 or MADLAD-400 seemed far superior at equal or even smaller model size.

for sure, so my use case for example is

"using the following documentation to guide you {api documentation}, edit this code {relevant code}, with the following objective: Replace uses of {old API calls} in {some function} with with relevant functions from the supplied documentation"

It mostly works, but if the context is a little to long, sometimes it will just spam the same umlaut or number (always umlaut's or numbers) over and over for example. Perhaps some fine-tuning of parameters like temp. or repetition penalty might fix it, time will tell.

I had this same issue with incomplete answers on longer summarization tasks. If you ask it to "go on" it will produce a better completion, but I haven't seen this behaviour in any other model.

Neither, still, the answers it does provide - despite a few hiccups - are truly outstanding. I am really impressed with this model, even with its issues. Though, I am sure the issues such that they are, are a month or two away from being fixed. For what its worth, I haven't played as much with the bigger model, but it seems to not struggle with the same, though take that with a grain of salt, it runs too slow on my hardware for me to rapidly test things.

Could it be that people who can talk anonymously with no reputation to gain or lose and no repercussions to fear actually score high on truthfulness? Could it be that truthfulness is actually completely unrelated to the offensiveness of the language used to signal in-group status?

Not sure I understand your example? It's not an offensiveness benchmark, in fact I can imagine a model trained to be inoffensive would do worse on a truth benchmark. I wouldn't go so far as to say truthfulQA is actually testing how truthful a model is or its reasoning. But it's one of the least correlated with other benchmarks which makes it one of the most interesting. Much more so than running most other tests that are highly correlated with MMLU performance. https://twitter.com/gblazex/status/1746295870792847562

The Chinchilla paper is about how to design/train a model to optimize use of computing power, which we can equate to cost (FLOPs cost dollars).

The question that Chinchilla tries to answer is: for a given training budget (which you can think of as dollars or FLOPs), what is the optimal trade off of model size and quantity of training data to get the most performant model? Build a large model and train with less data, or build a smaller one and train with more data?

However, another consideration is minimizing total lifetime cost of the model: training cost + inference cost. You could train a model for longer (costing more) in order to get a given level of performance from a smaller model that will be cheaper for inference, or vice versa. For any given projected model lifetime inference volume, there is going to be a different answer.

It's not that Chinchilla-optimal models stopped making sense, but rather that this sort of consideration has people willing to pump more money (tokens) into smaller models to reduce inference cost for that level of capability.

They only experimented with different cosine learning rate decay schedules, but found results consistent across these, as well as across two different types of experiment where they either varied number of training tokens for a given model size, or varied model size for a given number of training FLOPs.

Not trying to disparage them, but their models always give a feeling that it is overfitted on benchmarks hence they perform so well. On everyday tasks, it's much worse - chat or simple completion tasks.

Distilling can work and there are papers which suggest it does, but we still do not have a reliable mechanism which can distill knowledge from larger teacher models to smaller student models.

I don't think we can call it distillation, at least not in the conventional ML use of the word as you're not interacting with any of the actual model architecture, specifically — not computing the loss between the predictions of the parent model and the distilled target model.

This is an important distinction when it comes to assess model collapse risk, which is a risk I think has probably been overstated enough to this point where now its being understated.

I had a lot of issues trying to get Phi-2 to perform as well as the benchmarks indicated on non-chat tasks.

It felt a lot like it was overfitted to the exact type of tasks (ie, not a data leak) in the benchmarks but if you were trying something a bit off track if didn't know what to do. At the time my hypothesis was that the small model just didn't have the capacity to generalise well enough, but since then Gemma 2B has come out and seems to be ok.

So now I have no idea why, but yes: the benchmarks for Phi-2 didn't represent how it worked for me on real world tasks where you'd expect it top be ok.

I'm pretty naive so please forgive it's a stupid question.

To me, what the parent comment is saying is that even though the benchmarks are cool, it's not super helpful to the every day person. Because if you can't chat with it very well (even for a narrow context) what utility does it have with great benchmarks?

Both are saying the same thing: in order for the base model that is phi to perform well as a chat agent, it would need to be tuned for that purpose before its benchmark results could have real-world value.

From this report. Phi-2 was not instruct tuned indeed.

"Our models went through post-training with both supervised instruction fine-tuning, and preference tuning with DPO. We have worked on generating and curating various instruction and preference data. This has improved the model chat capabilities, robustness, as well as its safety."

Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.

And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown)

So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? Kinda? Wild.

(I'm sure there's a lot of nuance to it, for one these benchmarks are not so hard to game, we'll see how the dust settles, but still...)

Phi-3-mini 3.8b: 71.2

Phi-3-small 7b: 74.9

Phi-3-medium 14b: 78.2

Phi-2 2.7b: 58.8

Mistral 7b: 61.0

Gemma 7b: 62.0

Llama-3-In 8b: 68.0

Mixtral 8x7b: 69.9

GPT-3.5 1106: 75.3

(these are averages across all tasks for each model, but looking at individual scores shows a similar picture)

This inductive logic is way overblown.

> Incredible, beat Llama 3 8B with 3.8B parameters after less than a week of release.

Judging by a single benchmark? Without even trying it out with real world usage?

> And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Any potential caveat in such a leaderboard not withstanding, on that leaderboard alone, there is a huge gap between llama 3 8B and Mistral-Large, let alone any of the GPT-4.

By the way, for beating benchmark, "Pretraining on the Test Set Is All You Need"

It's easy to miss: select English in the dropdown. The scores are quite different in Overall and in English for LMSYS.

As I've stated in other comments, yeah... Agreed, I'm stretching it a bit. It's just that any indication of a 3.8B model being in the vicinity of GPT-4 is huge.

I'm sure that when things are properly measured by third-parties it will show a more sober picture. But still, with good fine-tunes, we'll probably get close.

It's a very significant demonstration of what could be possible soon.

Firstly, English is a highly subjective category.

Secondly, Llama 3 usually adds first sentences like ‘What a unique question!’ or ‘What an insightful thought’, which might make people like it more than the competition because of the pandering.

While Llama 3 is singular in terms of size to quality ratio, calling the 8B model close to GPT4 would be an overstretch.

Yes, I don't know how people don't realize how much cheap tricks works in Chatbot Arena. A single base model produces 100s of ELO difference depending on the way it is tuned. And on most cases, instruction tuning heavily slightly even decreases reasoning ability on standard benchmark. You can see base model scores better in MMLU/ARC most of the times in huggingface leaderboard.

Even GPT-4-1106 seems to only sounds better than GPT-4-0613 and works for wider range of prompt. But in a well defined prompt and follow up questions I don't think there is an improvement in reasoning.

> Phi-3-mini 3.8b: 71.2

Per the paper, phi3-mini (which is english-only) quantised to 4bit uses 1.8gb RAM and outputs 1212 tokens/sec (correction: 12 tokens/sec) on iOS.

A model on par with GPT-3.5 running on phones!

(weights haven't been released, though)

> (weights haven't been released, though)

Phi-1, Phi-1.5, and Phi-2 have all had their weights released, and those weights are available under the MIT License.

Hopefully Microsoft will continue that trend with Phi-3.

> outputs 1212 tokens/sec on iOS

I think you meant "12 tokens/sec", which is still nice, just a little less exciting than a kilotoken/sec.

Where did you get this from?

> So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones

No, not even close ... Even Gemini has huge UX gap comparing to GPT4/Opus, 8B I won't even attempt this argument.

At a glance, it looks like Phi-3 was trained on an English only, STEM-strong dataset. See how they are not as strong in HumanEval, Trivia, etc. But of course it's very good.

Can’t wait to see some Phi-3 fine tunes! Will be testing this out locally, such a small model that I can run it without quantization.

Feels incredible to be living in a time with such neck breaking innovations. What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

> What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

In 5 years time we'll have adaptive compute and the idea of talking about the parameter count of a model will seem as quaint as talking about the cylinder capacity of a jet engine.

It feels like it's going to be closer than that. People always forget that GPT4 and Opus have the advantage of behind-the-curtain tool use that you just can't see, so you don't know how much of a knowledge or reasoning leg-up they're getting from their internal tooling ecosystem. They're not really directly comparable to a raw LLM downloaded from HF.

What we need is a standardised open harness for open source LLMs to sit in that gives them both access to tools and the ability to write their own, and that's (comparatively speaking) a much easier job than training up another raw frontier LLM: it's just code, and they can write a lot of it.

5 years? 5 years is a millennia these days.

We’ll have small local models beating gpt-4/Claude opus in 2024. We already have sub 100b models trading blows with former gpt-4 models, and the future is racing toward us. All these little breakthroughs are piling up.

Small leaves wiggle room, but it's extremely unlikely trad small, <= 7B, will get there this year even on these evals.

UX matching is a whole different matter and needs a lot of work: Worked heavily with Llama 8B over last days, and Phi 3 today, and the Q+A benchmarks don't tell the full story. Ex. It's nigh impossible to get Llama _70_B to answer in JSON; when Phi sees RAG from search it goes off inventing new RAG material and a new question.

On par in some categories. Phi was intended for reasoning, not storing data, due to small size. I mean, it's still great, but the smaller it gets, the more facts from outside of the prompts context will not be known at all.

It depends what you want to do. If you want a chat bot that can replace most Google queries, you want as much learned data as possible and the whole Wikipedia consumed. If you want a RAG style system, you want good reasoning about the context and minimal-or-no references to extra information. It's neither positive nor negative without a specific use case.

Thanks, I don't see them being "well above GPT-4", merely 1 point? Also, no idea why one would want to exclude GPT-4-Turbo, the flagship "GPT-4" model, but w/e.

I also don't think they "beat Llama 3 8B"; their own abstract says "rivals that of models such as Mixtral 8x7B and GPT-3.5", "rivals" not even "beats".

Great model, but let's not overplay it.

In the English category: GPT-4-0314 (ELO 1166), Llama 3 8B Instruct (ELO 1161), Mistral-Large-2402 (ELO 1151), GPT-4-0613 (ELO 1148).

You are right, I toned down the language, I got a bit overexcited, and I missed the difference in the versions of GPT-4. And LMSYS is a subjective benchmark for what users prefer, which I'm sure has weird inherent biases.

It's just that any signal of an 3.8B model being anywhere in the vicinity of GPT-4 is huge.

> So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones?

No, we don't. LMsys is just one, very flawed benchmark.

Why is LMsys flawed?

Many people treat LMsys as gospel because it's the only large-scale, up-to-date qualitative benchmark. All the numeric benchmarks seem to miss real-world applicability.

Agreed, but it's wild that even one benchmark shows this. Based on what we knew just a few months ago, these models should be so far from each other in every benchmark.

"But still"? Lets be realistic, all of these benchmark scores are absolute garbage. Yes, the open source community is making great strides, they are getting closer but the gap is still wide when comparing to commercially available models.

It’s not open source, but is open weight - like distributing a precompiled executable. In particular what makes it open weights rather than just weights available is that it is licensed using an OSI approved license (MIT) rather than a restricted proprietary license.

I really wish these companies would release the training source, evaluation suites, and code used to curate/filter training data (since safety efforts can lead to biases). Ideally they would also share the training data but that may not be fully possible due to licensing.

This shows the power of synthetic content - 3.3 trillion tokens! This approach can make a model even smaller and more efficient than organic text training, and it will not be able to regurgitate NYT articles because it hasn't seen any of them. This is how copyright infringement claims can be placated.

I'll believe it till I try it for myself, Phi-2 was the clear worst of the 20 LLMs we evaluated (was also smallest so was expected).

But it was slow for its size, generated the longest responses with the most hallucinations, as well as generating the most empty responses. It was also the model ranked with the lowest quality answers.

They'll just do what they have been doing for ~20 years, they wait, pick the "winner", polish the "user experience", call it Apple magic and incorporate that into their product cycles.

Some day will be the day their joke book becomes so mediocre it will not stick anymore, but I think they are safe on this one, for now..

There is enormous value in polishing the user experience, especially if no one else is doing it (or maybe capable of doing it?). It will never get old as long as they are the only ones doing it.

I don't think MS has a special sauce here, just a willingness to publish. To the extent MS has disclosed the bulk of what they are doing with Phi, it's a combination of really nice initial idea "Use written texts + GPT-4 to generate high quality prompts where we know the answer is great because it's written down" and engineering.

To me this is advancing the state of the art as to the impact of data quality, but it doesn't look to me like the phi series have some magical special sauce otherwise. Data of quality and synthetic data creation are not magical moats that Apple can't cross.

I'll say too that I'm psyched to try Phi-3; the sweet spot for me is a model that can be a local coding assistant and still answer random q&a questions with some sophistication. I'm skeptical that 3-8b parameter models will bring the high-level of sophistication sometimes needed in this cycle; there's still a very large gap with the larger models in daily use, despite some often close benchmark scores.

Anyway, Apple-Phi-3 is in no way an impossibility.

I tore my hair out developing a SwiftUI app that could run llama.cpp and whisper.cpp simultaneously. Was able to run a Q3_K Mistral 7B along with a smaller whisper model eventually, but grinding through XCode is a nightmare.

They're working on MLX but it only recently got swift bindings. They just don't have the DEVELOPERS DEVELOPERS DEVELOPERS coked out attitude i guess

Did they ever claim to be a powerhouse in foundation models? Did your MacBook or iPhone become obsolete or stop working? They use the models, they don't release them because they don't hoard data.

The opposite is the case, with all the advancements, even by doing nothing, Apple (like everyone, including hobbyists) is moving closer to the frontier. Hopefully this trend stays alive!

If anything this is good for them. Apple's play here has always been getting their devices ready for running LLMs locally. This makes it way easier.

If I were apple, I would be developing something in total secrecy and then release something ahead of the rest of competition when people least expect it. very big ifs but siri can be updated everywhere overnight and I dont see them rushing into anything like this

That's a very eloquent variation on the word "censorship"

Are you next going to tell us that the CIA's access to iCloud data protects their users from terrorism too?

Tried it: as soon as you ask something outside the head of the likely training data distribution it starts hallucinating like crazy. This isn’t surprising to me as a researcher: you need the associative memories of a larger model to cover the tail with at least something. That said, it’ll likely work well at specific narrow tasks once fine tuned. Just don’t expect it to really “beat GPT-3.5” at the general chat use case

Has anyone used these/similar with fine tune and RAG? How is the performance over a narrow domain for simple queries? Is it good enough for say an informational chat bot?

Hm, roundabout 84 authors of one "scientific" paper. I wonder if this says something about (a) the quality of its content, (b) the path were academic (?) paper publishing goes to, (c) nothing at all, or (d), something entirely else.

I have been on far larger author lists :) There's probably a whole team for the training data generation and assessment, a whole team for the safety assessment (section 4), that stuff adds up.

Both precious phi have been epic letdowns when I actually tried them myself so quite low confidence in this being reflective of real world. Will try it anyway though

I'm getting a bit skeptical of MMLU at this point. As far as I can tell it's a set of multiple choice questions that hasn't been updated since 2020. We have to trust the model providers not to deliberately or accidentally train on it for those scores to be useful.

At the least, there's multiple benchmarks noted in the paper (21!) and the results are consistent across all of them.

I'd trust Microsoft to do decontamination testing, although the paper doesn't explicitly mention it other than "The prompts and number of shots are part of a Microsoft internal tool to evaluate language models, and in particular we did no optimization to the pipeline for the phi-3 models."

Phi-2 was useless for practical purposes except if you want to show your friends that it can write a poem, llama3 8b was slightly better but is still same category, it’s complete trash with coding vs gpt4. Llama3 400b “iS OPen SoURce!” But no you will need to pay to access because most one can not practically afford an A100 and set it up properly.

What I’m trying to say is that user experience is now as key as the model smarts and these barely touching gpt4 models cannot beat OpenAI right now as a whole package.

I just tried to give gpt4 a scrape of a menu page and asked it to reformat it to csv. It hallucinated several parts and missed others. Llama3-70b hasn’t done that. So far it’s been more reliable. You can run a quantized version on consumer hardware or pay significantly less ($10 vs $1 in, $30 vs $1 out) on hosted platforms.

Less tokens than Llama 3 (3.3T vs 15T) yet better outcome. No doubt more information dense training data. The interesting thing is the use of synthetic data which they don't talk about.

Actually the original Phi papers did talk about their synthetic data strategy, and it's very cool -- essentially invert high quality textbook text using GPT-4 to create prompts, where the textbooks supply the answers. There may be more undisclosed, but it remains in my mind as one of the best ideas of the last twelve months -- so smart, and interesting, and apparently, it works well.

I feel like literal dictionaries would make good training data; wonder if any of them have done that. LLMs are good at faking so it's hard to tell by asking them.

1. They need it for style and language, not necessarily for the facts

2. Since GPT-4 is seen as the very best general-purpose LLM in existence, it makes sense to emulate its performance with less resources.

3. Phi models are also trained with other high-quality data

My understanding from this tweet thread [1] is that chinchilla probably overspecified some of the hyperparameters to the model

tl;dr I'm looking forward to having lots of models (ideally models) trained with a wide range of parameters to narrow down "what is actually optimal"

I think there is an interesting tradeoff of data quality and data volume, though

(Eg if we train with the highest quality 10% of our data, does the model improve if we use the other 90%? What if we increase our data size by 10x?)

[1] https://twitter.com/tamaybes/status/1780639257389904013

（评论） (comments)

（评论）
(comments)