![]() |
|
![]() |
|
I had this same issue with incomplete answers on longer summarization tasks. If you ask it to "go on" it will produce a better completion, but I haven't seen this behaviour in any other model.
|
![]() |
|
Not sure I understand your example? It's not an offensiveness benchmark, in fact I can imagine a model trained to be inoffensive would do worse on a truth benchmark. I wouldn't go so far as to say truthfulQA is actually testing how truthful a model is or its reasoning. But it's one of the least correlated with other benchmarks which makes it one of the most interesting. Much more so than running most other tests that are highly correlated with MMLU performance. https://twitter.com/gblazex/status/1746295870792847562
|
![]() |
|
Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release. And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large. Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown) So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? Kinda? Wild. (I'm sure there's a lot of nuance to it, for one these benchmarks are not so hard to game, we'll see how the dust settles, but still...) Phi-3-mini 3.8b: 71.2 Phi-3-small 7b: 74.9 Phi-3-medium 14b: 78.2 Phi-2 2.7b: 58.8 Mistral 7b: 61.0 Gemma 7b: 62.0 Llama-3-In 8b: 68.0 Mixtral 8x7b: 69.9 GPT-3.5 1106: 75.3 (these are averages across all tasks for each model, but looking at individual scores shows a similar picture) |
![]() |
|
At a glance, it looks like Phi-3 was trained on an English only, STEM-strong dataset. See how they are not as strong in HumanEval, Trivia, etc. But of course it's very good.
|
![]() |
|
> So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? No, we don't. LMsys is just one, very flawed benchmark. |
![]() |
|
Why is LMsys flawed? Many people treat LMsys as gospel because it's the only large-scale, up-to-date qualitative benchmark. All the numeric benchmarks seem to miss real-world applicability. |
![]() |
|
Agreed, but it's wild that even one benchmark shows this. Based on what we knew just a few months ago, these models should be so far from each other in every benchmark.
|
![]() |
|
There is enormous value in polishing the user experience, especially if no one else is doing it (or maybe capable of doing it?). It will never get old as long as they are the only ones doing it.
|
![]() |
|
Did they ever claim to be a powerhouse in foundation models? Did your MacBook or iPhone become obsolete or stop working? They use the models, they don't release them because they don't hoard data.
|
![]() |
|
The opposite is the case, with all the advancements, even by doing nothing, Apple (like everyone, including hobbyists) is moving closer to the frontier. Hopefully this trend stays alive!
|
![]() |
|
If anything this is good for them. Apple's play here has always been getting their devices ready for running LLMs locally. This makes it way easier.
|
![]() |
|
That's a very eloquent variation on the word "censorship" Are you next going to tell us that the CIA's access to iCloud data protects their users from terrorism too? |
![]() |
|
Has anyone used these/similar with fine tune and RAG? How is the performance over a narrow domain for simple queries? Is it good enough for say an informational chat bot?
|
![]() |
|
I have been on far larger author lists :) There's probably a whole team for the training data generation and assessment, a whole team for the safety assessment (section 4), that stuff adds up.
|
![]() |
|
Both precious phi have been epic letdowns when I actually tried them myself so quite low confidence in this being reflective of real world. Will try it anyway though
|
![]() |
|
Less tokens than Llama 3 (3.3T vs 15T) yet better outcome. No doubt more information dense training data. The interesting thing is the use of synthetic data which they don't talk about.
|
![]() |
|
I feel like literal dictionaries would make good training data; wonder if any of them have done that. LLMs are good at faking so it's hard to tell by asking them.
|
That said, I don't think it's impossible for a small model to be very good. I see their "synthetic data" as essentially a way of distilling GPT-4 into smaller models. It would be exciting if a large fraction of the performance of huge models could be transferred to small ones! If true, then Chinchilla-optimal training could make sense again, as you could optimally train a ginormous model and then distill it afterward for efficient inference.