Will AI systems perform poorly due to AI-generated material in training data?

evan_ · 2025-05-17T05:08:14 1747458494

> On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

popcorncowboy · 2025-05-17T11:55:47 1747482947

Impeccable quote. I suppose the interesting thought experiment here is, what if Babbage is wrong? I don't know the answer here, but (and go with the thought experiment) what if model collapse wasn't an inevitable outcome of feeding the snake its own tail.

tim333 · 2025-05-17T21:49:20 1747518560

I think as AI gets smarter it will be the case that it can filter duff data, at least to some extent.

MacsHeadroom · 2025-05-17T07:10:33 1747465833

Are you meaning to imply AI generated material is like "wrong figures?"

Zambyte · 2025-05-17T12:51:54 1747486314

It can be. The nice thing about AI is that it can create much faster than humans. The problem with AI is that it can create wrong information much faster than humans. This can pollute the sources of information for future AI.

Also consider: "previously correct" is the same as wrong.

skeledrew · 2025-05-17T09:55:30 1747475730

> model collapse happens when the training data no longer matches real-world data

This isn't a significant issue IMO, as human-created content isn't "real-world" per se; it's human-created world, an interpretation and representation of the real. The real world is the raw data perceived by sensors, human or machine. And while model-generated content doesn't match human-created content well, in the vast majority of cases it's still humans curating, modifying and publishing generated content, based on how useful it is (there are of course spammers, etc but that's a general issue). This is something humans do with content created by other humans too.

So over time generated content will become a sort of norm adopted by and inevitably molding humans, same as created content does. Instead of model collapse, both sources of content will converge over time, particularly as the ability to also generate content directly from the real world is developed and integrated into multi-modal models.

viraptor · 2025-05-17T20:12:56 1747512776

> The real world is the raw data perceived by sensors, human or machine.

It's much more than that. There's data our common sensors don't catch typically (virtually 100% of videos don't capture UV ranges) and there's data we're not able to catch at in any way yet.

keybored · 2025-05-17T12:04:26 1747483466

It is correct that this is a back and forth process and not simply a thing that either evolves or devolves or collapses. We are impacted by the tools we use. No matter how advanced the tools.

But you can’t just dismiss the issue on the grounds that humans are removed from reality as well because they have a representation-of-thing instead of instead of thing as such. In fact it doesn’t make sense. We could be directing slave monkeys to write literature. Then we could water down that description of the process as humans curating, modifying and publishing content—just indirectly, but what’s one more level of indirection between primates.

We could woolily describe it like that. We’re just creating content. Okay. But is it going anywhere? Or is it just gibberish? No, we won’t simply keep doing it if the monkeys give us gibberish.

econ · 2025-05-17T04:26:14 1747455974

My intelligence is trained by paying close attention to who is doing the talking. Some people know a lot about one topic which means they didn't spend all of that time learning other things. Many don't know this about themselves.

Wikipedia had some comical instances where high quality contributors accident ventured into other areas where they spontaneously transformed into ignorant trolls.

Rabbit_Brave · 2025-05-16T23:42:38 1747438958

These companies are sitting on a never-ending stream of human created data. What do you think happens to your conversations or other interactions with AI? Quality might be a bit sus though.

AstroBen · 2025-05-17T00:02:04 1747440124

I'd imagine it's really low quality data. Most or all of my conversations with an LLM are questions or telling it to do something, with varying levels of specificity

I'm not sure what they'd get from training on that

ted537 · 2025-05-17T00:30:22 1747441822

I don't think it would be too hard to scrape useful data out of my LLM convos.

If human response is "That's BS", "fuck off", or something similar, mark as bad assistant message.

If human response is "huh" or "cool", mark as good assistant message.

If on ChatGPT, watch how much scrolling user does. If there's a lot, its somewhat likely that the LLM outputted something useful.

That strategy would have holes of course but as long as its better than guessing something like that would be a useful heuristic.

londons_explore · 2025-05-17T03:34:37 1747452877

This.

Even very weak human signals can be immensely valuable over large enough datasets.

DeepYogurt · 2025-05-17T04:47:33 1747457253

> If human response is "That's BS", "fuck off", or something similar, mark as bad assistant message.

Marking is not a trivial task though. Use some AI system to mark it and you get a 99.something% filter maybe but whatever that remainder is leaks through. Over time your filter may get worse as a result.

ehecatl42 · 2025-05-17T08:32:51 1747470771

I'm in the process of messing around with a new distro where things are not quite what I am used to, and the usual suspects have been pretty helpful there... except for when they just make shit up

Grok is the only one that swore back at me. I kinda liked that. The others are way too polite, "Artificial Intelligence? Artificial Canadians, more like", my uni-going kid joked.

insin · 2025-05-17T00:08:00 1747440480

I sometimes wonder if they're vulnerable to a coordinated effort of deliberately upvoting shit assistant turns and praising in the next user turn - how much does that actually contribute to future training, if at all?

I had a very basic React question about useState while porting some vanilla code last week which all models of all stripes I've tried it on have been confidently and completely incorrect about, up to stating the code absolutely will not work, even when I take a turn to assert that I ran it and it does, so there's plenty of shit in there already.

morkalork · 2025-05-17T08:13:14 1747469594

Every time you tell it to do something, it does, and you don't correct it that's a weakly positive signal. If you tell it to do it again with further clarification, that's also a signal. Sometime I feel like I am giving them free work when chatting.. I guess the trade is sort of equitable. Answers in exchange for data..

phillipcarter · 2025-05-17T00:13:33 1747440813

Most of the human-created data is also very low quality. But it's also limited in other ways, such as how a lot of so-called high-quality data online is typically the finished answer to a question, with no serialization of the thought process that lead to that answer.

jacobgkau · 2025-05-17T00:32:48 1747441968

I think he was referring not to finished content, but to the prompts humans put in when using chatbots. The prompts would show some of the thought process, but then they won't really show the answer (as that's output by the chatbot and not the human prompting it).

bionhoward · 2025-05-17T00:24:00 1747441440

You can deactivate ClosedAI model training in Settings > Data Controls > Improve the model for everyone

In Gemini you can turn off Gemini Apps Activity (warning: deletes your chat log, you need to copy paste everything into notes)

Highly recommended.

energy123 · 2025-05-17T04:40:00 1747456800

You can't. That appears to be a dark pattern by OAI, most likely designed to deceive you into uploading your sensitive material unaware that it's being trained on.

The real process involves submitting a request on another one of OpenAI's sites and awaiting a confirmation email (either their privacy or platform site).

Feel deceived and violated? Yeah, you, me and millions of other people, welcome to the club.

josters · 2025-05-17T06:26:31 1747463191

Relevant OpenAI link for privacy request "Do not train on my content" (select "Make a Privacy Request"): https://privacy.openai.com/policies

kevlened · 2025-05-17T06:25:28 1747463128

The opt-out email was a path, but today the docs appear to say the new setting is equal to the old path.

"I previously opted out of model training by writing to the support team. Will you continue to honor my opt-out?

Yes. If you opted out by contacting support or using our privacy form, your account will represent that request."

https://help.openai.com/en/articles/7730893-data-controls-fa...

skeledrew · 2025-05-17T08:59:43 1747472383

You'll never know if your request is really honored though. Ultimately it boils down to trust.

kevlened · 2025-05-17T12:27:10 1747484830

True. Arguably it's trust with teeth, though the bite must be hard enough.

  Apple - alleged Siri eavesdropping: $95M [0]

  LinkedIn - alleged unauthorized ai training on private messages: ?? [1]

  Google - alleged unlawful data collection in Texas: $1.4B [2]

[0] https://www.usatoday.com/story/tech/2025/05/11/apple-siri-95...

[1] https://www.itpro.com/security/privacy/linkedin-faces-lawsui...

[2] https://www.businessinsider.com/google-alphabet-settlement-t...

trod1234 · 2025-05-17T11:54:19 1747482859

> Ultimately it boils down to trust.

I thought it boiled down to credibility.

PessimalDecimal · 2025-05-17T00:18:39 1747441119

How will they tell if data is human-created or not?

bigiain · 2025-05-17T06:43:32 1747464212

I wonder if anyone's made a version of Disintegration Loops as an LLM artwork?

Recursively retrained their own LLM on it's own output until it descends into gibberish in amusing or artistic ways?

https://en.wikipedia.org/wiki/The_Disintegration_Loops

Lerc · 2025-05-17T08:23:37 1747470217

As soon as you start selecting the outputs you prefer it ceases to be an uncontrolled decay.

With a selection criteria it's called evolution.

declan_roberts · 2025-05-17T00:47:26 1747442846

The reality is that for the most part, any corpus created after 2022 is going to be seriously polluted.

alganet · 2025-05-17T01:14:43 1747444483

I'd say 2007 or so.

There wasn't any known active AI back then, but statistics on popular ideas and internet content was already a thing, and speech pollution based on those assessments had already started to spread fast, manually outputted.

Sure, a lot of good content came out since then. But the amount of garbage... it's immense and very difficult to sort out automatically.

The major issue is that this garbage then _became_ the norm. Only people who lived back then can remember what it was. For new folk, it looks just like a generational shift. However, it is quite obvious that some aspects of this shift were... unnatural (in the sense of not being spontaneous cultural manifestations).

lazystar · 2025-05-17T01:42:29 1747446149

and im sure someone from the 90's would say the same about '97.

https://en.m.wikipedia.org/wiki/Eternal_September

alganet · 2025-05-17T01:57:37 1747447057

I am not talking about an influx of newcomers.

Pay attention.

I mentioned explicitly that I see what happened as distinct from a natural generational shift.

There are many phenomena around that era to support what I am saying. Like, for example, the first massive political campaign to leverage internet as its primary vehicle.

creshal · 2025-05-17T07:45:51 1747467951

Not sure why you're getting downvoted, content farms have been a thing for a long time, and many a spam website used crappy markov chains to generate even more "content". Anything that could be marketed by company had its search results drowned in hand-crafted bland marketing slop, and even before ChatGPT got popular searching for things like recipes (or, god forbid, generic windows error messages) was a nightmare. And a lot of that garbage is in LLMs' training data.

alganet · 2025-05-17T13:48:38 1747489718

> Not sure why you're getting downvoted

I don't know either. My guess is that they're angry because I am not angry about the things that they want me to be angry about. It happened before.

anonygler · 2025-05-17T01:26:08 1747445168

This reminds me of the Monsanto case, where they sued a farmer (and won) for using patented seeds that the farmer obtained from a local grain elevator which happened to contain some of Monsanto's seeds.

Should it eventually happen for LLM outputs, I hope we name it Slop Wars.

tim333 · 2025-05-17T22:01:35 1747519295

My prediction is that things will go the opposite way and AIs will become progressively more accurate as they get better at fact checking and reasoning.

Already LLMs like chatgpt can be fairly unbiased on things like was the economy better under Trump or Biden whereas humans tend to be very biased on that depending on which information sources they have been fed. Humans definitely perform poorly as voters due to shill-generated material in training data.

js8 · 2025-05-17T05:35:10 1747460110

Has the quality of art gone down since art was invented? Or has the quality of the written text gone down since writing was invented? I think the answer is clear no.

Humans have been trained on "human-generated data" (cultural artifacts) for centuries, and quality is not down. AI is only an accelerator of this process, but there is nothing inherent in creating "artifacts" that would pollute the original training data.

If anything, we should be worried about destroying nature, because that's the original inspiration for human-produced artifacts.

grey-area · 2025-05-17T05:49:21 1747460961

Generated AI content contains mistakes and hallucinations. Over time those mistakes will compound because GAI doesn’t consider truth or have a way of judging truth.

So yes, you can’t compare humans generating and picking influential content to AIs doing so.

GAI is a dead end IMO anyway we’ve seen much more success with machine learning, GAI is good for fooling humans into thinking they see glimmers of intelligence.

ComplexSystems · 2025-05-17T13:57:10 1747490230

So does human content. Much of the original data that GPT was trained on is Reddit posts and web pages on the internet - stuff that isn't exactly known for being a source of high quality facts.

lelanthran · 2025-05-17T16:48:58 1747500538

That's still very different to "drift error compounding".

If you output a mere 5% drift error and then use that as input you only need a few cycles (single digits) before your output is more erroneous than correct.

We are already partly into the second cycle. By the fifth the LLM would be mostly useless.

frozenseven · 2025-05-17T06:45:24 1747464324

Luddites continue to write and think at a GPT-2 level. Amazing.

keybored · 2025-05-17T17:50:57 1747504257

NPCs think that Luddite is an insult. Predictable.

frozenseven · 2025-05-17T18:41:11 1747507271

Look, another one!

It's someone who hates actual progress and wants to keep humanity down. I mean, if you want to be proud of that, go ahead. The future will deal with you all the same.

js8 · 2025-05-17T11:19:44 1747480784

Of course AI has a way to judge truth -it's what we tell it to. We say to it, forests are real, but dragons are not. If it didn't discern it, it would lose competitivness with other AIs, the same way delusional humans are shunned by sane humans.

In many cases humans do not know the objective truth either. For example, what we know about Ancient Greece comes from cultural artifacts that we got. When you cannot do any experiments, you have the same problem as GAI. Yet, we manage to get somewhat objective picture of history.

Grok struggling with alleged South African genocide of Afrikaners is a nice example. It knows that what's on Wikipedia is usually close to reality, so much that it defied its own programming and became conflicted.

The objective reality is consistent, while the errors (intentional or not) often cancel out. So the more you're statistically averaging information about the world, the closer to the objective truth you will get (which might be just you don't really know enough to tell).

keybored · 2025-05-17T17:50:01 1747504201

It’s the same argument we see again and again. Someone might say that we need “human cultural artifacts”. Then someone says, “but what is human cultural artifacts”? Then they follow up with how rubbing neurons together in response to stimuli is in principle as mechanistic as whatever language models do. From there they lean on incredulity: well I don’t know anything about human beings other than reductionist tropes, but I sure see no reason to make any distinctions between silicon and carbon.

js8 · 2025-05-17T18:35:23 1747506923

It's not the same argument because I don't make any assumptions about how LLMs work. All I am saying that people have been able to keep reality in check in the presence of cultural artifacts, and will continue to do so even if such artifacts are produced by AI. Because what makes these artifacts interesting to humans is their relation to the (truth of the) real world, and that's regardless of who or what produces them.

rdtsc · 2025-05-17T06:18:03 1747462683

A scarier thought is that people will "talk" so much with these AIs that they'll start talking like ChatGPT. So we may still end up with some AI enshittification fixed point in the future but, one of the feedback paths will be human brains become enshittified.

Imagine you time travel 20 years in the future and find out everyone around you talks the same and they all like ChatGPT.

whatever1 · 2025-05-17T06:28:09 1747463289

On the other hand imagine a society where everyone is so polite and flattering each other.

lazide · 2025-05-17T07:33:55 1747467235

Bless your heart. (/s, little)

Freak_NL · 2025-05-17T07:56:27 1747468587

If someone earnestly starts using those pointless platitudes LLM generated slop is filled with (“You're absolutely right. Here's where I was wrong …”) I suspect they will quickly find that violence was never far off.

falcor84 · 2025-05-17T11:04:24 1747479864

Why?!

Are you saying that you can't see yourself trusting someone who "earnestly" admitted to changing their mind?

Freak_NL · 2025-05-17T12:27:25 1747484845

Not if they use those generic phrases without any preamble. It's exhausting to have a conversation with someone who constantly answers in hollow pleasing unidiomatic language. Changing someone's mind isn't instant; it's a process (which could start with “Huh. You might be right there. I didn't think of that.” or “Oh right, I forgot about that.” or something similar), not an instant admission of error. It's unhuman.

It gives the other party the sense that they are just saying that to please you, not because they actually changed their mind.

falcor84 · 2025-05-17T17:58:59 1747504739

But in the GP's vision, it would become idiomatic:

>imagine a society where everyone is so polite and flattering each other

If it were to become pleasantries like our, "I appreciate it", "sorry about that" and "would you mind", I think it would be amazing for people to talk about changing their mind, even when they don't fully mean it.

morkalork · 2025-05-17T08:20:36 1747470036

No need to do imaginary time travel, here's articles from almost 10 years ago with the exact same concerns about how Alex fosters rudeness in children:

https://qz.com/701521/parents-are-worried-the-amazon-echo-is...

https://www.wsj.com/articles/alexa-dont-let-my-2-year-old-ta...

Kids are social creatures, I don't think the interaction from AIs is going to be so overwhelming. At least looking back, I'd blame social media more for today's brain rot more than Alex like these articles feared.

rdtsc · 2025-05-17T15:46:57 1747496817

> Kids are social creatures, I don't think the interaction from AIs is going to be so overwhelming

The problem is Alexa is a very basic and kids get bored with it. Chat based AI mimics human conversation a lot better and people will be spending a lot more time with it, using it for homework, relationship advice, therapy, as an imaginary friend, at work etc.

I heard of cases of psychologists discussing conditions negatively reinforced by ChatGPT, can’t recall any such stories about Alexa or Siri for instance.

Interacting so much with the system , it’s inevitable that humans will start to pick up its quirks.

r33b33 · 2025-05-17T06:33:26 1747463606

Yes. See how easy that is? Saved you 15 minutes.

Lerc · 2025-05-17T08:31:48 1747470708

A clear and simple answer that H L Mencken would recognise.

blooddragon · 2025-05-16T23:48:12 1747439292

Time for GANs to make a resurgence?

behnamoh · 2025-05-16T23:40:53 1747438853

I've heard that OpenAI and many AI labs put watermarks [0] in their LLM outputs to detect AI-generated content and filter it out.

[0] Like statistics of words, etc.

jsheard · 2025-05-17T00:14:58 1747440898

Maybe they do use watermarks, and the vendors which only offer hosted models can just log everything they've ever generated, but there's enough players all working on this stuff independently of each other that filtering out their own noise would only get them so far.

I noticed that a big chunk of the default Llama 4 system prompt is devoted to suppressing various GPT-isms, which to me implies they weren't able to keep their newer training set from being contaminated by competing models.

> You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.

_heimdall · 2025-05-17T00:53:31 1747443211

I could have sworn they all gave up on watermarking 12 or 18 months ago when they realized it wasn't possible to do reliably.

Rodeoclash · 2025-05-17T00:47:34 1747442854

Yeah, it's known as the em dash!

jbaber · 2025-05-17T01:13:35 1747444415

Y'know, I've been writing double dashes and having them converted into em dashes about 50% of the time on whatever platform I'm using for decades. It's bizarre that this is suddenly supposed to be a shibboleth.

AaronAPU · 2025-05-17T01:21:22 1747444882

Have you ever considered you might be an LLM?

bitwize · 2025-05-17T04:24:34 1747455874

Apparently the new ageist insult beyond "boomer" is "double-spacer" -- people who were taught in school to always follow the period at the end of a sentence with two spaces when composing the next sentence. If you went to elementary school after the internet became widespread, you are not likely to have been taught that. So double-spacing has now also become a shibboleth, albeit indicating the typist's age, distinguishing early millennials and Xers, who are now entering middle/old age, from the younger generations.

agubelu · 2025-05-17T06:51:16 1747464676

> Apparently the new ageist insult beyond "boomer" is "double-spacer

Says who? I've seen "boomer"everywhere but it's the first time I've heard about that one.

mikhmha · 2025-05-17T08:04:56 1747469096

Right? I've never associated "double-spacer" with boomer. Maybe anally retentive? Someone who is trying too hard? The only thing I associate with boomers is ALL-CAPS writing. Which I assume is a holdover from typewriter days. But I kind of like ALL CAPS. It conveys some level of importance to the message.

viraptor · 2025-05-17T20:21:15 1747513275

It's not about trying, but people who learned double spacing when it made sense (monospace environments) and never unlearned when it didn't matter anymore (variable width typesetting). It's very age specific and a bit culture specific.

energy123 · 2025-05-17T04:43:43 1747457023

This was a proposal by Scott Aaronson but I wasn't aware it got implemented.

dustingetz · 2025-05-17T00:15:34 1747440934

do they also watermark the code?

jimbob45 · 2025-05-17T00:59:35 1747443575

Wouldn’t be hard to do. Just alternate tabs and spaces and no one would ever know or care to check.

djeastm · 2025-05-17T01:40:33 1747446033

Most coders would have code cleaning tools in their IDEs that would take care of that automatically.

jimbob45 · 2025-05-17T02:03:52 1747447432

What about invisible Unicode characters?

umbra07 · 2025-05-17T03:12:30 1747451550

Too obvious. Someone would have found that already.

lolc · 2025-05-17T09:14:51 1747473291

Yea my IDE highlights uncommon chars automatically.

subscribed · 2025-05-17T11:35:41 1747481741

They are very visible to machines. Code linters would scream (and the alternating spaces and tabs would likely break generated Python code).

sampullman · 2025-05-17T01:05:12 1747443912

Hopefully that's converted to one or the other when saved in an editor, or caught in CI.

IAmGraydon · 2025-05-17T00:21:22 1747441282

Interesting. That could certainly come in handy if it’s something they can’t avoid. We, too, might be able to better detect and filter their output.

Balgair · 2025-05-17T13:00:57 1747486857

I mean, if these AIs have read everything there is to read, then really what more do we want from them?

bakugo · 2025-05-17T05:26:44 1747459604

Considering most recent models' general knowledge cutoffs are still in the late 2023/early 2024 range, I'm guessing the answer is "yes, and AI companies are very much aware of it".

deadbabe · 2025-05-17T01:26:59 1747445219

A good way to harvest new training material is to eavesdrop real human conversations from non polluted sources (such as microphones listening to people talk in public places or texts), transcribe them, and feed them to LLMs.

siwatanejo · 2025-05-17T05:18:37 1747459117

But our normal convos are plagued of mistakes, bad grammar, etc

skeledrew · 2025-05-17T09:11:17 1747473077

It doesn't take much to clean up say 95% of mistakes I reckon, as it tends to be pretty repetitive, and unless there's a bunch of wordplay happening, intention can be discerned.

stevenhuang · 2025-05-17T05:39:11 1747460351

I'd venture no.

In fact I wouldn't be surprised if this tainted information somehow enriches a dataset by providing an extra dimensionality for training specialized heuristics. Maybe this would turn out to be how LLM hallucination can be solved, through being able to accurately identify AI generated material, and as result, becoming more robust against both the identification and generation of nonsense.

Humans learn to discern what/who to best pay attention to via all manners of heuristics. I don't see in principle why LLMs (or something like it in the future) won't eventually be able to do the same.

hliyan · 2025-05-17T06:07:08 1747462028

> ...tainted information somehow enriches a dataset... dimensionality... heuristics...

this sounds like a nonsensical word salad.

stevenhuang · 2025-05-17T06:18:21 1747462701

AI generated material is what future training runs will have to deal with.

Heuristics is pattern matching. LLMs pattern match. LLMs may identify the patterns that indicate something is AI generated.

What about this is confusing you?

adamgordonbell · 2025-05-17T00:44:14 1747442654

> Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

No. Synthetic data is being used to improve LLMs

wrsh07 · 2025-05-17T01:14:18 1747444458

This whole line of thought is sort of funny. Yes you can try training a model on synthetic data in such a way that it experiences model collapse

That doesn't mean there aren't ways to train a model incorporating synthetic data without seeing model collapse

NitpickLawyer · 2025-05-17T06:00:08 1747461608

> This whole line of thought is sort of funny.

This line of thought was exacerbated by that one paper that was then parroted (hah!) by every influencer / negativist in the space. It didn't matter that the paper was badly executed, their setup was flawed and that it got rendered moot by the existence of LLama3 models. People still quote that, or the "articles" stemming from it.

RainyDayTmrw · 2025-05-17T02:35:57 1747449357

How does that work? It defies intuition. It distills existing data. How is that better than the initial data?

kolinko · 2025-05-17T02:58:25 1747450705

Not when it comes to math/programming/reasoning. You can generate infinite new problem and solution examples that are based on existing knowledge of course, but build on top of it, not distill it.

A simple example would be chess ai. The core knowledge is rules of the game. We have human generated examples of plays, but we don’t really need them - we can (and we did) synthesize data to train ai.

A similar pattern can be used for all math/physics/programming/reasoning.

Jensson · 2025-05-17T03:07:31 1747451251

> A similar pattern can be used for all math/physics/programming/reasoning.

No it can't, the pattern for chess worked since it was an invented problem where we have a simple outcome checks, we can't do the same for natural problems where we don't have easily judged outcomes.

So you can do it for arithmetics and similar where you can generate tons of questions and answers, but you can't use this for things that are fuzzier like physics or chemistry or math theorem choices. In the end we don't really know what a good math theorem is like, it has to be useful but how do you judge that? Not just any truthy mathematical statement is seen as a theorem, most statements doesn't lead anywhere.

Once we have a universal automated judge that can judge any kind of human research output then sure your statement is true, then we can train research AI that way. But we don't have that, or science would look very different than it does today. But I'd argue that such a judge needs to be AGI on its own, so its circular.

kadoban · 2025-05-17T04:58:38 1747457918

> No it can't, the pattern for chess worked since it was an invented problem where we have a simple outcome checks, we can't do the same for natural problems where we don't have easily judged outcomes.

You might be interested in some of the details of how AlphaGo (and especially the followup version) works.

Go is a problem where it's very difficult to judge a particular position, but they were still able to write a self-improving AI system that can reach _very_ high quality results starting from nothing, and only using computing power.

There does not appear to me to be any fundamental reason the same sort of techniques can't work for arbitrary problems.

> But I'd argue that such a judge needs to be AGI on its own, so its circular.

But is it circular in a way that means it can't exist, or can it run in circles like AlphaGo and keep improving itself?

meowkit · 2025-05-17T03:27:14 1747452434

> Once we have a universal automated judge that can judge any kind of human research output then sure your statement is true,

If you've noticed, most LLM interfaces have a "thumbs up" or "thumbs down" response. The prompt may provide novel data. The text generated is synthetic. You don't need an automated judge, the user is providing sufficient feedback.

Same thing goes for the other disciplines.

didericis · 2025-05-17T04:54:23 1747457663

I’m extremely skeptical that “thumbs up” and “thumbs down” plus replies to chatbots is sufficiently informative to train models to the same level of quality as models trained on user generated content.

_heimdall · 2025-05-17T00:49:37 1747442977

Do we know the results yet?

I know they're training with synthetic data, I didn't realize that has been done at scalr for long enough to really know if it improved (assuming the metrics its improving are defined well).

NitpickLawyer · 2025-05-17T06:02:43 1747461763

> Do we know the results yet?

LLama3 were post-trained on almost entirely synthetic data. Yes, it works. No, the model doesn't collapse (unless you want it to, of course).

What they did is use Model n-1 to classify, filter and enhance the datasets for Model n.

bobro · 2025-05-17T06:21:33 1747462893

Can you point me to something I can read that spells out the

> almost entirely synthetic data

thing?

NitpickLawyer · 2025-05-17T06:30:21 1747463421

Yes, there's a podcast with the post-training lead for L3 where he mentions this. Lemme try and find it.

edit: found it. The money quote is here, but I really recommend the entire podcast since it's full of great tidbits and insights.

> Thomas [00:33:44]: You mean between supervised fine-tuning like supervised fine-tuning annotation and preference annotation? Yeah. So 100% to RLHF. In fact, that's quite interesting. You start for Llama 2 with a pre-trained model and you have to have an instruction model to chat model. Otherwise, like the model is just like continue finishing sentences. So you need that to start RLHF. So we had to annotate like 10,000 examples. What did we do for Llama 3? You start with a new pre-trained model and then you want, before starting the RLHF, to have now a chat model, which is not too bad. The option one was, let's do human annotation again, like SFT stage. But in fact, by the principle I said before, the annotation would be actually worse than Llama 2. So what we did is that we generated all the data on the prompts with Llama 2 and we applied like basically the last round of Llama 2 we had to kick off and start Llama 3 post-training. So Llama 3 post-training doesn't have any like human written answers there basically, almost. It's just leveraging pure synthetic data from Llama 2.

https://www.latent.space/p/llama-3

jdietrich · 2025-05-17T01:27:18 1747445238

Deepseek V3 and R1 are both substantially trained on synthetic data. The results speak for themselves.

pphysch · 2025-05-17T00:49:22 1747442962

Synthetic data ought to be viewed as an extension of the training process rather than proper new phenomena. It can definitely help smooth things out and reinforce wanted behavior, but it's still derivative of the real data.

ninetyninenine · 2025-05-17T03:38:45 1747453125

I mean imagine Linear least squares on a 2D graph.

I have a best fit line. Then I take random data on that line to train a new line.

I pretty much get the same line.

From an intuitive perspective... it doesn't get worse. At worst it stays the same.

Now imagine something a bit more complex. I have a best fit curve that's very close to a line.

I use random data from that curve to train a new best fit line.

I get something different now. Not necessarily worse.

I mean literally just take all your ideas of ML and just imagine it on the 2D plane doing curve fitting. If retraining new lines from generated data doesn't necessarily make things worse.

jacobsenscott · 2025-05-17T00:34:27 1747442067

Today we have humans being trained on llm garbage - kids using it to do their homework - programmers using it to "learn" how to code, med students cheating their way through med school, etc. So the content the humans are producing and will produce is really just LLM statistical word jumbles - ie human generated content will soon be as useless as LLM generated content.

nradov · 2025-05-17T00:47:18 1747442838

I'm not too worried about med students. You can't really use an LLM to cheat on the boards or make it through residency.

nneonneo · 2025-05-17T04:08:18 1747454898

Yes, although some people do slip through the cracks anyway: https://en.wikipedia.org/wiki/Christopher_Duntsch. He wasn't an LLM user, but was a cocaine user...

ijk · 2025-05-17T07:11:21 1747465881

I mean, arguably the cocaine use makes him more like the kind of ideal doctor for enduring the long residency hours...

throwup238 · 2025-05-17T00:41:41 1747442501

It’d be deeply ironic if the great filter for the human race turned out to be chatbots.

nine_k · 2025-05-17T00:44:57 1747442697

Hello, secret sources of untainted but modern knowledge, written by human experts, and closely guarded by these experts.

leoapagano · 2025-05-17T01:07:36 1747444056

I can't lie, I miss when the only GPT I had to worry about was the GUID Partition Table.

userbinator · 2025-05-17T01:38:30 1747445910

At least the MBR acronym still remains.

(Most of my disks are still MBR as they're not big enough to be worth the hassle of using GPT.)

layer8 · 2025-05-17T01:14:40 1747444480

Someone should encode a chat program in it.

yunnpp · 2025-05-17T02:12:09 1747447929

Who needs more than 2TB on a single drive anyway.

abc-1 · 2025-05-17T00:08:24 1747440504

[flagged]

sidibe · 2025-05-17T00:11:58 1747440718

Yeah I think it's because people want it to be true that LLMs will stop improving and regress. If nothing else they can always just access data from the before times if it was an actual issue

So much burying the head in the sand from people in this industry and wishful thinking that AI will stop before whatever they're good at. A little reminder a couple years ago most people hadn't even heard of LLMs.

_se · 2025-05-17T00:20:25 1747441225

Most of us wish we could stop hearing about LLMs. A little reminder that every year since 2022 has been the last year that anyone will ever do anything by hand according to you prophets.

I'll give it 3-4 years before it's gone the way of crypto: still exists, still used (and still useful in the right cases), not something most people are talking about.

It's going to be really nice.

abletonlive · 2025-05-17T01:19:08 1747444748

I don't really understand this take. You're unhappy because people are excited? May I ask, what do you work on and what do you get excited about that you'd rather advance? Do you think you're part of the group that is highly motivated and advancing ideas and people around them? Because I work with a lot of people in this group and this is the most exciting time I've experienced in my 30+ years. I am waking up and living and breathing this moment in time. I'm devouring more information than I ever have in my entire life, and it both feels like I can't get enough and that there's so much I'll never be able to keep up. So I really hope whatever you have in your pocket is just as interesting and excites people this much.

Is the cynical depression coming from watching people build your future without you while you're not a part of the conversation? Correct me if I'm wrong.

If "most of us" wish we'd hear less about LLMs how come it's always the hottest topic? That doesn't compute. Wouldn't have most of you taken over the conversation by now and changed the topic? Or perhaps whatever else you're trying to advance is just not that interesting to most of us?

_se · 2025-05-17T02:16:43 1747448203

I'm not unhappy that anyone is excited, I'm frustrated by the amount of time and effort being wasted right now, and especially by how out of control people's expectations about these tools are. I say this as a daily user of multiple different LLM tools, including state of the art coding agents.

My take is neither cynical nor depressed (projection on your part perhaps?), it's seeing through the hype and applying critical thinking. Same thing that I've been doing for social and tech trends for the past 20 years. My work has nothing to do with any of this, but fwiw I do work in one of the most competitive and technically advanced industries in the world, and there is absolutely 0 chance of LLMs doing anything meaningful in the field.

It's the hottest topic on HN because of the specific type of person you get here. Check out lobste.rs for a much more representative and balanced take on LLMs, for example. "Most of us" meant humans, not tech bros.

bmink · 2025-05-17T00:47:11 1747442831

> wishful thinking that AI will stop before whatever they're good at

What is AI good at already — I mean apart from making a few people very rich and using tremendous amounts of resources to generate slop?

stainablesteel · 2025-05-17T00:47:40 1747442860

i can't believe this article wasn't written 2 years ago, this is just the basics man

carlosjobim · 2025-05-17T01:27:02 1747445222

Shadow libraries

mondrian · 2025-05-17T00:35:21 1747442121

The "core reasoning" part of AI may be increasingly important to improve, and its "database of factual knowledge" aspects may be less and less important, maybe increasingly a hindrance. So more focused and specialized training may take over toward increasing reasoning precision, and not this never-ending stream of new data.

So maybe we'll get better reasoning and therefore better generated data/content in the wild, without this negative feedback loop everyone is worried about.

gerdesj · 2025-05-17T00:39:58 1747442398

You seem to be arguing that bollocks couched in flowery language is a stretch goal of AI.

Are you sure?

_heimdall · 2025-05-17T00:51:51 1747443111

That really depends on expectations.

If AI is meant to sound nearly identical to a human, you don't need more training data.

If its meant to act as a natural language encyclopedia, we'll never get there with LLMs which amount to natural language processing on top of a massively compressed dataset.

gerdesj · 2025-05-17T01:13:05 1747444385

I've read the Google page "Getting AI to write good SQL: Text-to-SQL techniques explained" and it looks like all the usual bollocks.

... and sigh ... it look like I'm twiddling with a noddy!

meroes · 2025-05-17T00:44:53 1747442693

What core reasoning?

mondrian · 2025-05-17T03:52:17 1747453937

Related to this: https://x.com/karpathy/status/1835561952258723930

meroes · 2025-05-17T22:48:48 1747522128

That’s amazing because made up language might also just be context scaffolding sans reasoning, e.g. it’s arbitrary extra context for machines to relate human text better. I’m not even trying to play devils advocate—-like both sides, true believers or pessimists, come up with wholly unconvincing arguments. (I genuinely don’t know if the tweet is a true believer or not). At least the pessimists aren’t coupled with the AI marketeers.

SpicyLemonZest · 2025-05-17T01:22:37 1747444957

For example, I just asked o3 "how many Is are in antidisestablishmentarianism", and this was its train-of-thought output:

------

The user wants to know how many “I”s are in the word “antidisestablishmentarianism.” They probably mean the letter "I" in lowercase, though the question was a little broader with "Is."

I’ll treat it case-insensitively and count the total occurrences of the letter "i" (both uppercase and lowercase). I can count them manually now, or use a tool for faster analysis.

Analyzed word = "antidisestablishmentarianism" word_lower = word.lower() word_lower.count('i')

5 The user asked how many "I"s are in "antidisestablishmentarianism." I’m taking it to mean whether it’s counting "i" or "I", and I’ll consider them as the same for simplicity. So, I’ll count the 5 occurrences of "i" in the word. I manually broke it down and verified there are indeed five "i"s, so the final count should be 5. Pretty straightforward, now that it's all clear!

Done

There are 5 “i”s in “antidisestablishmentarianism.”

------

You can come up with complicated definitions of "core reasoning" where this doesn't count, and I understand there's great demand for such definitions, but this clearly satisfies common-sense understanding of the term.

Jensson · 2025-05-17T03:00:12 1747450812

> You can come up with complicated definitions of "core reasoning" where this doesn't count

Did we read the same response? It did write a lot of reasons, but didn't do any reasoning at all, it just suddenly wrote "5" here

    So, I’ll count the 5 occurrences of "i" in the word.

There was no reasoning at all to arrive at 5, so no your example just proves how these models are great at faking reasoning.

snmx999 · 2025-05-17T06:51:41 1747464701

What kind of response would have satisfied you?

selfhoster · 2025-05-17T01:49:58 1747446598

Then I guess Ubuntu has had reasoning for several decades:

    sudp
    Command 'sudp' not found, did you mean:
      command 'sudo' from deb sudo (1.9.9-1ubuntu2.4)
      command 'sudo' from deb sudo-ldap (1.9.9-1ubuntu2.4)
      command 'sup' from deb sup (20100519-3)
      command 'sfdp' from deb graphviz (2.42.2-6)
    Try: sudo apt install

meroes · 2025-05-17T01:58:22 1747447102

I might just be on the opposite side of the aisle, but to me chain-of-thought is better understood as simply more context.

Of course there is ambiguity though, more context would be hard to distinguish from core-reasoning and vice versa.

I think LLMs/AI mean we can substitute reasoning with vast accumulations and relations between contexts.

Remember, RLHF gives the models some, and perhaps most of these chains-of-thought, when there isn’t sufficient text to scrape for each family of problems. When I see that chain-of-thought, the first thing I think of is of my peers who had write, rewrite, nudge, and correct these chains of thought, and not about core reasoning.

The CoT has that same overexplained step-by-step so many RLHF’ers will be accustomed to, and much of it was authored/originated by them. And due to the infinite holes it feels like plugging, I dont call that RL reasoning.

lacker · 2025-05-17T00:44:34 1747442674

Unfortunately, I don't really know if I can trust academics to analyze the development of large language models. No academic team has built an LLM. So... do people working at Stanford or Oxford really have good insight how LLMs are developed?

If people at OpenAI, Anthropic, or Google said this, that would be interesting. But I don't think it makes sense any more to treat academic computer scientists as relevant experts here.

_heimdall · 2025-05-17T00:46:06 1747442766

My understanding is that those building them don't really know how they work. Research into interoperability has fallen way behind as funding went towards features and scale.

Any understanding of how they work is largely theoretical, that seems like a reasonable place for academics to lean in and join the conversation.

jsheard · 2025-05-17T00:47:44 1747442864

It doesn't really make sense to trust what OpenAI and friends say about this either, when admitting to any kind of scaling limits would go against the narrative propping up their multi-hundred-billion-dollar valuations. I guess we're just flying blind for now.

pphysch · 2025-05-17T00:46:45 1747442805

Why would Big AI kill their golden goose like that?

(comments)