(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=40001971

本文讨论虚构的概念,即个人根据过去的经历创造错误的记忆或事实。 演讲者分享了他们与临床谈话者的个人经历,并将其与 YouTube 上的儿童内容进行了比较。 他们认为,虽然孩子们可以说他们不知道某些事情,但他们也可以自信地错了,这是人类尚未弄清楚如何应对的挑战。 他们提到了哲学家路德维希·维特根斯坦和他的著作《逻辑哲学论》和《哲学研究》,并暗示维特根斯坦可能将幻觉视为语言使用的一部分,尽管这两本书在这个问题上相互矛盾。 然后,演讲者将话题转向人工智能,讨论模型训练期间的故障标记以及某些网站行为背后的潜在经济激励。 他们对 OpenAI 服务器群行为背后的动机表示怀疑,并批评用于模型训练的数据来源缺乏透明度。 此外,他们还触及了维特根斯坦对语言的看法,并提出了理解语言的“法国”方法。

相关文章

原文


This reminds me of how GPT-2/3/J came across https://reddit.com/r/counting, wherein redditors repeatedly post incremental numbers to count to infinity. It considered their usernames, like SolidGoldMagikarp, such common strings on the Internet that, during tokenization, it treated them as top-level tokens of their own.

https://www.alignmentforum.org/posts/8viQEp8KBg2QSW4Yc/solid...

https://www.lesswrong.com/posts/LAxAmooK4uDfWmbep/anomalous-...

Vocabulary isn't infinite, and GPT-3 reportedly had only 50,257 distinct tokens in its vocabulary. It does make me wonder - it's certainly not a linear relationship, but given the number of inferences run every day on GPT-3 while it was the flagship model, the incremental electricity cost of these Redditors' niche hobby, vs. having allocated those slots in the vocabulary to actually common substrings in real-world text and thus reducing average input token count, might have been measurable.

It would be hilarious if the subtitle on OP's site, "IECC ChurnWare 0.3," became a token in GPT-5 :)



I wonder how much the source content is the cause of hallucinations rather than anything inherent to LLMs. I mean if someone posts a question on an internet forum that I don't know the answer to, I'm certainly not going to post "I don't know" since that wouldn't be useful.

In fact, in general, in any non one-on-one conversation the answer "I don't know" is not useful because if you don't know in a group, your silence indicates that.



Reminds me of a joke

Three logicians walk into a bar. The bartender says "what'll it be, three beers?" The first logician says "I don't know". The second logician says "I don't know". The third logician says "Yes".



If, like me, you didn't get the joke at first:

Both of the first two logicians wanted a beer; otherwise they would know the answer was "no". The third logician recognizes this, and therefore knows the answer.



He didn’t know perfectly, but he knew with great enough probability to place an order. In the very small chance that someone wanted two beers, someone would speak up.

This way is logically most efficient to work and involve the least communication.



I recently heard this explained () in the following way: three is the smallest number where you can set up an expectation (with the first two) and then break it. This is why three is such a common number, not just in jokes but in all sorts of story-telling.

() In a lecture by the mathematician & author Sarah Hart.



It's "fabrication", plain and simple.

Fully agreed that "hallucination" is a bonkers word for it — sensational and melodramatic. But few people know what a confabulation is, and moreover it's an overly complex way to describe the phenomenon.

The LLM is making something up. It's a fabrication.

It's not fanciful; it's not spooky; it's mundane, as it should be.



My view is that hallucination is something related to the interpretation of reality. It's not really directly mapping to memory at all. The mechanisms of confabulation entirely surround the gluing together of memories, and what are these models other than some sort of representation of memory.

I believe that you can also cause something a bit like a transient dysphasia by giving them bad inputs as well, so there is that on the language production side. However there's still nothing that pertains to the experience aspects central to what hallucinations actually are.



I think hallucination is better and more accurate at it implies a bit of imagination and buffoonery deceit.

I don’t think confabulate matches as well as it implies confusion or mixture of different ideas.

ChatGPT isn’t confused, it’s making things up. It’s trying to bullshit as best it can in hope that what it makes up convinces its user.



That making things up based on memories of past things is entirely what confabulation is. Bullshitting in the large as it were. I've met quite a few clinical confabulators (people with Korsakov syndrome and the like) and I find the parallels remarkable.


It's not. Kids overhear what parents watch, their ears are like little recorders. Meanwhile, kids videos near-universally end with either a like&subscribe admonition, or some crap like "ask parents to download our tablet app". Even the quality videos, they all do that.

Even if you don't show children videos, but want to play some music, YouTube is still the least-hassle, least-bullshit music stream player (arguably still it's main use for adults, too). Ain't anyone got time to deal with Spotify's ever more broken app. And this is the limit of technical skill of almost all parents. They can't exactly run SponsorBlock in YouTube's mobile app (and paid YouTube doesn't help here either, surprise surprise).

Not making excuses (though I'm not really blaming parents for this) - just saying how things actually are.



Which is crazy because there's plenty of good content for kids on Youtube (if you really need a break!). Blippy, Meekah, Seasame Street, even that mind-numbing drivel Cocomelon (which at least got my girls talking/singing really early).


Sure, and most of it starts with a jingle and ends up with begging block.

I used to cut all those things to shape with youtube-dl and Audacity; we have a library of a good hundred+ of sanitized songs to play, but with modern world hating files and anything offline, it turned out to be quite a hassle to keep the practice up.



I get the sentiment, but when reality hits unrealistic parental expectations, things get messy.

If you have to put a show on TV to give some songs to sing along to or to distract them while you're making lunch, I'm not judging you, and I think it's best to put this content on a gradient rather than black and white.



People had extended family around, now days they don't.

Parenting is easier if you have 5 family members in walking distance and they also have similarly ages kids who can all play together.



They can say they don't know, and have been trained to in at least some cases; I think the deeper problem — which we don't know how to fix in humans, the closest we have is the scientific method — is they can be confidently wrong.


This is Amazon's fault; they send an email that looks like it's specifically directed to you. "ceejayoz, a fellow customer has a question on..."

At some point fairly recently they added a "I don't know the answer" button to the email, but it's much less prominent than the main call-to-action.



I’d place in the same category the responses that I give to those chat popups so many sites have. They show a person saying to me “Can I help you with anything today?” so I always send back “No”.


A lot of LLM hallucination is because of the internal conflict between alignment for helpfulness and lack of a clear answer. It's much like when someone gets out of their depth in a conversation and dissembles their way through to try and maintain their illusion of competence. In these cases, if you give the LLM explicit permission to tell you that it doesn't know in cases where it's not sure, that will significantly reduce hallucinations.

A lot more of LLM hallucination is it getting the context confused. I was able to get GPT4 to hallucinate easily with questions related to the distance from one planet to another, since most distances on the internet are from the sun to individual planets, and the distances between planets varies significantly based on their locations in cycle. These are probably slightly harder to fix.



"In these cases, if you give the LLM explicit permission to tell you that it doesn't know in cases where it's not sure, that will significantly reduce hallucinations."

I've noticed that while this can help to prevent hallucinations, it can also cause it to go way too far in the other direction and start telling you it doesn't know for all kinds of questions it really can answer.



It also has a problem with quantity, so it gets confused by things like the cube root of 750l that it maintains for a long time is around 9m. It even suggests that 1l is equal to 1m³.


With Wittgenstein I think we see that "hallucinations" are a part of language in general, albeit one I could see being particularly vexing if you're trying to build a perfectly controllable chatbot.


I'm referring to his two works, the "Tractatus Logico-Philosophicus" and "Philosophical Investigations". There's a lot explored here, but Wittgenstein basically makes the argument that the natural logic of language—how we deduce meaning from terms in a context and naturally disambiguate the semantics of ambiguous phrases—is different from the sort of formal propositional logic that forms the basis of western philosophy. However, this is also the sort of logic that allows us to apply metaphors and conceive of (possibly incoherent, possibly novel, certainly not deductively-derived) terms—counterfactuals, conditionals, subjunctive phrases, metaphors, analogies, poetic imagery, etc. LLMs have shown some affinity of the former (linguistic) type of logic with greatly reduced affinity with the latter (formal/propositional) sort of logical processing. Hallucinations as people describe them seem to be problems with not spotting "obvious" propositional incoherence.

What I'm pushing at is not that this linguistic ability naturally leads to the LLM behavior we're seeing and calling "hallucinating", just that LLMs may capture some of how humans process language, differentiate semantics, recall terms, etc, but without the mechanisms that enable rationally grappling with the resulting semantics and propositional (in)coherency that are fetched or generated.

I can't say this is very surprising—most of us seem to have thought processes that involve generating and rejecting thoughts when we e.g. "brainstorm" or engage in careful articulation that we haven't even figured out how to formally model with a chatbot capable of generating a single "thought", but I'm guessing if we want chatbots to keep their ability to generate things creatively there will always be tension with potentially generating factual claims, erm, creatively. Further evidence is anecdotal observations that some people seem to have wildly different thresholds for the propositional coherence they can spot—perhaps one might be inclined to correlate the complexity with which one can engage in spotting (in)coherence with "intelligence", if one considers that a meaningful term.



Wait, are you saying this something you read in both the Tractatus and the PI? They are quite opposed as texts! That's kinda why he wrote the PI at all..

I don't think Wittgenstein would agree, first of all, that there is a "natural logic" to language. At least in the PI, that kind of entity--"the natural logic of language"--is precisely the kind of weird and imprecise use of language he is trying to expose. Even more, to say that such a logic "allows" for anything (like metaphors) feels like a very very strange thing for Wittgenstein to assert. He would ask "what do you mean by 'allows'"?

All we know, according to him (in the PI), is that we find ourselves speaking in situations. Sometimes I say something, and my partner picks up the right brick, other times they do nothing, or hit me. In the PI, all the rest is doing away with things, like our idea of private language, the irreality of things like pain, etc. To conclude that he would make such assertions about the "nature" of language, of poetry, whatever, seems like maybe too quick a reading of the text. It is at best, a weirdly mystical reading of him, that he probably would not be too happy about (but don't worry about that, he was an asshole).

The argument you are making sounds much more French. Derrida or Lyotard have said similar things (in their earlier, more linguistic years). They might be better friend to you here.



I would assume GP is talking about the fallibility of human memory, or perhaps about the meanings of words/phrases/aphorisms that drift with time. C.S. Lewis talks about the meaning of the word "gentleman" in one of his books; at first the word just meant "land owner" and that was it. Then it gained social significance and began to be associated with certain kinds of behavior. And now, in the modern register, its meaning is so dilute that it can be anything from "my grandson was well behaved today" or "what an asshole" depending on its use context.

Dunno. GP?



I've wondered if one could train a LLM on a closed set of curated knowledge. Then include training data that models the behaviour of not knowing. To the point that it could generalize to being able to represent its own not knowing.

Because expecting a behaviour, like knowing you don't know, that isn't represented in the training set is silly.

Kids make stuff up at first, then we correct them - so they have a way to learn not to.



> I've wondered if one could train a LLM on a closed set of curated knowledge. Then include training data that models the behaviour of not knowing. To the point that it could generalize to being able to represent its own not knowing.

The problem is that curating data is slow and expensive and downloading the entire web is fast and cheap.

See also https://en.wikipedia.org/wiki/Cyc



Agreed. Using LLM to generate or curate training sets for other generations seems like a cool approach.

Maybe if you trained a small base model to know it doesn't know in general and THEN trained it on the entire web with embedded not-knowing preserving training examples, it would work?



No it won’t work.

This is not a brain. The best analogy is an English major.

They are good at language, not reasoning.

Humans see language and think reason. It seems we can’t separate the two.



> train a LLM a closed set of curated knowledge

Google has one of these already, with an LLM that was trained on nothing but weather data and so can only give weather-data-prediction responses.

The 'knowing it doesn't know things' part is much harder to get reliable, though.



"I wonder how much the source content is the cause of hallucinations rather than anything inherent to LLMs."

Probably true, but if you have quality, organized data, you will just want to search the data itself.



> I wonder how much the source content is the cause of hallucinations rather than anything inherent to LLMs

I mean, it's inherent to LLMs to be unable to answer "I don't know" as a result of not knowing the answer. An LLM never "doesn't know" the answer. But they'll gladly answer "I don't know" if that's statistically the most likely response, right? (Although current public offerings are probably trained against ever saying that.)



That knowing to say "I don't know" instead of extrapolating is an explicitly learned skill in humans, not something innate or inherent in structure of language, so we shouldn't expect LLMs to pick that ex nihilo either.


I suspect this is going to be a disagreement on the meaning of "to know".

On the same lines as why people argue if a tree falling in a wood where nobody can hear it makes sound because some people implicitly regard sound is the qualia while others regard it as the vibrations in the air.



LLMs don't know anything except the most frequent observed response to a context made up of a sequence of tokens.

How often do the words "I don't know" get uttered in books, papers, articles, stack overflow, or any other resource of knowledge?



Not really.

An LLM should have no problem replying "I don't know" if that's the most statistically likely answer to a given question, and if it's not trained against such a response.

What it fundamentally can't do is introspect and determine it doesn't have enough information to answer the question. It always has an answer. (disclaimer: I don't know jack about the actual mechanics. It's possible something could be constructed which does have that ability and still be considered an "LLM". But the ones we have now can't do that.)



No, that's the same misunderstanding previously stated.

Answering "I don't know" because it a likely response to a particular string is completely different from being aware that one does not know the answer and saying so.

Both motivations lead to the same outcome, but they're unrelated processes. The response "I don't know" can represent either:

1. The most likely answer to a particular question, based on statistical data; or

2. An expression of an agent's internal state.

Figuring out that distinction is perhaps one of the most important questions ever raised.



>In fact, in general, in any non one-on-one conversation the answer "I don't know" is not useful because if you don't know in a group, your silence indicates that.

This isn't true. There are many contexts where it is true but it doesn't actually generalize they way you say it does.

There are plenty of cases where experts in a non-one-on-one context will express a lack of knowledge. Sometimes this will be as part of making point about the broader epistemic state of the group, sometimes it will be simply to clarify the epistemic state of the speaker.



During tokenization, the usernames became tokens... but before training the actual model, they removed stuff like that from the training data, so it was never trained on text which contains those tokens. As such, it ended up with tokens which weren't associated with anything; glitch tokens.


It's interesting: perhaps the stability (from a change management perspective) of the tokenization algorithm, being able to hold that constant, between old and new training runs was deemed more important than trying to clean up the data at an earlier phase of the pipeline. And the eventuality of glitch tokens was deemed an acceptable consequence.


I'm more interested in what that content farm is for. It looks pointless, but I suspect there's a bizarre economic incentive. There are affiliate links, but how much could that possibly bring in?


Except the first thing openai does is read robots.txt.

However, robots.txt doesn't cover multiple domains, and every link that's being crawled is to a new domain, which requires a new read of a robots .txt on the new domain.



> Except the fiist thing openai does is read robots.txt.

Then they should see the "Disallow: /" line, which means they shouldn't crawl any links on the page (because even the homepage is disallowed). Which means they wouldn't follow any of the links to other subdomains.



All the lines related to GPTBot are commented out. That robots.txt isn't trying to block it. Either it has been changed recently or most of this comment thread is mistaken.


Accessing a directly referenced page is common in order to receive the noindex header and/or meta tag, whose semantics are not implied by “Disallow: /”

And then all the links are to external domains, which aren't subject to the first site's robots.txt



This is a moderately persuasive argument.

Although the crawler should probably ignore all the html body. But it does feel like a grey area if I accept your first pint.



More directly, e.g. Tesla boasts of training their FSD on data captured from their customer's unassisted driving. So it's hardly surprising that it imitates a lot of humans' bad habits, e.g. rolling past stop lines.


Jesus, that’s one of those ideas that looks good to an engineer but is why you really need to hire someone with a social sciences background (sociology, anthropology, psychology, literally anyone who’s work includes humans), and probably should hire two, so the second one can tell you why the first died of an aneurism after you explained your idea.


This would be considered a Slow Loris attack, and I'm actually curious how scrapers would handle it.

I'm sure the big players like Google would deal with it gracefully.



Here you go (1 req/min, 10 bytes/sec), please report results :)
  http {
    limit_req_zone $binary_remote_addr zone=ten_bytes_per_second:10m rate=1r/m;
    server {
      location / {
        if ($http_user_agent = "mimo") {
          limit_req zone=ten_bytes_per_second burst=5;
          limit_rate 10;
        }
      }
    }
  }


Scrapers of the future won't be ifElse logic, they will be LLM agents themselves. The slow loris robots.txt has to provide an interface to it's own LLM, which engages the scraper LLM in conversation, aiming to extend it as long as possible. "OK I will tell you whether or not I can be scraped. BUT FIRST, listen to this offer. I can give you TWO SCRAPES instead of one, if you can solve this riddle."


You just set limits on everything (time, buffers, ...), which is easier said than done. You need to really understand your libraries and all the layers down to the OS, because its enough to have one abstraction that doesn't support setting limits and it's an invitation for (counter-)abuse.


Doesn't seem like it should be all that complex to me assuming the crawler is written in a common programming language. It's a pretty common coding pattern for functions that make HTTP requests to set a timeout for requests made by your HTTP client. I believe the stdlib HTTP library in the language I usually write in actually sets a default timeout if I forget to set one.


He says 3 million, and 1.8 million are for robots.txt

So 1.2 million non robots.txt requests, when his robots.txt file is configured as follows

    # buzz off
    User-agent: GPTBot
    Disallow: /
Theoretically if they were actually respecting robots.txt they wouldn't crawl any pages on the site. Which would also mean they wouldn't be following any links... aka not finding the N subdomains.


A lot of crawlers, if not all, have a policy like "if you disallow our robot, it might take a day or two before it notices". They surely follow the path "check if we have robots.txt that allows us to scan this site, if we don't get and store robots.txt, scan at least the root of the site and its links". There won't be a second scan, and they consider that they are respecting robots.txt. Kind of "better ask for forgiveness than for permission".


That is indistinguishable from not respecting robots.txt. There is a robots.txt on the root the first time they ask for it, and they read the page and follow its links regardless.


I agree with you. I only stated how the crawlers seem to work, if you read their pages or try to block/slow down them it seems clear that they scan-first-respect-after. But somehow people understood that I approve that behaviour.

For those bad crawlers, which I very much disapprove, "not respecting robots.txt" equals "don't even read robots.txt, or if I read it ignore it completely". For them, "respecting robots.txt" means "scan the page for potential links, and after that parse and respect robots.txt". Which I disapprove and don't condone.



Except now it says
    # silly bing
    #User-agent: Amazonbot          
    #Disallow: /

    # buzz off
    #User-agent: GPTBot
    #Disallow: /

    # Don't Allow everyone
    User-agent: *
    Disallow: /archive

    # slow down, dudes
    #Crawl-delay: 60
Which means he's changing it. The default for all other bots is to allow crawling.


There are fewer than 10 links on each domain, how did GPTBot find out about the 1.8M unique sites? By crawling the sites it's not supposed to crawl, ignoring robots.txt. "disallow: /" doesn't mean "you may peek at the homepage to find outbound links that may have a different robots.txt"


I'm not sure any publisher means for their robots.txt to be read as:

"You're disallowed, but go head and slurp the content anyway so you can look for external links or any indication that maybe you are allowed to digest this material anyway, and then interpret that how you'd like. I trust you to know what's best and I'm sure you kind of get the gist of what I mean here."



The convention is that crawlers first read /robots.txt to see what they're encouraged to scrape and what they're not meant to, and then hopefully honor those directions.

In this case, as in many, the disallow rules are intentionally meant to protect the signal quality and efficiency of the crawler.



This is honeypot. The author, https://en.wikipedia.org/wiki/John_R._Levine, keeps it just to notice any new (significant) scraping operation launched that will invariably hit his little farm and let be seen in the logs. He's well known anti-spam operative with his various efforts now dating back multiple decades.

Notice how he casually drops a link to the landing page in the NANOG message. That's how the bots will get a bait.



It's for shits-and-giggles and it's doing its job really well right now. Not everything needs to have an economic purpose, 100 trackers, ads and backed by a company.


Am I the only one who was hoping—even though I knew it wouldn’t be the case—that OpenAI’s server farm was infested with actual spiders and they were getting into other people’s racks?


He's not done his robots.txt properly, he's commented out the bit that actually disallows it
  # silly bing
  #User-agent: Amazonbot          
  #Disallow: /

  # buzz off
  #User-agent: GPTBot
  #Disallow: /

  # Don't Allow everyone
  User-agent: *
  Disallow: /archive

  # slow down, dudes
  #Crawl-delay: 60


It's not just a problem for training, but the end user, too. There are so many times that I've tried to ask a question or request a summary for a long article only to be told it can't read it itself, so you have to copy-paste the text into the chat. Given the non-binding nature of robots.txt and the way they seem comfortable with vacuuming up public data in other contexts, I'm surprised they allow it to be such an obstacle for the user experience.


That’s the whole point. The site owner doesn’t want their information included in ChatGPT—they want you going to their website to view it instead.

It’s functioning exactly as designed.



It's a stretch to expect a human initiated action to abide by robot.txt.

Also, once you click on a link in chrome it's pretty much all robot parsed and rendered from there as well..



I would say robots.txt is meant to filter access for interactions initiated by an automated process (ie automatic crawling). Since the interaction to request a site with a language model is manual (a human request) it doesn't make sense to me that it is used to block that request.

If you want to block information you provide from going through ClosedAI servers, block their IPs instead of using robots.txt.



Nothing.

This is why these things - search engines, AI crawlers, even adblock and video downloaders - exist in a slightly adversarial/parasitic relationship with the sites that provide their content to which they provide nothing back (or negative, if you cost them a page load without incurring an ad view).

I use adblock all the time but I'm very aware that it can only succeed as long as it doesn't win.



In the network security world, this is known as a tarpit. You can delay an attack, scan or any other type of automation by sending data either too slowly or in such a way as to cause infinite recursion. The result is wasted time and energy for the attacker and potentially a chance for us to ramp up the defences.


From the content of the email, I get the impression that it's just a honeypot. Also I'm not seeing any delays in the content being returned.

A tarpit is different because it's designed to slow down scanning/scraping and deliberately waste an adversary's resources. There are several techniques but most involve throttling the response (or rate of responses) exponentially.



I’d let them do their thing, why not?! They want the internet? This is the real internet. It looks like he doesn’t really care that much that they’re retrieving millions of pages, so let them do their thing…


> It looks like he doesn’t really care that much that they’re retrieving millions of pages

It impacts the performance for the other legitimate users of that web farm ;)



Eventually, OpenAI (and friends) are going to be training their models on almost exclusively AI generated content, which is more often than not slightly incorrect when it comes to Q&A, and the quality of AI responses trained on that content will quickly deteriorate. Right now, most internet content is written by humans. But in 5 years? Not so much. I think this is one of the big problems that the AI space needs to solve quickly. Garbage in, garbage out, as the old saying goes.


The end state of training on web text has always been an ouroboros - primarily because of adtech incentives to produce low quality content at scale to capture micro pennies.

The irony of the whole thing is brutal.



Content you’re allowed and capable of scraping on the Internet is such a small amount of data, not sure why people are acting otherwise.

Common crawl alone is only a few hundred TB, I have more content than that on a NAS sitting in my office that I built for a few grand (Granted I’m a bit of a data hoarder). The fears that we have “used all the data” are incredibly unfounded.



"Content you're allowed to scrape from the internet" is MUCH smaller than what LLMs have actually scraped, but they don't care about copyright.

> The fears that we have “used all the data” are incredibly unfounded.

The problem isn't whether we used all the real data or not, the problem is that it becomes increasingly difficult to distinguish real data from previous LLM outputs.



> "Content you're allowed to scrape from the internet" is MUCH smaller than what LLMs have actually scraped, but they don't care about copyright.

I don't know about that. If you scraped the same data and ran a search engine I think people would generally say you're fine. The copyright issue isn't the scraping step.



Gonna say you’re way off there. Once you decompress common crawl and index it for FTS and put it on fast storage you’re in for some serious pain, and that’s before you even put it in your ML pipeline.

Even refined web runs about 2TB once loaded into Postgres with TS vector columns, and that’s a substantially smaller dataset than common crawl.

It’s not just a dumping a to of zip files on your NAS, it’s making the data responsive and usable.



> Content you’re allowed and capable of scraping on the Internet is such a small amount of data, not sure why people are acting otherwise

YMMV depending on the value of "you" and your budget.

If you're Google, Amazon or even lower tier companies like Comcast, Yahoo or OpenAI, you can scape a massive amount of data (ignoring the "allowed" here, because TFA is about OpenAI disregarding robots.txt)



> Facebook alone probably has more data than the entire dataset GPT4 was trained on and it’s all behind closed doors.

Meta is happily training their own models with this data, so it isn't going to waste.



Not Llama, they’ve been really clear about that. Especially with DMA cross-joining provisions and various privacy requirements it’s really hard for them, same for Google.

However, Microsoft has been flying under the radar. If they gave all Hotmail and O365 data to OpenAI I’d not be surprised in the slightest.



I bet they are training their internal models on the data. Bet the real reason they are not training open source models on that data is because of fears of knowledge distillation, somebody else could distill LLaMa into other models. Once the data is in one AI, it can be in any AIs. This problem is of course exacerbated by open source models, but even closed models are not immune, as the Alpaca paper showed.


> The end state of training on web text has always been an ouroboros

And when other mediums have been saturated with AI? Books, music, radio, podcasts, movies -- what then? Do we need a (curated?) unadulterated stockpile of human content to avoid the enshittification of everything?



I mean, you’re not wrong. I’ve been building some unrelated web search tech and have considered just indexing all the sites I can about and making my own “non shit” search engine. Which really isn’t too hard if you want to do say, 10-50 sites. You can fit that on one 4TB nvme drive on a local workstation.

I’m trying to work on monetization for my product now. The “personal Google” idea is really just an accidental byproduct of solving a much harder task. Not sure if people would pay for that alone.



Once we've got the first, making a billion is easy.

That said… are content creators collectively (all media, film and books as well as web) a thin tail or a fat tail?

I could easily believe most of the actual culture comes from 10k-100k people today, even if there's, IDK, ten million YouTubers or something (I have a YouTube channel, something like 14 k views over 14 years, this isn't "culturally relevant" scale, and even if it had been most of those views are for algorithmically generated music from 2010 that's a literal Markov chain).



It's true that there will no longer be any virgin forest to scrape but it's also true that content humans want will still be most popular and promoted and curated and edited etc etc. Even if it's impossible to train on organic content it'll still be possible to get good content


It is already solved. Look at how Microsoft trained Phi - they used existing models to generate synthetic data from textbooks. That allowed them to create a new dataset grounded in “fact” at a far higher quality than common crawl or others.

It looks less like an ouroboros and more like a bootstrapping problem.



AI training on AI-generated content is a future problem. Using textbooks is a good idea, until our textbooks are being written by AI.

This problem can't really be avoided once we begin using AI to write, understand, explain, and disseminate information for us. It'll be writing more than blogs and SEO pages.

How long before we start readily using AI to write academic journals and scientific papers? It's really only a matter of time, if it's not already happening.



You need to separate “content” and “knowledge.” GenAI can create massive amounts of content, but the knowledge you give it to create that content is what matters and why RAG is the most important pattern right now.

From “known good” sources of knowledge, we can generate an infinite amount of content. We can add more “known good” knowledge to the model by generating content about that knowledge and training on it.

I agree there will be many issues keeping up with what “known good” is, but that’s always been an issue.



> We can add more “known good” knowledge to the model by generating content about that knowledge and training on it.

That's my entire point -- AI only generates content right now, but it will also be the source of content for training purposes soon. We need a "known good" human knowledge-base, otherwise generative AI will degenerate as AI generated content proliferates.

Crawling the web, like in the case of the OP, isn't going to work for much longer. And books, video, and music are next.



> Crawling the web, like in the case of the OP, isn't going to work for much longer. And books, video, and music are next.

That is training on content.

The future will have models pre-trained on content and tuned on corpuses of knowledge. The knowledge it is trained on will be a selling point for the model.

Think of it this way - if you want to update the model so it knows the latest news, does it matter if the news was AI generated if it was generated from details of actual events?



Is this like, the AI equivalent of “another layer will fix it” that crypto fans used?

“It’s ok bro, another model will fix, just please, one more ~layer~ ~agent~ model”

It’s all fun and games until you can’t reliably generate your base models anymore, because all your _base_ data is too polluted.

Let’s not forget MS has a $10bn stake in the current crop of LLM’s turning out to be as magic as they claim, so I’m sure they will do anything to ensure that happens.



Oh I’m sure it works wonderfully for now.

My point is about the inevitable future when _those_ models start to struggle.

The phi approach doesn’t seem like breaking the ouroboros, it just feels like inserting another model/snake into the loop.



Well it will be multimodal, training and inferring on feeds of distributed sensing networks; radio, optical, acoustic, accelerometer, vibration, anything that's in your phone and much besides. I think the time of the text-only transformer has already passed.


Want a real conspiracy?

What do you think the NSA is storing in that datacenter in Utah? Power point presentations? All that data is going to be trained into large models. Every phone call you ever had and every email you ever wrote. They are likely pumping enormous money into it as we speak, probably with the help of OpenAI, Microsoft and friends.



> What do you think the NSA is storing in that datacenter in Utah?

A buffer with several-days-worth of the entire internet's traffic for post-hoc decryption/analysis/filtering on interesting bits. All that tapped backbone/undersea cable traffic has to be stored somewhere.



I am not sure why this would even be a conspiracy.

They would almost be failing in their purpose if they were not doing this.

On the other hand, this is an incredibly tough signal to noise problem. I am not sure we really understand what kind of scaling properties this would have as far as finding signals.



As I understand it, they don't have the capability to essentially PCAP all that data.. and the data wouldn't be that useful since most interesting traffic is encrypted as well. Instead they store the metadata around the traffic. Phone number X made an outgoing call to Y @ timestamp A, call ended at timestamp B, approximate location is Z, etc. Repeat that for internet IP addresses do some analysis and then you can build a pretty interesting web of connections and how they interact.


> most interesting traffic is encrypted as well

encrypted with an algorithm currently considered to be un-brute-forcible. If you presume we'll be able to decrypt today's encrypted transmissions in, say, 50-100 years, I'd record the encrypted transmission if I were the NSA.



It's a big data centre.

But is it big enough to store 50 years worth of encrypted transmissions?

Far cheaper to simply have spies infiltrate the ~3 companies that hold the keys to 98% of internet traffic.



Of everyone's? No. But enough to store the signal messages of the President, down a couple of levels? I hope so. After I'm dead, I hope the messages between the President, his cabinet, and their immediate contacts that weren't previously accessible get released to historians.


Though it seems like something that could exist, who is doing the technical work/programming? It seems impossible to be in the industry and not have associates and colleagues either from or going to an operation like that. This is what I've always pondered about when it comes to any idea like this. The number of engineers at the pointy end of the tech spear is pretty small.


> Eventually, OpenAI (and friends) are going to be training their models on almost exclusively AI generated content

What makes you think this is true? Yes, it's likely that the internet will have more AI generated content than real content eventually (if it hasn't happened already), but why do you think AI companies won't realize this and adjust their training methods?



I really really hope that five years from now we are not still using AI systems that behave the way today's do, based on probabilistic amalgamations of the whole of the internet. I hope we have designed systems that can reason about what they are learning and build reasonable mental models about what information is valuable and what can be discarded.


The only way out of this is robots that can go out in the world and collect data. Write in natural language what they observed which can then be used to train better LLMs.


I for one, welcome the junk-data-ouroboros-meta-model-collapse. I think it’ll force us out of this local maxima of “moar data moar good” mindset, and give us collectively, a chance to evaluate the effect these things have our society. Some proverbial breathing room.


They've obviously been thinking about this for a while and are well aware of the pitfalls of training on AI based content. This is why they're making such aggressive moves into video, audio, other better and more robust ground forms of truth. Do you really think that they aren't aware of this issue?

It's funny whenever people bring this up, they think AI companies are some mindless juggernauts who will simply train without caring about data quality at all and end up with worse models that they'll still for some reason release. Don't people realize that attention to data quality is the core differentiating feature that lead companies like OpenAI to their market dominance in the first place?



Is it (I am not a worker in this space, so genuine question)?

My thoughts - I teach myself all the time. Self reflection with a loss function can lead to better results. Why can't the LLMs do the same (I grasp that they may not be programmed that way currently)? Top engines already do it with chess, go, etc. They exceed human abilities without human gameplay. To me that seems like the obvious and perhaps only route to general intelligence.

We as humans can recognize botnets. Why wouldn't the LLM? Sort of in a hierarchal boost - learn the language, learn about bots and botnets (by reading things like this discussion), learn to identify them, learn that their content doesn't help the loss function much, etc. I mean sure, if the main input is "as a language model I cannot..." and that is treated as 'gospel' that would lead to a poor LLM, but i don't think that is the future. LLMs are interacting with humans - how many times do they have to re-ask a question - that should be part of the learning/loss function. How often do they copy the text into their clipboard (weak evidence that the reply was good)? do you see that text in the wild, showing it was used? If so, in what context "Witness this horrible output of chatGPT: " should result in lower scores and suppression of that kind of thing.

I dream of the day where I have a local LLM (ie individualized, I don't care where the hardware is) as a filter on my internet. Never see a botnet again, or a stack overflow q/a that is just "this has already been answered" (just show me where it was answered), rewrite things to fix grammar, etc. We already have that with automatic translation of languages in your browser, but now we have the tools for something more intelligent than that. That sort of thing. Of course there will be an arms race, but in one sense who cares. If a bot is entirely indistinguishable from a person, is that a difference that matters? I can think of scenerios where the answer is an emphatic YES, but overall it seems like a net improvement.



Anyone care to explain the purpose of Levine's https://www.web.sp.am site. Are the names randomly generated. Pardon my ignorance.

This is the type of stuff the news organisations should be publishing about "AI". Instead I keep reading or hearing people referring to training data with phrases like, "The sum of all human knowledge..." Quite shocking anyone would believe that.



A "honeypot" is a system designed to trap unsuspecting entrants. In this case, the website is designed to be found by web crawlers and to then trap them in never-ending linked sites that are all pointless. Other honeypots include things like servers with default passwords designed to be found by hackers so as to find the hackers.


I would presume the crawlers have a queue-based architectures with thousands of workers. It’s an amplification attack.

When a worker gets a webpage for the honeypot, it crawls it, scrapes it, and finds X links on the page where X is greater than 1. Those links get put on the crawler queue. Because there’s more than 1 link per page, each worker on the honeypot will add more links to the queue than it removed.

Other sites will eventually leave the queue, because they have a finite number of pages so the crawlers eventually have nothing new to queue.

Not on the honeypot. It has a virtually infinite number of pages. Scraping a page will almost deterministically increase the size of the queue (1 page removed, a dozen added per scrape). Because other sites eventually leave the queue, the queue eventually becomes just the honeypot.

OpenAI is big enough this probably wasn’t their entire queue, but I wouldn’t be surprised if it was a whole digit percentage. The author said 1.8M requests; I don’t know the duration, but that’s equivalent to 20 QPS for an entire day. Not a crazy amount, but not insignificant. It’s within the QPS Googlebot would send to a fairly large site like LinkedIn.



While the other comments are correct, I was alluding to a more subtle attack where you might try to indirectly influence the training of an LLM. Effectively, if OpenAI is crawling the open web for data to use for training, then if they don't handle sites like this properly their training dataset could be biased towards whatever content the site contains. Now in this instance this website was clearly not set up target an LLM, but model poisoning (e.g. to insert backdoors) is an active area of research at the intersection of ML and security. Consider as a very simple example the tokenizer of previous GPTs that was biased by reddit data (as mentioned by other comments).


In this case there are >6bn pages with roughly zero value each. That could eat a substantial amount of time. It's unlikely to entirely trap a crawler, but a dumb crawler (as is implied here) will start crawling more and more pages, becoming very apparent to the operator of this honeypot (and therefore identifying new crawlers), and may take up more and more share of the crawl set.


What has it got to do with deduplication? I'm talking about crafting some kind of alternative (not necessarily duplicate) data. I agree some kind of post data collection cleaning/filtering of the data before training could potentially catch it. But maybe not!


The funny way to do this would be to use an LLM to generate the content you respond with. Have 2 smallish LLMs talk to each other about topics chosen at random and generate infinite nonsense pages that have a few hundred words.
联系我们 contact @ memedata.com