![]() |
|
![]() |
|
People had extended family around, now days they don't. Parenting is easier if you have 5 family members in walking distance and they also have similarly ages kids who can all play together. |
![]() |
|
I’d place in the same category the responses that I give to those chat popups so many sites have. They show a person saying to me “Can I help you with anything today?” so I always send back “No”.
|
![]() |
|
It also has a problem with quantity, so it gets confused by things like the cube root of 750l that it maintains for a long time is around 9m. It even suggests that 1l is equal to 1m³.
|
![]() |
|
With Wittgenstein I think we see that "hallucinations" are a part of language in general, albeit one I could see being particularly vexing if you're trying to build a perfectly controllable chatbot.
|
![]() |
|
> I've wondered if one could train a LLM on a closed set of curated knowledge. Then include training data that models the behaviour of not knowing. To the point that it could generalize to being able to represent its own not knowing. The problem is that curating data is slow and expensive and downloading the entire web is fast and cheap. See also https://en.wikipedia.org/wiki/Cyc |
![]() |
|
No it won’t work. This is not a brain. The best analogy is an English major. They are good at language, not reasoning. Humans see language and think reason. It seems we can’t separate the two. |
![]() |
|
I'm more interested in what that content farm is for. It looks pointless, but I suspect there's a bizarre economic incentive. There are affiliate links, but how much could that possibly bring in?
|
![]() |
|
All the lines related to GPTBot are commented out. That robots.txt isn't trying to block it. Either it has been changed recently or most of this comment thread is mistaken.
|
![]() |
|
This is a moderately persuasive argument. Although the crawler should probably ignore all the html body. But it does feel like a grey area if I accept your first pint. |
![]() |
|
This would be considered a Slow Loris attack, and I'm actually curious how scrapers would handle it. I'm sure the big players like Google would deal with it gracefully. |
![]() |
|
That is indistinguishable from not respecting robots.txt. There is a robots.txt on the root the first time they ask for it, and they read the page and follow its links regardless.
|
![]() |
|
This is honeypot. The author, https://en.wikipedia.org/wiki/John_R._Levine, keeps it just to notice any new (significant) scraping operation launched that will invariably hit his little farm and let be seen in the logs. He's well known anti-spam operative with his various efforts now dating back multiple decades. Notice how he casually drops a link to the landing page in the NANOG message. That's how the bots will get a bait. |
![]() |
|
It's for shits-and-giggles and it's doing its job really well right now. Not everything needs to have an economic purpose, 100 trackers, ads and backed by a company.
|
![]() |
|
Am I the only one who was hoping—even though I knew it wouldn’t be the case—that OpenAI’s server farm was infested with actual spiders and they were getting into other people’s racks?
|
![]() |
|
That’s the whole point. The site owner doesn’t want their information included in ChatGPT—they want you going to their website to view it instead. It’s functioning exactly as designed. |
![]() |
|
It's a stretch to expect a human initiated action to abide by robot.txt. Also, once you click on a link in chrome it's pretty much all robot parsed and rendered from there as well.. |
![]() |
|
> It looks like he doesn’t really care that much that they’re retrieving millions of pages It impacts the performance for the other legitimate users of that web farm ;) |
![]() |
|
The only way out of this is robots that can go out in the world and collect data. Write in natural language what they observed which can then be used to train better LLMs.
|
![]() |
|
Anyone care to explain the purpose of Levine's https://www.web.sp.am site. Are the names randomly generated. Pardon my ignorance. This is the type of stuff the news organisations should be publishing about "AI". Instead I keep reading or hearing people referring to training data with phrases like, "The sum of all human knowledge..." Quite shocking anyone would believe that. |
https://www.alignmentforum.org/posts/8viQEp8KBg2QSW4Yc/solid...
https://www.lesswrong.com/posts/LAxAmooK4uDfWmbep/anomalous-...
Vocabulary isn't infinite, and GPT-3 reportedly had only 50,257 distinct tokens in its vocabulary. It does make me wonder - it's certainly not a linear relationship, but given the number of inferences run every day on GPT-3 while it was the flagship model, the incremental electricity cost of these Redditors' niche hobby, vs. having allocated those slots in the vocabulary to actually common substrings in real-world text and thus reducing average input token count, might have been measurable.
It would be hilarious if the subtitle on OP's site, "IECC ChurnWare 0.3," became a token in GPT-5 :)