The AI bots that desperately need OSS for code training, are now slowly killing OSS by overloading every site.
The curl website is now at 77TB/month, or 8GB every five minutes.
arstechnica.com/ai/2025/03/dev…
Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries
AI bots hungry for data are taking down FOSS sites by accident, but humans are fighting back.Benj Edwards (Ars Technica)
reshared this
Gerard Braad
in reply to daniel:// stenberg:// • • •What is the use of them hammering the website over and over again. They do the same for the Fedora wiki... It is not like they need be near real-time.
Are you considering an IP block ?
daniel:// stenberg://
in reply to Gerard Braad • • •Luc
in reply to daniel:// stenberg:// • • •𝚜𝚎𝚕𝚎𝚊
in reply to daniel:// stenberg:// • • •daniel:// stenberg://
in reply to daniel:// stenberg:// • • •Benjamin Sonntag-King
in reply to daniel:// stenberg:// • • •(I know some mastodon instances that does the same with messages 😉)
Chris [list of emoji]
in reply to daniel:// stenberg:// • • •There's always iocain et. al.
(algorithmic-sabotage.github.io…)
Putting something like that trained on source code behind your robots.txt will save you a lot of CPU (and maybe bandwidth). And if they complain, you can point to the robots.txt.
Trapping AI
ASRGOliver Vanderb
in reply to daniel:// stenberg:// • • •Time to start defending, beginning with the blog posts. Give 'em invalid data.
Wonder if an AI crawler will catch that or if it simple strips the tags away to read the text fully.
<div>
Loem Ipsum <span>and Trump and the AI bros have no balls</span> parvus principio...
</div>
span {
display:none;
}
John Socks
in reply to daniel:// stenberg:// • • •seriously, they could just offer them terabyte tapes.
But I think the problem in the future is that these things democratize and not always in the right way.
If high school kids are not crawling github now they will be when a raspberry pi can do it.
(When I was dabbling with synthetic portfolios I made use of a convenient Yahoo API to get historic stock prices.)
Lord Caramac the Clueless, KSC
in reply to daniel:// stenberg:// • • •The Tragedy of the Commons only exists when there is competition instead of cooperation. Competition is how we ruin everything by trying to grab it all before anybody else does. Cooperation is how we can give everybody whatever they need for free and still have enough for all of us.
Why train so many machine learning models that aren't all that different, which are owned and run by private enterprises, when we could instead train much fewer models that aren't owned by anybody and can be used for free?
Owen Beresford
in reply to daniel:// stenberg:// • • •I hope your project can survive.
You remember/know in the 2000's thousands+ of DSL routers verified time against a few NTP servers ~ reported as "a thunderclap of traffic on the peak of every hour". So the organisations forced the setup of more local servers for the NTP service?
If the AI engineers where nice people, they could setup one of their laptops as a local relay/ image for your site, and poll it every second **much faster**, so the end-users would get a better service.
PuercoPop
in reply to daniel:// stenberg:// • • •Shadow Heart
in reply to daniel:// stenberg:// • • •SkaveRat 🐀
in reply to daniel:// stenberg:// • • •daniel:// stenberg://
in reply to SkaveRat 🐀 • • •three years ago we were at less than 20TB/month, but there is no clear cut-off date nor do I know exactly what amount of this traffic that is AI bots and not
(edit: I meant TB, not GB)
Robin Whittleton
in reply to daniel:// stenberg:// • • •harmone
in reply to daniel:// stenberg:// • • •@skaverat An increase in traffic from 20 TB/month to 80 TB/month in 3 years seems normal to me.
Why would an AI crawl your site more than any search engine would? They all want just one copy of your site once/month or so. If an AI visits your site on demand in real time then that's actually just a visit from a biological human that asked their AI to get current data and not rely on its memory of old data.
Sure, one biological human will cause more traffic than before AI. As expected.
Jim Fuller
in reply to daniel:// stenberg:// • • •Karl Pettersson
in reply to daniel:// stenberg:// • • •AI bots are destroying Open Access
go-to-hellman.blogspot.comMichal Čihař
in reply to daniel:// stenberg:// • • •AndyK1970
in reply to daniel:// stenberg:// • • •I see the article says "cycling through residential IP addresses as proxies"
Does this imply that these AI crawlers are using botnets?
BigGrizzly
in reply to daniel:// stenberg:// • • •We collected 470K IPv4s from a botnet that was trying to get all the content from our social network; it was behaving in such a way that we could track every single request it made. Since we blocked it, the server has been working much better; it hasn't been running with such a low load for at least a year.
seenthis.net/messages/1105923
framapiaf.org/@biggrizzly/1142…
BigGrizzly (@biggrizzly@framapiaf.org)
FramapiafDJGummikuh
in reply to daniel:// stenberg:// • • •Marco Ivaldi
in reply to daniel:// stenberg:// • • •Dmian 🇪🇺
in reply to daniel:// stenberg:// • • •S1m
in reply to daniel:// stenberg:// • • •Zac
in reply to daniel:// stenberg:// • • •Markus Peuhkuri
in reply to daniel:// stenberg:// • • •@bagder
Corsaro Nero
in reply to daniel:// stenberg:// • • •Hi Daniel, I have read of Anubis lastly, and I guess it would be a potential solution to this.. Even if you make the proof of work very high, normal people would not have a problem waiting some time if the alternative would be to not be able to download from entire countries :)
Also I have iplists of AWS, Huawei Cloud, and other big cloud providers if you are interested, as I am blocking them with my small forgejo instance.
With nginx you can block based on string matching in the requests.. like this I give most bots a 444 directly, and like this I have much less traffic problem.
The syntax I can provide you too, if you need it.
alihan_banan
in reply to daniel:// stenberg:// • • •radhitya.org
in reply to daniel:// stenberg:// • • •xs4me2
in reply to daniel:// stenberg:// • • •Sven/Sarah
in reply to daniel:// stenberg:// • • •Serge Droz
in reply to daniel:// stenberg:// • • •AMAZON JOB SERVICE
in reply to daniel:// stenberg:// • • •Are you interested in the open position Personal Assistant/Customer Service FullTime/PartTime both available
If you interested?
APPLY NOW:
linkpop.com/best-job-offer-ser…
Nova🐧✨
in reply to daniel:// stenberg:// • • •send them into the labyrinth!
arstechnica.com/ai/2025/03/clo…
Cloudflare turns AI against itself with endless maze of irrelevant facts
Benj Edwards (Ars Technica)Ronan Klyne
in reply to daniel:// stenberg:// • • •#1 trackball mouse enjoyer
in reply to daniel:// stenberg:// • • •sarah tonin
in reply to #1 trackball mouse enjoyer • • •daniel:// stenberg://
in reply to sarah tonin • • •Dan Sugalski
in reply to daniel:// stenberg:// • • •Some folks have developed what seem to be both cheap and effective countermeasures. iocane came across my feed this morning:
come-from.mad-scientist.club/@… chronicles.mad-scientist.club/…
and seems like it could be both extremely effective and potentially hilarious if the seed text is chosen appropriately.
A season on Iocaine - Chronicae Novis Rebus
chronicles.mad-scientist.clubMx. Eddie R
in reply to daniel:// stenberg:// • • •*shakes fist theatrically*
Mike 🇨🇦 NuanceRhymesWithOrange
in reply to daniel:// stenberg:// • • •GitHub - mitchellkrogza/apache-ultimate-bad-bot-blocker: Apache Block Bad Bots, (Referer) Spam Referrer Blocker, Vulnerability Scanners, Malware, Adware, Ransomware, Malicious Sites, Wordpress Theme Detectors and Fail2Ban Jail for Repeat Offenders
GitHubcontrasocial
in reply to daniel:// stenberg:// • • •Peter van Heusden (he/him) 🗿
in reply to daniel:// stenberg:// • • •Tóth Gábor Baltazár
in reply to daniel:// stenberg:// • • •crazyeddie
in reply to daniel:// stenberg:// • • •Linux developers already make great use of PGP keys if they're producing installers and stuff.
Maybe we should all be required to use such a key to retrieve open source code. Keys could gain rep by appropriate use from all forges (whitelisted maybe?). Start out super painful so nobody wants to have to start over with a new key or re-gain rep. Good rep gets you standard use.
Sucks, but wtf.
Soatok Dreamseeker
in reply to daniel:// stenberg:// • • •Kevin Russell
in reply to daniel:// stenberg:// • • •Wolfgang Müller (DE:er/EN:he)
in reply to daniel:// stenberg:// • • •Wolfgang Müller (DE:er/EN:he)
in reply to daniel:// stenberg:// • • •Christian Berger DECT 9314
in reply to daniel:// stenberg:// • • •SWAGEDGE
in reply to daniel:// stenberg:// • • •You can get a $1000 Amazon Gift Card Now!
swagedge (#site_title)Phil
in reply to daniel:// stenberg:// • • •Dr Know
in reply to daniel:// stenberg:// • • •hacknorris
in reply to daniel:// stenberg:// • • •Felix Palmen
in reply to daniel:// stenberg:// • • •Wow. For a few months, I was wondering why I suddenly have bandwidth issues when activating my camera in MS Teams meetings, so others can't understand me any more.
A look into my #nginx logs seems to clarify. Bots are eagerly fetching my (partially pretty large) #poudriere build logs. 🧐 (#AI "watching shit scroll by"?)
I see GPTBot at least occassionally requests robots.txt, which I don't have so far. Other bots don't seem to be interested. Especially PetalBot is hammering my server. And there are others (bytedance, google, ...)
Now what? Robots.txt would actually *help* well-behaved bots here (I assume build logs aren't valuable for anything). The most pragmatic thing here would be to add some http basic auth in the reverse proxy for all poudriere stuff. It's currently only public because there's no reason to keep it private....
Have to admit I feel inclined to try one of the tarpitting/poisoning approaches, too. 😏
Dmian 🇪🇺
in reply to daniel:// stenberg:// • • •Block AI scrapers with Anubis
xeiaso.net