The AI bots that desperately need OSS for code training, are now slowly killing OSS by overloading every site.

The curl website is now at 77TB/month, or 8GB every five minutes.

arstechnica.com/ai/2025/03/dev…

reshared this

in reply to daniel:// stenberg://

There's always iocain et. al.

(algorithmic-sabotage.github.io…)

Putting something like that trained on source code behind your robots.txt will save you a lot of CPU (and maybe bandwidth). And if they complain, you can point to the robots.txt.

in reply to daniel:// stenberg://

seriously, they could just offer them terabyte tapes.

But I think the problem in the future is that these things democratize and not always in the right way.

If high school kids are not crawling github now they will be when a raspberry pi can do it.

(When I was dabbling with synthetic portfolios I made use of a convenient Yahoo API to get historic stock prices.)

in reply to daniel:// stenberg://

This is what happens when there is competition where there should be cooperation. AI research and development could be, _should_ be a collaborative project, not owned by anybody and open to everybody, but instead it's a bunch of corporations trying to outrun each other.
The Tragedy of the Commons only exists when there is competition instead of cooperation. Competition is how we ruin everything by trying to grab it all before anybody else does. Cooperation is how we can give everybody whatever they need for free and still have enough for all of us.
Why train so many machine learning models that aren't all that different, which are owned and run by private enterprises, when we could instead train much fewer models that aren't owned by anybody and can be used for free?
in reply to daniel:// stenberg://

I hope your project can survive.

You remember/know in the 2000's thousands+ of DSL routers verified time against a few NTP servers ~ reported as "a thunderclap of traffic on the peak of every hour". So the organisations forced the setup of more local servers for the NTP service?

If the AI engineers where nice people, they could setup one of their laptops as a local relay/ image for your site, and poll it every second **much faster**, so the end-users would get a better service.

in reply to daniel:// stenberg://

@skaverat An increase in traffic from 20 TB/month to 80 TB/month in 3 years seems normal to me.

Why would an AI crawl your site more than any search engine would? They all want just one copy of your site once/month or so. If an AI visits your site on demand in real time then that's actually just a visit from a biological human that asked their AI to get current data and not rely on its memory of old data.

Sure, one biological human will cause more traffic than before AI. As expected.

in reply to daniel:// stenberg://

We collected 470K IPv4s from a botnet that was trying to get all the content from our social network; it was behaving in such a way that we could track every single request it made. Since we blocked it, the server has been working much better; it hasn't been running with such a low load for at least a year.

seenthis.net/messages/1105923

framapiaf.org/@biggrizzly/1142…

in reply to daniel:// stenberg://

recently had an issue where a service I run for a friend had degraded performance out of the blue at a critical point in time for her. I never could fully nail down the issue since my monitoring still sucks in that department but I am reasonably sure my server was DoS'd, either maliciously because I pissed someone off the day before or accodentially by the evergrowing traffic on my wordpress which is hosted on the same machine... 🤮 I need to build up one of those AI mazes...
in reply to daniel:// stenberg://

Find which bots are the offenders and send them an invoice for the bandwidth. “Your crawler consumed XXTB/month this past month. This is a small project and we can’t sustain this level of access by just one party, so here’s an invoice for the consumed bandwidth. If you don’t stop your crawling we will be forced to keep sending these invoices. Regards” 😆
in reply to daniel:// stenberg://

Hi Daniel, I have read of Anubis lastly, and I guess it would be a potential solution to this.. Even if you make the proof of work very high, normal people would not have a problem waiting some time if the alternative would be to not be able to download from entire countries :)

Also I have iplists of AWS, Huawei Cloud, and other big cloud providers if you are interested, as I am blocking them with my small forgejo instance.

With nginx you can block based on string matching in the requests.. like this I give most bots a 444 directly, and like this I have much less traffic problem.
The syntax I can provide you too, if you need it.

in reply to daniel:// stenberg://

Some folks have developed what seem to be both cheap and effective countermeasures. iocane came across my feed this morning:

come-from.mad-scientist.club/@… chronicles.mad-scientist.club/…

and seems like it could be both extremely effective and potentially hilarious if the seed text is chosen appropriately.

in reply to daniel:// stenberg://

AI bots were overloading my cgit instance, but github.com/mitchellkrogza/apac… helped immensely. Load levels are back to normal now.
in reply to daniel:// stenberg://

This reminds me so much of when spam took over the SMTP world. Originally, SMTP was an open system, I could even use % in an address to bounce through someone else's server. I don't know if 95% of the world's SMTP traffic is still spam but essentially mass abuse of the system for fractional profit forced us all to fundamentally change how the system worked and accelerated to dominance of Google's Gmail etc.
in reply to daniel:// stenberg://

Linux developers already make great use of PGP keys if they're producing installers and stuff.

Maybe we should all be required to use such a key to retrieve open source code. Keys could gain rep by appropriate use from all forges (whitelisted maybe?). Start out super painful so nobody wants to have to start over with a new key or re-gain rep. Good rep gets you standard use.

Sucks, but wtf.

in reply to daniel:// stenberg://

Wow. For a few months, I was wondering why I suddenly have bandwidth issues when activating my camera in MS Teams meetings, so others can't understand me any more.

A look into my #nginx logs seems to clarify. Bots are eagerly fetching my (partially pretty large) #poudriere build logs. 🧐 (#AI "watching shit scroll by"?)

I see GPTBot at least occassionally requests robots.txt, which I don't have so far. Other bots don't seem to be interested. Especially PetalBot is hammering my server. And there are others (bytedance, google, ...)

Now what? Robots.txt would actually *help* well-behaved bots here (I assume build logs aren't valuable for anything). The most pragmatic thing here would be to add some http basic auth in the reverse proxy for all poudriere stuff. It's currently only public because there's no reason to keep it private....

Have to admit I feel inclined to try one of the tarpitting/poisoning approaches, too. 😏