How to (try to) block #IA scrapers with a tailored `robots.txt`:
You may want to use:
[Dark Visitor](darkvisitors.com/)
And:
[ai.robots.txt](github.com/ai-robots-txt/ai.ro…)
Adding to the list the `robots.txt` by #VLC:
[VLC robots.txt](videolan.org/robots.txt)
Source: **Bloquer les gaveurs d'IA** by @lord (in French)
lord.re/fast-posts/76-bloquer-…
Bloquer les gaveurs d'IA
Vous avez un joli site ouaib avec vorte ptit contenu écrit main. C'est votre blog, votre espace de réflexion, votre zone de création, votre espace rien qu'à vous partagé au monde, votre rejeton… C'est super chouette mais bon maintenant en 2024, ça ve…/home/lord
Seirdy
in reply to AGR Risk Intelligence • • •robots.txt references are unfortunately quite prone to cargo-culting. I find ai.robots.txt to be generally misleading. Lots of the UAs they list aren’t used to train generative AI.
AdsBot-Google, for instance, doesn’t do anything if you don’t use Google Ads; Google uses GoogleBot-Extended to train its GenAI offerings.
For image2dataset, it’s better to use a
noai
meta
robots directive and allow it to crawl to properly opt out of indexing, should it stumble upon a cached copy of your page.Blocking certain bots
Seirdy’s Home