Skip to main content


How to (try to) block #IA scrapers with a tailored `robots.txt`:

You may want to use:
[Dark Visitor](darkvisitors.com/)

And:

[ai.robots.txt](github.com/ai-robots-txt/ai.ro…)

Adding to the list the `robots.txt` by #VLC:

[VLC robots.txt](videolan.org/robots.txt)

Source: **Bloquer les gaveurs d'IA** by @lord (in French)

lord.re/fast-posts/76-bloquer-…

#ChatGPT #IA #Robots #Scraper

in reply to AGR Risk Intelligence

robots.txt references are unfortunately quite prone to cargo-culting. I find ai.robots.txt to be generally misleading. Lots of the UAs they list aren’t used to train generative AI.

AdsBot-Google, for instance, doesn’t do anything if you don’t use Google Ads; Google uses GoogleBot-Extended to train its GenAI offerings.

For image2dataset, it’s better to use a noai meta robots directive and allow it to crawl to properly opt out of indexing, should it stumble upon a cached copy of your page.