Skip to main content


How to (try to) block #IA scrapers with a tailored `robots.txt`:

You may want to use:
[Dark Visitor](https://darkvisitors.com/)

And:

[ai.robots.txt](https://github.com/ai-robots-txt/ai.robots.txt)

Adding to the list the `robots.txt` by #VLC:

[VLC robots.txt](https://www.videolan.org/robots.txt)

Source: **Bloquer les gaveurs d'IA** by @lord (in French)

https://lord.re/fast-posts/76-bloquer-les-gaveurs-dia/

#ChatGPT #IA #Robots #Scraper

in reply to AGR Risk Intelligence

robots.txt references are unfortunately quite prone to cargo-culting. I find ai.robots.txt to be generally misleading. Lots of the UAs they list aren’t used to train generative AI.

AdsBot-Google, for instance, doesn’t do anything if you don’t use Google Ads; Google uses GoogleBot-Extended to train its GenAI offerings.

For image2dataset, it’s better to use a noai meta robots directive and allow it to crawl to properly opt out of indexing, should it stumble upon a cached copy of your page.