Another new LLM scraper just dropped: Ai2 Bot.
First-party documentation does not list any way to opt-out except filtering the user-agent on your server/firewall. The docs list the following User-Agent to filter:
Mozilla/5.0 (compatible) AI2Bot (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)My server logs contained the following string:
Mozilla/5.0 (compatible) Ai2Bot-Dolma (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)That appears to be for Ai2’s Dolma product.
159 hits came from 174.174.51.252, a Comcast-owned IP in Oregon.
I recommend adding ai2bot to your server’s user-agent matching rules if you don’t want to be in the Dolma public dataset; unlike Common Crawl, this seems tailored specifically for training LLMs with few other users.
Dolma | Ai2
Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.allenai.org
This entry was edited (1 year ago)
Seirdy
in reply to Seirdy • • •Include that in the
httpsection to make the variable globally available to allserverblocks, then use anifdirective in yourserverblock to return403(or444if they’re causing some layer-7 load) if the variable is true.How to block specific user agents on Nginx web server
www.xmodulo.comSeirdy
in reply to Seirdy • • •Seirdy’s Home
Seirdy’s HomeSlatian
in reply to Seirdy • • •Serving suggestion for Caddy:
(Make sure the bot actually is in your robots.txt)
Future expansion simply by adding another
header User-Agent *Other-AI-Bot*directive@Seirdy
Seirdy likes this.
Seirdy
in reply to Slatian • • •