Another new LLM scraper just dropped: Ai2 Bot.
First-party documentation does not list any way to opt-out except filtering the user-agent on your server/firewall. The docs list the following User-Agent to filter:
Mozilla/5.0 (compatible) AI2Bot (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)
My server logs contained the following string:
Mozilla/5.0 (compatible) Ai2Bot-Dolma (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)
That appears to be for Ai2’s Dolma product.
159 hits came from 174.174.51.252
, a Comcast-owned IP in Oregon.
I recommend adding ai2bot
to your server’s user-agent matching rules if you don’t want to be in the Dolma public dataset; unlike Common Crawl, this seems tailored specifically for training LLMs with few other users.
Dolma | Ai2
Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.allenai.org
This entry was edited (3 months ago)
Seirdy
in reply to Seirdy • • •Include that in the
http
section to make the variable globally available to allserver
blocks, then use anif
directive in yourserver
block to return403
(or444
if they’re causing some layer-7 load) if the variable is true.How to block specific user agents on Nginx web server
www.xmodulo.comSeirdy
in reply to Seirdy • • •Seirdy’s Home
Seirdy’s HomeSlatian
in reply to Seirdy • • •Serving suggestion for Caddy:
(Make sure the bot actually is in your robots.txt)
Future expansion simply by adding another
header User-Agent *Other-AI-Bot*
directive@Seirdy
Seirdy likes this.
Seirdy
in reply to Slatian • • •