Another new LLM scraper just dropped: Ai2 Bot.
First-party documentation does not list any way to opt-out except filtering the user-agent on your server/firewall. The docs list the following User-Agent to filter:
Mozilla/5.0 (compatible) AI2Bot (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)
My server logs contained the following string:
Mozilla/5.0 (compatible) Ai2Bot-Dolma (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)
That appears to be for Ai2’s Dolma product.
159 hits came from 174.174.51.252
, a Comcast-owned IP in Oregon.
I recommend adding ai2bot
to your server’s user-agent matching rules if you don’t want to be in the Dolma public dataset; unlike Common Crawl, this seems tailored specifically for training LLMs with few other users.
Dolma | Ai2
Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.allenai.org