Search
Items tagged with: scraper
Anybody know anything about the following User Agent strings?
ReplicantReaderBot
: “Replicant” isn’t an entirely unique brand name. I hope this is unrelated to the Replicant LLM chatbots. If it is, is it used to train or is it just a client of the chatbots?ArenaBot/1.0 (+<https://arena.im/bot/;> contact@arena.im)
(page is a 404; is this used to train LLMs or does an LLM use this as a client to fetch data?)SocialBeeAgent
: again, used to train LLMs or a client of an LLM?Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5 Tencent/BrandProtection
. Does this obey robots.txt or am I gonna have to add another Nginx rule? I normally block brand-protection bots.
Another new LLM scraper just dropped: Ai2 Bot.
First-party documentation does not list any way to opt-out except filtering the user-agent on your server/firewall. The docs list the following User-Agent to filter:
Mozilla/5.0 (compatible) AI2Bot (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)
My server logs contained the following string:
Mozilla/5.0 (compatible) Ai2Bot-Dolma (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)
That appears to be for Ai2’s Dolma product.
159 hits came from 174.174.51.252
, a Comcast-owned IP in Oregon.
I recommend adding ai2bot
to your server’s user-agent matching rules if you don’t want to be in the Dolma public dataset; unlike Common Crawl, this seems tailored specifically for training LLMs with few other users.
Dolma | Ai2
Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.allenai.org
New scraper just dropped (well, an old scraper was renamed):
Facebook/Meta updated its robots.txt entry for opting out of GenAI data scraping. If you blocked FacebookBot
before, you should block meta-externalagent
now:
User-Agent: meta-externalagent
Disallow: /
Official references:
- Facebook developer documentation for
FacebookBot
no longer mentions GenAI. - Facebook developer documentation for web crawlers, including
Meta-ExternalAgent
mentions “AI”.
Meta Web Crawlers - Sharing - Documentation - Meta for Developers
This page lists the User Agent (UA) strings that identify Meta’s most common web crawlers and what each of those crawlers are used for.developers.facebook.com
How to (try to) block #IA scrapers with a tailored `robots.txt`:
You may want to use:
[Dark Visitor](darkvisitors.com/)
And:
[ai.robots.txt](github.com/ai-robots-txt/ai.ro…)
Adding to the list the `robots.txt` by #VLC:
[VLC robots.txt](videolan.org/robots.txt)
Source: **Bloquer les gaveurs d'IA** by @lord (in French)
lord.re/fast-posts/76-bloquer-…
Bloquer les gaveurs d'IA
Vous avez un joli site ouaib avec vorte ptit contenu écrit main. C'est votre blog, votre espace de réflexion, votre zone de création, votre espace rien qu'à vous partagé au monde, votre rejeton… C'est super chouette mais bon maintenant en 2024, ça ve…/home/lord