Skip to main content

Search

Items tagged with: Scraper


Anybody know anything about the following User Agent strings?

  • ReplicantReaderBot: “Replicant” isn’t an entirely unique brand name. I hope this is unrelated to the Replicant LLM chatbots. If it is, is it used to train or is it just a client of the chatbots?
  • ArenaBot/1.0 (+<https://arena.im/bot/;> contact@arena.im) (page is a 404; is this used to train LLMs or does an LLM use this as a client to fetch data?)
  • SocialBeeAgent: again, used to train LLMs or a client of an LLM?
  • Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5 Tencent/BrandProtection. Does this obey robots.txt or am I gonna have to add another Nginx rule? I normally block brand-protection bots.

#bot #scraper #LazyWeb


Another new LLM scraper just dropped: Ai2 Bot.

First-party documentation does not list any way to opt-out except filtering the user-agent on your server/firewall. The docs list the following User-Agent to filter:

Mozilla/5.0 (compatible) AI2Bot (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)

My server logs contained the following string:
Mozilla/5.0 (compatible) Ai2Bot-Dolma (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)

That appears to be for Ai2’s Dolma product.

159 hits came from 174.174.51.252, a Comcast-owned IP in Oregon.

I recommend adding ai2bot to your server’s user-agent matching rules if you don’t want to be in the Dolma public dataset; unlike Common Crawl, this seems tailored specifically for training LLMs with few other users.

#Scraper


New scraper just dropped (well, an old scraper was renamed):

Facebook/Meta updated its robots.txt entry for opting out of GenAI data scraping. If you blocked FacebookBot before, you should block meta-externalagent now:

User-Agent: meta-externalagent
Disallow: /

Official references:

#RobotsTxt #Scraper


How to (try to) block #IA scrapers with a tailored `robots.txt`:

You may want to use:
[Dark Visitor](darkvisitors.com/)

And:

[ai.robots.txt](github.com/ai-robots-txt/ai.ro…)

Adding to the list the `robots.txt` by #VLC:

[VLC robots.txt](videolan.org/robots.txt)

Source: **Bloquer les gaveurs d'IA** by @lord (in French)

lord.re/fast-posts/76-bloquer-…

#ChatGPT #IA #Robots #Scraper