fedi.ml | Search

Search

Items tagged with: scraper

Seirdy

2 weeks ago

Seirdy
2 weeks ago

Anybody know anything about the following User Agent strings?

ReplicantReaderBot: “Replicant” isn’t an entirely unique brand name. I hope this is unrelated to the Replicant LLM chatbots. If it is, is it used to train or is it just a client of the chatbots?
ArenaBot/1.0 (+<https://arena.im/bot/;> contact@arena.im) (page is a 404; is this used to train LLMs or does an LLM use this as a client to fetch data?)
SocialBeeAgent: again, used to train LLMs or a client of an LLM?
Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5 Tencent/BrandProtection. Does this obey robots.txt or am I gonna have to add another Nginx rule? I normally block brand-protection bots.

#bot #scraper #LazyWeb

Please wait

View in context

Seirdy

3 months ago

Seirdy
3 months ago

Another new LLM scraper just dropped: Ai2 Bot.

First-party documentation does not list any way to opt-out except filtering the user-agent on your server/firewall. The docs list the following User-Agent to filter:

Mozilla/5.0 (compatible) AI2Bot (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)

My server logs contained the following string:

Mozilla/5.0 (compatible) Ai2Bot-Dolma (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)

That appears to be for Ai2’s Dolma product.

159 hits came from 174.174.51.252, a Comcast-owned IP in Oregon.

I recommend adding ai2bot to your server’s user-agent matching rules if you don’t want to be in the Dolma public dataset; unlike Common Crawl, this seems tailored specifically for training LLMs with few other users.

#Scraper

Dolma | Ai2

Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.

^allenai.org

#scraper

Please wait

View in context

Seirdy

3 months ago

Seirdy
3 months ago

New scraper just dropped (well, an old scraper was renamed):

Facebook/Meta updated its robots.txt entry for opting out of GenAI data scraping. If you blocked FacebookBot before, you should block meta-externalagent now:

User-Agent: meta-externalagent
Disallow: /

Official references:

Facebook developer documentation for FacebookBot no longer mentions GenAI.
Facebook developer documentation for web crawlers, including Meta-ExternalAgent mentions “AI”.

#RobotsTxt #Scraper

Meta Web Crawlers - Sharing - Documentation - Meta for Developers

This page lists the User Agent (UA) strings that identify Meta’s most common web crawlers and what each of those crawlers are used for.

^{developers.facebook.com}

#scraper #robotstxt

Please wait

View in context

AGR Risk Intelligence

7 months ago

AGR Risk Intelligence
7 months ago

How to (try to) block #IA scrapers with a tailored `robots.txt`:

You may want to use:
[Dark Visitor](darkvisitors.com/)

And:

[ai.robots.txt](github.com/ai-robots-txt/ai.ro…)

Adding to the list the `robots.txt` by #VLC:

[VLC robots.txt](videolan.org/robots.txt)

Source: **Bloquer les gaveurs d'IA** by @lord (in French)

lord.re/fast-posts/76-bloquer-…

#ChatGPT #IA #Robots #Scraper

Bloquer les gaveurs d'IA

Vous avez un joli site ouaib avec vorte ptit contenu écrit main. C'est votre blog, votre espace de réflexion, votre zone de création, votre espace rien qu'à vous partagé au monde, votre rejeton… C'est super chouette mais bon maintenant en 2024, ça ve…

^/home/lord

#VLC #chatgpt #ia #robots #scraper @Lord

Please wait

View in context

⇧

Search

Items tagged with: scraper

Seirdy 2 weeks ago

Seirdy 2 weeks ago

Seirdy 3 months ago

Seirdy 3 months ago

Dolma | Ai2

Seirdy 3 months ago

Seirdy 3 months ago

Meta Web Crawlers - Sharing - Documentation - Meta for Developers

AGR Risk Intelligence 7 months ago

AGR Risk Intelligence 7 months ago

Bloquer les gaveurs d'IA

Seirdy

2 weeks ago

Seirdy
2 weeks ago

Seirdy

3 months ago

Seirdy
3 months ago

Seirdy

3 months ago

Seirdy
3 months ago

AGR Risk Intelligence

7 months ago

AGR Risk Intelligence
7 months ago