Search

Items tagged with: Scraper

Seirdy

1 year ago

Seirdy
1 year ago

Another new LLM scraper just dropped: Ai2 Bot.

First-party documentation does not list any way to opt-out except filtering the user-agent on your server/firewall. The docs list the following User-Agent to filter:

Mozilla/5.0 (compatible) AI2Bot (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)

My server logs contained the following string:

Mozilla/5.0 (compatible) Ai2Bot-Dolma (+<a href="https://www.allenai.org/crawler" rel="ugc">https://www.allenai.org/crawler</a>)

That appears to be for Ai2’s Dolma product.

159 hits came from 174.174.51.252, a Comcast-owned IP in Oregon.

I recommend adding ai2bot to your server’s user-agent matching rules if you don’t want to be in the Dolma public dataset; unlike Common Crawl, this seems tailored specifically for training LLMs with few other users.

#Scraper

Dolma | Ai2

Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.

^allenai.org

#scraper

Please wait

View in context

Seirdy

1 year ago

Seirdy
1 year ago

New scraper just dropped (well, an old scraper was renamed):

Facebook/Meta updated its robots.txt entry for opting out of GenAI data scraping. If you blocked FacebookBot before, you should block meta-externalagent now:

User-Agent: meta-externalagent
Disallow: /

Official references:

Facebook developer documentation for FacebookBot no longer mentions GenAI.
Facebook developer documentation for web crawlers, including Meta-ExternalAgent mentions “AI”.

#RobotsTxt #Scraper

Meta Web Crawlers - Sharing - Documentation - Meta for Developers

This page lists the User Agent (UA) strings that identify Meta’s most common web crawlers and what each of those crawlers are used for.

^{developers.facebook.com}

#scraper #robotstxt

Please wait

View in context

AGR Risk Intelligence

1 year ago

AGR Risk Intelligence
1 year ago

How to (try to) block #IA scrapers with a tailored `robots.txt`:

You may want to use:
[Dark Visitor](darkvisitors.com/)

And:

[ai.robots.txt](github.com/ai-robots-txt/ai.ro…)

Adding to the list the `robots.txt` by #VLC:

[VLC robots.txt](videolan.org/robots.txt)

Source: **Bloquer les gaveurs d'IA** by @lord (in French)

lord.re/fast-posts/76-bloquer-…

#ChatGPT #IA #Robots #Scraper

Bloquer les gaveurs d'IA

Vous avez un joli site ouaib avec vorte ptit contenu écrit main. C'est votre blog, votre espace de réflexion, votre zone de création, votre espace rien qu'à vous partagé au monde, votre rejeton… C'est super chouette mais bon maintenant en 2024, ça ve…

^/home/lord

#VLC #chatgpt #ia #robots #scraper @Lord

Please wait

View in context

⇧

Items tagged with: Scraper

Search

Items tagged with: Scraper

Seirdy 1 year ago

Seirdy 1 year ago

Dolma | Ai2

Seirdy 1 year ago

Seirdy 1 year ago

Meta Web Crawlers - Sharing - Documentation - Meta for Developers

AGR Risk Intelligence 1 year ago

AGR Risk Intelligence 1 year ago

Bloquer les gaveurs d'IA

Seirdy

1 year ago

Seirdy
1 year ago

Seirdy

1 year ago

Seirdy
1 year ago

AGR Risk Intelligence

1 year ago

AGR Risk Intelligence
1 year ago