Excuse me we are now living in a time where search engines strike exclusivity deals with websites to exclusively show results of these websites on the one search engine?? Will I need to go with Google to look up reddit posts, Bing to see movie information, and DuckDuckGo for info on ducks now?
I love our post-capitalist cyberpunk dystopia.
RE: mastodon.social/@Mojeek/112841…
Mojeek (@Mojeek@mastodon.social)
Google Is the Only Search Engine That Works on Reddit Now Thanks to AI Deal “They’re [Reddit] killing everything for search but Google,” Colin Hayhurst, CEO of the search engine Mojeek told me on a call.Mastodon


daughter of lilith
in reply to daughter of lilith • • •Seirdy
in reply to daughter of lilith • • •Common Crawl is the closest thing we have to an open index, though it doesn’t meet your requirement of ignoring robots.txt for corporate websites while obeying it for personal sites. Unfortunately, being open and publicly available means that people use it to train LLMs. Google did this for initial versions of Bard, so a lot of sites block its crawler. Most robots.txt guides for blocking GenAI crawlers include an entry for it now.
Common Crawl powers Alexandria Search and was the basis of Stract’s initial index, both of which are upstart FOSS engines.
A similar EU-focused project is OpenWebSearch/Owler.
Originally posted on
seirdy.one: See Original (POSSE).Common Goals with Common Crawl - Open Web Search – Promoting Europe‘s Independence in Web Search
ows.eu Team (Open Web Search – Promoting Europe‘s Independence in Web Search)Matija Šuklje
in reply to Seirdy • • •@Seirdy, do you have or know of a more nuanced analysis of Common Crawl (compared with OWS)?
I recently blocked it for my blog togethet with (other) GenAI crawlers, but I'm on the fence and would like to re-evaluate with more info.
@marta
Seirdy
in reply to Matija Šuklje • • •@hook CC is an index for anybody to use. This reduces the barrier to create alternatives to other indexes, like those owned by Google/Bing/Yandex. On one hand, this makes upstart engines and research like the Web Data Commons possible; on the other hand, it allows all sorts of bad actors (non-consensual GenAI, for instance) to index your pages easily. If Common Crawl embedded information from the upcoming W3C TDM Reservation Protocol in each site/page, this would be partially resolved; bad actors could ignore this but well-behaved actors would at least know what they’re welcome to do with each page. Right now, they don’t even know so they can’t comply with your preferred rules easily; all they have are X-Robots tags, which are a bit limited.
I think that the CC does more good than harm on my site, so I allow it. I can’t speak for everyone else.
Seirdy
Unknown parent • • •LisPi
Unknown parent • • •@Jessica @404mediaco @Mojeek @privateger That's pretty much what I expected, yes.
Or for them to completely destroy what remains of old.reddit and only have a captcha-intensive basically mobile-only site.
Seirdy
in reply to LisPi • • •