Štěpán Škorpil

2 years ago

Štěpán Škorpil
2 years ago

I fixed some bugs in the #FediSearch crawler and once again cleared whole index. It will take for about two days to fill it back...

News:
1️⃣ Fixed mastodon api pagination
2️⃣ It now respects #nobot tag in bio
3️⃣ Added robots-parser library for full complience with robots.txt specification (ua: "FediCrawl/1.0")
4️⃣ Now fully respects Mastodon noindex option.
5️⃣ I also added page about opting out fedisearch.skorpil.cz/optout

FediSearch

Search people on Fediverse

^{fedisearch.skorpil.cz}

This entry was edited (2 years ago)

Archos reshared this.

in reply to Štěpán Škorpil

NoLog.cz 🏴

in reply to Štěpán Škorpil 2 years ago

Bud somehow it still doesn't index the whole instance user directory 😢

in reply to NoLog.cz 🏴

Michal 🇨🇿

in reply to NoLog.cz 🏴 2 years ago

@nolog typical obstacle for crawling is firewall and limited number of tcp sessions in some period. Check it in logs

@NoLog.cz 🏴

in reply to Michal 🇨🇿

Štěpán Škorpil

in reply to Michal 🇨🇿 2 years ago

@michal @nolog It should open only one connection at once to the instance. There is no paralelism.
The only thing I noticed is some kind of rate limiting on newer peertube instaces.
And large instaces often timeout because of their work load.

@NoLog.cz 🏴 @Michal 🇨🇿

in reply to Štěpán Škorpil

Štěpán Škorpil

in reply to Štěpán Škorpil 2 years ago

@michal @nolog actually there is one paralelism. It used to push huge batch of data to elastic at once. Although it worked, I added some limiting to feed data storing.

@NoLog.cz 🏴 @Michal 🇨🇿

in reply to Štěpán Škorpil

Michal 🇨🇿

in reply to Štěpán Škorpil 2 years ago

@nolog "it should" and does it really? For example mastadon app via Chrome isn't closing connections properly. Does your script closing tcp properly? Are you using HTTP/1.1 keep-alive?

@NoLog.cz 🏴

in reply to Štěpán Škorpil

Michal 🇨🇿

in reply to Štěpán Škorpil 2 years ago

Another reason is:
{"error":"Search queries pagination is not supported without authentication"}

in reply to Michal 🇨🇿

Štěpán Škorpil

in reply to Michal 🇨🇿 2 years ago

@michal on which service?

@Michal 🇨🇿

in reply to Štěpán Škorpil

Michal 🇨🇿

in reply to Štěpán Škorpil 2 years ago

Every instances with fresh #Mastodon version. It was merged month ago into main branch.

github.com/mastodon/mastodon/p…

Change unauthenticated search to not support pagination in REST API by Gargron · Pull Request #19326 · mastodon/mastodon

Only support search queries with 5 or more characters Do not support queries with offset (pagination) Return HTTP 401 on truthy resolve instead of overriding to false

^GitHub

#Mastodon

in reply to Michal 🇨🇿

Štěpán Škorpil

in reply to Michal 🇨🇿 2 years ago

@michal I am not using this endpoint

@Michal 🇨🇿

⇧

Štěpán Škorpil 2 years ago • •

Štěpán Škorpil
2 years ago