Skip to main content


I fixed some bugs in the #FediSearch crawler and once again cleared whole index. It will take for about two days to fill it back...
:fedisearch:
News:
1️⃣ Fixed mastodon api pagination
2️⃣ It now respects #nobot tag in bio
3️⃣ Added robots-parser library for full complience with robots.txt specification (ua: "FediCrawl/1.0")
4️⃣ Now fully respects Mastodon noindex option.
5️⃣ I also added page about opting out fedisearch.skorpil.cz/optout
This entry was edited (2 years ago)

Archos reshared this.

in reply to Štěpán Škorpil

Bud somehow it still doesn't index the whole instance user directory 😢
in reply to NoLog.cz 🏴

@nolog typical obstacle for crawling is firewall and limited number of tcp sessions in some period. Check it in logs
in reply to Michal 🇨🇿

@michal @nolog It should open only one connection at once to the instance. There is no paralelism.
The only thing I noticed is some kind of rate limiting on newer peertube instaces.
And large instaces often timeout because of their work load.
in reply to Štěpán Škorpil

@michal @nolog actually there is one paralelism. It used to push huge batch of data to elastic at once. Although it worked, I added some limiting to feed data storing.
in reply to Štěpán Škorpil

@nolog "it should" and does it really? For example mastadon app via Chrome isn't closing connections properly. Does your script closing tcp properly? Are you using HTTP/1.1 keep-alive?
in reply to Štěpán Škorpil

Another reason is:
{"error":"Search queries pagination is not supported without authentication"}
in reply to Štěpán Škorpil

Every instances with fresh #Mastodon version. It was merged month ago into main branch.

github.com/mastodon/mastodon/p…