Skip to main content


I fixed some bugs in the #FediSearch crawler and once again cleared whole index. It will take for about two days to fill it back...
:fedisearch:
News:
1️⃣ Fixed mastodon api pagination
2️⃣ It now respects #nobot tag in bio
3️⃣ Added robots-parser library for full complience with robots.txt specification (ua: "FediCrawl/1.0")
4️⃣ Now fully respects Mastodon noindex option.
5️⃣ I also added page about opting out https://fedisearch.skorpil.cz/optout
This entry was edited (1 year ago)
in reply to NoLog.cz 🏴

@nolog typical obstacle for crawling is firewall and limited number of tcp sessions in some period. Check it in logs
in reply to Michal 🇨🇿

@michal @nolog It should open only one connection at once to the instance. There is no paralelism.
The only thing I noticed is some kind of rate limiting on newer peertube instaces.
And large instaces often timeout because of their work load.
in reply to Štěpán Škorpil :skorpil_cz:

@michal @nolog actually there is one paralelism. It used to push huge batch of data to elastic at once. Although it worked, I added some limiting to feed data storing.
in reply to Štěpán Škorpil :skorpil_cz:

@nolog "it should" and does it really? For example mastadon app via Chrome isn't closing connections properly. Does your script closing tcp properly? Are you using HTTP/1.1 keep-alive?
in reply to Štěpán Škorpil :skorpil_cz:

Another reason is:
{"error":"Search queries pagination is not supported without authentication"}
in reply to Štěpán Škorpil :skorpil_cz:

Every instances with fresh #Mastodon version. It was merged month ago into main branch.

https://github.com/mastodon/mastodon/pull/19326/commits/85b9310440d9dafc468e65c7a7326855f63958ca