Štěpán Škorpil

3 years ago

Štěpán Škorpil
3 years ago

Yesterday I deleted whole #FediSearch index and started crawling the #fediverse from scratch.

So many new accounts should be discovereable now.

fedisearch.skorpil.cz

FediSearch

Search people on Fediverse

^{fedisearch.skorpil.cz}

This entry was edited (3 years ago)

in reply to Štěpán Škorpil

The index was spoiled by huge amount of fake domains leading to a badly configured Mastodon instance. This combination of problems overloaded crawler, thus after 14 days new accounts still were not discovered.

in reply to Štěpán Škorpil

Štěpán Škorpil

in reply to Štěpán Škorpil 3 years ago

I need to improve crawler to be able to handle this situation, but I don't have time right now.
For now I added the badly configured instance to the blacklist.

in reply to Štěpán Škorpil

Štěpán Škorpil

in reply to Štěpán Škorpil 3 years ago

Remember, that it uses public APIs for account discovery and that means you have to set your account discoverable to be, well, discoverable.

in reply to Štěpán Škorpil

Štěpán Škorpil

in reply to Štěpán Škorpil 3 years ago

And also remember that to be discoverable you have to fill your bio with info you want to be discoverble by...

in reply to Štěpán Škorpil

NoLog.cz 🏴

in reply to Štěpán Škorpil 3 years ago

Is there any way to force the crawler to get the whole list of users for some instances? (Like all uodated CZ instances)
I'm not sure how it works on pre-4.0, but now the user directory is limited to 80 records per page and it can't be overridden with the 'limit' argument.

in reply to NoLog.cz 🏴

Štěpán Škorpil

in reply to NoLog.cz 🏴 3 years ago

@nolog It use this enpoint:
docs.joinmastodon.org/methods/…

directory API methods

A directory of profiles that your website is aware of.

^{docs.joinmastodon.org}

@NoLog.cz 🏴

in reply to Štěpán Škorpil

NoLog.cz 🏴

in reply to Štěpán Škorpil 3 years ago

If I understand it correctly, it sets the limit to 500 users per page, but the server has it's internal limit on 80.
So if the instance has < 500 users, it only shows the first 80 and with >500 it probably undercounts by a lot if the same limit is everywhere.

github.com/Stopka/fedicrawl/bl…

fedicrawl/retrieveLocalPublicUsersPage.ts at 29acce39063d1dbfbe69bab22348855ff5ca21c2 · Stopka/fedicrawl

Collect feeds to follow on Fediverse nodes. Contribute to Stopka/fedicrawl development by creating an account on GitHub.

^GitHub

in reply to NoLog.cz 🏴

Štěpán Škorpil

in reply to NoLog.cz 🏴 3 years ago

@nolog Aha, you can't get all user at once, you have to go through all pages...

@NoLog.cz 🏴

in reply to Štěpán Škorpil

Štěpán Škorpil

in reply to Štěpán Škorpil 3 years ago

@nolog Well this is a problem. I will have to reindex it all again 😄

@NoLog.cz 🏴

in reply to Štěpán Škorpil

NoLog.cz 🏴

in reply to Štěpán Škorpil 3 years ago

Sorry for that 🫣

in reply to Štěpán Škorpil

stop genocide punch nazis

in reply to Štěpán Škorpil 3 years ago

Does this respect robots.txt and opt-outs, and limit itself to profiles?

Would be nice to have a statement about this on the site.

in reply to stop genocide punch nazis

Hans Gaylordinen

in reply to stop genocide punch nazis 3 years ago

@nikodemus I've opted out from search engine indexing and I could still find my profile.

Do. Not. Like. This.

@stop genocide punch nazis

Unknown parent

Štěpán Škorpil

Unknown parent 3 years ago

its "FediCrawl/1.0"

This entry was edited (3 years ago)

Unknown parent

Štěpán Škorpil

Unknown parent 3 years ago

@rootadmin Ok, will look at it.

@root.admin🔅

in reply to Štěpán Škorpil

Štěpán Škorpil

in reply to Štěpán Škorpil 3 years ago

@rootadmin Fixed. I missed this setting, now the crawler respects it.

@root.admin🔅

in reply to Štěpán Škorpil

Štěpán Škorpil

in reply to Štěpán Škorpil 3 years ago

@admin What is the diference between regular engines and specialized fediverse engine? Why would you like to allow google and disallow fedisearch?

@Admin 🤓 Todon.eu (mod)

⇧

Štěpán Škorpil 3 years ago • •

Štěpán Škorpil
3 years ago