We have a CI job to spot unwanted utf8 letters in #curl PRs as we have noticed that GitHub will gladly show the for example (identical) Cyrillic version of a letter next to the Latin version in a diff and it is yes, entirely impossible for a human to spot the diff. I mean the diff is shown, but the significance of it is not.

Changing just a single letter like that in a URL hostname opens up for a world of grief.

#curl
This entry was edited (3 months ago)
in reply to daniel:// stenberg://

I feel like there needs to be tools that make safer handling of Unicode easier. Anyone know of the full list of Unicode ranges? I know there are some sites that give partial ones. But I'd like the information needed to detect "this sentence contains Unicode characters consistent with language X" vs "this sentence contains Unicode characters for 45 different languages"
in reply to daniel:// stenberg://

GitHub recently added warning for Hidden Unicode characters.

Maybe they will get to homograph attacks next.

github.blog/changelog/2025-05-…

This entry was edited (3 months ago)
in reply to Mustaque Ahmed

@amustaque97 the check was merged into into another script for generic checks, here: github.com/curl/curl/blob/mast…
in reply to kaiserkiwi

@kaiserkiwi the job is not cleanly only doing this but is done as part of a bunch of other scanning duties by this script: github.com/curl/curl/blob/mast…
in reply to daniel:// stenberg://

a few years ago when Confusable Homoglyphs where last a popular talking point I ported a Python package to PHP so I could do similar filtering github.com/photogabble/php-con…