Neil Harris wrote:
A suggestion, based on practices used for IDN
registration:
Restrict new usernames on the en: Wikipedia to characters from the Latin
alphabet and selected punctuation only (and possibly digits as well).
Before allowing a username to be registered, generate a canonical
comparison form by Unicode normalization, lowercasing, punctuation and
space suppression and accent-stripping, followed by homograph
canonicalizations such as mapping both digit zero and letter O to the
latter, digit 1 and letter L to the latter, eth to lowercase d, etc.
A new username should then only be allowed to be registered if the
comparison form of the proposed new username is different from the
comparison form of every existing username (which are stored in an
indexed table, alongside the full, uncanonicalized name that actually
gets registered).
Doing this will eliminate the vast majority of all simple username
spoofing hacks.
Existing usernames get grandfathered in, of course.
Yes, we could do it based on normalisation. Another option would be to
disallow cross-script usernames.
Chris Lüer wrote:
Is there are a reason why user names with weird
Unicode characters
are even allowed? It would seem sensible to limit user names on each
Wikipedia to the alphabet that is used in that language.
On the English Wikipedia we could potentially limit usernames to basic
latin, but on the other wikis it's not so simple. Wikis which use another
alphabet for the main text often have a large number of usernames which are
written in ASCII.
Most homograph pairs occur across scripts, because within scripts,
characters are typically intended to be distinguishable. There are a few
unfortunate exceptions, such as l and I in some sans serif fonts, and if
you're not watching closely you can be fooled by substitutions such as I->1
and 0->O. But I would argue that users should most certainly be watching
closely before they go pointing the finger at an established user. You can't
protect against all forms of stupidity, an established user once told me
"actually it looks like I did do that edit", when a vandal with an all
lowercase surname impersonated a user who conventionally capitalises the
first letter.
Anyway, what I recommend is:
* Disallow control character ranges. I think this one is already
implemented.
* Disallow usernames which take letters from multiple scripts. Exceptions
would have to be made for certain punctuation and number characters.
* Remove errant composing characters by normalisation. This could be done
silently during Title creation, like first-letter capitalisation.
-- Tim Starling