...also, these names would be easy to spot if you used a
Unicode-incapable browser. Not that I encourage that, though.
Mark
On 20/06/06, Mark Williamson <node.ue(a)gmail.com> wrote:
One thing that could be done is to make a list of
homographs -- most
of them are rather obvious -- and to consider, for example,
User:Маtthеw.Вrоwn as "taken" as soon as the username
User:Matthew.Brown is registered (in case you couldn't tell, the
former uses several Cyrillic letters).
The question you may ask is, why would people want characters from
multiple scripts? The answer is that, depending on the language, this
may be fashionable, a pun, or have some other effect. For example, a
username such as User:田中志郎jp is not unreasonable to allow for ja.wp,
especially for a Japanese person named Shiro Tanaka.
Also, with some scripts, there are few or no homographs with certain
other scripts -- for example, Hiragana and Basic Latin have no
homographs, you can add Hangul, most Indic scripts, Thai...
Mark
On 20/06/06, Tim Starling <t.starling(a)physics.unimelb.edu.au> wrote:
Neil Harris wrote:
A suggestion, based on practices used for IDN
registration:
Restrict new usernames on the en: Wikipedia to characters from the Latin
alphabet and selected punctuation only (and possibly digits as well).
Before allowing a username to be registered, generate a canonical
comparison form by Unicode normalization, lowercasing, punctuation and
space suppression and accent-stripping, followed by homograph
canonicalizations such as mapping both digit zero and letter O to the
latter, digit 1 and letter L to the latter, eth to lowercase d, etc.
A new username should then only be allowed to be registered if the
comparison form of the proposed new username is different from the
comparison form of every existing username (which are stored in an
indexed table, alongside the full, uncanonicalized name that actually
gets registered).
Doing this will eliminate the vast majority of all simple username
spoofing hacks.
Existing usernames get grandfathered in, of course.
Yes, we could do it based on normalisation. Another option would be to
disallow cross-script usernames.
Chris Lüer wrote:
Is there are a reason why user names with weird
Unicode characters
are even allowed? It would seem sensible to limit user names on each
Wikipedia to the alphabet that is used in that language.
On the English Wikipedia we could potentially limit usernames to basic
latin, but on the other wikis it's not so simple. Wikis which use another
alphabet for the main text often have a large number of usernames which are
written in ASCII.
Most homograph pairs occur across scripts, because within scripts,
characters are typically intended to be distinguishable. There are a few
unfortunate exceptions, such as l and I in some sans serif fonts, and if
you're not watching closely you can be fooled by substitutions such as I->1
and 0->O. But I would argue that users should most certainly be watching
closely before they go pointing the finger at an established user. You can't
protect against all forms of stupidity, an established user once told me
"actually it looks like I did do that edit", when a vandal with an all
lowercase surname impersonated a user who conventionally capitalises the
first letter.
Anyway, what I recommend is:
* Disallow control character ranges. I think this one is already
implemented.
* Disallow usernames which take letters from multiple scripts. Exceptions
would have to be made for certain punctuation and number characters.
* Remove errant composing characters by normalisation. This could be done
silently during Title creation, like first-letter capitalisation.
-- Tim Starling
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
--
Refije dirije lanmè yo paske nou posede pwòp bato.