Re: [Wikitech-l] Eliminating homographs in usernames: was: Re: [WikiEN-l] Pruning "dead" accounts (was Re: New York Times article)

21 Jun 2006

...also, these names would be easy to spot if you used a
Unicode-incapable browser. Not that I encourage that, though.

Mark

On 20/06/06, Mark Williamson &lt;node.ue(a)gmail.com&gt; wrote:
...
  One thing that could be done is to make a list of
homographs -- most
 of them are rather obvious -- and to consider, for example,
 User:Маtthеw.Вrоwn as "taken" as soon as the username
 User:Matthew.Brown is registered (in case you couldn't tell, the
 former uses several Cyrillic letters).

 The question you may ask is, why would people want characters from
 multiple scripts? The answer is that, depending on the language, this
 may be fashionable, a pun, or have some other effect. For example, a
 username such as User:田中志郎jp is not unreasonable to allow for ja.wp,
 especially for a Japanese person named Shiro Tanaka.

 Also, with some scripts, there are few or no homographs with certain
 other scripts -- for example, Hiragana and Basic Latin have no
 homographs, you can add Hangul, most Indic scripts, Thai...

 Mark

 On 20/06/06, Tim Starling &lt;t.starling(a)physics.unimelb.edu.au&gt; wrote:
  Neil Harris wrote:
  A suggestion, based on practices used for IDN
registration:

 Restrict new usernames on the en: Wikipedia to characters from the Latin
 alphabet and selected punctuation only (and possibly digits as well).

 Before allowing a username to be registered, generate a canonical
 comparison form by Unicode normalization, lowercasing, punctuation and
 space suppression and accent-stripping, followed by homograph
 canonicalizations such as mapping both digit zero and letter O to the
 latter, digit 1 and letter L to the latter, eth to lowercase d, etc.

 A new username should then only be allowed to be registered if the
 comparison form of the proposed new username is different from the
 comparison form of every existing username (which are stored in an
 indexed table, alongside the full, uncanonicalized name that actually
 gets registered).

 Doing this will eliminate the vast majority of all simple username
 spoofing hacks.

 Existing usernames get grandfathered in, of course. 
 Yes, we could do it based on normalisation. Another option would be to
 disallow cross-script usernames.

 Chris Lüer wrote:
  Is there are a reason why user names with weird
Unicode characters
 are even allowed? It would seem sensible to limit user names on each
 Wikipedia to the alphabet that is used in that language. 
 On the English Wikipedia we could potentially limit usernames to basic
 latin, but on the other wikis it's not so simple. Wikis which use another
 alphabet for the main text often have a large number of usernames which are
 written in ASCII.

 Most homograph pairs occur across scripts, because within scripts,
 characters are typically intended to be distinguishable. There are a few
 unfortunate exceptions, such as l and I in some sans serif fonts, and if
 you're not watching closely you can be fooled by substitutions such as I->1
 and 0->O. But I would argue that users should most certainly be watching
 closely before they go pointing the finger at an established user. You can't
 protect against all forms of stupidity, an established user once told me
 "actually it looks like I did do that edit", when a vandal with an all
 lowercase surname impersonated a user who conventionally capitalises the
 first letter.

 Anyway, what I recommend is:

 * Disallow control character ranges. I think this one is already
 implemented.
 * Disallow usernames which take letters from multiple scripts. Exceptions
 would have to be made for certain punctuation and number characters.
 * Remove errant composing characters by normalisation. This could be done
 silently during Title creation, like first-letter capitalisation.

 -- Tim Starling

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)wikimedia.org
 http://mail.wikipedia.org/mailman/listinfo/wikitech-l

 --
 Refije dirije lanmè yo paske nou posede pwòp bato.

-- 
Refije dirije lanmè yo paske nou posede pwòp bato.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Eliminating homographs in usernames: was: Re: [WikiEN-l] Pruning "dead" accounts (was Re: New York Times article)