Re: [Wikitech-l] More aggressive DEFAULTSORT

12 May 2009

On Mon, May 11, 2009 at 3:29 PM, Lars Aronsson &lt;lars(a)aronsson.se&gt; wrote:
...
  There is a way to avoid all such problems, namely by a
more
 aggressive use of DEFAULTSORT that removes from sorting all upper
 case letters (except the initial one), all whitespace and all
 commas.  It would mean almost every article needs a DEFAULTSORT.
 In the examples above:

  {{DEFAULTSORT:Walesjimmy}}
  {{DEFAULTSORT:Europeancourtofauditors}}
  {{DEFAULTSORT:Europeanunionmission}}
  {{DEFAULTSORT:Europeanquarterofbrussels}}
  {{DEFAULTSORT:Moonillusion}} 
This would be a good thing to do in the software.  We could implement
the framework reasonably easily, if anyone cares to, and then let each
language do its thing.  A basic English implementation like this would
be easy enough.

Of course, any change to the sortkey beyond the first will require
that all existing sort keys be changed by a batch job -- otherwise
sorting will be a mess.  Every change to the sortkey algorithm would
either require that all pages be reparsed (very expensive), or that a
special conversion script be defined to account for that exact change.
 Unless it's minor enough that the inconsistency is acceptable, I
guess.

On Tue, May 12, 2009 at 7:18 AM, Petr Kadlec &lt;petr.kadlec(a)gmail.com&gt; wrote:
...
  Well, not really. Bug 164 would be fixed almost
completely for
 Czech-language wikis by using database features designed for exactly
 this problem. [1] But, I guess you know the situation.
 ...
 [1] http://dev.mysql.com/doc/refman/4.1/en/charset-collation-effect.html 
Note the version.  Wikimedia uses MySQL 4.0, which doesn't contain any
charsets or collations other than binary.  If we used a higher
version, utf8 might be an option: that would use a Unicode collation,
I guess, which should at least be okay for most languages, if not
perfect.  (But MySQL's utf8 has other downsides, like being
variable-width and not supporting Unicode outside the BMP.)

...
  If Swedish sorting rules are simple enough that
removing all
 whitespace and punctuation and converting to lower case would solve
 most of the problems, I would say that such feature would not be too
 difficult to implement right into MediaWiki (into LanguageSv.php),
 writing those DEFAULTSORT codes explicitly into every article would be
 nonsense, IMHO. (So, go ahead with it, I won’t stop you or anything,
 I’m just trying to say that this is not really a solution for Czech
 language.) 
There's no reason this couldn't be implemented for Czech as well in
the software, in principle.  Ideally we'd use something based on
Unicode collation as a baseline, with optional customizations per
language:

http://unicode.org/reports/tr10/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] More aggressive DEFAULTSORT