Re: [Wikitech-l] Architectural revisions to improve category sorting

22 Jul 2010

On 22/07/10 07:00, Aryeh Gregor wrote:
...
  Categories, files, and other types of pages cannot
 be moved to one another, as far as I know (it would hardly make
 sense), so it automatically stays consistent this way.   
This restriction is enforced by Title::isValidMoveOperation().

...
  1) Change the way category sortkeys are generated. 
Start them with a
 letter depending on namespace, like 'C' for category, 'P' for regular
 page, 'F' for file.  After that first letter, append a sortkey
 generated by ICU or whatever.   
An alternative would be to add a column to the categorylinks table,
say cl_type. It could be an ENUM or some short text type. Then the
index could be altered to include this field at the start of it.

Presumably the rationale for combining these two things into
cl_sortkey is to avoid a schema change, and to make the paging code
slightly simpler. But I worry that future generations of MediaWiki
developers will curse us for laziness and obfuscation.

...
  I think Tim has opinions on what would
 be a good choice to convert the article title into sort key -- if not,
 I'll have to research it and hopefully not come up with a completely
 incorrect answer. 
Well, I've said ICU, possibly with a PHP simulation of some Western
European sort key algorithm for the benefit of users without access to
ICU. But I formed that opinion years ago, and I never properly
surveyed all the possible solutions in the first place. It probably
makes sense to do a little of your own research.

Note that I specifically excluded the actual implementation of
language-dependent sort keys from the requirements list when I wrote
up this project. It could easily eat up a lot of time, and it's not
necessary for a proof-of-principle implementation.

...
  2) On category pages, maintain three offsets and do
three queries (or
 maybe UNION them together, doesn't matter), one for each of
 categories/regular pages/files.  Because of (1), this will be
 efficient and will also sort less unreasonably for non-English
 languages.

 One problem that was pointed out somewhere in the massive useless
 discussion on bug 164 is that we'd have to do something to display the
 first letter for each section.  Currently it's just the first letter
 of the sortkey, but if that's some binary string, that becomes a
 problem.  I'm not seeing an obvious solution, since the
 sortkey-generation algorithm will be opaque to us.  If it sorts Á the
 same as A, then how do we figure out that the "canonical" first letter
 for the section should be "A" and not "Á"?  How do we even figure
out
 where the sections begin or end?  Would that even make sense in all
 cases?  At a first pass, I'd say we should just skip the first letter
 and display all the items straight from beginning to end without
 section divisions.  I don't think that's a big problem. 
Roan is also asking for a store of the plain text form in this thread.

Work out how much space we would need to additionally store the
category keys in plain text. Then we will know what sort of tradeoff
we are looking at. Have you got a toolserver account you can use to do
the sums?

Since we won't be sorting on the plain text form anymore, we could use
some tricks to save space. For instance, if the sort key is the same
as the article title, we could store NULL instead of another copy of
the article title. That should save 95% or so.

-- Tim Starling

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Architectural revisions to improve category sorting