On 22/07/10 07:00, Aryeh Gregor wrote:
Categories, files, and other types of pages cannot
be moved to one another, as far as I know (it would hardly make
sense), so it automatically stays consistent this way.
This restriction is enforced by Title::isValidMoveOperation().
1) Change the way category sortkeys are generated.
Start them with a
letter depending on namespace, like 'C' for category, 'P' for regular
page, 'F' for file. After that first letter, append a sortkey
generated by ICU or whatever.
An alternative would be to add a column to the categorylinks table,
say cl_type. It could be an ENUM or some short text type. Then the
index could be altered to include this field at the start of it.
Presumably the rationale for combining these two things into
cl_sortkey is to avoid a schema change, and to make the paging code
slightly simpler. But I worry that future generations of MediaWiki
developers will curse us for laziness and obfuscation.
I think Tim has opinions on what would
be a good choice to convert the article title into sort key -- if not,
I'll have to research it and hopefully not come up with a completely
incorrect answer.
Well, I've said ICU, possibly with a PHP simulation of some Western
European sort key algorithm for the benefit of users without access to
ICU. But I formed that opinion years ago, and I never properly
surveyed all the possible solutions in the first place. It probably
makes sense to do a little of your own research.
Note that I specifically excluded the actual implementation of
language-dependent sort keys from the requirements list when I wrote
up this project. It could easily eat up a lot of time, and it's not
necessary for a proof-of-principle implementation.
2) On category pages, maintain three offsets and do
three queries (or
maybe UNION them together, doesn't matter), one for each of
categories/regular pages/files. Because of (1), this will be
efficient and will also sort less unreasonably for non-English
languages.
One problem that was pointed out somewhere in the massive useless
discussion on bug 164 is that we'd have to do something to display the
first letter for each section. Currently it's just the first letter
of the sortkey, but if that's some binary string, that becomes a
problem. I'm not seeing an obvious solution, since the
sortkey-generation algorithm will be opaque to us. If it sorts Á the
same as A, then how do we figure out that the "canonical" first letter
for the section should be "A" and not "Á"? How do we even figure
out
where the sections begin or end? Would that even make sense in all
cases? At a first pass, I'd say we should just skip the first letter
and display all the items straight from beginning to end without
section divisions. I don't think that's a big problem.
Roan is also asking for a store of the plain text form in this thread.
Work out how much space we would need to additionally store the
category keys in plain text. Then we will know what sort of tradeoff
we are looking at. Have you got a toolserver account you can use to do
the sums?
Since we won't be sorting on the plain text form anymore, we could use
some tricks to save space. For instance, if the sort key is the same
as the article title, we could store NULL instead of another copy of
the article title. That should save 95% or so.
-- Tim Starling