On Wed, Jul 21, 2010 at 6:45 PM, Conrad Irwin <conrad.irwin(a)gmail.com> wrote:
I don't have an example to hand (as the page is
not yet complete on Wiktionary)
The Hungarian letter "cs" sorts after "c", so while in English
"cs"
(for centi-seconds) should come before "CV", in Hungarian the entry
for the letter (which is missing) should come afterwards. Both English
and Hungarian would be on the same Wiktionary page.
Okay, I see. I don't think this would be terribly hard, although I
don't think it's needed for an initial implementation. The major
problem I see is that if the sort collation is per-category, then
changing it on a preexisting large category will require reparsing all
the pages, probably. (Unless we store the raw sortkeys as well.)
Some languages treat accented letters as the same
primary letter, and
use it only in the secondary or tertiary sort key (Which the current
category table's keys of 80 bytes are in danger of truncating), others
have variations on a theme, again Hungarian makes a good example, ö
and ő are the one letter with two stresses, o and ó is a different
letter. It should be automatically possible to extract the first
letter from the words to be sorted (I don't know if ICU covers that,
if not, just ask some people who speak the language, or Wikipedia) -
but it's not possible to get that information from the sort keys
directly, so either we store the user provided sort key, and our
derived sort key, so we can use the former to find the first letter at
render time, or we just store the first letter.
I don't see an answer to my question here. Given a sorted list of
sortkeys, possibly including the raw sortkey as well as the one that's
been put through ICU/CLDR/whatever, what algorithm do you propose to
break it up into sections labeled with first letters? In particular,
any such algorithm should not conflict with the sort order, in the
sense that you should not have three words A, B, C sorted as A < B < C
where firstLetter(A) == firstLetter(C) != firstLetter(B). Is this
reasonably possible to guarantee in all alphabetic languages'
conventional sort orders?
If we do store the raw sort key, we could have some Language method to
retrieve the section name, and just write our own implementations for
various languages. However, I'm not sure this is worth the effort.
On Wed, Jul 21, 2010 at 7:03 PM, Roan Kattouw <roan.kattouw(a)gmail.com> wrote:
It doesn't make a great deal of sense and can be
changed fairly easily
in Title::isValidMoveTarget().
On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling <tstarling(a)wikimedia.org> wrote:
This restriction is enforced by
Title::isValidMoveOperation().
Any objections to changing this so files can't be moved over non-files
or vice versa?
An alternative would be to add a column to the
categorylinks table,
say cl_type. It could be an ENUM or some short text type. Then the
index could be altered to include this field at the start of it.
Presumably the rationale for combining these two things into
cl_sortkey is to avoid a schema change, and to make the paging code
slightly simpler. But I worry that future generations of MediaWiki
developers will curse us for laziness and obfuscation.
I'm okay with this.
Well, I've said ICU, possibly with a PHP
simulation of some Western
European sort key algorithm for the benefit of users without access to
ICU. But I formed that opinion years ago, and I never properly
surveyed all the possible solutions in the first place. It probably
makes sense to do a little of your own research.
Gerard Meijssen felt strongly that we should use something based on
CLDR. Apparently we have connections there and work with them a lot,
and I guess he feels it's higher-quality or such.
Note that I specifically excluded the actual
implementation of
language-dependent sort keys from the requirements list when I wrote
up this project. It could easily eat up a lot of time, and it's not
necessary for a proof-of-principle implementation.
All right. Then I'll just do whatever's readily available and fits in
the database column.
Work out how much space we would need to additionally
store the
category keys in plain text. Then we will know what sort of tradeoff
we are looking at. Have you got a toolserver account you can use to do
the sums?
Yes, I'm a toolserver root.
Since we won't be sorting on the plain text form
anymore, we could use
some tricks to save space. For instance, if the sort key is the same
as the article title, we could store NULL instead of another copy of
the article title. That should save 95% or so.
It doesn't seem like it would save nearly that much. On the Welsh
Wikipedia (small enough database to be manageable), I get the
following:
mysql> SELECT SUM(LENGTH(cl_sortkey)) FROM categorylinks JOIN page ON
cl_from=page_id WHERE REPLACE(cl_sortkey, ' ', '_') != page_title;
+-------------------------+
| SUM(LENGTH(cl_sortkey)) |
+-------------------------+
| 551851 |
+-------------------------+
1 row in set (1.94 sec)
mysql> SELECT SUM(LENGTH(cl_sortkey)) FROM categorylinks JOIN page ON
cl_from=page_id;
+-------------------------+
| SUM(LENGTH(cl_sortkey)) |
+-------------------------+
| 1619747 |
+-------------------------+
1 row in set (0.44 sec)
mysql> SELECT SUM(LENGTH(cl_sortkey)) FROM categorylinks JOIN page ON
cl_from=page_id WHERE REPLACE(cl_sortkey, ' ', '_') != page_title AND
page_namespace = 0;
+-------------------------+
| SUM(LENGTH(cl_sortkey)) |
+-------------------------+
| 347539 |
+-------------------------+
1 row in set (0.20 sec)
mysql> SELECT SUM(LENGTH(cl_sortkey)) FROM categorylinks JOIN page ON
cl_from=page_id WHERE page_namespace = 0;
+-------------------------+
| SUM(LENGTH(cl_sortkey)) |
+-------------------------+
| 1067588 |
+-------------------------+
1 row in set (0.19 sec)
I filtered out the main namespace in the last two to avoid false
positives from namespace prefixes. This suggests savings of maybe
50-75%. The story may be different on larger wikis. It's worth
remembering, though, that a lot of these sortkeys might be set to work
around deficiencies in the current default sortkey generation, so
maybe it would be higher savings in the long term.
It's still not at all clear to me that saving a raw copy in the
database is worth it. If we really need sectioning by first letter on
category pages, we could save the first letter instead, and leave that
NULL when it's the same as the first letter of the page title (all of
this for some locale-specific definition of "first letter"). But I
don't know if we need that.
On Thu, Jul 22, 2010 at 3:37 AM, Roan Kattouw <roan.kattouw(a)gmail.com> wrote:
There is another reason to prefer this schema, which
is that the
orginially proposed one is susceptible to weird transition bugs. After
this feature is deployed, there will be old-format (i.e. plain)
sortkeys sticking around in the database for quite some time after
(they won't go away until LinksUpdate fixes them), which means that
pages whose sortkey starts with a C or F will be recognized as
categories and files respectively, even if they're normal pages.
The best way to mitigate that is to populate the namespace information
prior to deployment. In Tim's schema, that means filling the cl_type
field based on page_namespace. In the sortkey prefix schema, that
means prefixing the sortkey with the relevant sortkey, but that also
requires the sortkey updating code has already been updated at that
time (so it doesn't overwrite new-style sortkeys with old-style ones),
which means you'd have to partially deploy the code while running the
population script. Yuck.
This whole problems arises for sortkey changes generally. It will be
just as much of a problem when going to a new sortkey type (based on
CLDR or whatever). The only way to avoid it is to create a new
column, populate it while maintaining both columns at once, start
using the new column once it's fully populated, and then drop the old
column. That seems excessive. Remember that we can convert the
current raw sortkey into ICU/CLDR/whatever without reparsing pages, as
long as we can reliably tell old from new sortkeys (should be pretty
easy to do heuristically). So it shouldn't take forever -- surely no
more than a day or two even for enwiki.
On Thu, Jul 22, 2010 at 5:34 AM, David Gerard <dgerard(a)gmail.com> wrote:
Please don't remove the feature where the first
letter of the sort key
is displayed in the rendered category page, and if necessary add what
it takes to keep it.
There are scripts where this will be a hard problem, but it's still
much-used and much-loved in those where it isn't.
Is it? What use does it serve? We don't have it for any other type
of list. We have zillions of types of page lists, and category pages
are the only ones that have the first letter displayed. It makes the
columns uneven, and is completely crazy for some scripts (like CJK,
AFAICT).