Magnus Manske wrote:
Hi,
I worte a little PHP object that takes
* a list of article IDs
* a list of categories
and returns those article IDs which belong to these categories *or their
subcategories*.
The algorithm is written to minimize the number of read queries when
parsing the category tree. The maximum number of read queries is the
"tallest" category tree in the set of articles (times two, one for the
"parents", one for their page_ids), which IMHO should mostly be below 10
(I guess it's usually 5 or less, but I have no data to back that up).
This problem seems like the top-down counterpart of bottom-up problem of
category tracing an article to the one top category. Or the similar
problem of using "What links here" to trace back to the Main Page. In
either case the path length should be the same, and many will find it
amazingly short. It can probably be estimated from the average number
of direct subcategories in a category. This yields the simple
exponential equation n^x=T where "n" is the average number of
subcategories in a category, and T is the total number of articles in
the database. In other words x = log T/log n. In a simple example with
base 10 logarithms: If each category has 10 sub-categories and there are
1,000,000 articles in the database the average path length will only be
6. Growing to 10,000,000 only increases that path length to 7.
So, if the tree depth for an article is 8, and I add
more articles to
search for which all have a depth of 8 or less, no additional database
queries are neccessary. The individual query will grow in size, though.
Yes
I intend to use it on the Tasks feature search, which
is why I put it
into that extension ("extensions" module,
"Tasks/categoryfinder.php").
But I hope this can be used for other searches as well. The idea is,
instead of searching for, say, 25 articles, to internally search for
more (e.g. a few hundred), then filter that set through the
categoryfinder, until there are 25 matching articles.
Before I start implementing the actual search interface, can anyone tell
me if that would put too much stress on the DB slaves? Keep in mind that
limiting several searches to "articles in [[Category:Physics]] and its
subcategories" appears to be immensely useful, so it might be well worth
the DB stress from a user standpoint.
I suspect that the bottom up problem should be easier, especially in a
strict herarchy where a sub-category only belongs to one category. This
would give a single tracing back to the source. Ethnologue uses it in
its website, and so do taxonomic databases. We show this in our species
taxoboxes, and it is a significant feature in Wikispecies.
When you go in the other direction the size of your results can grow
exponentially, and practically you can only show a limited depth of
categories. After an initial query you can perform a second more
limited query. In a sense it suggests that a more organized approach to
categories would be superior to the higgledy-piggledy one that currently
prevails.
Ec