Re: [Wikitech-l] Category finder

7 Jan 2006

Magnus Manske wrote:

...
 Hi,

I worte a little PHP object that takes
* a list of article IDs
* a list of categories
and returns those article IDs which belong to these categories *or their
subcategories*.

The algorithm is written to minimize the number of read queries when
parsing the category tree. The maximum number of read queries is the
"tallest" category tree in the set of articles (times two, one for the
"parents", one for their page_ids), which IMHO should mostly be below 10
(I guess it's usually 5 or less, but I have no data to back that up).
 This problem seems like the top-down counterpart of bottom-up problem of 
category tracing an article to the one top category.  Or the similar 
problem of using "What links here" to trace back to the Main Page.  In 
either case the path length should be the same, and many will find it 
amazingly short.  It can probably be estimated from the average number 
of direct subcategories in a category.  This yields the simple 
exponential equation n^x=T where "n" is the average number of 
subcategories in a category, and T is the total number of articles in 
the database.  In other words x = log T/log n.  In a simple example with 
base 10 logarithms: If each category has 10 sub-categories and there are 
1,000,000 articles in the database the average path length will only be 
6.  Growing to 10,000,000 only increases that path length to 7.

...
 So, if the tree depth for an article is 8, and I add
more articles to
search for which all have a depth of 8 or less, no additional database
queries are neccessary. The individual query will grow in size, though.
 Yes

...
 I intend to use it on the Tasks feature search, which
is why I put it
into that extension ("extensions" module,
"Tasks/categoryfinder.php").
But I hope this can be used for other searches as well. The idea is,
instead of searching for, say, 25 articles, to internally search for
more (e.g. a few hundred), then filter that set through the
categoryfinder, until there are 25 matching articles.

Before I start implementing the actual search interface, can anyone tell
me if that would put too much stress on the DB slaves? Keep in mind that
limiting several searches to "articles in [[Category:Physics]] and its
subcategories" appears to be immensely useful, so it might be well worth
the DB stress from a user standpoint. 
 I suspect that the bottom up problem should be easier, especially in a 
strict herarchy where a sub-category only belongs to one category.  This 
would give a single tracing back to the source.  Ethnologue uses it in 
its website, and so do taxonomic databases.  We show this in our species 
taxoboxes, and it is a significant feature in Wikispecies.

When you go in the other direction the size of your results can grow 
exponentially, and practically you can only show a limited depth of 
categories.  After an initial query you can perform a second more 
limited query.  In a sense it suggests that a more organized approach to 
categories would be superior to the higgledy-piggledy one that currently 
prevails.

Ec

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Category finder