[WikiEN-l] current categories and subcategories

Oskar Sigvardsson oskarsigvardsson at gmail.com
Fri Sep 8 18:52:17 UTC 2006


On 9/7/06, karen marcelo <karen at radarnetworks.com> wrote:
> hi all
>
> i just joined this list so i'm not sure if this is the correct forum to
> ask this, but how do i get a master list of
> all wikipedia categories and sub-categories?    i looked a the database
> download page and took a peek at
> pages_articles.xml and see that one can parse out Categories from the
> mediawiki tags embedded in the
> text.  but is there some other dump with all the categories (even ones
> that may not have articles) as well as their
> subcategories available somewhere?
>
> apologies in advance if this was the wrong list to post this question
> to, but if someone could direct me to the right one
> it would be most appreciated.
>
> thanks much!

If you look at http://download.wikimedia.org/enwiki/latest/ you can
see that there is a dump that's called categorylinks (
http://download.wikimedia.org/enwiki/latest/enwiki-latest-categorylinks.sql.gz
). It's just under 100 mb. That should be it. The schema for it can be
found at http://meta.wikimedia.org/wiki/Categorylinks_table

It only identifies articles that contains a certain category by the
id-number of the article, which means that if you want to know what
categories are sub/super-categories you need to download another dump
that contains the names of the categories, as well as their id. cur
certainly contains this, but it's 1.5 GB large, and that may be a bit
large. You should try enwiki-latest-all-titles-in-ns0.gz, that might
contain the ids. It's just 14 mb big, so it's not hard to just
download them and try.

Note: I'm not a (very good) mediawiki developer ;)

Good luck!

--Oskar



More information about the WikiEN-l mailing list