Here's something that looks promising: Using MySQL's fulltext search
against categories only. What I did was create a table having pagenumbers
and category text that looks like this:
<http://athlonx2/phpmyadmin/sql.php?db=categorylinks&table=pages&token=71a69e555a82e3d94a8450b4ace4496a&sql_query=SELECT+%2A+FROM+%60pages%60&session_max_rows=30&pos=0&disp_direction=horizontal&repeat_cells=100&goto=tbl_properties_structure.php&dontlimitchars=1>
ID<http://athlonx2/phpmyadmin/sql.php?db=categorylinks&table=pages&token=71a69e555a82e3d94a8450b4ace4496a&pos=0&session_max_rows=30&disp_direction=horizontal&repeat_cells=100&dontlimitchars=0&sql_query=SELECT+%2AFROM+%60pages%60++ORDER+BY+%60ID%60+ASC>
pagenum<http://athlonx2/phpmyadmin/sql.php?db=categorylinks&table=pages&token=71a69e555a82e3d94a8450b4ace4496a&pos=0&session_max_rows=30&disp_direction=horizontal&repeat_cells=100&dontlimitchars=0&sql_query=SELECT+%2AFROM+%60pages%60++ORDER+BY+%60pagenum%60+ASC>
pagename<http://athlonx2/phpmyadmin/sql.php?db=categorylinks&table=pages&token=71a69e555a82e3d94a8450b4ace4496a&pos=0&session_max_rows=30&disp_direction=horizontal&repeat_cells=100&dontlimitchars=0&sql_query=SELECT+%2AFROM+%60pages%60++ORDER+BY+%60pagename%60+ASC>
catcount<http://athlonx2/phpmyadmin/sql.php?db=categorylinks&table=pages&token=71a69e555a82e3d94a8450b4ace4496a&pos=0&session_max_rows=30&disp_direction=horizontal&repeat_cells=100&dontlimitchars=0&sql_query=SELECT+%2AFROM+%60pages%60++ORDER+BY+%60catcount%60+ASC>
catlist<http://athlonx2/phpmyadmin/sql.php?db=categorylinks&table=pages&token=71a69e555a82e3d94a8450b4ace4496a&pos=0&session_max_rows=30&disp_direction=horizontal&repeat_cells=100&dontlimitchars=0&sql_query=SELECT+%2AFROM+%60pages%60++ORDER+BY+%60catlist%60+ASC>
7343 0 Kurt000 14 Chinese_Wikipedians Leo_Wikipedians User_cello Us...
7344 1 AaA 3 Redirects_from_CamelCase Redirects_from_other_cap...
7345 5 AlgeriA 2 Redirects_from_CamelCase Unprintworthy_redirects
7346 6 AmericanSamoa 2 Redirects_from_CamelCase Unprintworthy_redirects
So the categories have underscores instead of spaces, thus forcing mysql to
index the whole category phrase as one word. In a table with about 114,000
rows with categories (it takes a long time to recombine the categories into
one row - I'm sure the way I'm doing it is clunky), I get very reasonable
query result times. Take our worst case "Living_People" category for
example:
Showing rows 0 - 29 (4,105 total, Query took 0.0042 sec)
SQL query: SELECT *
FROM `pages`
WHERE MATCH (
catlist
)
AGAINST (
'Living_People'
)
LIMIT 0 , 30
This also give us an immediate solution to category math (Boolean searches):
Showing rows 0 - 29 (61 total, Query took 0.0058 sec)
SQL query: SELECT *
FROM `pages`
WHERE MATCH (
catlist
)
AGAINST (
'+Living_People +1969_births'
IN BOOLEAN
MODE
)
LIMIT 0 , 30
So, I know 114,000 records is much smaller than one million +, but this
looks quite promising.
I'm using PHP to cycle through records in the categorylinks table to
recreate this index. If one of you smart guys will tell me a better way,
I'll try this on a larger data set.
Best Regards,
Aerik