Okay, I've got some better numbers (and a better method of building the
category intersections table from an existing database).
I build a table with 2.6 million records, all the pages from the snapshot I
had. It's 90% populated with categories (this will make more sense in a
minute) and I'm still getting good results from intersections queries. For
instance:
Showing rows 0 - 1 (2 total, Query took 0.3459 sec) SQL query: SELECT *
FROM `pageindex`
WHERE MATCH (catlist
)
AGAINST ( '+Living_People +People_from_Maine +1956_births'
IN BOOLEAN
MODE
)
LIMIT 0 , 30
(which has two results, btw, Cynthia McFadden and David Kelley).
This takes consistently a third of a second. Not bad. And that's for
2.6million rows. I think we should restrict it to just current
articles in the
main namespace, but that's my opinion.
I figured out a better way to populate the table, too:
UPDATE pageindex,categorylinks SET pageindex.catlist=CONCAT_WS(' ',
pageindex.catlist,categorylinks.cl_to) WHERE (
pageindex.pageid=categorylinks.cl_from AND INSTR(pageindex.catlist,
categorylinks.cl_to) =0)
You have to run this query for a number of times equal to the greatest
number of distinct categories a page has. Most pages have less than 7, but
a few apparently have outrageous numbers. After doing a lot of searching
around for a solution to concatenating multiple rows into one value, I think
this is pretty good.
Aerik