On Wed, 23 Apr 2008, Robert Stojnic wrote:
Roan Kattouw wrote:
Brion Vibber schreef:
Should check whether Robert's already hacked
some of this stuff into the
lucene server or what changes it would require.
If I understand correctly, Lucene shouldn't really care what it stores,
as long as it's text and it's searchable. Storing "Living_people
Articles_needing_cleanup" would work just fine, right? We do need to
think about case-sensitivity, though.
Let me briefly repeat what I said earlier about my experience with this
category
intersection thingy. Adding categories to lucene index is easy *IF* they
are inside
the article, e.g. try this:
http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=%2Bin…
This will give you category intersection of "Living People" and "English
comedy writers"
in fraction of the second.
Hey Robert,
That is really cool - but it seems to be doing a text match on the whole
article, not just the categories... ?
What I found that the hard part is keeping the index updated. If we want
a fancy category
intersection system discussed here before we need to have an index that
is frequently updated,
that will be integrated with the job queue, that will understand
templates etc..
That is always the hurdle with Lucene, right? It doesn't do updates, just
delete, re-add, and then optimize (and I'd guess optimizing can get
resource-intensive on a big index).
Lucene is not that good with very frequent updates. The usual setting is
to have an indexer,
make snapshots of the index at regular intervals and then rsync it onto
searchers. The whole
process takes time, although for a category-only index it will probably
be fast. I assume there
would be at least few tens of minutes lag anyhow. Our current lucene
framework could
easily be used for index distribution and such.
What remains unsolved, however, is keeping the index updated with the
latest changes
on the site. If one changes a template with a category in it, the thing
goes on the job queue.
I assume there would need to be some kind of hook that will either log
the change somewhere
or send data to lucene somehow. This is the part of the backend that
needs thinking and solving.
Well... this isn't a complete plan, just some thoughts (and maybe they're
naive, but I'll give it a shot anyway). I'm thinking of a new table that
holds the pageid and a text field that holds the category strings, leaving
the underscores in place. This gets updated via hook at the same time an
edit triggers an update to the categorylinks table (not familiar enough with
the code to have that data at hand) - this will handle categories in
templates etc (leverage the logic that already deals with this). Okay, so
once you build the table, the updates to that table aren't too bad. In
core, this would be a MyISAM table with a fulltext index. For larger wikis,
this is an innodb table.
Then the question (the one I think you're raising) is at what point do we
refresh or update the lucene index from that table? I'm not sure of the
best answer to that question. Is it feasible to do delete/add every time a
category is changed, and then optimize once a day or something? (probably
not, eh?)
What are we doing for the main search index? rebuild daily or so? In an
initial implementation, why not follow the same type of schedule?
Alternately, perhaps do an update and optimize once an hour? Guess it
depends on how much time/resources it takes to update and optimize the
index... But certainly using the same schedule as the main index is a safe
and conservative plan?
Best Regards,
Aerik
--
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!