Actually, I think the more generic Lucene library
which Nutch is built
upon will be more useful. We should be indexing the wikitext, not the
HTML (which is a lower quality version ;))
This is the only open issue when you plan to use lucene, you need a
good parser for the syntax and this is very difficult.
Seriously, we also don't want a crawler. What is left in Nutch's
favour?
Nothing! Use Lucene - trust me. :-)
It will definitely save wikipedia very very much load!!!