Hi all,
I am wondering what is the fastest/best way to get a local dump of
English Wikipedia in HTML? We are looking just for the current versions
(no edit history) of articles for the purposes of a research project.
We have been exploring using bliki [1] to do the conversion of the
source markup in the Wikipedia dumps to HTML, but the latest version
seems to take on average several seconds per article (including after
the most common templates have been downloaded and stored locally). This
means it would take several months to convert the dump.
We also considered using Nutch to crawl Wikipedia, but with a reasonable
crawl delay (5 seconds) it would several months to get a copy of every
article in HTML (or at least the "reachable" ones).
Hence we are a bit stuck right now and not sure how to proceed. Any
help, pointers or advice would be greatly appreciated!!
Best,
Aidan
[1]
https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home