On 3 May 2018 at 19:54, Aidan Hogan <ahogan(a)dcc.uchile.cl> wrote:
Hi all,
I am wondering what is the fastest/best way to get a local dump of English
Wikipedia in HTML? We are looking just for the current versions (no edit
history) of articles for the purposes of a research project.
We have been exploring using bliki [1] to do the conversion of the source
markup in the Wikipedia dumps to HTML, but the latest version seems to take
on average several seconds per article (including after the most common
templates have been downloaded and stored locally). This means it would take
several months to convert the dump.
We also considered using Nutch to crawl Wikipedia, but with a reasonable
crawl delay (5 seconds) it would several months to get a copy of every
article in HTML (or at least the "reachable" ones).
Hence we are a bit stuck right now and not sure how to proceed. Any help,
pointers or advice would be greatly appreciated!!
Best,
Aidan
[1]
https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Just in case you have not thought of it, how about taking the XML dump
and converting it to the format you are looking for?
Ref
https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_…
Fae
--
faewik(a)gmail.com
https://commons.wikimedia.org/wiki/User:Fae