Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

14 May 2018

Hi all,

Many thanks for all the pointers! In the end we wrote a small client to 
grab documents from RESTBase (https://www.mediawiki.org/wiki/RESTBase) 
as suggested by Neil. The HTML looks perfect, and with the generous 200 
requests/second limit (which we could not even manage to reach with our 
local machine), it only took a couple of days to grab all current 
English Wikipedia articles.

@Kaartic, many thanks for the offers of help with extracting HTML from 
ZIM! We also investigated this option in parallel with converting ZIM to 
HTML using Zimreader-Java [1], and indeed it looked promising, but we 
had some issues with extracting links. We did not try the mwoffliner 
tool you mentioned since we got what we needed through RESTBase in the 
end. In any case, we appreciate the offers of help. :)

Best,
Aidan

[1] https://github.com/openzim/zimreader-java

On 08-05-2018 9:34, Kaartic Sivaraam wrote:
...
  On Tuesday 08 May 2018 05:53 PM, Kaartic Sivaraam
wrote:
  On Friday 04 May 2018 03:49 AM, Bartosz
Dziewoński wrote:
  On 2018-05-03 20:54, Aidan Hogan wrote:
  I am wondering what is the fastest/best way to
get a local dump of
 English Wikipedia in HTML? We are looking just for the current
 versions (no edit history) of articles for the purposes of a research
 project. 
 The Kiwix project provides HTML dumps of Wikipedia for offline reading:
 http://www.kiwix.org/downloads/

 In case you need pure HTML and not the ZIM file format, you could check
 out mwoffliner[1], ...  
 Note that the HTML is (of course) is not the same as the one you see
 when visiting Wikipedia. For example, the side bar links are not present
 here, the ToC would not be present.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML