Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

3 May 2018

Hi Fae,

On 03-05-2018 16:18, Fæ wrote:
...
  On 3 May 2018 at 19:54, Aidan Hogan
&lt;ahogan(a)dcc.uchile.cl&gt; wrote:
  Hi all,

 I am wondering what is the fastest/best way to get a local dump of English
 Wikipedia in HTML? We are looking just for the current versions (no edit
 history) of articles for the purposes of a research project.

 We have been exploring using bliki [1] to do the conversion of the source
 markup in the Wikipedia dumps to HTML, but the latest version seems to take
 on average several seconds per article (including after the most common
 templates have been downloaded and stored locally). This means it would take
 several months to convert the dump.

 We also considered using Nutch to crawl Wikipedia, but with a reasonable
 crawl delay (5 seconds) it would several months to get a copy of every
 article in HTML (or at least the "reachable" ones).

 Hence we are a bit stuck right now and not sure how to proceed. Any help,
 pointers or advice would be greatly appreciated!!

 Best,
 Aidan

 [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home  
 Just in case you have not thought of it, how about taking the XML dump
 and converting it to the format you are looking for?

 Ref https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_…

Thanks for the pointer! We are currently attempting to do something like 
that with bliki. The issue is that we are interested in the 
semi-structured HTML elements (like lists, tables, etc.) which are often 
generated through external templates with complex structures. Often from 
the invocation of a template in an article, we cannot even tell if it 
will generate a table, a list, a box, etc. E.g., it might say "Weather 
box" in the markup, which gets converted to a table.

Although bliki can help us to interpret and expand those templates, each 
page takes quite long, meaning months of computation time to get the 
semi-structured data we want from the dump. Due to these templates, we 
have not had much success yet with this route of taking the XML dump and 
converting it to HTML (or even parsing it directly); hence we're still 
looking for other options. :)

Cheers,
Aidan

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML