), which offers an API for
retrieving HTML versions of Wikipedia pages. It's maintained by the
Wikimedia Foundation and used by a number of production Wikimedia services,
so you can rely on it.
I don't believe there are any prepared dumps of this HTML, but you should
be able to iterate through the RESTBase API, as long as you follow the
rules (from
):
- *Limit your clients to no more than 200 requests/s to this API. Each
API endpoint's documentation may detail more specific usage limits.*
- *Set a unique User-Agent or Api-User-Agent header that allows us to
contact you quickly. Email addresses or URLs of contact pages work well.*
On Thu, 3 May 2018 at 14:26, Aidan Hogan <ahogan(a)dcc.uchile.cl> wrote:
Hi Fae,
On 03-05-2018 16:18, Fæ wrote:
On 3 May 2018 at 19:54, Aidan Hogan
<ahogan(a)dcc.uchile.cl> wrote:
> Hi all,
>
> I am wondering what is the fastest/best way to get a local dump of
English
> Wikipedia in HTML? We are looking just for
the current versions (no edit
> history) of articles for the purposes of a research project.
>
> We have been exploring using bliki [1] to do the conversion of the
source
> markup in the Wikipedia dumps to HTML, but
the latest version seems to
take
> on average several seconds per article
(including after the most common
> templates have been downloaded and stored locally). This means it would
take
> several months to convert the dump.
>
> We also considered using Nutch to crawl Wikipedia, but with a reasonable
> crawl delay (5 seconds) it would several months to get a copy of every
> article in HTML (or at least the "reachable" ones).
>
> Hence we are a bit stuck right now and not sure how to proceed. Any
help,
pointers
or advice would be greatly appreciated!!
Best,
Aidan
[1]
https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home
Just in case you have not thought of it, how about taking the XML dump
and converting it to the format you are looking for?
Ref
https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_…
Thanks for the pointer! We are currently attempting to do something like
that with bliki. The issue is that we are interested in the
semi-structured HTML elements (like lists, tables, etc.) which are often
generated through external templates with complex structures. Often from
the invocation of a template in an article, we cannot even tell if it
will generate a table, a list, a box, etc. E.g., it might say "Weather
box" in the markup, which gets converted to a table.
Although bliki can help us to interpret and expand those templates, each
page takes quite long, meaning months of computation time to get the
semi-structured data we want from the dump. Due to these templates, we
have not had much success yet with this route of taking the XML dump and
converting it to HTML (or even parsing it directly); hence we're still
looking for other options. :)
Cheers,
Aidan
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF>
(he/him/his)
product analyst, Wikimedia Foundation