Parsoid template expansion - Wikitext-l

Gabriel Wicke

14 Feb 14 Feb

6:12 a.m.

Nitin, Parsoid actually has some support for basic template expansion as well, but it does not re-implement all extensions and parser functions from MediaWiki. Some of those like Lua scripting are used on many pages. However, there is a production API for Parsoid HTML at http://parsoid-lb.eqiad.wikimedia.org/. See https://www.mediawiki.org/wiki/Parsoid/API for the API docs. This API can sustain moderate traffic only right now (please don't send more than 10 req/s), but a faster API using RESTBase will come online in the next weeks. Gabriel On Fri, Feb 13, 2015 at 9:49 PM, Nitin Gupta <nitingupta910(a)gmail.com> wrote:

...

Hi, I created a nodejs service to convert wikitext to HTML with a frontent (written in Golang) which reads wikipedia dump and feed wikitext to this service[1]. However, after doing all this I discovered that Parsoid needs to contact wikimedia server for template expansion. Since I want to convert the entire wikipedia dump and HTML, I do not want to keep hitting wikimedia servers for template expansion requests. So, are there any plans to add support to Parsoid to do this expansion offline. Once the wikipedia dump is downloaded, I want the entire process of converting to HTML to be offline (of course, don't need images). [1] https://github.com/nitingupta910/wikiparser Thanks, Nitin _______________________________________________ Wikitext-l mailing list Wikitext-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply

Nitin Gupta

7:03 a.m.

Gabriel, I hope HTML would be made available with the same frequency as XML (wikitext) dumps; it would save me yet another attempt to make wikitext parser. Thanks. For any API points you provide, it would be helpful if you could also mention expected maximum load from a client (req/s), so client writers can throttle accordingly. Thanks, Nitin On Fri, Feb 13, 2015 at 10:13 PM, Gabriel Wicke <gwicke(a)wikimedia.org> wrote:

...

Oh, I should also mention that we are looking into publishing full HTML dumps once RESTBase is online. Gabriel On Fri, Feb 13, 2015 at 10:12 PM, Gabriel Wicke <gwicke(a)wikimedia.org> wrote:

Nitin, Parsoid actually has some support for basic template expansion as well, but it does not re-implement all extensions and parser functions from MediaWiki. Some of those like Lua scripting are used on many pages. However, there is a production API for Parsoid HTML at http://parsoid-lb.eqiad.wikimedia.org/. See https://www.mediawiki.org/wiki/Parsoid/API for the API docs. This API can sustain moderate traffic only right now (please don't send more than 10 req/s), but a faster API using RESTBase will come online in the next weeks. Gabriel On Fri, Feb 13, 2015 at 9:49 PM, Nitin Gupta <nitingupta910(a)gmail.com> wrote:

Hi, I created a nodejs service to convert wikitext to HTML with a frontent (written in Golang) which reads wikipedia dump and feed wikitext to this service[1]. However, after doing all this I discovered that Parsoid needs to contact wikimedia server for template expansion. Since I want to convert the entire wikipedia dump and HTML, I do not want to keep hitting wikimedia servers for template expansion requests. So, are there any plans to add support to Parsoid to do this expansion offline. Once the wikipedia dump is downloaded, I want the entire process of converting to HTML to be offline (of course, don't need images). [1] https://github.com/nitingupta910/wikiparser Thanks, Nitin _______________________________________________ Wikitext-l mailing list Wikitext-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

_______________________________________________ Wikitext-l mailing list Wikitext-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply

Nitin Gupta

7:52 p.m.

On Sat, Feb 14, 2015 at 2:21 AM, Emmanuel Engelhart <kelson(a)kiwix.org> wrote:

...

On 14.02.2015 08:03, Nitin Gupta wrote:

I hope HTML would be made available with the same frequency as XML (wikitext) dumps; it would save me yet another attempt to make wikitext parser. Thanks. For any API points you provide, it would be helpful if you could also mention expected maximum load from a client (req/s), so client writers can throttle accordingly.

Kiwix already publish full HTML snapshots packed in ZIM files (snapshots with and without pictures). We publish monthly updates for most of Wikimedia projects and are working to to it for all the projects: http://www.kiwix.org

I somehow missed the Kiwix project and HTML dump is all I'm interested in (text only for now since images can have copyright issues). Surprisingly, I could not find link to kiwix ZIM dump without images, assuming default offered for download has thumbnails.

...

The solution is coded in Node.js and uses the Parsoid API: https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/ We face recurring stability problems (with the 'http' module) which is impairing the rollout for all project. If you are a Node.js expert your help is really welcome.

I'm no http expert but I see that you are downloading full article content from Parsoid API. Have you considered the approach of just downloading the entire XML dump and then extracting articles out of that. You would still need to download images, do template expansion over http but still it saves a lot. I have used this approach here: https://github.com/nitingupta910/wikiparser And it only requires http transfers (by Parsoid nodejs module) for template expansion. Thanks, Nitin

Reply

Nitin Gupta

15 Feb 15 Feb

8:36 a.m.

On Sat, Feb 14, 2015 at 12:06 PM, Emmanuel Engelhart <kelson(a)kiwix.org> wrote:

...

On 14.02.2015 20:52, Nitin Gupta wrote:

I hope HTML would be made available with the same frequency as XML (wikitext) dumps; it would save me yet another attempt to make wikitext parser. Thanks. For any API points you provide, it would be helpful if you could also mention expected maximum load from a client (req/s), so client writers can throttle accordingly. Kiwix already publish full HTML snapshots packed in ZIM files (snapshots with and without pictures). We publish monthly updates for most of Wikimedia projects and are working to to it for all the projects: http://www.kiwix.org I somehow missed the Kiwix project and HTML dump is all I'm interested in (text only for now since images can have copyright issues). Surprisingly, I could not find link to kiwix ZIM dump without images, assuming default offered for download has thumbnails.

Have a look to the "all_nopic" links: http://www.kiwix.org/wiki/Wikipedia_in_all_languages

The latest all_nopic dump for english wikipedia I can see is from 2014-01. Anyways, as Gabriel mentioned, it looks like wikimedia is going to generate and provide regularly updated HTML dumps for various projects directly -- hopefully sometime soon, so maybe that can then be used as gold source.

...

The solution is coded in Node.js and uses the Parsoid API:

https://sourceforge.net/p/__kiwix/other/ci/master/tree/__mwoffliner/ <https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/> We face recurring stability problems (with the 'http' module) which is impairing the rollout for all project. If you are a Node.js expert your help is really welcome. I'm no http expert but I see that you are downloading full article content from Parsoid API. Have you considered the approach of just downloading the entire XML dump and then extracting articles out of that. You would still need to download images, do template expansion over http but still it saves a lot. I have used this approach here:

Parsing wiki code is a nightmare (if you want to reach Mediawiki quality of output & maintain that code base). It's far more easy to write a scraper based on Parsoid API.

Yes, it's a nightmare to parse wikitext markup but in this case, the frontend (wikipedia.go) is not parsing wikitext at all. The XML dump simply encapsulates wikitext in a well structured XML. So, all the frontent is doing is extract the wikitext from XML and pass it to the backend (server.js -- running locally) service which uses Parsoid module to parse this wikitext to HTML. Thanks, Nitin

Reply

Gabriel Wicke

20 Mar 20 Mar

8:59 p.m.

Hi Dmitrijs, we are currently waiting for hardware to be allocated. We hope to have a first set of dumps 1-2 weeks from now, with the intention to provide dumps at regular intervals. See https://phabricator.wikimedia.org/T17017 and dependencies for the progress on this. We are also considering which distribution format to use for the HTML dumps. One option is a lzma-compressed sqlite database. Please weigh in on this at https://phabricator.wikimedia.org/T93396. Thanks, Gabriel On Mon, Mar 16, 2015 at 3:29 AM, Dmitrijs Milajevs <dimazest(a)gmail.com> wrote:

...

Hi, Is there any progress regarding html dumps? I'm not interested in html dumps as such, but I believe that HTML is way nicer way of getting raw text of articles out of a wiki dump. See this proof of concept [1]. However, what I believe would be very useful for the scientific community are syntacticly parsed dumps of Wikipedia. Right now everyone uses different pipelines to parsed Wikipedia, which are often undocumented, outdated and unreproducible. At IWCS we are running a two day hackathon [2] and I think that one useful project would be to come up with a documented and easily reproducible way of getting parsed versions of wikipedia dumps. I've started some noted as part of NLTK corpus readers [3], but this might grow into a separate project. So, I see an easily deployable pipeline of: enwiki.bz2 -> raw_text.bz2 -> parsed_text.bz2 as a perfect project for the hackathon. Ideally, this should be picked up by someone to produce regular dumps (but I don't know who will be willing to invest computational resources). Do you have any ideas/suggestions that I should take care of? In case you are in London on April 11-12 you are welcome to take part in the hackathon. [1] http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/ti… [2] http://iwcs2015.github.io/hackathon.html [3] http://iwcs2015.github.io/hackathon.html#nltk-corpus-readers -- Dima On Sun, Feb 15, 2015 at 8:36 AM, Nitin Gupta <nitingupta910(a)gmail.com> wrote:

On Sat, Feb 14, 2015 at 12:06 PM, Emmanuel Engelhart <kelson(a)kiwix.org> wrote:

On 14.02.2015 20:52, Nitin Gupta wrote:

I hope HTML would be made available with the same frequency as XML (wikitext) dumps; it would save me yet another attempt to make wikitext parser. Thanks. For any API points you provide, it would be helpful if you could also mention expected maximum load from a client (req/s), so client writers can throttle accordingly. Kiwix already publish full HTML snapshots packed in ZIM files (snapshots with and without pictures). We publish monthly updates for most of Wikimedia projects and are working to to it for all the projects: http://www.kiwix.org I somehow missed the Kiwix project and HTML dump is all I'm interested in (text only for now since images can have copyright issues). Surprisingly, I could not find link to kiwix ZIM dump without images, assuming default offered for download has thumbnails.

Have a look to the "all_nopic" links: http://www.kiwix.org/wiki/Wikipedia_in_all_languages

The latest all_nopic dump for english wikipedia I can see is from 2014-01. Anyways, as Gabriel mentioned, it looks like wikimedia is going to generate and provide regularly updated HTML dumps for various projects directly -- hopefully sometime soon, so maybe that can then be used as gold source.

The solution is coded in Node.js and uses the Parsoid API:

https://sourceforge.net/p/__kiwix/other/ci/master/tree/__ mwoffliner/ <https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/> We face recurring stability problems (with the 'http' module) which is impairing the rollout for all project. If you are a Node.js expert your help is really welcome. I'm no http expert but I see that you are downloading full article content from Parsoid API. Have you considered the approach of just downloading the entire XML dump and then extracting articles out of that. You would still need to download images, do template expansion over http but still it saves a lot. I have used this approach here:

Parsing wiki code is a nightmare (if you want to reach Mediawiki quality of output & maintain that code base). It's far more easy to write a scraper based on Parsoid API.

Yes, it's a nightmare to parse wikitext markup but in this case, the frontend (wikipedia.go) is not parsing wikitext at all. The XML dump simply encapsulates wikitext in a well structured XML. So, all the frontent is doing is extract the wikitext from XML and pass it to the backend (server.js -- running locally) service which uses Parsoid module to parse this wikitext to HTML. Thanks, Nitin _______________________________________________ Wikitext-l mailing list Wikitext-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

_______________________________________________ Wikitext-l mailing list Wikitext-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply