Re: [Wikitext-l] Wiki Query Language

5 Mar 2012

On 03/04/2012 02:09 PM, Gabriel Wicke wrote:
...
  Hello Sebastian,

  It comes down to these two options:
 a) create one scraper configuration for each template, which captures
 the intention of the creator and allows to "correctly" scrape the data
 from all pages.
 b) load all necessary template definitions into MediaWiki and then do a
 transformation to HTML or XML and use XPath (or JQuery)

 On 01/12/2012 03:38 PM, Oren Bochman wrote:
  2. the only aplication which (correctly!?)
expands templates is
 MedaiWiki itself.  (Thanks for your answer) I agree, that only Mediawiki can
"correctly"
 expand templates, as it can interpret the code on the template pages.
 The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am
 currently not aware of any other transformation options.) 
 we are currently working on http://www.mediawiki.org/wiki/Parsoid, a 
 JS parser that by now expands templates well and also supports a few 
 parser functions. We need to mark up template parameters for the 
 visual editor in any case, and plan to employ HTML5 microdata or RDFa 
 for this purpose (see 
 http://www.mediawiki.org/wiki/Parsoid/HTML5_DOM_with_microdata). I 
 intend to start implementing this sometime this month. Let us know if 
 you have feedback / ideas on the microdata or RDFa design. Awesome!! I forwarded it
to DBpedia developers. I think, the Parsoid 
project might interest some of our people. How is it possible to join? 
Or is it Wikimedia internal development? Is there a parsoid mailing list?

Can JS handle this? I read somewher, that it was several magnitudes 
slower than other languages... Maybe this is not true for node-JS.

All the data in our mappings wiki was created  to "mark up" Wikipedia 
template parameters. So please try to reuse it. I think there are almost 
200 active users in http://mappings.dbpedia.org/ who have added extra 
parsing information to thousands of templates in Wikipedia across 20 
languages. You can download and reuse it or we can also add your 
requirements to it.

All the best,
Sebastian

...

  To ask more precisely:
 Is there a best practice for scraping data from Wikipedia? What is the
 smartest way to resolve templates for scraping? Am I not seeing any
 third option? 
 AFAIK most scraping is based on parsing the WikiText source. This gets 
 you the top-most template parameters, which might already be good 
 enough for many of your applications.

 We try to provide provenance information for expanded content in the 
 HTML DOM produced by Parsoid. Initially this will likely focus on 
 top-level arguments, as that is all we need for the editor. Extending 
 this to nested expansions should be quite straightforward however, as 
 provenance is tracked per-token internally.

 Gabriel

 _______________________________________________
 Wikitext-l mailing list
 Wikitext-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitext-l

-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] Wiki Query Language