Hello Sebastian,
It comes down to these two options:
a) create one scraper configuration for each template, which captures
the intention of the creator and allows to "correctly" scrape the data
from all pages.
b) load all necessary template definitions into MediaWiki and then do a
transformation to HTML or XML and use XPath (or JQuery)
On 01/12/2012 03:38 PM, Oren Bochman wrote:
2. the only aplication which (correctly!?)
expands templates is
MedaiWiki itself.
(Thanks for your answer) I agree, that only Mediawiki can
"correctly"
expand templates, as it can interpret the code on the template pages.
The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am
currently not aware of any other transformation options.)
we are currently working on
http://www.mediawiki.org/wiki/Parsoid, a JS
parser that by now expands templates well and also supports a few parser
functions. We need to mark up template parameters for the visual editor
in any case, and plan to employ HTML5 microdata or RDFa for this purpose
(see
http://www.mediawiki.org/wiki/Parsoid/HTML5_DOM_with_microdata). I
intend to start implementing this sometime this month. Let us know if
you have feedback / ideas on the microdata or RDFa design.
To ask more precisely:
Is there a best practice for scraping data from Wikipedia? What is the
smartest way to resolve templates for scraping? Am I not seeing any
third option?
AFAIK most scraping is based on parsing the WikiText source. This gets
you the top-most template parameters, which might already be good enough
for many of your applications.
We try to provide provenance information for expanded content in the
HTML DOM produced by Parsoid. Initially this will likely focus on
top-level arguments, as that is all we need for the editor. Extending
this to nested expansions should be quite straightforward however, as
provenance is tracked per-token internally.
Gabriel