Dear all,
I had a misconfigured mail client and did not receive any of your
answers in January. I concluded, that the mailing list was not
populated. I really have to apologize for not replying to your answers.
Since we assumed that nobody replied, we already started to develop a
generic, configurable scraper and used it on the Englsih and German
Wiktionary. The config files and data can be found here (it is part of
DBpedia): [1] [2] [3] . We hope that it is generic enough to be applied
to all languages of Wiktionary and that it can also be used on other
MediaWikis (e.g.
travelwiki.org).
Normally a transformation is done by an Extract-Transform-Load (ETL)
process. Generally the E (extract) can also be considered a "select" or
"query" procedure. Hence my initial question about the "Wiki Query
Language". If you have a good language for E, then T and L are easy ;)
One of the main unsolved problems, yet, is scraping infos from
templates: to effectively build a generic scraper, it would require to
be able to "interpret" templates right. Templates are a good way to
structure information, and are easy to scrape (technically speaking) .
The problem is more that you would need one config file for each
template to get "good" data. In Wikipedia, infoboxes can all be parsed
with the same algorithm, but in DBpedia we still have to do so-called
"mappings" to get good data:
http://mappings.dbpedia.org/ Infoboxes
are a special case however, as they are all structured in a similar way.
So the "mapping solution" only works for infoboxes.
It comes down to these two options:
a) create one scraper configuration for each template, which captures
the intention of the creator and allows to "correctly" scrape the data
from all pages.
b) load all necessary template definitions into MediaWiki and then do a
transformation to HTML or XML and use XPath (or JQuery)
On 01/12/2012 03:38 PM, Oren Bochman wrote:
2. the only aplication which (correctly!?) expands
templates is
MedaiWiki itself.
(Thanks for your answer) I agree, that only Mediawiki can
"correctly"
expand templates, as it can interpret the code on the template pages.
The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am
currently not aware of any other transformation options.)
On 01/12/2012 07:06 PM, Gabriel Wicke wrote:
Rendered HTML obviously misses some of the information
available in the
wiki source, so you might have to rely on CSS class / tag pairs to
identify template output.
(Thanks for your answer): It misses some information, but
it also gets
more on the other hand.
A good example would be inflection of the Latin word "suus" in
wiktionary:
http://en.wiktionary.org/wiki/suus#Latin
====Inflection====
{{la-decl-1&2|su}}
To ask more precisely:
Is there a best practice for scraping data from Wikipedia? What is the
smartest way to resolve templates for scraping? Am I not seeing any
third option?
On 01/12/2012 06:56 PM, Platonides wrote:
I don't think so. I think the most similar piece
used are applying regex
to the page. Which you may find too powerful/low-level.
Regex is effective, but has
its limits. We included it as a tool.
I hope this has not been TL;DR and thanks again for your answers.
All the best,
Sebastian
[1]
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/f…
[2]
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/f…
[3]
http://downloads.dbpedia.org/wiktionary/
--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects:
http://nlp2rdf.org ,
http://dbpedia.org
Homepage:
http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group:
http://aksw.org