Hi all,
as we have written a previous email, we are currently applying for a
MediaWiki grant to exploit the DBpedia Extraction Software to syncronize
between infoboxes between Wikipedias as well as Wikipedia and Wikidata.
During the discussion on the talk page
(
https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncRE)
the question was raised that we as DBpedia have too much of a bird's eye
view on things and it is true, i.e. since we are used to bulk extracting
and working with a lot of data instead of individual records.
The main problem here is that for us the prototype first of all shows
that we have the data, which we could exploit in several ways, however
for Wikipedians and Wikidata users the process of using it is the main
focus. We assumed that for Wikipedians an article-centric view would be
the best, i.e. you can directly compare one article's infobox with all
other articles and wikidata. However, for Wikidata the
article/entity-centric view does not seem practical and we would like to
have feedback on this. The options for globalfactsync are:
1. entity-centric view as it is now: same infobox across all wikipedias
and wikidata for one article/entity
2. template-centric (this one will not work, as there are no equivalent
infoboxes across Wikipedias or only very few )
3. template-parameter-centric: this is the current focus of Harvest
Templates, i.e. one parameter in one template in one language
https://tools.wmflabs.org/pltools/harvesttemplates/
* Note that one improvement DBpedia could make here is the
mappings we have parameter to DBpedia to Wikidata
* Another is that we can save the logs and manifest the mappings
entered by users to do a continuous sync, at the moment it is a
one time import
4. multilingual-template-parameter-centric or wikidata property
centric, i.e. one parameter/one Wikidata P across multiple templates
across multiple languages. This is supercharging harvesttemplates,
but since it is a power tool for syncing, it gets more complex and
overview is difficult.
All Feedback welcome, we also created a topic here:
https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncR…
# Motivation, Wikidata adoption report
One goal of Wikidata is to support the infoboxes. We are doing monthly
releases now at DBpedia and are able to provide statistics about
Wikidata adoption or missing adoption in Wikipedia:
https://docs.google.com/spreadsheets/d/1_aNjgExJW_b0MvDSQs5iSXHYlwnZ8nU2zrQ…
In total 584 million facts are still maintained in Wikipedia, not using
Wikidata. In case they are already in Wikidata, this means that there
are two or more places the same fact is maintained, multiplying
maintenance work (unless the fact is static).
Code used to extract:
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/s…
Data:
http://downloads.dbpedia.org/repo/dev/generic-spark/infobox-properties/2018…
Stat generation: echo -n "" > res.csv ; for i in `ls *.bz2` ; do echo
-n $i | sed 's/infobox-properties-2018.11.01_//;s/.ttl.bz2/\t/' >>
res.csv ; lbzip2 -dc $i | wc -l >> res.csv ; done
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects:
http://dbpedia.org,
http://nlp2rdf.org,
http://linguistics.okfn.org,
https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage:
http://aksw.org/SebastianHellmann
Research Group:
http://aksw.org