Wikidata Adoption in Wikipedia (584 million facts, HarvestTemplates process granularity) - Wikidata

6 Dec 2018

Hi all,

as we have written a previous email, we are currently applying for a 
MediaWiki grant to exploit the DBpedia Extraction Software to syncronize 
between infoboxes between Wikipedias as well as Wikipedia and Wikidata.

During the discussion on the talk page 
(https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncRE) 
the question was raised that we as DBpedia have too much of a bird's eye 
view on things and it is true, i.e. since we are used to bulk extracting 
and working with a lot of data instead of individual records.

The main problem here is that for us the prototype first of all shows 
that we have the data, which we could exploit in several ways, however 
for Wikipedians and Wikidata users the process of using it is the main 
focus. We assumed that for Wikipedians an article-centric view would be 
the best, i.e. you can directly compare one article's infobox with all 
other articles and wikidata. However, for Wikidata the 
article/entity-centric view does not seem practical and we would like to 
have feedback on this. The options for globalfactsync are:

 1. entity-centric view as it is now: same infobox across all wikipedias
    and wikidata for one article/entity
 2. template-centric (this one will not work, as there are no equivalent
    infoboxes across Wikipedias or only very few )
 3. template-parameter-centric: this is the current focus of Harvest
    Templates, i.e. one parameter in one template in one language
    https://tools.wmflabs.org/pltools/harvesttemplates/
      * Note that one improvement DBpedia could make here is the
        mappings we have parameter to DBpedia to Wikidata
      * Another is that we can save the logs and manifest the mappings
        entered by users to do a continuous sync, at the moment it is a
        one time import
 4. multilingual-template-parameter-centric or wikidata property
    centric, i.e. one parameter/one Wikidata P across multiple templates
    across multiple languages. This is supercharging harvesttemplates,
    but since it is a power tool for syncing, it gets more complex and
    overview is difficult.

All Feedback welcome, we also created a topic here: 
https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncR…

# Motivation, Wikidata adoption report

One goal of Wikidata is to support the infoboxes. We are doing monthly 
releases now at DBpedia and are able to provide statistics about 
Wikidata adoption or missing adoption in Wikipedia:

https://docs.google.com/spreadsheets/d/1_aNjgExJW_b0MvDSQs5iSXHYlwnZ8nU2zrQ…

In total 584 million facts are still maintained in Wikipedia, not using 
Wikidata. In case they are already in Wikidata, this means that there 
are two or more places the same fact is maintained, multiplying 
maintenance work (unless the fact is static).

Code used to extract: 
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/s…

Data: 
http://downloads.dbpedia.org/repo/dev/generic-spark/infobox-properties/2018…

Stat generation: echo -n "" > res.csv ;  for i in `ls *.bz2` ; do echo 
-n $i | sed 's/infobox-properties-2018.11.01_//;s/.ttl.bz2/\t/'  >> 
res.csv ; lbzip2 -dc $i | wc -l >> res.csv  ; done

-- 
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org