Re: [Wikidata] Item Label from ItemId

1 Sep 2016

On 31.08.2016 22:14, Sumit Asthana wrote:
...
  Hi,
 I've written a code to scrape Wikidata dump following Wikidata Toolkit
 examples.

 In processItemDocument, I have extracted the target entityId for the
 property 'instanceof' for the current item. However I'm unable to find a
 way to get the label of the target entity given that I have the
 entityId, but not the entityDocument? Help would be appreciated :) 
When you process a dump, you don't have random access to the data of all 
entities -- you just get to see them in order. Depending on your 
situation, there are several ways to go forward:

(1) You can use the Wikidata Toolkit API support to query the labels 
from Wikidata. This can be done in bulk at the end of the dump 
processing (fewer requests, since you can ask for many labels at once), 
or you can do it each time you need a label (more requests, slower, but 
easiest to implement). In the latter case, you should probably cache 
labels locally in a hashmap or similar to avoid repeated request.

This solution works well if you have a small or medium amount of labels. 
Otherwise, the API requests will take too long to be practical. 
Moreover, this solution will give you *current* labels from Wikidata. If 
you want to make sure that the labels are at a similar revision as your 
dump data (e.g., for historic analyses), then you must get them from the 
dump, not from the Web.

(2) If you need large amounts of labels (in the order of millions), then 
Web requests will not be practical. In this case, the easiest solution 
is to process the dump twice: first you collect all qids that you care 
about, second you gather all of their labels. Takes twice the time, but 
is very scalable: it will work for all data sizes (provided you can 
store the qids/labels while your program is running; if your local 
memory is very limited, you will need to use a database for this, which 
would slow down things more).

(1+2) You can do a combined approach of (1) and (2): do a single pass; 
remember all ids that you need labels for; if you find such an id in the 
dump, store the label; for ids that you did not find (because they 
occurred before you knew you needed them), do Web API queries after the 
dump processing.

(3) If you need to run such analyses a lot, you could also build up a 
label database locally: just write a small program that processes the 
dump and stores the label(s) for each id in a on-disk database. Then 
your actual program can get the labels from this database rather than 
asking the API. If your label set is not so large, you can also store 
the labels in a file that you load into memory when you need it. In 
fact, for the case of "class" items (things with an incoming P31 link), 
you can find such a file online:

http://tools.wmflabs.org/sqid/data/classes.json

It contains some more information, but also all English labels. This is 
26M, so quite manageable.

(4) If the items that you need labels for can be described easily (e.g., 
"all items with incoming P31 links") and are not too many (e.g., around 
100000), then you can use SPARQL to get all labels at once. This may 
(sometimes) time out if the result set is big. For example, the 
following query gets you all P31-targets + the number of their direct 
"best rank" instances:

SELECT ?cl ?clLabel ?c WHERE {
   { SELECT ?cl (count(*) as ?c) WHERE { ?i wdt:P31 ?cl } GROUP BY ?cl }
   SERVICE wikibase:label {
     bd:serviceParam wikibase:language "en" .
   }
}

Do *not* run this in your browser! There are too many results to 
display. Use the query service API programmatically instead. This query 
times out in as much as half of the cases, but so far I could always get 
it to return a complete result after a few attempts (you have to wait 
for at least 60sec before trying again).

My applications now do a single pass in WDTK for only the "hard" things, 
and then complete the output file using (4) with a Python script filling 
in labels. If the Python script's query does not time out, then the 
update of all labels takes less than a minute in this way. We had an 
implementation of (1+2) at some point, but it was more complicated to 
program and less efficient in this case. We did not have a reason to do 
(3) since we process each dump only once, so the effort of creating a 
label file does not pay off compared to (2).

Best regards,

Markus

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Item Label from ItemId