On 31.08.2016 22:14, Sumit Asthana wrote:
Hi,
I've written a code to scrape Wikidata dump following Wikidata Toolkit
examples.
In processItemDocument, I have extracted the target entityId for the
property 'instanceof' for the current item. However I'm unable to find a
way to get the label of the target entity given that I have the
entityId, but not the entityDocument? Help would be appreciated :)
When you process a dump, you don't have random access to the data of all
entities -- you just get to see them in order. Depending on your
situation, there are several ways to go forward:
(1) You can use the Wikidata Toolkit API support to query the labels
from Wikidata. This can be done in bulk at the end of the dump
processing (fewer requests, since you can ask for many labels at once),
or you can do it each time you need a label (more requests, slower, but
easiest to implement). In the latter case, you should probably cache
labels locally in a hashmap or similar to avoid repeated request.
This solution works well if you have a small or medium amount of labels.
Otherwise, the API requests will take too long to be practical.
Moreover, this solution will give you *current* labels from Wikidata. If
you want to make sure that the labels are at a similar revision as your
dump data (e.g., for historic analyses), then you must get them from the
dump, not from the Web.
(2) If you need large amounts of labels (in the order of millions), then
Web requests will not be practical. In this case, the easiest solution
is to process the dump twice: first you collect all qids that you care
about, second you gather all of their labels. Takes twice the time, but
is very scalable: it will work for all data sizes (provided you can
store the qids/labels while your program is running; if your local
memory is very limited, you will need to use a database for this, which
would slow down things more).
(1+2) You can do a combined approach of (1) and (2): do a single pass;
remember all ids that you need labels for; if you find such an id in the
dump, store the label; for ids that you did not find (because they
occurred before you knew you needed them), do Web API queries after the
dump processing.
(3) If you need to run such analyses a lot, you could also build up a
label database locally: just write a small program that processes the
dump and stores the label(s) for each id in a on-disk database. Then
your actual program can get the labels from this database rather than
asking the API. If your label set is not so large, you can also store
the labels in a file that you load into memory when you need it. In
fact, for the case of "class" items (things with an incoming P31 link),
you can find such a file online:
http://tools.wmflabs.org/sqid/data/classes.json
It contains some more information, but also all English labels. This is
26M, so quite manageable.
(4) If the items that you need labels for can be described easily (e.g.,
"all items with incoming P31 links") and are not too many (e.g., around
100000), then you can use SPARQL to get all labels at once. This may
(sometimes) time out if the result set is big. For example, the
following query gets you all P31-targets + the number of their direct
"best rank" instances:
SELECT ?cl ?clLabel ?c WHERE {
{ SELECT ?cl (count(*) as ?c) WHERE { ?i wdt:P31 ?cl } GROUP BY ?cl }
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
}
}
Do *not* run this in your browser! There are too many results to
display. Use the query service API programmatically instead. This query
times out in as much as half of the cases, but so far I could always get
it to return a complete result after a few attempts (you have to wait
for at least 60sec before trying again).
My applications now do a single pass in WDTK for only the "hard" things,
and then complete the output file using (4) with a Python script filling
in labels. If the Python script's query does not time out, then the
update of all labels takes less than a minute in this way. We had an
implementation of (1+2) at some point, but it was more complicated to
program and less efficient in this case. We did not have a reason to do
(3) since we process each dump only once, so the effort of creating a
label file does not pay off compared to (2).
Best regards,
Markus