Hi Daniel,
I started working on the DBpedia release and just wanted to check
what's the current status of the Wikidata dumps. I saw that RDF data
and RDF URIs like
are already
available. Cool! Do you think there will be RDF dumps soon, i.e. in
the next few weeks?
If not, could you guys prepare a dump of the sitelinks table, as you
suggested below? If it's not too much effort, it would be cool if you
could generate CSV or a similar simple format. We won't put the stuff
into a DB, we just extract the data, and we would have to write a
parser for SQL insert statements. CSV would be much simpler.
Thanks a lot for your help!
Christopher
On 4 May 2013 23:36, Daniel Kinzler <daniel.kinzler(a)wikimedia.de> wrote:
On 04.05.2013 19:13, Jona Christopher Sahnwaldt
wrote:
We will produce a DBpedia release pretty soon, I
don't think we can
wait for the "real" dumps. The inter-language links are an important
part of DBpedia, so we have to extract data from almost all Wikidata
items. I don't think it's sensible to make ~10 million calls to the
API to download the external JSON format, so we will have to use the
XML dumps and thus the internal format.
Oh, if it's just the language links, this isn't an issue: there's an
additional
table for them in the database, and we'll soon be providing a separate dump of
that at table
http://dumps.wikimedia.org/wikidatawiki/
If it's not there when you need it, just ask us for a dump of the sitelinks
table (technically, wb_items_per_site), and we'll get you one.
But I think it's not a big
deal that it's not that stable: we parse the JSON into an AST anyway.
It just means that we will have to use a more abstract AST, which I
was planning to do anyway. As long as the semantics of the internal
format will remain more or less the same - it will contain the labels,
the language links, the properties, etc. - it's no big deal if the
syntax changes, even if it's not JSON anymore.
Yes, if you want the labels and properties in addition to the links, you'll have
to do that for now. But I'm working on the "real" data dumps.
-- daniel