On 26-08-2016 16:58, Stas Malyshev wrote:
Hi!
I think in terms of the dump, /replacing/ the
Turtle dump with the
N-Triples dump would be a good option. (Not sure if that's what you were
suggesting?)
No, I'm suggesting having both. Turtle is easier to comprehend and also
more compact for download, etc. (though I didn't check how much is the
difference - compressed it may not be that big).
I would argue that human readability is not so important for a dump? For
dereferenced documents sure, but less so for a dump perhaps.
Also I'd expect when [G|B]Zipped the difference would not justify having
both (my guess is the N-triples file compressed should end up within
+25% of the size of the Turtle file compressed, but that's purely a
guess; obviously worth trying it to see!).
But yep, I get both points.
to have both:
existing tools expecting Turtle shouldn't have a problem
with N-Triples.
That depends on whether these tools actually understand RDF - some might
be more simplistic (with text-based formats, you can achieve a lot even
with dumber tools). But that definitely might be an option too. I'm not
sure if it's the best one but a possibility, so we may discuss it too.
I'd imagine that anyone processing Turtle would be using a full-fledged
Turtle parser? A dumb tool would have to be pretty smart to do anything
useful with the Turtle I think. And it would not seem wise to parse the
precise syntax of Turtle that way. But you never know [1]. :)
Of course if providing both is easy, then there's no reason not to
provide both.
(Also just to
put the idea out there of perhaps (also) having N-Quads
where the fourth element indicates the document from which the RDF graph
can be dereferenced. This can be useful for a tool that, e.g., just
What you mean by "document" - like entity? That may be a problem since
some data - like references and values, or property definitions - can be
used by more than one entity. So it's not that trivial to extract all
data regarding one entity from the dump. You can do it via export, e.g.:
http://www.wikidata.org/entity/Q42?flavor=full - but that doesn't
extract it, it just generates it.
If it's problematic, then for sure it can be skipped as a feature. I'm
mainly just floating the idea.
Perhaps to motivate the feature briefly: we worked a lot for a while on
a search engine over RDF data ingested from the open Web. Since we were
ingesting data from the Web, considering one giant RDF graph was not a
possibility: we needed to keep track of which RDF triples came from
which Web documents for a variety of reasons. This simple notion of
provenance was easy to keep track of when we crawled the individual
documents themselves because we knew what documents we were taking
triples from. But we could rarely if ever use dumps because they did not
give such information.
In this view, Wikidata is a website publishing RDF like any other.
It is useful in such applications to know the online RDF documents in
which a triple can be found. The document could be the entity, or it
could be a physical location like:
http://www.wikidata.org/entity/Q13794921.ttl
Mainly it needs to be an IRI that can be resolved by HTTP to a document
containing the triple. Ideally the quads would also cover all triples in
that document. Even more ideally, the dumps would somehow cover all the
information that could be obtained from crawling the RDF documents on
Wikidata, including all HTTP redirects, and so forth.
At the same time I understand this is not a priority and there's
probably no immediate need for N-Quads or publishing redirects. The need
for this is rather abstract at the moment so perhaps left until the need
becomes more concrete.
tl;dr:
N-Triples or N-Triples + Turtle sounds good.
N-Quads would be a bonus if easy to do.
Best,
Aidan
[1]
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xht…