Re: [Wikidata] Wikidata HDT dump

28 Oct 2017

Hi!

...
  I will look into the size of the jnl file but should
that not be
 located where the blazegraph is running from the sparql endpoint or
 is this a special flavour? Was also thinking of looking into a gitlab
 runner which occasionally could generate a HDT file from the ttl dump
 if our server can handle it but for this an md5 sum file would be
 preferable or should a timestamp be sufficient? 
Publishing jnl file for Blazegraph may be not as useful as one would
think, because jnl file is specific for a specific vocabulary and
certain other settings - i.e., unless you run the same WDQS code (which
customizes some of these) of the same version, you won't be able to use
the same file. Of course, since WDQS code is open source, it may be good
enough, so in general publishing such file may be possible.

Currently, it's about 300G size uncompressed. No idea how much
compressed. Loading it takes a couple of days on reasonably powerful
machine, more on labs ones (I haven't tried to load full dump on labs
for a while, since labs VMs are too weak for that).

In general, I'd say it'd take about 100M per million of triples. Less if
triples are using repeated URIs, probably more if they contain ton of
text data.

-- 
Stas Malyshev
smalyshev(a)wikimedia.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Wikidata HDT dump