Re: [Wikidata] 2 million queries against a Wikidata instance

23 Jul 2020

Hi Adam,

On 2020-07-13 13:41, Adam Sanchez wrote:
...
  Hi,

 I have to launch 2 million queries against a Wikidata instance.
 I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with RAID 0).
 The queries are simple, just 2 types.

 select ?s ?p ?o {
 ?s ?p ?o.
 filter (?s = ?param)
 }

 select ?s ?p ?o {
 ?s ?p ?o.
 filter (?o = ?param)
 }

 If I use a Java ThreadPoolExecutor takes 6 hours.
 How can I speed up the queries processing even more? 
Perhaps I am a bit late to respond.

It's not really clear to me what you are aiming for, but if this is a 
once-off task, I would recommend to download the dump in Turtle or 
N-Triples, load your two million parameters in memory in a sorted or 
hashed data structure in the programming language of your choice (should 
take considerably less than 1GB of memory assuming typical constants), 
use a streaming RDF parser for that language, and for each 
subject/object, check if its in your list in memory. This solution is 
about as good as you can get in terms of once-off batch processing.

If your idea is to index the data so you can do 2 million lookups in 
"interactive time", your problem is not what software to use, it's what 
hardware to use.

Traditional hard disks have a physical arm that takes maybe 5-10 ms to 
move. Sold state disks are quite a bit better but still have seeks in 
the range of 0.1 ms. Multiply those seek times by 2 million and you have 
a long wait (caching will help, as will multiple disks, but not by 
nearly enough). You would need to get the data into main memory (RAM) to 
have any chance of approximating interactive times, and even still you 
will probably not get interactive runtimes without leveraging some 
further assumptions about what you want to do to optimise further (e.g., 
if you're only interesting in Q ids, you can use integers or bit 
vectors, etc). In the most general case, you would probably need to 
pre-filter the data as much as you can, and also use as much compression 
as you can (ideally with compact data structures) to get the data into 
memory on one machine, or you might think about something like Redis 
(in-memory key-value store) on lots of machines. Essentially, if your 
goal is interactive times on millions of lookups, you very likely need 
to look at options purely in RAM (unless you have thousands of disks 
available at least). The good news is that 512GB(?) sounds like a lot of 
space to store stuff in.

Best,
Aidan

...
  I was thinking :

 a) to implement a Virtuoso cluster to distribute the queries or
 b) to load Wikidata in a Spark dataframe (since Sansa framework is
 very slow, I would use my own implementation) or
 c) to load Wikidata in a Postgresql table and use Presto to distribute
 the queries or
 d) to load Wikidata in a PG-Strom table to use GPU parallelism.

 What do you think? I am looking for ideas.
 Any suggestion will be appreciated.

 Best,

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] 2 million queries against a Wikidata instance