Hi Adam,
On 2020-07-13 13:41, Adam Sanchez wrote:
Hi,
I have to launch 2 million queries against a Wikidata instance.
I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with RAID 0).
The queries are simple, just 2 types.
select ?s ?p ?o {
?s ?p ?o.
filter (?s = ?param)
}
select ?s ?p ?o {
?s ?p ?o.
filter (?o = ?param)
}
If I use a Java ThreadPoolExecutor takes 6 hours.
How can I speed up the queries processing even more?
Perhaps I am a bit late to respond.
It's not really clear to me what you are aiming for, but if this is a
once-off task, I would recommend to download the dump in Turtle or
N-Triples, load your two million parameters in memory in a sorted or
hashed data structure in the programming language of your choice (should
take considerably less than 1GB of memory assuming typical constants),
use a streaming RDF parser for that language, and for each
subject/object, check if its in your list in memory. This solution is
about as good as you can get in terms of once-off batch processing.
If your idea is to index the data so you can do 2 million lookups in
"interactive time", your problem is not what software to use, it's what
hardware to use.
Traditional hard disks have a physical arm that takes maybe 5-10 ms to
move. Sold state disks are quite a bit better but still have seeks in
the range of 0.1 ms. Multiply those seek times by 2 million and you have
a long wait (caching will help, as will multiple disks, but not by
nearly enough). You would need to get the data into main memory (RAM) to
have any chance of approximating interactive times, and even still you
will probably not get interactive runtimes without leveraging some
further assumptions about what you want to do to optimise further (e.g.,
if you're only interesting in Q ids, you can use integers or bit
vectors, etc). In the most general case, you would probably need to
pre-filter the data as much as you can, and also use as much compression
as you can (ideally with compact data structures) to get the data into
memory on one machine, or you might think about something like Redis
(in-memory key-value store) on lots of machines. Essentially, if your
goal is interactive times on millions of lookups, you very likely need
to look at options purely in RAM (unless you have thousands of disks
available at least). The good news is that 512GB(?) sounds like a lot of
space to store stuff in.
Best,
Aidan
I was thinking :
a) to implement a Virtuoso cluster to distribute the queries or
b) to load Wikidata in a Spark dataframe (since Sansa framework is
very slow, I would use my own implementation) or
c) to load Wikidata in a Postgresql table and use Presto to distribute
the queries or
d) to load Wikidata in a PG-Strom table to use GPU parallelism.
What do you think? I am looking for ideas.
Any suggestion will be appreciated.
Best,
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata