On Fri, Nov 5, 2021 at 5:55 AM David Causse <dcausse(a)wikimedia.org> wrote:
Hi Thad,
I looked at this query and I have nothing to add to what was suggested
already to make it run faster.
I think the main issue is the size of the intermediate results that have
to have the language filter applied, sadly almost every time that a FILTER
is being used on a string literal blazegraph might have to fetch its
representation from its lexicon which incur a huge slowdown.
Regarding indices and ordering I believe the right indices are being used
otherwize the query would certainly time out, I doubt it can filter all
english labels before joining them to the property labels.
The criterion ?prop wdt:P31/wdt:P279* wd:Q18616576 does indeed seem
useless to me and is pulling a couple false positives[1] into the join
(totally harmless regarding query perf but should perhaps be cleaned up
from wikidata?).
So filtering & fetching the textual data is indeed what makes this query
slow. I tried various combinations but could not come up with reasonable &
stable sub-second response times. Fetching the textual data (possibly
lazily) from another service might help but this certainly is a consequent
rewrite of the client relying on this query.
Caching is definitely going to help especially if this data is not subject
to rapid/frequent changes, the WDQS infrastructure has a caching layer but
retention might not be long enough to be useful for this particular tool.
The json output seems indeed quite big (almost 5Mb), while not
enormous it's still consequent and if this data is relatively stable there
might be value in refreshing it on purpose (daily as you suggest) and
making it available on a static storage.
Another note about response times, you may see varying response times from
the query service and the reasons might be one of the following:
- it's cached on the query service caching layer (generally sub 100ms
response time)
- the server the query hits is heavily loaded
- the server the query hits is an old generation (we have 2 different
kinds of hardware setup in the cluster at the moment and might explain some
of the variance you see).
Hope it helps a bit,
Regards,
David.
1:
https://w.wiki/4Lae
On Wed, Nov 3, 2021 at 11:39 PM Thad Guidry <thadguidry(a)gmail.com> wrote:
Thanks Kingsley, Thomas, Jeff,
From what I see the live query never is sub second and that's likely
because of 2 things:
1. indexing not prioritizing this kind of query and aligning it (which
David Causse might know if that could be changed), essentially its metadata
about Wikidata (it's available properties).
2. it's 2.2 MB of data
I think that Yi Liu's Wikidata Property Explorer service then might want
to instead cache the results for 24 hours for the best of both worlds.
To be fair, the raw amount of data requested seems to be approximately
2.2 MB and so probably should be locally cached by his tool for some
determined time (like 24 hours).
Thad
https://www.linkedin.com/in/thadguidry/
https://calendly.com/thadguidry/
_______________________________________________
Wikidata mailing list -- wikidata(a)lists.wikimedia.org
To unsubscribe send an email to wikidata-leave(a)lists.wikimedia.org
_______________________________________________
Wikidata mailing list -- wikidata(a)lists.wikimedia.org
To unsubscribe send an email to wikidata-leave(a)lists.wikimedia.org