On Sat, 21 Dec 2019 at 17:25, Lydia Pintscher <Lydia.Pintscher@wikimedia.de> wrote:
On Thu, Dec 19, 2019 at 11:16 PM Aidan Hogan <aidhog@gmail.com> wrote:
> - @Lydia, good point! I was thinking that filtering by wikilinks will
> just drop some more obscure nodes (like Q51366847 for example), but had
> not considered that there are some more general "concepts" that do not
> have a corresponding Wikipedia article. All the same, in a lot of the
> research we use Wikidata for, we are not particularly interested in one
> thing or another, but more interested in facilitating what other people
> are interested in. Examples would be query performance, finding paths,
> versioning, finding references, etc. But point taken! Maybe there is a
> way to identify "general entities" that do not have wikilinks, but do
> have a high degree or centrality, for example? Would a degree-based or
> centrality-based filter be possible in something like WDumper (perhaps
> it goes beyond the original purpose; certainly it does not seem trivial
> in terms of resources used)? Would it be a good idea?

I think it's definitely worth exploring but I fear it needs someone to
actually sit down and collect the different dumps use-cases and talk
to people to figure out which part of the data they need. Based on
that we could identify common patterns.

Yeah, there are a bunch of quite varied motivations for subsets.  I have found the topic of Wikidata subsetting and data dumps coming up again and again. Most recently in a lifescience/bioinformations setting which is how we ended up collecting raw materials in the doc already shared here, 
https://docs.google.com/document/d/1MmrpEQ9O7xA6frNk6gceu_IbQrUiEYGI9vcQjDvTL9c but also in other domains. If people here care to drop use cases, thoughts and notes (*however scrappy*) into that doc I will make a pass over it to try to pull together a more readable summary of the various motivations for subsetting.

The work Adam wrote up at 
https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits-part-1/ is also very relevant...

(I think this is something
that needs to be done but unfortunately can't dedicate time to it in
the foreseeable future. https://phabricator.wikimedia.org/T46581 is a
good place for people who want to help think it through.

That is also a fine place to record things! I don’t mean to fork the discussion. Maybe we could have a call for interested parties in the new year?

Dan





Cheers
Lydia

--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata