On Thu, Dec 19, 2019 at 11:16 PM Aidan Hogan <aidhog@gmail.com> wrote:
> - @Lydia, good point! I was thinking that filtering by wikilinks will
> just drop some more obscure nodes (like Q51366847 for example), but had
> not considered that there are some more general "concepts" that do not
> have a corresponding Wikipedia article. All the same, in a lot of the
> research we use Wikidata for, we are not particularly interested in one
> thing or another, but more interested in facilitating what other people
> are interested in. Examples would be query performance, finding paths,
> versioning, finding references, etc. But point taken! Maybe there is a
> way to identify "general entities" that do not have wikilinks, but do
> have a high degree or centrality, for example? Would a degree-based or
> centrality-based filter be possible in something like WDumper (perhaps
> it goes beyond the original purpose; certainly it does not seem trivial
> in terms of resources used)? Would it be a good idea?
I think it's definitely worth exploring but I fear it needs someone to
actually sit down and collect the different dumps use-cases and talk
to people to figure out which part of the data they need. Based on
that we could identify common patterns.
Yeah, there are a bunch of quite varied motivations for subsets. I have found the topic of Wikidata subsetting and data dumps coming up again and again. Most recently in a lifescience/bioinformations setting which is how we ended up collecting raw materials in the doc already shared here,
The work Adam wrote up at
(I think this is something
that needs to be done but unfortunately can't dedicate time to it in
the foreseeable future. https://phabricator.wikimedia.org/T46581 is a
good place for people who want to help think it through.
That is also a fine place to record things! I don’t mean to fork the discussion. Maybe we could have a call for interested parties in the new year?
Dan
Cheers
Lydia
--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata
Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata