This is a deep-seated semantic confusion going back to at least 2006 [1]
when the Protein Infobox had Entrez and OMIM gene IDs. Freebase naively
adopted in its initial protein schema in 2007 when it was importing from
those infoboxes. Although it made some progress in improving the schema
later, anything not aligned with how Wikipedians want to do things is
shoveling against the tide. It's also very difficult to manage
equivalences when Wikipedia articles are about multiple things like the
protein/gene articles.
If you look at the recent merge of Reelin [3] you can see that it was done
by the same user who contributed substantially to the article back in 2006
[4], so clearly, as the "owner" of that article, they clearly know what's
best. :-) It's going to be very difficult to get people to unlearn a
decade of habits.
Another issue is that, as soon as you start trying to split things out into
semantically clean pieces, you immediately run afoul of the notability
restrictions. Because human (and mouse) genes don't have their own
Wikipedia pages, they're clearly not notable, so they can't be added to
Wikidata.
This problem of chunking by notability (or lack thereof), length of text
article, relatedness, and other attributes rather than semantic
individuality is much more widespread than just proteins/genes. It also
effects things like pairs (or small sets) of people who aren't notable
enough to have an article on their own, articles which contain infoboxes
about people who aren't notable, so they got tacked onto related article to
give them a how, etc.
The inverse problem exists as well where a single semantic topic is broken
up into multiple articles purely for reasons of length. Other types of
semantic mismatches include articles along precoordinated facets like
Transportation in New York City (or even History of Transportation in New
York City!), list articles (* Filmography, * Discography, * Videography,
List of *). Of course, some lists, like the Fortune 500, make sense to
talk about as entities, but most Wikipedia lists are just mechanically
generated things for human browsing which don't really need a semantic
identifier. Freebase deleted most of this Wikipedia cruft.
Going back to Ben's original problem, one tool that Freebase used to help
manage the problem of incompatible type merges was a set of curated sets of
incompatible types [5] which was used by the merge tools to warn users that
the merge they were proposing probably wasn't a good idea. People could
ignore the warning in the Freebase implementation, but Wikidata could make
it a hard restriction or just a warning.
Tom
[1]
https://en.wikipedia.org/w/index.php?title=Reelin&diff=56108806&old…
[2]
http://www.freebase.com/biology/protein/entrez_gene_id
[3]
https://www.wikidata.org/w/index.php?title=Q414043&type=revision&di…
[4]
https://en.wikipedia.org/w/index.php?title=Reelin&dir=prev&action=h…
[5]
http://www.freebase.com/dataworld/incompatible_types?instances=
On Wed, Oct 28, 2015 at 1:07 PM, Benjamin Good <ben.mcgee.good(a)gmail.com>
wrote:
The Gene Wiki team is experiencing a problem that may
suggest some areas
for improvement in the general wikidata experience.
When our project was getting started, we had some fairly long public
debates about how we should structure the data we wanted to load [1].
These resulted in a data model that, we think, remains pretty much true to
the semantics of the data, at the cost of distributing information about
closely related things (genes, proteins, orthologs) across multiple,
interlinked items. Now, as long as these semantic links between the
different item classes are maintained, this is working out great. However,
we are consistently seeing people merging items that our model needs to be
distinct. Most commonly, we see people merging items about genes with
items about the protein product of the gene (e.g. [2]]). This happens
nearly every day - especially on items related to the more popular
Wikipedia articles. (More examples [3])
Merges like this, as well as other semantics-breaking edits, make it very
challenging to build downstream apps (like the wikipedia infobox) that
depend on having certain structures in place. My question to the list is
how to best protect the semantic models that span multiple entity types in
wikidata? Related to this, is there an opportunity for some consistent way
of explaining these structures to the community when they exist?
I guess the immediate solutions are to (1) write another bot that watches
for model-breaking edits and reverts them and (2) to create an article on
wikidata somewhere that succinctly explains the model and links back to the
discussions that went into its creation.
It seems that anyone that works beyond a single entity type is going to
face the same kind of problems, so I'm posting this here in hopes that
generalizable patterns (and perhaps even supporting code) can be realized
by this community.
[1]
https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#D…
[2]
https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370
[3]
https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/m…
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata