[WikiEN-l] Types of categories

Roger Luethi collector at hellgate.ch
Sat Jun 3 21:10:14 UTC 2006


On Sat, 03 Jun 2006 19:54:27 +0200, Steve Bennett wrote:
> I'm probably not the only one who envisages all the wonderful things
> that could be done with this massive collection of information that is
> Wikipedia, *if only* we could do something clever with the categories.
> And then you realise that you can't really do anything clever because
> "category" has all sorts of different meanings to different people.

Agreed. Still: can you give some specific examples of wonderful things that
could be done but are not possible now? That would tell us what problem you
are trying to solve.

> So far I have identified four rough types of categories. I'll invent
> the notion a(X) to mean that article X is in category a. a(b(X)) means
> that a is a subcategory of b, and X is in b.

ITYM "b is a subcategory of a".

> Taxonomies: Tend to end in "s" and satisfy the rule that "If a(X) then
> X is an a") is a logical sentence. Tend to form strict hierarchies,
> where if a(X) and b(a), then it's perfectly natural and normal that
> b(a(X)). Eg, Bridges in France is a subcat of Bridges, and every entry
> in "bridges in France" is definitely a Bridge.  It's rare for an
> article to be in more than two taxonomic categories at once.

"Bridges in France" may not be the best example. "Bridges in France" is
just an intersection of two attributes ("in France", "Bridges"), and their
relative position in a hierarchy is undefined. Hence more than one
hierarchy: You can drill down "France" ... "Buildings and structurces in
France" or "Bridges" ...  "Bridges by country".

Compare with taxons in the classification of species: an actual hierarchy,
and only one path from the top down to any species -- there you are
dividing into subsets (and intersections make no sense).

Categories based on such intersections of attributes are conceptually bad.
Look at the categories for an article like [[Marie Curie]]: She's French
three times, female four times, Polish four times (not counting "Natives of
Warsaw"), etc. Why not create [[Category:Polish women who were born in
1867 and died in 1934 and won a Nobel Prize in Chemistry and in Physics]]?

If we don't have a term for (or an article about) it, there probably
shouldn't be a category for it, either (I'm sure a determined mind could
come up with an exception).

> Themes: Tend not to be plurals, and tend not to form strict
> hierarchies. Often it is the case that b looks like it belongs in a,
> but then a(b(X)) is nonsense for certain X. Eg, Paris might be in
> European cities, and the film Amelie might be in Paris, but it's silly
> to say that Amelie is in European cities. (or many worse examples)

Well yes, Amelie _is_ related to European cities. It is relevant for a list
of movies that are set in European cities. The real problem is that the
initial relation is entirely unqualified: Amelie is neither a part nor a
member of Paris.

You could conceivably create a category "set in Paris" for the film and
have that be a subcategory of "set in European cities". Problem is, you
need to propagate that modifier backwards all the way to the top or you
will have the same situation you described.

The best solution I've seen is qualifying relations (something like the
[[Semantic MediaWiki]]). For instance: Amelie is set in [[set in::Paris]].

> Attributes: The category exists to denote some very specific small
> detail of a subject, such that it would be conceivable to have dozens
> or more such categories on an article. Examples: 1943 deaths, Living
> persons, Winners of Nobel Peace Prize, etc. These tend to hierarchies
> that start strict then end up fuzzy. Eg, 1943 deaths is only in 1943
> and "1940s deaths", and these have parent categories of
> "1940s","Years" and so forth, eventually ending up in "History",
> whereupon things become chaos.

There is no way to make hierarchies not suck, especially if you have to
maintain them manually (as we do now). Don't try to impose hierarchies
unless they emerge quite naturally from the subject.

> Meta-attributes: These are categories about *articles* rather than
> article subjects. The most common examples are stubs ("France
> geography stubs"), sources ("1911 Encyclopaedia Britannica") and
> disputes of various kinds ("Articles lacking sources").

Actually, "France geography stubs" contains two attributes (France,
geography). Only the "stub" part is not about the subject. But yeah,
it's a problem.

Another one that you didn't mention is articles that merge several concepts
into one: This happens for instance if a biography is merged with the thing
that made the person notable. You get articles that are in people and
object categories at the same time (e.g. programmers, software).

> To me, these types of categories are all fairly incompatible, and
> really get in the way of using categories to do anything useful. It's
> pointless trying to draw tree structures when you have attributes and
> meta-attributes involved, for example.

So the problem you are trying to solve is drawing tree structures? I'm
afraid your problem may not be shortcomings in WP, but the real world.

> So my questions are these:
> *Can anyone think of other types of categories I might have missed?

Basically, you have identified:
1) is an intersection of [Bridges in France / in France & Bridges]
2) is a subset of [Bridges in France / Bridges]
3) is a member of [Paris / European Cities] and all your attribute examples
4) is related to (or more specifically: is set in) [Amelie (movie) / Paris]
5) information about the article

1) can be computed and shouldn't exist as categories. I'm not sure whether
we care about the difference between 2) and 3). 5) you can quite easily
deal with using namespaces (depending on the problem, of course). The meat
is in 4): You can add any number of named relations there, and most of
the current ugliness is there.

> *How could Wikipedia be better if this general problem was addressed?

What was the problem again?

Anyhow, I guess my main point is that hierarchies are overrated. They are
most useful when you don't have a computer to sort things out for you.

Roger



More information about the WikiEN-l mailing list