Hello,
I read the thread "how bad is a category with ....", and I was wondering
how categories were filled. If I understand well, categories were filled
by editors of the article. This assume that these editors know the whole
set of categories and that these categories will not change with time ?
I was wondering if there is projects to help *detect* categories and
then to help editors by *suggesting* categories ?
I am thinking about two different technologies to help dealing with
these two problems :
1) Text clustering to help finding categories but probably not using
classical approaches where words space is used to describe a document
(applying a part of speech tagging
<http://en.wikipedia.org/wiki/Part-of-speech_tagging>, stemming
<http://en.wikipedia.org/wiki/Stemmer>, ...). I am thinking about
clustering links graph (seems similar to the clique problem
<http://en.wikipedia.org/wiki/Clique_problem> but with different
constraints), i.e. each document will not be described by his words (or
lemmas, LSA vector...) but by his links to other articles using an
algorithm that do not needs the number of cluster before processing but
needs a distance or a similarity threshold. With this kind of
processing, you will have a set of clusters that are linked together,
but a cluster will probably not be a complete graph (this is the
difference with the clique problem). Once you have the clusters, you
need to try labeling them with a category :
- give to the user the role of identifying the category name
- use the words space to find the better words that describe this set
of articles
- ...
Then you can run this algorithm on a category to try to split it in sub
categories.
2) Machine learning or links graph exploration to suggest categories
during edition of an article.
This first idea is to try to learn existing categories with a machine
learning algorithm (using words space) to guess categories of a new
article (but this algorithm will have to deal with the new categories
and the fact that the number of document not having a category is grater
than number of document having a category).
The second idea is really more simple and easier to implement : When you
edit an article, you can suggest categories of linked articles (can be
replaced by an other graph-exploration algorithm).
Is there some functions like these in Wikimedia ? and to you think that
this kind of algorithms could help ?
Finally, do you know people working on this functionalities (maybe
people working on semantic web ?)
Best Regards.
Julien Lemoine