[Wikitech-l] Categories problem

27 Aug 2006

Hello,

I read the thread "how bad is a category with ....", and I was wondering 
how categories were filled. If I understand well, categories were filled 
by editors of the article. This assume that these editors know the whole 
set of categories and that these categories will not change with time ?
I was wondering if there is projects to help *detect* categories and 
then to help editors by *suggesting* categories ?

I am thinking about two different technologies to help dealing with 
these two problems :
1) Text clustering to help finding categories but probably not using 
classical approaches where words space is used to describe a document 
(applying a part of speech tagging 
<http://en.wikipedia.org/wiki/Part-of-speech_tagging>, stemming 
<http://en.wikipedia.org/wiki/Stemmer>, ...). I am thinking about 
clustering links graph (seems similar to the clique problem 
<http://en.wikipedia.org/wiki/Clique_problem> but with different 
constraints), i.e. each document will not be described by his words (or 
lemmas, LSA vector...) but by his links to other articles using an 
algorithm that do not needs the number of cluster before processing but 
needs a distance or a similarity threshold. With this kind of 
processing, you will have a set of clusters that are linked together, 
but a cluster will probably not be a complete graph (this is the 
difference with the clique problem). Once you have the clusters, you 
need to try labeling them with a category :
 - give to the user the role of identifying the category name
 - use the words space to find the better words that describe this set 
of articles
 - ...
Then you can run this algorithm on a category to try to split it in sub 
categories.

2) Machine learning or links graph exploration to suggest categories 
during edition of an article.
This first idea is to try to learn existing categories with a machine 
learning algorithm (using words space) to guess categories of a new 
article (but this algorithm will have to deal with the new categories 
and the fact that the number of document not having a category is grater 
than number of document having a category).
The second idea is really more simple and easier to implement : When you 
edit an article, you can suggest categories of linked articles (can be 
replaced by an other graph-exploration algorithm).

Is there some functions like these in Wikimedia ? and to you think that 
this kind of algorithms could help ?
Finally, do you know people working on this functionalities (maybe 
people working on semantic web ?)

Best Regards.
Julien Lemoine

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Categories problem