On Thu, Feb 28, 2008 at 7:14 AM, Neil Harris <usenet(a)tonal.clara.co.uk> wrote:
Interesting. What they seem to be proposing is to
store the tags for
each article in a plain text field, and then use the built-in MySQL
full-text search mechanism to index and search that, thus taking
advantage of all the development already devoted to speeding up
general-purpose full text search.
I wonder how it would scale to Wikipedia's vast datasets?
Well, the point is that that solution exactly has already been
discussed extensively on this list in exactly this context, and the
answer is "not clear". Performance in various tests (primarily if not
exclusively by Aerik) was not terrible but might or might not be good
enough. The thought was to use something like Lucene instead, which
would probably be faster. The normalized performance was also tested
here, I think, by Greg Maxwell, and he concluded that PostgreSQL was
much faster (which this paper agrees with).
The interesting part is their remark that basically they have no idea
why fulltext is actually faster, since in principle it shouldn't be.
It makes me wonder if we won't have more effort by RDBMSes in the
future to implement efficient indexes for this kind of thing.
One thing that needs to be kept in mind is that their data set had
50,000 tag-item associations. This is about 500 times smaller than
enwiki's categorylinks. Scalability data with smaller and larger data
sets would have made an interesting addition to the paper.