[WikiEN-l] Using search to explore and find issues w/ articles

Derek Gottfrid dgottfrid at gmail.com
Mon Nov 13 15:43:40 UTC 2006


I have been developing a small search engine for the last 12 months and have
been using  Wikipedia as sample data. I made this available via
futef.com. Recently, I did further parsing and merging of the wikipedia
data and now merge all the titles together ie "Jimmy Wales" and "Jimbo
Wales" that
are attached to the same article and not connected just  via #REDIRECT.
In doing this, along w/ some relevancy tuning I uncovered some interesting
things about the dataset. A search for "jimbo wales" returns as a top
article "exploding whale" since somebody has included a redirect for
"exploding jimbo wales" as well as "king jimbo wales". I can fix my
search since nobody links to "exploding jimbo wales" - I can assume it
is a junk link and exclude it. But I wanted to know if this community
would be interested in using the search
facilities to verify and explore some of the data. If there is some interested,
I would be happy to create a richer interface into the search engine that would
allow for more data anomalies to be exposed.

A Few More Examples:

Better than france -> Italy
Better than germany -> Italy
Cheese Eating Surrender Monkies -> France

Then there are more subtle issues like

Educational background of George W. Bush -> Yale

I haven't absorbed  enough of the wikipedia ethos to offer a strong position
on all of the things I have found but it would be great if people that
are interested have a better set of tools to work w/ the data. What do
your think?


BTW, I have been lurking here for sometime and watched a conversation appear
about the relevancy of FUTEF and I took the criticism to heart. It was not
as good as it should be - it still isn't but I have worked to make it better.
The several cases that were mention on the mailing list in particular have
been fixed and in general the overall relevancy has greatly improved. If
anyone has issues please let me know - the previous comments were very helpful.

thanks,
derek


--
http://futef.com
derek at futef.com
dgottfrid at gmail.com



More information about the WikiEN-l mailing list