I have been compiling a machine compiled lexicon
created from link and
disambiguation pages from the XML dumps. Oddly, the associations
contained in [[ARTICLE_NAME | NAME]] form a comprehesive "real time"
thesauraus of common associations used by current English Speakers in
Wikipedia, and perhaps comprise the worlds largest and most comprehesive
Thesaurus on the planet emedded within the mesh of these links within
the dumps.
[... snip ...]
The first part of the message discusses a machine created thesaurus
based upon these links which I will post as an XML
dump when th program is completed. That part may be of interest moving
forward as it would enable a built in
thesaurus for MediaWiki. The wikitrans uses this thesaurus created from
within the dumps. Could have a lot
of applications for translators. I have found it very useful.
Hi Jeff,
I ran a small project a couple of years ago to try and create "missing"
redirects
and disambiguation pages using this information
(
http://en.wikipedia.org/w/index.php?title=User:Nickj/Redirects ) - I'll quickly
describe what it did in case it helps anyone who wants to do something similar now.
A list of possible new redirects was created based on piped-link / [[ARTICLE_NAME |
LINK_NAME]]
usage in articles in the main namespace (using database dumps of enwiki), where:
* all or most of the source LINK_NAME "votes" agreed on what the target
ARTICLE_NAME was;
* and a certain minimum threshold for votes was crossed (I think it might have been >=
3 votes);
* and where there was no article currently at [[LINK_NAME]];
* and where there was an article currently at ARTICLE_NAME (since redirects that point to
non-existent articles should be deleted with extreme prejudice, IMHO).
These redirect suggestions were then reviewed by humans, and if they liked them, they
were added by them clicking on a link (which used a GET request, to give a Preview of
the result, and which supplied an edit description, and all the body contents). The meant
that a new redirect could be added with just 2 mouse clicks, using a standard browser.
(Using the exact same method today is not currently possible due to
http://bugzilla.wikimedia.org/show_bug.cgi?id=3693 ,
although it is currently possible to use "Show Changes" instead of
"Preview", to achieve
a very similar result using GET requests).
A series of disambiguation pages were also suggested, and these suggestions were created
using the same methods, based on [[ARTICLE_NAME | LINK_NAME]] usage, but where the
LINK_NAME "votes" did _not_ agree on what the target ARTICLE_NAME was. In these
cases,
it suggested a disambig page that basically said "LINK_NAME is either [[A]], [[B]] or
[[C]]".
Anyone who wanted to give something like this a go (and I'm sure in 2 years that
there
must have been tonnes more links added, which means a lot more raw data to work with),
would probably want to have a quick glance over the "Previously Rejected
Suggestions"
(
http://en.wikipedia.org/wiki/User:Nickj/Redirects#Previously_rejected_sugge… )
to see what people did not like previously.
Oh, and once something like this was done, you could maybe start a thesaurus directly
from the redirects themselves, thus helping both the thesaurus people and the Wikipedia -
win/win :-) And if you wanted to create a truly open thesaurus, you'd probably want to
tag
the redirects that were worthy of inclusion with something like [[Category:Thesaurus
Redirect]],
and you'd probably also want to tag the ones that weren't worthy of inclusion
somehow too,
and that way anyone could build on this data and come up with new and cool ways of using
it ;-)
All the best,
Nick.