Francis Tyers wrote:
Actually it is surprisingly difficult. I have a script
which goes it
here:
https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-lex-le…
Which really needs to be redone for each Wikipedia. If you ask
http://en.wikipedia.org/wiki/User:Tresoldi#Wikipedia_as_a_corpus
He has some scripts which do it too. But there is no generic "nice" way
of getting Wikipedia as a nice plain text corpus so far. If anyone has
one I would love to hear about it.
Convert to html using mediawiki, then filter out all html tags.