Hi,
This is a message I sent to the Python Wikipedia Bot mailing list today, which
should be of concern for MediaWiki development as there might be other
clients that send invalid UTF-8 (or even the MediaWiki software itself, as
seen in [[es:Wikipedia:Registro_de_borrados]]).
---------- Forwarded Mail ----------
Subject: [pyWikipediaBot-users] Bot messed up databases on UTF-8 wikis
Date: Sonntag, 18. Juli 2004 16:12
From: Daniel Herding <DHerding(a)gmx.de>
To: pywikipediabot-users(a)lists.sourceforge.net
Hi,
wikipedia.putpage() always sent its edit summary messages as Latin-1 or
something, even if it was editing a UTF-8 wiki which expected UTF-8 summary
messages. (This concerns all bot functions which can have non-ASCII
characters in their summary messages.)
This, of course, troubled the SQL databases, and if you look at
http://fr.wikipedia.org/w/wiki.phtml?title=10_mars&action=history
with Mozilla, it will show flashy question marks instead of special
characters. The same happened on nds:, where Andre ran the interwiki bot, and
probably on many other Wikipedias. I just can't believe that nobody noticed
this, and I'm quite angry that nobody reported this bug.
I fixed this bug yesterday, as you can see here:
http://fr.wikipedia.org/w/wiki.phtml?title=Utilisateur:Head&action=hist…
but the databases are already fucked up. The XML export special page (which
is used by interwiki.py) gives out crappy XML, which leads to a SAX parse
bug. And my newly created sqldump.py is unusable for these wikis.
So I guess we should ask the MediaWiki developers to help us out. Maybe they
can shut down the wiki for a while, then run over the 'old' database,
replacing every non-UTF-8-byte with a question mark.
Daniel
----------- End of forwarded Mail ---------------
It would be nice if someone could repair the databases on fr:, nds:, es:, and
other affected Wikipedias. You should also consider implementing a filter
that stops users from posting illegal characters. Mail me if you need
additional information.
Daniel