On 12/4/06, Brion Vibber <brion(a)pobox.com> wrote:
Evan Martin wrote:
When trying to use mwdumper to import
enwiki-20061001-pages-articles.xml.bz2, I find that using the
--output=mysql:... --format=sql:1.5 causes some piece to eat UTF-8. I
can verify this in the database: the text table will have data like
is:Stj�rnleysisstefna where the middle mojibake character is a single
byte.
[snip]
Some other info that may be useful:
- I have $wgDBmysql5 = false; in my LocalSettings.php.
Assuming you're using MySQL 4.1 higher, this is your problem.
When you let Java speak directly to MySQL, it's going to try to speak
"real" UTF-8.
Aha! I'm vaguely familiar with the UTF-8 problem with MySQL, but I
couldn't see how it was getting involved here. After sending my mail
I had thought $wgDBmysq5 was a red herring, because mwdumper (I
assume) doesn't use it, but I realize now that when the tables are
*created* that value is used.
To rephrase for anyone else who stumbles across this thread:
$wgDBmysql5 = false means that the tables are created with DEFAULT
CHARSET=latin1, which isn't especially a problem as long as the
software atop it (mediawiki) knows that it's actually storing UTF-8.
But when you use the Java library to speak to MySQL, it notices that
the table is marked as latin1 and tries to convert your UTF-8 data for
you while importing.
I now understand importing via a pipe to MySQL is my best bet.
Thanks for the quick response!