Hello,
When trying to use mwdumper to import
enwiki-20061001-pages-articles.xml.bz2, I find that using the
--output=mysql:... --format=sql:1.5 causes some piece to eat UTF-8. I
can verify this in the database: the text table will have data like
is:Stj�rnleysisstefna where the middle mojibake character is a single
byte.
If I use mwdumper with --output=stdout, I can verify the resulting SQL
has good UTF-8 in it. If I run a command like
mwdumper --output=stdout --format=sql:1.5 ... | mysql -uwiki -pwiki wikidb
things come out ok: the Greek in [[Anarchism]] (which is a useful test
article because it occurs early in the dump) displays fine.
Some other info that may be useful:
- I have $wgDBmysql5 = false; in my LocalSettings.php.
- My locale doesn't seem to affect it, but it's all en_US.UTF-8 in
case that matters.
- java version "1.5.0_07" / Java(TM) 2 Runtime Environment, Standard
Edition (build 1.5.0_07-b03)
- Command line I'm trying is:
java -server -classpath /usr/share/java/mysql-3.1.11.jar:mwdumper/bin
org.mediawiki.dumper.Dumper
"--output=mysql://localhost/wikidb?user=wiki&password=wiki"
--format=sql:1.5 enwiki-20061001-pages-articles.xml.bz2
That mysql jarfile comes from libmysql-java on Ubuntu dapper, version
3.1.11-1, but I find the same behavior with
mysql-connector-java-5.0.4.
I could file this as a bug, but I wanted to first verify I wasn't
doing anything wrong.