Brion Vibber wrote:
On some quick
testing it looks like there are some encoding problems if
UTF-8 isn't the locale charset; I'll try and get those worked out.
In the meantime, try setting LANG=en_US.UTF-8 and rerunning it.
Fixed version of mwdumper available:
http://download.wikimedia.org/tools/
Thank you! The new version definitely makes a big difference, as it
gets past 29,000 articles without any errors.
However, it then died after 40 minutes with this error message:
=============================================================
637,000 pages (272.057/sec), 637,000 revs (272.057/sec)
638,000 pages (272.21/sec), 638,000 revs (272.21/sec)
639,000 pages (272.254/sec), 639,000 revs (272.254/sec)
640,000 pages (272.402/sec), 640,000 revs (272.402/sec)
641,000 pages (272.203/sec), 641,000 revs (272.203/sec)
642,000 pages (272.332/sec), 642,000 revs (272.332/sec)
643,000 pages (272.476/sec), 643,000 revs (272.476/sec)
644,000 pages (272.514/sec), 644,000 revs (272.514/sec)
645,000 pages (272.676/sec), 645,000 revs (272.676/sec)
646,000 pages (272.746/sec), 646,000 revs (272.746/sec)
647,000 pages (272.891/sec), 647,000 revs (272.891/sec)
648,000 pages (272.927/sec), 648,000 revs (272.927/sec)
649,000 pages (273.067/sec), 649,000 revs (273.067/sec)
650,000 pages (273.11/sec), 650,000 revs (273.11/sec)
651,000 pages (273.274/sec), 651,000 revs (273.274/sec)
652,000 pages (273.416/sec), 652,000 revs (273.416/sec)
653,000 pages (273.401/sec), 653,000 revs (273.401/sec)
654,000 pages (273.614/sec), 654,000 revs (273.614/sec)
655,000 pages (273.716/sec), 655,000 revs (273.716/sec)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
ERROR 1064 at line 4426: You have an error in your SQL syntax near
''<ul><li>15:
38, 20 Sep 2004 [[User:Docu|Docu]] deleted \"Category:Liberal partie' at line 1
Tue Oct 25 10:44:45 EST 2005
ludo:/home/nickj/wikipedia# screendump 1 > screen1
=============================================================
(Note machine has 452324k of RAM, and 787144k of swap, and wasn't
doing anything else at the time).
MySQL article count at this time was:
=============================================================
mysql> select count(*) from cur;
+----------+
| count(*) |
+----------+
| 655000 |
+----------+
1 row in set (0.00 sec)
=============================================================
As a workaround, I then tried changing the command line from:
/usr/java/jre1.5.0_05/bin/java -server -jar mwdumper.jar
--format=sql:1.4 20051020_pages_articles.xml.bz2 | mysql enwiki
To:
/usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar
--format=sql:1.4 20051020_pages_articles.xml.bz2 | mysql enwiki
(i.e. increased max allowed memory use to 200 Mb), then did a "delete
from cur;", and then reran mwdumper.
With this, it went much further (to around 1933000 articles).
In case it helps with mwdumper, memory use during import (with the
-Xmx200M arg) looks like this:
===============================================================================
ludo:/home/nickj/wikipedia# top -n1
top - 12:45:30 up 2:53, 3 users, load average: 4.48, 4.48, 4.19
Tasks: 62 total, 2 running, 60 sleeping, 0 stopped, 0 zombie
Cpu(s): 9.3% us, 3.0% sy, 0.0% ni, 0.0% id, 86.7% wa, 1.0% hi, 0.0% si
Mem: 452324k total, 449468k used, 2856k free, 476k buffers
Swap: 787144k total, 76k used, 787068k free, 270148k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1694 root 24 0 384m 142m 51m S 0.0 32.4 24:53.10 java
1697 root 16 0 384m 142m 51m S 0.0 32.4 0:00.00 java
1698 root 16 0 384m 142m 51m S 0.0 32.4 2:11.39 java
1699 root 16 0 384m 142m 51m S 0.0 32.4 0:00.00 java
1700 root 15 0 384m 142m 51m S 0.0 32.4 0:00.00 java
1701 root 16 0 384m 142m 51m S 0.0 32.4 0:00.00 java
1702 root 16 0 384m 142m 51m S 0.0 32.4 0:00.04 java
1703 root 16 0 384m 142m 51m S 0.0 32.4 0:05.24 java
1704 root 16 0 384m 142m 51m S 0.0 32.4 0:05.91 java
1705 root 16 0 384m 142m 51m S 0.0 32.4 0:00.00 java
1706 root 15 0 384m 142m 51m S 0.0 32.4 0:00.16 java
573 mysql 16 0 27232 11m 5380 S 0.0 2.6 0:00.05 mysqld
575 mysql 16 0 27232 11m 5380 S 0.0 2.6 0:00.00 mysqld
576 mysql 16 0 27232 11m 5380 S 0.0 2.6 0:00.00 mysqld
[...snip irrelevant processes...]
===============================================================================
and:
===============================================================================
ludo:/home/nickj/wikipedia# ps auxwf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
[...snip irrelevant processes...]
root 823 0.0 0.2 2240 1280 tty1 Ss 09:52 0:00 -bash
root 1692 0.0 0.2 2240 1280 tty1 S+ 11:28 0:00 \_ -bash
root 1694 31.6 33.1 393228 149888 tty1 S+ 11:28 25:08 \_
/usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar
--format=sql:1.4 20051020_pages_articles.xml.bz2
root 1697 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 |
\_ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar
--format=sql:1.4 20051020_pages_articles.xml.bz2
root 1698 2.7 33.1 393228 149888 tty1 S+ 11:28 2:11 |
\_ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar
mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2
root 1699 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 |
\_ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar
mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2
root 1700 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 |
\_ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar
mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2
root 1701 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 |
\_ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar
mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2
root 1702 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 |
\_ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar
mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2
root 1703 0.1 33.1 393228 149888 tty1 S+ 11:28 0:05 |
\_ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar
mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2
root 1704 0.1 33.1 393228 149888 tty1 S+ 11:28 0:05 |
\_ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar
mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2
root 1705 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 |
\_ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar
mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2
root 1706 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 |
\_ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar
mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2
root 1695 2.5 1.9 11184 8696 tty1 S+ 11:28 2:02 \_
mysql enwiki
===============================================================================
At around 1933000 articles it seemed to get stuck. I left it overnight
(no change), then rebooted (for good measure), and then MySQL gave
strange errors for cur (e.g. "ERROR 1016: Can't open file: 'cur.MYD'.
(errno: 145)"), and refused to do anything with this table. Further
investigation showed that the disk partition that MySQL was using was
100% full (Doh! My bad). I'm fairly confident that if it there had
been sufficient disk space that the mwdumper import would have
succeeded.
By the way, I noticed that in the TODO list in the README.txt, it has:
* Include table initialization in SQL output
This is a very good idea - i.e. for 1.4 output a "CREATE TABLE IF NOT
EXISTS cur (...);" before the insert statements. I'd also suggest a
table cleanout option, which does "DELETE FROM cur;" for 1.4 (would be
placed right after the table creation in the output, if this options
is invoked). The equivalents are for 1.5 are I guess are probably
CREATE TABLE IF NOT EXISTS for both 'page' and 'text', and "DELETE
FROM text; DELETE FROM page;". A "--table-cleanout" or
"--delete-current" or "--from-scratch" option would be very handy to
automate this as part of the dump import process.
All the best,
Nick.