Hi All,
Are other people having grief importing the new XML format database-dumps?
Today, I've just tried 3 different methods of importing the EN
20051009_pages_articles.xml.bz2 dump, and not one of them seems to
work properly.
Incidentally, I have verified that the md5sum of the dump is correct,
so as to eliminate downloading problems:
ludo:/home/nickj/wikipedia# md5sum 20051009_pages_articles.xml.bz2
4d18ffa1550196f3a6a0abc9ebbd7d06 20051009_pages_articles.xml.bz2
------------------------------------------------------------------------------------
Method 1:
Importing using ImportDump from MediaWiki 1.5.0 running on PHP 4.1.2
This I knew this one might have problems, due to the oldness of the
version of PHP.
However, this one got the furthest of all the methods. It ran for 6
hours and 24 minutes, and imported around 60 percent of articles.
Something (probably PHP) has a memory leak however, as it resulted in
Linux 2.6.8's Out-of-Memory killer kicking in, until it killed the
script in question. The machine has 448 Mb of RAM, so it took a while
for the leak to consume all the memory.
Command line was:
bzip2 -dc /home/nickj/wikipedia/20051009_pages_articles.xml.bz2 | php
maintenance/importDump.php
But from the overnight system log we have:
Oct 21 03:05:01 ludo kernel: Out of Memory: Killed process 816 (apache).
Oct 21 03:13:04 ludo kernel: Out of Memory: Killed process 817 (apache).
Oct 21 03:20:41 ludo kernel: Out of Memory: Killed process 7677 (apache).
Oct 21 03:23:30 ludo kernel: Out of Memory: Killed process 946 (apache).
Oct 21 03:26:57 ludo kernel: Out of Memory: Killed process 7696 (apache).
Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 573 (mysqld).
Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 575 (mysqld).
Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 576 (mysqld).
Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 577 (mysqld).
Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 3111 (mysqld).
Oct 21 06:29:24 ludo kernel: Out of Memory: Killed process 7697 (apache).
Oct 21 06:29:25 ludo kernel: Out of Memory: Killed process 7699 (apache).
Oct 21 06:29:25 ludo kernel: Out of Memory: Killed process 3110 (php).
At that point importing stopped.
------------------------------------------------------------------------------------
Method 2:
Importing using ImportDump from MediaWiki 1.5.0 using a fresh PHP 4.4
STABLE CVS snapshot build (From really old, to really new).
This I though would work, but it didn't:
ludo:/var/www/hosts/local-wikipedia/wiki# bzip2 -dc
/home/nickj/wikipedia/20051009_pages_articles.xml.bz2 |
~root/tmp/php-5.1-dev/php4-STABLE-200510201252/sapi/cli/php
maintenance/importDump.php
100 (22.802267296596 pages/sec 22.802267296596 revs/sec)
200 (20.961060430845 pages/sec 20.961060430845 revs/sec)
300 (20.006219254115 pages/sec 20.006219254115 revs/sec)
[...snip lots of progress lines...]
64000 (41.86646431353 pages/sec 41.86646431353 revs/sec)
64100 (41.87977053847 pages/sec 41.87977053847 revs/sec)
64200 (41.891992792767 pages/sec 41.891992792767 revs/sec)
64300 (41.902506473828 pages/sec 41.902506473828 revs/sec)
64400 (41.920741784615 pages/sec 41.920741784615 revs/sec)
64500 (41.937710744276 pages/sec 41.937710744276 revs/sec)
64600 (41.945053966443 pages/sec 41.945053966443 revs/sec)
64700 (41.95428629711 pages/sec 41.95428629711 revs/sec)
PHP Fatal error: Call to a member function on a non-object in
/var/www/hosts/local-wikipedia/wiki/includes/Article.php on line 934
ludo:/var/www/hosts/local-wikipedia/wiki#
I.e. it dies, after 13 minutes, and at around 4% of the articles.
------------------------------------------------------------------------------------
Method 3:
Using the latest mwdumper (from
http://download.wikimedia.org/tools/
), plus the latest and greatest stable JRE (1.5.0_05), and converting
into 1.4 format, then importing that into MySQL:
/usr/java/jre1.5.0_05/bin/java -server -jar mwdumper.jar
--format=sql:1.4 20051009_pages_articles.xml.bz2 | mysql enwiki
This ran without any errors, and looked really promising.
However before this, where were some 1.5 million articles (from a June
SQL dump, which was the last Wikipedia dump I've been able import
properly):
mysql> select count(*) from cur;
+----------+
| count(*) |
+----------+
| 1535910 |
+----------+
1 row in set (0.00 sec)
# Then I cleared the table:
mysql> delete from cur;
Query OK, 0 rows affected (4.11 sec)
# Then the above mwdumper command ran for 53 minutes before finishing,
which seemed way too quick. Checking how many articles had been
imported showed there was something wrong:
mysql> select count(*) from cur;
+----------+
| count(*) |
+----------+
| 29166 |
+----------+
1 row in set (0.00 sec)
I.e. less than 2% of the articles got imported.
------------------------------------------------------------------------------------
So, my question to the list is this:
What methods have you tried for importing the XML dumps? In particular
what you tried that actually _worked_ ? (and by "working", I mean runs
without a memory leak, runs without dying of an error message, and
imports all of the articles into a database).
All the best,
Nick.