[Wikipedia-l] MediaWiki converter to Plain Text, XML, DocBook, PDF

Magnus Manske magnus.manske at web.de
Wed Mar 22 10:40:22 UTC 2006


Note: I cross-posted this to several lists, because I think this is of
interest to many; please reply on wikitech-l only.


A long, long time ago, I started writing a PHP script to convert
MediaWiki markup into XML. I believe it is now feature-complete and
relatively reliable. Not only can it process a single wiki text, but a
list of articles, taking the text from any MediaWiki-based site online.
It uses the same method to replace templates.

The generated XML can now be converted into other formats. For
demonstration [1], I offer "plain text" and DocBook XML.

What I cannot demonstrate (due to limitations of my hosting service) is
the subsequence conversion to HTML or PDF from the DocBook XML. However,
it is quite easy to set up an automatic conversion locally if you have
the necessary DocBook files installed.

As an example, I have generated a PDF [2] by
1. Entering the titles of the articles I want to have
2. Chosing "DocBook PDF" as output format
3. Clicking "Convert"
4. Waiting for the PDF to open
Really, that easy! :-)

I am well aware of some shortcomings of the example PDF, however, most
of them (no left margin, gigantic tables, misshaped images) are flaws of
DocBook, or of the default stylesheets I use. I'm not really familiar
with DocBook and hope for help by people that are.

While the converter seems to work pretty well, I'm sure there are lots
of fun bugs to find. If you do find a page that breaks, please mail me
the title so I can find the bug, or even better, fix it yourself! The
code is in CVS, "wiki2xml" module, "php" directory (ignore the old C
code in the main directory;-)

A word about speed: Yes, the process of creating a PDF takes some time.
However, most of it is DocBook at work, and of course the loading times
for articles and templates. Converting the example from wiki markup to
XML to DocBook XML to PDF takes 2 minutes 20 seconds total, but the
actual conversion wiki-to-XML is done in just 8 seconds.

Apart from bug fixing, my next priority is ODT (OpenOffice) format
output. Also, I would like to extend Special:Export in MediaWiki so it
can return a list of authors, which can then be added automagically to
all converted files.

Awaiting your feedback,
Magnus

[1] http://magnusmanske.de/wiki2xml/w2x.php
[2] http://magnusmanske.de/wiki2xml/Biology_topics.pdf (3.7 MB!)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : http://lists.wikimedia.org/pipermail/wikipedia-l/attachments/20060322/5e95b6bc/attachment.pgp 


More information about the Wikipedia-l mailing list