On Thu, 2006-06-15 at 11:47 +0200, Andy Rabagliati wrote:
On Wed, 14 Jun 2006, SJ wrote:
I have a related, orthogonal, request, regarding the process of
assembling a CD or other snapshot. My interest is less to do with
quality, and more to do with the process. My end result is either a CD,
or a plucker document for PalmOS.
http://en.wikipedia.org/wiki/Wikipedia_talk:Version_0.5
I see the job as too big to be done via hand selection.
I agree, kinda. Too big
for maybe one person, but we have a lot of
hands.
I am also more
interested in coverage than quality - I figure the quality will just get
better.
Yes. I am too. For .5 and 1.0, Both are important though.
So, I want automated methods, both for selecting good
coverage,
and (less important at the moment) version selection. I also would
like to target a size - 128Meg, 512Meg, 600Meg, 1Gig, 4Gig.
I really like this.
Thumbdrive ? Check. CD? Check. DVD? Check. My
laptop? Check.
I am also
interested in post-processing - stripping redlinks, including ''main
article'' references on core articles, like ''History of South
Africa''
etc. I want to be able to tweak parameters, then press a button and get
a new CD (from my downloaded XML dump of en and a picture collection,
and possibly via a live mediawiki snapshot of that content).
Live? How did you want
to do this? Perhaps using the toolserver?
This is what I have tried, mostly with available tools, and a bit of perl.
* Download recent XML dump.
* Download list of articles from category (currently using the WPCD template)
* Trim the full dump to the above article list (natively performed by mwdumper
--exactlist)
* Import this to mysql
* import (full) category dump to mysql (sql dump downloaded from wikipedia)
* Use mediawiki/maintenance/dumpHTML.php to convert this to HTML
* perl script removes categories with less than four included items from HTML dump
* redlink removal by un-anchoring HTML with class=new (red links) -
but not Categories (that always seem to appear red)
Problems I have come across:-
* templates (particularly <nowiki>{{main|History of Country}}</nowiki>
and the like) do not make it through dumpHTML.php. Maybe I have to
hack the php.
* Remove all the dross at the end, like inter-wiki links.
You don't want those?
Why not? (assuming they reflecting a link within
the dump)
Could this be done by tweaking the CSS from dumpHTML ?
I don't know, but I
would like to reproduce your steps. Can you show me
your perl and any other special things you are using?
Kyle