[Wikipedia-l] [Foundation-l] moving forward on article validation

Andy Rabagliati andyr at wizzy.com
Thu Jun 15 09:47:13 UTC 2006


On Wed, 14 Jun 2006, SJ wrote:

> Note that the "Wikipedia 0.5" WikiProject on en:wp is tackling this
> issue with some energy, and could use more input and nominations:
> 
> http://en.wikipedia.org/wiki/Wikipedia:Version_0.5_Nominations
> 

I have a related, orthogonal, request, regarding the process of
assembling a CD or other snapshot. My interest is less to do with
quality, and more to do with the process. My end result is either a CD,
or a plucker document for PalmOS.

http://en.wikipedia.org/wiki/Wikipedia_talk:Version_0.5

I see the job as too big to be done via hand selection. I am also more
interested in coverage than quality - I figure the quality will just get
better. So, I want automated methods, both for selecting good coverage,
and (less important at the moment) version selection. I also would
like to target a size - 128Meg, 512Meg, 600Meg, 1Gig, 4Gig. I am also
interested in post-processing - stripping redlinks, including ''main
article'' references on core articles, like ''History of South Africa''
etc. I want to be able to tweak parameters, then press a button and get
a new CD (from my downloaded XML dump of en and a picture collection,
and possibly via a live mediawiki snapshot of that content).

This is what I have tried, mostly with available tools, and a bit of perl.

* Download recent XML dump.
* Download list of articles from category (currently using the WPCD template)
* Trim the full dump to the above article list (natively performed by mwdumper --exactlist)
* Import this to mysql
* import (full) category dump to mysql (sql dump downloaded from wikipedia)
* Use mediawiki/maintenance/dumpHTML.php to convert this to HTML
* perl script removes categories with less than four included items from HTML dump
* redlink removal by un-anchoring HTML with class=new (red links) -
  but not Categories (that always seem to appear red)

Problems I have come across:-

* templates (particularly <nowiki>{{main|History of Country}}</nowiki>
  and the like) do not make it through dumpHTML.php. Maybe I have to
  hack the php.
* Remove all the dross at the end, like inter-wiki links.

Could this be done by tweaking the CSS from dumpHTML ?

Cheers,     Andy!



More information about the Wikipedia-l mailing list