Re: [Wikitech-l] [Wikipedia-l] [Foundation-l] moving forward on article validation

22 Jun 2006

Kyle Anderson schrieb:
...
  On Thu, 2006-06-15 at 11:47 +0200, Andy Rabagliati
wrote: 

  I am also
 interested in post-processing - stripping redlinks, including ''main
 article'' references on core articles, like ''History of South
Africa''
 etc. I want to be able to tweak parameters, then press a button and get
 a new CD (from my downloaded XML dump of en and a picture collection,
 and possibly via a live mediawiki snapshot of that content).
       Live? How did you want to do this? Perhaps using the toolserver?
    To shamelessly plug one of my toys again:
Included in my Wiki-to-XML-to-(inser your favourite format) is a script
that reads an XML dump and converts it into lots of single files, one
per article. This "file heap" can then be browsed using the
Wiki-to-XML-to-XHTML converter. It thus only depends on PHP and a web
server, which should be easy to bundle with a CD/DVD.
I also had the whole "file heap" indexed with Lucene, but didn't bother
to hack a search frontend into the generated XHTML. Otherwise, I'd have
had fulltext search as well.

A next step would be to output the "file heap" as compressed files to
save disk space. Further, these compressed files could be merged into
one single huge file, since the file system has a little trouble coping
with that many files. That would require a separate file with title and
position-in-file indices, though.
Note that one can't just use the compressed dump file directly, as
seeking in a compressed file is slooooow (it has to decomress everything
that comes before the position you seek). So I'd go for a single file
consisting of compressed individual files stacked on top of each other.

Templates work fine for that solution, categories don't - yet.

Magnus

P.S.: SVN, module "wiki2xml", directory "php", for those interested.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [Wikipedia-l] [Foundation-l] moving forward on article validation