Re: [Wikitech-l] Static html dump

20 May 2003

Je Mardo 20 Majo 2003 03:53, Alfio Puglisi skribis:
...
  I just subscribed (I'm the wikipedia user At18) to
ask about the
 automatic html dump function. I see from the database page that it's
 "in development". 
Welcome!

...
  If anyone is interested, I have a rudimental Perl
script that is
 capable of reading the downloadable SQL dump and output all the
 articles as separate files in a number of alphabetical directories.
 It's not very fast, but it works.

 What's missing from the script: wikimarkup -> HTML conversion, some
 intelligence to autodetect redirects, dealing with images, and so on.
 I don't know if someone is in charge of this fuction. If so, I can
 post the script. Otherwise, I can further develop it myself, given
 some directions. 
Cool! I don't think anyone's really actively working on this at the
moment, so if you'd like to, that would be great.

A few things to consider:

Last year someone started on a static HTML dump system with a hacked-up
version of the wiki code and some post-processing, but never quite
finished it up. I don't think he posted the code, but if you can get
ahold of him he may still have it available:
http://mail.wikipedia.org/pipermail/wikitech-l/2002-November/001292.html

There's also a partial, very experimental offline reader program which
sucks the data out of the dump files. This includes a simplified wiki
parser which, I believe, outputs HTML to use in the wxWindows HTML
viewer widget: http://meta.wikipedia.org/wiki/WINOR
This may be useful to you.

The latest revisions of the wikipedia code can cache the HTML output
pages, but it's not clear whether this would be easy to adapt for
purposes of generating static output.

A couple of the big questions that have come up before are:

* filenames -- making sure they can stay within reasonable limits on
common filesystems, keeping in mind that non-ascii characters and
case-sensitivity may be handled differently on different OSs, and there
may be stronger limits on filename lengths.

* search -- an offline search would be very useful for an offline
reader. JavaScript, Java, local programs are various possibilties.

* size! with markup, header and footer text tacked onto every page, a
static html dump can be very large. The English wiki could at this
point approach or exceed the size of a CD-ROM without compression. Is
there a way to get the data compressed and still let it be accessible
to common web browsers accessing the filesystem directly? Less
important for a mirror site than a CD, perhaps.

* interlanguage links - it would be nice to be able to include all
languages in a single browsable tree, with appropriate cross-links.

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Static html dump