-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Finally got a chance to look at Alfio's script, here's some
non-copyright commentary. :)
Je Ĵaŭdo 29 Majo 2003 02:15, Alfio Puglisi skribis:
Looks nice! Cleaner interface than we have, too. ;)
- filenames should be OK for most filesystems not
"8.3" limited
(max 63 chars, only a-z, 0-9 and underscore)
- despite the two-letter subdirectories, some of them have over 4,000
files in them!
Letters aren't distributed evenly, alas... If you want to even that out,
consider using a binary hash as the basis of the divisions. (We use the
first one and two hex digits of the md5hash of the title/filename for
the uploads and the rendered page cache, for instance.) They're not
pretty, though.
- Time: the script takes more than 2 hours on my 1.3
Ghz Athlon...
Whee... I'm gonna have to get a faster cpu...
- Size: this dump is about 800MB. (tar.gz is just
110MB). I think
that I can bring it down to 600-650MB with a bit of trimming and
eliminating unnecessary redirects. BUT, without some form of
compression, the English wikipedia will soon overflow a single CD.
Maybe we should target DVDs? :-)
Single CD would be preferable, of course, though a static HTML dump can
target mirror sites which don't have that limitation as well.
Possibilities:
* Nearly all (all?) browsers these days can read files sent in gzip
encoding. But can we do this on the filesystem reliably, where we have
no opportunity to send an encoding header? Browsers seem to treat
.html.gz files as application/gzip and send them to an external app,
and don't internally recognize a gzipped file if just named as .html
* Self-extracting JavaScript. :) I'm sure someone, somewhere has done
this; if not it's worth it for the evil factor: rewrite gunzip in
JavaScript, and have the content of the HTML files be a <script> tag
with a big string and a call to the gunzip() function pulled in from a
common .js file. Downsides are likely crappy performance and an
inability to function in non-JavaScript browsers.
* Use transparent filesystem compression on the CD. I think only Linux
supports this, and only if certain options are enabled in the kernel
configuration; not portable, but might be nice for personal use.
* Ship a server. Java applet or light executable for many platforms
which serves out the pages with appropriate encoding header. Downsides:
hard to make portable, can't just browse the filesystem.
* Offline reader program with its own storage and display methods.
Again, hard to make portable, can't just browse the filesystem. (cf
http://meta.wikipedia.org/wiki/WINOR )
- Search: I tried a javascript search that worked well
for small
sized databases: it's basically a big array of strings (article
titles and filenames) with some lines code that do a regexp match
against them. For full-sized databases like this one, the search page
becomes an 8 megabytes monster that takes forever to process (IE
grabs 100 MB of memory and stops there, Opera is even worse). I'll
see if I can find a different solution.
Here are some thoughts, and since I'm not going to have time to
implement this myself for a while take it or leave it as you like:
It may be better to go with something similar to MySQL's fulltext search
index: break the titles into words, and associate words with lists of
pages that contain them rather than full titles with their page names.
Instead of regexping a hundred thousand strings, you'd only need to
break the query into words, fetch the lists of pages for _just those
words_, and intersect or union the results as desired.
Space/memory could probably be saved using prefix codes of some sort...
Also, you could break up the index into several smaller files so not all
strings need to be loaded into memory. I don't recall JavaScript having
an include() command, but in the worst case you could pull some kind of
<frameset> or <iframe> thing and bring up the necessary sub-scripts in
another frame.
- -- brion vibber (brion @
pobox.com)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
iD8DBQE+2lgdxVlOmwh1xjgRAvWRAJ9h3bqxYXAkLc4x09moo9KGLtGiKQCgiiD2
c7LtH7a05VbtmvIW8G/FvZk=
=MnDw
-----END PGP SIGNATURE-----