On Sun, 1 Jun 2003, Brion Vibber wrote:
Je Ĵaŭdo 29 Majo 2003 02:15, Alfio Puglisi skribis:
Looks nice! Cleaner interface than we have, too. ;)
Being static, lots of the dynamic stuff and special pages didn't need to
be on the topbar :-)
Letters aren't distributed evenly, alas... If you
want to even that out,
consider using a binary hash as the basis of the divisions. (We use the
first one and two hex digits of the md5hash of the title/filename for
the uploads and the rendered page cache, for instance.) They're not
pretty, though.
I'll save this for when/if it becomes a real problem
- Size: this dump is about 800MB. (tar.gz is just
110MB).
[...]
Single CD would be preferable, of course, though a static HTML dump can
target mirror sites which don't have that limitation as well.
The new version of the script (not online yet) produces a dump that,
according to Nero, can be written on a 650MB cd. The main reasons are a
smaller html template and the elimination of redirects (but they are still
present for searches).
* Self-extracting JavaScript. :) I'm sure someone,
somewhere has done
this; if not it's worth it for the evil factor: rewrite gunzip in
JavaScript, and have the content of the HTML files be a <script> tag
with a big string and a call to the gunzip() function pulled in from a
common .js file. Downsides are likely crappy performance and an
inability to function in non-JavaScript browsers.
So I wasn't the only one thinking about this :) Some days ago I did a
little Google search, but found nothing. I'm also *sure* that someone has
already written this. Again, I'll postpone it to a future version.
I should point out that the main reason for size bloat is the
proliferation of small files. Combined with the 2048 bytes cluster size
for cd-rom filesystems, it means that each article uses at least 2K, and
an average of 1K is wasted on bigger files. Just counting bytes, the html
version is around 490MB. So maybe some way to bundle files together (maybe
using frames and #anchors, or now-you-see-it-now-you-don't effects in
Javascript :) could pay off.
It may be better to go with something similar to
MySQL's fulltext search
index: break the titles into words, and associate words with lists of
pages that contain them rather than full titles with their page names.
Instead of regexping a hundred thousand strings, you'd only need to
break the query into words, fetch the lists of pages for _just those
words_, and intersect or union the results as desired.
Hmm, this seems neat. Now how many different words are there in the
average Wikipedia dump? :)) And, a frameset would be necessary as in the
next option, barring some black magic communication between Javascript
pages.
Also, you could break up the index into several smaller
files so not all
strings need to be loaded into memory. I don't recall JavaScript having
an include() command, but in the worst case you could pull some kind of
<frameset> or <iframe> thing and bring up the necessary sub-scripts in
another frame.
This is what I was thinking to do as the next step. The include() can be
hacked up, but memory would just add up to the original size. Some neat
frameset should do the trick.
Next version online soon :-)
Ciao,
Alfio