Tomasz Wegrzanowski queried:
What about storing all markup and all rendered (or
semi-rendered)
html files on disk, under names being md5 hashes of them,
and having database store only pointers and metadata ?
Storing all markup & rendered (cached) files on disk is one of
the options I've been raising as a possible performance enhancement.
However, I don't see there being a real performance advantage to
storing by hash value, instead of simply using an encoded article name
as the filename. And, if you use the article name as the filename,
the system's underpinnings become MUCH clearer (making it easier to
debug, etc.).
You probably don't want to just store
"all the files" in one big directory anyway.
Most filesystems do not handle large directories efficiently -
in fact, some are implemented as a linear search, which would make
finding the hashed file a real problem. This is easily handled
by using the first few characters of the article name as a hashing
function, e.g., Europe is stored in "E/u/r/Europe.wk".
The hash used here is "imperfect", but as a programmer I'd be
grateful for such a simple system when things go wrong.
And it makes lots of processing easier ("process articles in
ASCIIbetical order" is trivial).
Some filesystems _DO_ handle massive directories efficiently.
Reiser does well, for example. However, but a design that works well
on arbitrary filesystems would have the advantage of letting you
switch to more filesystems, depending on other factors.
If you're willing to accept implementations that limit the
code utility to specific filesystems (like Reiser),
I still don't see the advantage of hashed names - the
underlying filesystem will use its own hashing system anyway,
so you may as well use reasonable names.
However, maybe there's something I've missed.
If there is, please let me know!
Thanks...