On 12/5/06, Ivan Krstić <krstic(a)solarsail.hcs.harvard.edu> wrote:
George Herbert wrote:
If there's anything currently in Google which
would seriously benefit
WMF, other than cash, it is likely that we could approach the right
people and get access to it.
WMF has, up to this point, been committed to building Wikipedia only
with free software. While it's exceedingly likely that we can obtain
some kind of privileged access to Google's technologies, it's also
almost certain those technologies would not be opened up to the public,
thus going against the principles at play here.
That said, there definitely are Google technologies that would be useful
to Wikipedia. BigTable and GFS are examples.
Google has emitted quite a bit of IP as freeware; it depends on whether it's
a key commercial competitive advantage for them or not.
I hadn't been familiar with BigTable, which is sort of annoying since I'm
rather familiar with [[Sybase IQ]], the commercial column-based database,
and I've done a bit of evangelizing of column-based databases since I heard
of the idea.
I do know of GFS rather well, from the technical papers level on up.
On first glance... GFS doesn't seem relevant to Wikimedia Foundation work.
GFS is all about giving common access to a very large, petabytes scale disk
store across a wide sea of systems. As I understand it, one whole
database+static content en.wikipedia dump is less than a terabyte, and can
fit on the local disks of a single 1-2U rack server. There's no reason for
there to be a giant shared filesystem if the dataset fits on one system's
local disk. Am I missing something?
A column based database (generalizing here) seems like a possible match for
our needs; in general, as I understand it, the wikipedia servers are getting
hit 99% plus reads, many fewer edits, in terms of the database access? If
that's true, then the database contents are more like a data warehousing job
(few updates, predominantly read operations), and column based databases
seem to be around 10x faster for data warehousing work.
The question I would have is whether the database scale is large enough to
justify that. If you can keep the indexes for the key tables in RAM all the
time, and given the only-few-hundred-GB database sizes now, I would hope
that we could, then the index in RAM beats disk access to a column in terms
of performance, and the database's disk layout is therefore a second order
effect.
Can WMF admins confirm that the DB servers are effectively keeping the DB
indexes cached in RAM now?
--
-george william herbert
george.herbert(a)gmail.com