Wikimedia developers <wikitech-l(a)Wikipedia.org> writes:
Now that
Magnus mentioned it on Meta, I'm more inclined to ask: is
writing Phase IV in C/C++ something that's being actively considered?
Sure, it's an effort, but we're looking at - minimally - an order of
magnitude improvement in performance. That does, however, raise the
question of whether it's worth it to reinvent the wheel (Phase III is
great software) for a problem that, in the foreseeable future, can be
solved cheaply by throwing more processor horserpower into the mix, and
by better caching.
-IK
Well, an order of magnitude means we can do the same amount of work with
half the machines we have now, or twice as much with the same. That's a
pretty big deal. That said, rewriting the entire codebase in C/C++ will
take a huge amount of developer time. If there are any time-sensitive
spots in the code, it might be worth recoding those in C/C++ - but I
don't know about the entire codebase. I really don't think that's a
"win", as far as developer time, or future maintainability.
I suspect there is a lot that can be done without going to C/C++. With
some performance tuning you probably could get a doubling to an order of
magnitude performance out of the current code base. PHP is the right size
language for this product. At worst you might have to write some C/C++
helper programs. Typically, if a product hasn't been written with
performance in mind and never tuned, 90% of the CPU cycles occur in 5-10%
of the code.
I've been writing C/C++ programs for over 20 years. It is not a great
language for working with string data. Unless you code carefully, you'll
end up worse off. In terms of features/programmer hour, PHP is probably
several times more productive than C/C++.
Speaking on a purely theoretical basis, an important thing to writing or
tuning a program written in something like PHP is to understand what
operations cost the most when executed in the particular language. Also in
almost any language, repeatedly scanning, copying and concatinating large
strings, like 100,000 byte articles is really costly. Most string based
programs spend 90% of their time doing storage allocation, so *looking* is
much better than *touching*. Another thing to avoid are operations that go
up by the square of the number of or size of objects. linear to 'n * Log
n' are what you want to strive for.
Database turning is quite important as well, though I get feeling you've
been on top of that.
Has much work been done in seeing what takes the most time when running
mediawiki? I am still quite new to it and PHP as well.
Nick