For some while we've been using HTML Tidy to do additional correction
and clean-up on output. Normally this is relatively quick compared to
the other database overhead on page edits, though the time can get
relatively large for huge-o pages.
More seriously, forking and spawning an external tidy program can be a
bigger problem when the system's under heavy load.
I've checked into CVS HEAD the ability to use the PECL extension which
exposes an interface to the tidy library in-process. This speeds things
up a bit:
Short page ([[Stuff]], about 2.5k of HTML):
0.010ms no-op
2.891ms internalTidy
8.710ms externalTidy
Long page (a village pump page, 450k+ of HTML):
1.783ms no-op
266.066ms internalTidy
306.098ms externalTidy
Testing on a heavily loaded system the difference can go waaay up!
10 simultaneous tidy test threads:
Short page:
0.010ms no-op
2.736ms internalTidy
213.108ms externalTidy
Long page:
2.343ms no-op
565.822ms internalTidy
5868.871ms externalTidy
Heavy disk seeking (make clean on a GCC build) + 10 simultaneous tidy
test threads;
Short page:
0.010ms no-op
2.637ms internalTidy
928.098ms externalTidy
Long:
2.353ms no-op
4305.380ms internalTidy
6686.658ms externalTidy
This is coded for the PHP 4.3.x version of the extension, and may not
work on PHP5. Once installed ('pear install tidy' and add
'extension=tidy.so' to php.ini) it should automatically be picked up if
you've got $wgUseTidy on.
The changes are localized and don't alter the code interface, so I'll
backport this to 1.4 as a performance fix.
-- brion vibber (brion @
pobox.com)