I've made a new diff extension, called wikidiff2. It uses the same diff algorithm that
we've been
using in PHP, ported to C++. I've done some benchmarks on test sets which require lots
of word-level
diffs. For the lines "a b c d" -> "a b c e" repeated many times, on
srv31, the timings are:
PHP DifferenceEngine: 10230us per line
wikidiff (old C++ extension): 379us per line
wikidiff2 (new C++ extension): 11.5us per line
No doubt the ratios will be different under realistic conditions, but I know where I'm
putting my money.
We've been using the PHP version lately rather than wikidiff, because wikidiff
wasn't finding diffs
as short as people were used to. Because the new extension uses exactly the same algorithm
as the
PHP version, there should be no user-visible differences.
wikidiff2 can also be compiled as a standalone executable and used to diff files.
It's not tested to my satisfaction yet, but once it is, I imagine we'll put it
live on the Wikimedia
cluster. Eventually I imagine we could ditch the original extension and rename my one to
wikidiff,
but I wanted to keep both of them around for the moment so that I could compare them.
-- Tim Starling