An enhanced version of the C++ diff extension, wikidiff2, is now running on both clusters.
It now
does character-level diffs on Chinese, Japanese and Thai, so it produces much better
results than
the PHP diff algorithm, in a much shorter time to boot. Chinese had an ad-hoc segmentation
scheme
based on inserting a space between every character before the diff, then removing the
spaces
afterwards, but unfortunately that left spaces all over the place where there
shouldn't have been
spaces. Anyway, it's fixed now.
We're still calling dl() every time a diff is needed, and I'm still waiting for
profiling results on
the effect of that. The performance of the algorithm is quite good though, on our
opterons, it can
diff 2MB (each side) of the most pathological input text I've yet been able to devise
in 5.2
seconds, and it does it with only about 15MB of memory.
-- Tim Starling