Gregory Maxwell wrote:
On 3/22/06, Ilmari Karonen <nospam(a)vyznev.net>
wrote:
One could ignore edits that have been reverted.
Detecting reverts, in
the strict sense of the word, is easy: all you need is a hash value for
each revision.
Of course, this wouldn't be perfect. But it'd be as close to perfect as
any automated system can be. And it _would_ skip most vandals.
And what happens if the next edit merges some content back in from the
reverted text?
This case falls under "not perfect but as close as can be". It's
essentially the same problem as someone pasting content from another
article, or from another source entirely. Even your diff-based scheme,
while nifty indeed, doesn't solve that. In general, nothing can.
By the way, it might be possible to optimize your scheme by using some
form of histogram analysis to quickly establish lower bounds on edit
distances. For that matter, if you're not using it already, even just
the difference in article lengths gives a weak lower bound on the edit
distance. Meanwhile, hashing can be used to establish upper bounds,
both by hashing the entire text to detect exact reversions and by
hashing deterministically chosen chunks (such as article sections) to
detect local changes.
--
Ilmari Karonen