Re: [Wikitech-l] MediaWiki converter (follow-up)

23 Mar 2006

Gregory Maxwell wrote:
...
  On 3/22/06, Ilmari Karonen &lt;nospam(a)vyznev.net&gt;
wrote:

 One could ignore edits that have been reverted. 
Detecting reverts, in
the strict sense of the word, is easy: all you need is a hash value for
each revision.

Of course, this wouldn't be perfect.  But it'd be as close to perfect as
any automated system can be.  And it _would_ skip most vandals.  
 And what happens if the next edit merges some content back in from the
 reverted text? 
This case falls under "not perfect but as close as can be".  It's 
essentially the same problem as someone pasting content from another 
article, or from another source entirely.  Even your diff-based scheme, 
while nifty indeed, doesn't solve that.  In general, nothing can.

By the way, it might be possible to optimize your scheme by using some 
form of histogram analysis to quickly establish lower bounds on edit 
distances.  For that matter, if you're not using it already, even just 
the difference in article lengths gives a weak lower bound on the edit 
distance.  Meanwhile, hashing can be used to establish upper bounds, 
both by hashing the entire text to detect exact reversions and by 
hashing deterministically chosen chunks (such as article sections) to 
detect local changes.

-- 
Ilmari Karonen

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] MediaWiki converter (follow-up)