On 3/22/06, Ilmari Karonen <nospam(a)vyznev.net> wrote:
One could ignore edits that have been reverted.
Detecting reverts, in
the strict sense of the word, is easy: all you need is a hash value for
each revision.
Of course, this wouldn't be perfect. But it'd be as close to perfect as
any automated system can be. And it _would_ skip most vandals.
And what happens if the next edit merges some content back in from the
reverted text?
This happens, perhaps not frequently but it isn't that rare either.
You can't assume that a edit didn't count just because it was
reverted.
I have a tool I call 'entropy flow' which I've been using for my
"wikipedia text backend where all revisions fit in ram" project. I'll
get around to releasing some stuff sooner or later, but it's easy
enough to code yourself. It's a useful too to find out who were the
real contributors...
It works like this, take all n revisions of an article:
*compress with a whole word replacement memoryless compression
algorithm (this gets around 2:1 compression on enwiki text and it's
"diff invarent") where common words are replaced with high byte
huffman encoded keys from a pre computed dictionary (unicode properly
escaped). (this is important because long words along aren't signs of
greater contribution but the addition of more rare words probably is)
*using xdelta3, compute the n^2 compressed binary diffs (n*(n-1+the
null article)) between all versions. Yes, this is a number of diffs
easily in the millions.
*The diff sizes are the costs between the article nodes. Add a slight
penalty to diffs that run in reverse direction to break ties.
*Compute the minimum spanning tree between all these nodes. (I use the
MST algo in the parallel boost graph Library
http://www.osl.iu.edu/research/pbgl/ ... MST is polynomial time, but
it's still expensive on millions of edges)
*The users who have made edits to the nodes in the large subgraph
containing the current revision are the contributors. All the little
subgraphs hanging off the null article node are mostly vandalism and
nonsense.
What seems to be a spookily correct ranking of contributors can be
found by totaling the diff sizes of each user in the main subgraph and
sorting by size.
Of course, no warranties that this actually tells you the
contributors.. the best way to be safe is just to list all editors.
I created this approach for finding the optimal diff order for delta
compressing articles... this application is just a cute side effect.
(When I'm actually compressing articles I just use a sliding window of
100ish revisions, much cheaper to compute and pretty much just as
good, although it sometimes misses when someone reverts to an ancient
version which is why it's better for finding contributors to consider
the entire history in one MST pass).