The point is that I wanted to make a language-idependent tool.
If you go into mis-spelling, you need then a spelling tool for each language you want to support, and you need to worry about support for special names, locations, etc.  I did not want to go into that.  Should we get into that?  Would there be a marked advantage?  I would be interested to know.
Moving blocks is also tricky.  I can cut-and-paste a "did not" ... and change the meaning of the destination place.
So as you say, you need to look for self-contained blocks, but even changing the order of blocks can affect meaning...

Maybe a good proposal is the following:
  1. Still flag for trust as we do now, paying attention even to minor changes
  2. When giving reputation, only give reputation to authors who contribute non-negligible amounts of text.
But we tried 2, and it did not work well: it decreased the predictive power of the reputation we computed.  Many editors do mostly small edits; they would not receive much reputation under 2. for their work.  We found that valuing the authors of even small changes actually led to a better reputation system (as measured by the predictive power).

Luca


On Dec 21, 2007 11:57 AM, Jonathan Leybovich <jleybov@gmail.com> wrote:
> Date: Fri, 21 Dec 2007 10:34:47 -0800
> From: "Luca de Alfaro" <luca@dealfaro.org>
>
> If you want to pick out the malicious changes, you need to flag also small
> changes.
>
> "Sen. Hillary Clinton did *not* vote in favor of war in Iraq"
>
> "John Doe, born in *1947*"
>
> The ** indicates changes.
>

Yes, and I did not mean to include cases such as this, which involve
the insertion of a few words that could radically alter the semantic
content of a unit of text.  But legitimate spelling corrections (which
can be easily determined using any of the various spell-checker
databases to determine the set of common misspellings for a word) do
not.  In short, I cannot imagine a case where someone changing
"Senater Clinton" to "Senator Clinton" could involve vandalism (the
"smoother" algorithm should of course also take into account that if a
"misspelling" appears repeatedly in an article, or even better,
related subject articles by different authors, is is probably a valid
technical term or a proper name).  I also cannot imagine how moving a
large block of relatively self-contained text (i.e. a paragraph, since
even parsing at the level of sentences is problematic given all the
uses for the period '.') without modifying its interior could have any
large semantic repercussions (readability is, of course, a matter for
a different discussion ;-)

Again, these are mainly quibbles, but for the articles I sampled it
was quite annoying  to have my eye repeatedly drawn to a single orange
word that represented nothing more than a minor, good-faith
correction.  And overall the system seems to work well!

_______________________________________________
Wikiquality-l mailing list
Wikiquality-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l