Brian wrote:
Delerium, you do make it sound as if merely having the
tagged dataset
solves the entire problem. But there are really multiple problems. One
is learning to classify what you have been told is in the dataset
(e.g., that all instances of this rule in the edit history *really
are* vandalism). The other is learning about new reasons that this
edit is vandalism based on all the other occurences of vandalism and
non-vandalism and a sophisticated pre-parse of all the content that
breaks it down into natural language features. Finally, you then wish
to use this system to bootstrap a vandalism detection system that can
generalize to entirely new instances of vandalism.
Generally speaking, it is not true that you can only draw conclusions
about what is immediately available in your dataset. It is true that,
with the exception of people, machine learning systems struggle with
generalization.
My point is mainly that using the *results* of an automated rule system
as *input* to a machine-learning algorithm won't constitute training on
"vandalism", but on "what the current rule set considers vandalism". I
don't see a particularly good reason to find new reasons an edit is
vandalism for edits that we already correctly predict. What we want is
new discriminators for edits we *don't* correctly predict. And for
those, you can't use the labels-given-by-the-current rules as the
training data, since if the current rule set produces false positives,
those are now positives in your training set; and if the rule set has
false negatives, those are now negatives in your training set.
I suppose it could be used for proposing hypotheses to human
discriminators. For example, you can propose new feature X, if you find
that 95% of the time the existing rule set flags edits with feature X as
vandalism, and by human inspection determine that the remaining 5% were
false negatives, so actually feature X should be a new "this is
vandalism" feature. But you need that human inspection--- you can't
automatically discriminate between rules that improve the filter set's
performance and rules that decrease it if your labeled data set is the
one with the mistakes in it.
-Mark