[Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

Thomas Dalton thomas.dalton at gmail.com
Fri Aug 21 12:04:28 UTC 2009


2009/8/21 Anthony <wikimail at inbox.org>:
>> If we are only interested in whether the most
>> recent revision is vandalised then that is a simpler problem but would
>> require a much larger sample to get the same quality of data.
>
>
> How much larger?  Do you know anything about this, or you're just guessing?
>  The number of random samples needed for a high degree of confidence tends
> to be much much less than most people suspect.  That much I know.

I have a Masters degree in Mathematics, so I know a little about the
subject. (I didn't study much statistics, but you can't do 4 years of
Maths at Uni without getting some basic understanding of it.)

You say it requires 7649 articles, which sounds about right to me. If
we looked through the entire history (or just the last year or 6
months or something if you want just recent data) then we could do it
with significantly fewer articles. I'm not sure how many we would
need, though. I think we need to know what the distribution is for how
long a randomly chosen article spends in a vandalised state before we
can work out what the distribution of the average would be. My
statistics isn't good enough to even work out what kind of
distribution it is likely to be, I certainly can't guess at the
parameters. It obviously ranges between 0% and 100% with the mean
somewhere close to 0% (0.4% seems like a good estimate) and will
presumably have a long tail (truncated at 100%) - there are articles
that spend their entire life in a vandalised state (attack pages, for
example) and there is a chance we'll completely miss such a page and
it will last the entire length of the survey period, so the
probability density at 100% won't be 0. I'm sure there is a
distribution that satisfies those requirements, but I don't know what
it is.



More information about the foundation-l mailing list