Re: [Wikipedia-l] Methods of protection of minor Wikipedias

5 Nov 2005

...
   Nevertheless,
the vast majority of spam on inactive _Wikipedias_ is
 from unloggedin users. 
 But only because we don't make them log in, not because it's hard to
 do so. It's far more of a deterrent to genuine editors than to spam
 bots. 
Unfortunately, that's probably true.

...
    Blocking "that-medicine-that-starts-with-c"
will
 prevent anyone writing about socialism (which was rather a problem for
 socialism.wikicities.com) :) 
 See-eye-a-ell-eye-ess is related to _socialism_?? 
 At the risk of ending up in everyone's spam bins, I'll spell it out:
 "so...Cialis...m". Blocking the word blocks any words that contain it. 
Ohh. Duh. But, surely, it would take only a few lines of code to add a
feature so that it only blocked the _whole word_?

I mean, does anybody get spam e-mails that say "Free socialism! Click
here now"... or even "Get free soCialiSm! cl**k here no*" or anything
like that? I don't think spammers are sophisticated enough to realise
that there are legitimate words that contain spam-filter'd words.

Also, there is the occurance of _phrases_: "free (name of product or
medicine)" is significantly more likely to be spam than even "(name of
product or medicine) is a". If you add "get" before the "free",
that
is even more likely (exponentially?) to be spam. Add a "now"
afterwards, and more likely. Add "by" after that, or "for"... For the
_second_ one, add the word "nat**al". Then add "h*rb*l"... then
"s*p*l*m't", then "for", then "m*le", then that word
that you know
oh-so-well comes next due to the extreme odds!

Of course, anything that filtered on something as complex as this
would be very, very complex programming.

Perhaps instead, somebody could adapt a Free numerical rating system
for spam e-mails (which gives "likelyhoods" that e-mails are spam) --
Google may or may not be willing to help out there given how massive
their database must be and their commitment to Goodness on the
Internet, but if not there would be another project I'm sure.

...
 From that, some things could be adapted. For example,
the "from", "to", and "cc" lines aren't present,
and neither is the subject. HTML
codes would have to have aliases using WikiCode. Things which might be
"automatic kill" for a spam killer would, in many instances, have to
be significantly downgraded, at least for the English Wikipedia (for
example, the-medicine-that-starts-with-c is a legitimate topic, but in
very limited contexts). Talk pages would have to give a certain degree
of slack. The greater the length of a page, the more times its title
should occur within it, or *related* terms (ie, links to articles
which link back to it). So, to a certain extent, "subject" and
"article title" would correspond, although the length-title ratio
would be significantly different.

Certain IPs would be greylisted based on the relative frequency of
spam from them. In fact, every IP range would be assigned a %age based
on existing data. If 90% of the content from an IP range is spam, the
system might notice if any subranges or particular IPs had a
significantly less frequency, and if they did, semiwhitelist them (ie,
"good" percentage points). An IP range with 90% of submissions
legitimate, on the other hand, would have "good" points. If there were
any particular subranges or IPs with a significantly higher perentage
of spam, they would be semiblacklisted ("bad" percentage points, or
less "good" percentage points, depending on the exact frequency).

I'm going into too much detail here, and obviously it would be a
massive undertaking, but given the massive amount of work it would
solve, it's not the sort of pipe dream that I feel guilty bringing up
in front of people who could actually bring it to fruition (I know I
couldn't without learning a programming language first -- right now, I
have very rusty Qbasic, medium-to-advanced HTML, a bit of UNL, but
nothing else, and the latter two aren't exaclty programming
languages).

...
   if I were to
receive an e-mail every two
 hours with a list of suspicious edits, I could revert them immediately
 as nessecary. 
 I'd also be more likely to check edits sent to me by email. Perhaps
 http://meta.wikimedia.org/wiki/EmailNotification could be adapted.
 Currently, I don't think it will send diffs, and there's no way of
 filtering for "suspicious" edits. 
Ahh, but there are already three-halves party applications (meaning,
by Wikipedians, but not software integrated to MeW) which monitor for
"suspicious" edits. Nothing complex, but helpful nonetheless in
filtering out The Good Edits to give only the bad ones, based on a few
very basic observations, as well as feedback.

Mark

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [Wikipedia-l] Methods of protection of minor Wikipedias