Having dabbled in this initiative a couple years back when it first
started to gain some traction, I'll make some comments.
Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It
basically searched took the title of a new article, searched for that
term via the Yahoo! Search API, and looked for nearly-exact text matches
among the first results (using an edit distance metric).
Through the hard work of Jake Orlowitz and others we got free access to
the TurnItIn API (academic plagiarism detection). Their tool is much
more sophisticated in terms of text matching and has access to material
behind many pay-walls.
In terms of Jane's concern, we are (rather, "we imagine being")
primarily limited to finding violations originating at new article
creation or massive text insertions, because content already on WP has
been scraped and re-copied so many times.
*I want to emphasize this is a gift-wrapped academic research project*.
Jake, User:Madman, and myself even began amassing ground-truth to
evaluate our approach. This was nearly a chapter in my dissertation. I
would be very pleased for someone to come along, build a tool of
practice, and also get themselves a WikiSym/CSCW paper in the process. I
don't have the free cycles to do low-level coding, but I'd be happy to
advise, comment, etc. to whatever degree someone would desire. Thanks, -AW
--
Andrew G. West, PhD
Research Scientist
Verisign Labs - Reston, VA
Website:
Isn't that what Corenbot does/did? I always found
it very confusing
though whenever I ran into it, and the false positives are huge (so many
sites copy Wikimedia content these days)
On Mon, Jul 21, 2014 at 9:11 AM, Pine W <wiki.pine(a)gmail.com
<mailto:wiki.pine@gmail.com>> wrote:
It should be relatively easy to catch a significant percentage of those
copyright violations with the assistance of automated search tools. The
trick is to do it at a large scale in near-realtime, which might require
some computationally intensive and bandwidth intensive work. James,
can I
suggest that you take this discussion to Wiki-Research-l? There are a
number of ways that the copyright violation problem could be
addressed and
I think this would be a good subject for discussion on that list, or at
Wikimania. Depending on how the discussion on Research goes, it might be
good to invite some dev or tech ops people to participate in the
discussion
as well.
Pine
On Sun, Jul 20, 2014 at 7:05 PM, Leigh Thelmadatter
<osamadre(a)hotmail.com <mailto:osamadre@hotmail.com>>
wrote:
This is one of the best ideas Ive read on here!
> Date: Sun, 20 Jul 2014 20:00:28 -0600
> From: jmh649(a)gmail.com <mailto:jmh649@gmail.com>
> To: wikimedia-l(a)lists.wikimedia.org
<mailto:wikimedia-l@lists.wikimedia.org>; eloquence(a)gmail.com
<mailto:eloquence@gmail.com>;
fschulenburg(a)wikimedia.org
<mailto:fschulenburg@wikimedia.org>;
ladsgroup(a)gmail.com
<mailto:ladsgroup@gmail.com>;
jorlowitz(a)gmail.com <mailto:jorlowitz@gmail.com>;
madman.enwiki(a)gmail.com
<mailto:madman.enwiki@gmail.com>;
west.andrew.g(a)gmail.com
<mailto:west.andrew.g@gmail.com>
> Subject: [Wikimedia-l] Catching copy and
pasting early
>
> Come across another few thousand edits of copy and paste
violations
again
> today. These have occurred over more than a
year. It is wearing
me out.
> Really what is the point on collaborating on
Wikipedia if it is
simply a
> copyright violation. We need a solution and
one has been
proposed here a
> couple of years ago
https://en.wikipedia.org/wiki/Wikipedia:Turnitin
>
> We now need programmers to carry it out. The Wiki Education
Foundation
has
> expressed interest. We will need support from the foundation as
this
> software will likely need to mesh closely
with edits as they
come in. I
am
> willing to offer $5,000 dollars Canadian (almost the same as
American)
for
> a working solution that tags potential copyright issues in near
real
time
with a
greater than 90% accuracy. It is to function on at least all
medical
> and pharmacology articles but I would not complain if it worked
on all
of
> Wikipedia. The WMF is free to apply.
>
> --
> James Heilman
> MD, CCFP-EM, Wikipedian
>
> The Wikipedia Open Textbook of Medicine
>
www.opentextbookofmedicine.com
<http://www.opentextbookofmedicine.com>
_______________________________________________
Wikimedia-l mailing list, guidelines at:
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> Wikimedia-l(a)lists.wikimedia.org
<mailto:Wikimedia-l@lists.wikimedia.org>
> Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
<mailto:wikimedia-l-request@lists.wikimedia.org
<mailto:wikimedia-l-request@lists.wikimedia.org>?subject=unsubscribe>
_______________________________________________
Wikimedia-l mailing list, guidelines at:
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l(a)lists.wikimedia.org
<mailto:Wikimedia-l@lists.wikimedia.org>
Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
<mailto:wikimedia-l-request@lists.wikimedia.org
<mailto:wikimedia-l-request@lists.wikimedia.org>?subject=unsubscribe>
_______________________________________________
Wikimedia-l mailing list, guidelines at:
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l(a)lists.wikimedia.org
<https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l(a)lists.wikimedia.org>
Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
<mailto:wikimedia-l-request@lists.wikimedia.org
<mailto:wikimedia-l-request@lists.wikimedia.org>?subject=unsubscribe>
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l