Re: [Commons-l] New tool - User dupes

4 Dec 2006

On 12/4/06, Alexandre NOUVEL &lt;alexandre.nouvel(a)alnoprods.net&gt; wrote:
...
  Hi all,

 ---Selon Magnus Manske &lt;magnusmanske(a)googlemail.com&gt;om>:
  Well, cross-checking one million commons images
against a few hundred
 thousand on one of the larger wikipedias might kill the toolserver
 quite efficiently ;-) 
 Well, I agree that image processing is a very CPU-consuming task, and
 cross-checking adds to the difficulty.

 However, I think that it may be possible to build a kind of hash
 signature for each file and sort them to find duplicates. The process
 itself of hashing would require some time but may be splitted amongst
 some servers. The resulting hash lists may then be sorted, so that
 matching signatures would lead to further checking of their initial
 images. 
There was a discussion somewhere (maybe on this list? I don't
remember) to store MD5-hashes of image data in the table with the
other image information (size etc.). Nothing came of it, I'm afraid.
Too bad.

...
  One drawback for this solution is to maintain a huge
index of all the
 signatures (each one associated with the image name and the originating
 wiki). 
With images being replaced, deleted, undeleted, etc. the only
practical place is indeed the image table on the respective wiki. An
outside solution (i.e. toolserver) is out of the question IMHO.

...
  Or perhaps I'm just writing bullshit :) 
Nope :-)

Magnus

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Commons-l] New tool - User dupes