Re: [Wikitech-l] commons.wikimedia.org allowing directory indexes and web robots

18 Jul 2009

On Sat, Jul 18, 2009 at 3:20 PM, David Gerard&lt;dgerard(a)gmail.com&gt; wrote:
...
  2009/7/18 Alexandre Dulaunoy &lt;a(a)foo.be&gt;be>:

  I was wondering if it would be possible to allow
web robots to access
 http://upload.wikimedia.org/wikipedia/commons/ to gather and mirror
 the media files. As this is pure HTTP, the mirroring could benefit from
 the caching mechanisms of HTTP object (instead of having a large dump
 containing all the media files, that is more difficult to cache/update). 

 I see lots of files on upload.wikimedia.org on Google Image Search
 already. Is that actually forbidden by our robots.txt?

 It'd actually be better if Google properly indexed text pages whose
 name ends in .jpg or whatever ... but they're aware we'd like that, so
 it's up to them. 
But the current directory listing (upload dir) is disallowed, for example :

http://upload.wikimedia.org/wikipedia/commons/8/8c/

Of course, the bot will be able to get the media files by
following the links from the other pages but this is not
very handy/effective to make a exact mirror of just
the current media files repository.

Would it possible to enable directory listing of
http://upload.wikimedia.org/wikipedia/commons
and the following subdirectories?

Thanks for the feedback,

-- 
--                   Alexandre Dulaunoy (adulau) -- http://www.foo.be/
--                             http://www.foo.be/cgi-bin/wiki.pl/Diary
--         "Knowledge can create problems, it is not through ignorance
--                                that we can solve them" Isaac Asimov

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] commons.wikimedia.org allowing directory indexes and web robots