Neil Kandalgaonkar wrote:
So lately Google has been pinging the WMF about the
lack of sitemaps on
Commons. If you don't know what those are, sitemaps are a way of telling
search engines about all the URLs that are hosted on your site, so they
can find them more easily, or more quickly.[1]
We have had traditionally problems with images, description pages
assumed to be images...
I investigated this issue and found that we do have a
sitemaps script in
maintenance, but it hasn't been enabled on the Wikipedias since
2007-12-27. In the meantime it was discovered that Google wrote some
custom crawling bot for Recent Changes, so it was never re-enabled for them.
As for Commons: we don't have a sitemap either, but from a cursory
examination of Google Image Search I don't think they are crawling our
Recent Changes. Even if they were, there's more to life than Google --
we also want to be in other search engines, tools like TinEye, etc. So
it would be good to have this back again.
a) any objections, volunteers, whatever, for re-enabling the sitemaps
script on Commons? This means probably just adding it back into daily cron.
Have you tested it first? How long does it take?
b) anyone want to work on making it more efficient
and/or better?
Commons has 13M pages. That means generating at least 260 sitemaps.
You could do some tricks grouping pages into sitemaps by page_id, and
then updating the sitemap on update, but updating your url among 10000
inside a text file would lead to lots of apaches waiting for the file
lock. That could be overcome with some kind of journal applied later to
the sitemaps, but doing a full circle, that's equivalent to updating the
sitemap based in recentchanges data.
Google has introduced some nifty extensions to the
Sitemap protocol,
including geocoding and (especially dear to our hearts) licensing![2]
However we don't have such information easily available in the database,
so this requires parsing through every File page, which will take
several millenia.
This will not work at all with the current sitemaps script as it scans
the entire database every time and regenerates a number of sitemaps
files from scratch. So, what we need is something more iterative, that
only scans recent stuff. (Or, using such extensions will have to wait
until someone brings licensing into the database).
We can start using <image:image> <image:loc> now.
The other extensions will have to wait.