if a spider goes to Recent Changes and then to
"Last 5000 changes"
(and last 90 days, and last 30 days, and last 2500 changes, and last
1000 changes, and every such combination) it seems to me the server
load could get pretty high. Perhaps talk pages should be spidered,
but not recent changes or the history (diff/changes).
I agree. Every RecentChanges page contains links to 13 other
RecentChanges, and one of them changes its URL each time the page is
loaded. The other special: pages like statistics, all pages, most
wanted etc. seem to be good candidates for robot exclusion as well:
they stress the database but don't provide much useful information for
indices.
Regarding talk:, wikipedia: and user: pages, I don't see any reason not
to have them indexed.
Diff pages seem to be useless to spiders since the same information
is contained in the two article versions.
Remaining question is: what about article histories and old versions
of articles? Do we want Google to have a copy of every version of
every article, or only the current one?
Axel