Andre Engels wrote:
In my discussion with Ilse (based on which I recently
sent the request to reduce
the put_throttle), we also got to the subject of XML feeds. I mentioned that
Yahoo was already getting one, and my contact at Ilse said he would be interested
in such. Thus my questions:
* Would sending out XML feeds to other parties with a good reason for interest
be a good idea or would it be detrimental (how much is for example the load on
the servers of sending out an XML feed compared to being spidered by a search
engine?)
* If the answer is positive, who can I or my contact contact to discuss the
possibility?
Andre Engels
I would imagine that an XML feed would allow a significant reduction in
load, or a significant increase in freshness for the same level of load,
if it was used widely by spider operators.
Each spider that simply spiders the site repetitively will have to hit
every page with at least a last-modification-time request to get up to
date, and then download all the pages that have changed. The more
up-to-date a spider operator wants to be, the more load they need to
generate.
For example, for 500,000 articles, keeping up to date within a day makes
for 500,000 hits per day, or 5 hits per second 24/7, just to check the
timestamps. Then the updated articles (about 1000 to 3000) would still
need to be uploaded each day.
On the other hand, if they switch to using an XML feed, they can be
up-to-date to within an hour or so, and only download the pages that
have changed: perhaps 1000 to 3000 hits per day. This only corresponds
to one hit every 30 seconds or so. The overhead of polling the XML feed
every hour or so would be negligible.
We should consider adding a bit of filtering to the XML feed, so that
users can select their degree of granularity: for example, allowing them
to be notified of every single change, or only the last change in an
hour, a day, or other time period. Programmable hysteresis-based
filtering would also be interesting, to suppress notification until the
end of a "burst" of editing on a page: for example "1 hour after the
first edit since the last notification, or 10 minutes since the most
recent edit, whichever is earlier". With the right tuning this could try
to make sure that articles were "stable" when notification was sent via
the XML feed.
-- Neil