I did some data gathering last fall that is more or less the same as
Claudia is asking about. Looking up the bot flag, or checking the
username is often regarded as a reasonable way of filtering out the
bots. I chose to apply both, if there's no bot flag we look for a
typical bot signature in the username (regex: "bot$| ", username
either ends with bot or a part of it does), and used a
case-insensitive match since some users have usernames like "FoObOt".
Checking the edit history to find when interwiki links were first
added can be time-consuming if the page had lots of activity. I
therefore chose to use a binary search, halving the distance between
two test points until either the actual edit is found, or we're down
to so few edits that all can be efficiently grabbed through the API
(e.g. using Pywikibot's PreloadingGenerator). Otherwise you might be
examining thousands of edits for no reason.
Having Toolserver access simplifies the process a lot since all the
metadata is more easily accessible, but the revision text will still
have to be grabbed from the API.
Hope some of this helps, let me know if there's any questions.
Cheers,
Morten
On 8 May 2012 08:39, Bináris <wikiposta(a)gmail.com> wrote:
2012/5/8 Merlijn van Deen
<valhallasw(a)arctus.nl>
This is not completely true - the bot flag is also a property of the
user account. You can query e.g.
http://nl.wikipedia.org/w/index.php?title=Speciaal:Gebruikerslijst&offs…
Yes, that's true. And if you want to be quite accurate, you must also
determine the date of acquiring the bot flag from bureau logs and compare it
to the page history. :-)
--
Bináris
_______________________________________________
Pywikipedia-l mailing list
Pywikipedia-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l