Another option to this thread would be: cancelling
the convention and
continue working on regexps
I think regardless of our convention we will always be doing regex
detection of self-identified bots. Maybe I am missing some nuance here?
No, no, Nuria, you're right. I meant continue to improve the regexps and
the other means we have to identify bots. I didn't imply that we should
stop doing regexps if we establish the convention.
On Mon, Feb 1, 2016 at 7:44 PM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
> >In the past, the Analytics team also considered enforcing the convention
> by blocking those bots that don't follow it. And that is still an option to
> consider.
> I would like to point out that I think this is probably the prerogative of
> api's team rather than analytics.
>
Another option to this thread would be: cancelling
the convention and
continue working on regexps
I think regardless of our convention we will always be doing regex
detection of self-identified bots. Maybe I am missing some nuance here?
>
>
>
>
>
> On Mon, Feb 1, 2016 at 10:42 AM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
>
>> >It will take time for frameworks to implement an amended User-Agent
>> policy.
>> >For example, pywikipedia (pywikibot compat) is not actively
>> >maintained.
>> That doesn't imply we shouldn't have a policy that anyone can refer to,
>> these bots will not follow it until they get some maintainers.
>>
>> >There was a task filled against Analytics for this, but Dan Andreescu
>> >removed Analytics (
https://phabricator.wikimedia.org/T99373#1859170).
>>
>> Sorry that the tagging is confusing. I think Analytics tag was removed
>> cause this is a request for data and our team doesn't do data retrieval. We
>> normally tag with "analytics" phabricator items that have actionables
for
>> our team.
>> I am cc-ing Bryan who has already done some analysis on bots requests to
>> the API and can probably provide some data.
>>
>>
>>
>>
>> On Mon, Feb 1, 2016 at 6:39 AM, John Mark Vandenberg <jayvdb(a)gmail.com>
>> wrote:
>>
>>> Hi Marcel,
>>>
>>> It will take time for frameworks to implement an amended User-Agent
>>> policy.
>>> For example, pywikipedia (pywikibot compat) is not actively
>>> maintained. We dont know how much traffic is generated by compat.
>>> There was a task filled against Analytics for this, but Dan Andreescu
>>> removed Analytics (
https://phabricator.wikimedia.org/T99373#1859170).
>>>
>>> There are a lot of clients that need to be upgraded or be
>>> decommissioned for this 'add bot' strategy to be effective in the
near
>>> future. see
https://www.mediawiki.org/wiki/API:Client_code
>>>
>>> The all important missing step is
>>>
>>> 3. Create a plan to block clients that dont implement the (amended)
>>> User-Agent policy.
>>>
>>> Without that plan, successfully implemented, you will not get quality
>>> data (i.e. using 'Netscape' in the U-A to guess 'human' would
perform
>>> better).
>>>
>>> On Tue, Feb 2, 2016 at 1:24 AM, Marcel Ruiz Forns
<mforns(a)wikimedia.org>
>>> wrote:
>>> > So, trying to join everyone's points of view, what about?
>>> >
>>> > Using the existing
https://meta.wikimedia.org/wiki/User-Agent_policy
>>> and
>>> > modify it to encourage adding the word "bot"
(case-insensitive) to the
>>> > User-Agent string, so that it can be easily used to identify bots in
>>> the
>>> > anlytics cluster (no regexps). And link that page from whatever other
>>> pages
>>> > we think necessary.
>>> >
>>> > Do some advertising and outreach and get some bot maintainers and
>>> maybe some
>>> > frameworks to implement the User-Agent policy. This would make the
>>> existing
>>> > policy less useless.
>>> >
>>> > Thanks all for the feedback!
>>> >
>>> > On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns <
>>> mforns(a)wikimedia.org>
>>> > wrote:
>>> >>>
>>> >>> Clearly Wikipedia et al. uses bot to refer to automated software
that
>>> >>> edits the site but it seems like you are using the term bot to
refer
>>> to all
>>> >>> automated software and it might be good to clarify.
>>> >>
>>> >>
>>> >> OK, in the documentation we can make that clear. And looking into
>>> that,
>>> >> I've seen that some bots, in the process of doing their
"editing"
>>> work can
>>> >> also generate pageviews. So we should also include them as
potential
>>> source
>>> >> of pageview traffic. Maybe we can reuse the existing User-Agent
>>> policy.
>>> >>
>>> >>
>>> >>> This makes a lot of sense. If I build a bot that crawls
wikipedia and
>>> >>> facebook public pages it really doesn't make sense that my
bot has a
>>> >>> "wikimediaBot" user agent, just the word
"Bot" should probably be
>>> enough.
>>> >>
>>> >>
>>> >> Totally agree.
>>> >>
>>> >>
>>> >>> I guess a bigger question is why try to differentiate between
>>> "spiders"
>>> >>> and "bots" at all?
>>> >>
>>> >>
>>> >> I don't think we need to differentiate between
"spiders" and "bots".
>>> The
>>> >> most important question we want to respond is: how much of the
>>> traffic we
>>> >> consider "human" today is actually "bot". So, +1
"bot"
>>> (case-insensitive).
>>> >>
>>> >>
>>> >> On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <
>>> jayvdb(a)gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns"
<mforns(a)wikimedia.org>
>>> >>> wrote:
>>> >>> >>
>>> >>> >> Why not just "Bot", or
"MediaWikiBot" which at least encompasses
>>> all
>>> >>> >> sites that the client
>>> >>> >> can communicate with.
>>> >>> >
>>> >>> > I personally agree with you, "MediaWikiBot" seems
to have better
>>> >>> > semantics.
>>> >>>
>>> >>> For clients accessing the MediaWiki api, it is redundant.
>>> >>> All it does is identify bots that comply with this edict from
>>> analytics.
>>> >>>
>>> >>> --
>>> >>> John Vandenberg
>>> >>>
>>> >>>
>>> >>> _______________________________________________
>>> >>> Analytics mailing list
>>> >>> Analytics(a)lists.wikimedia.org
>>> >>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Marcel Ruiz Forns
>>> >> Analytics Developer
>>> >> Wikimedia Foundation
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Marcel Ruiz Forns
>>> > Analytics Developer
>>> > Wikimedia Foundation
>>> >
>>> > _______________________________________________
>>> > Analytics mailing list
>>> > Analytics(a)lists.wikimedia.org
>>> >
https://lists.wikimedia.org/mailman/listinfo/analytics
>>> >
>>>
>>>
>>>
>>> --
>>> John Vandenberg
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics(a)lists.wikimedia.org
>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation