I confirmed this on IRC, but just feeding the archives here.  I'm also convinced that the client IP hashing bug we just found explains this problem.  It's good we took a look at the other problems, but the main one seems the IP hashing.  We'll brain bounce more tomorrow on how to fix that.

On Tue, Sep 15, 2015 at 6:23 PM, Oliver Keyes <okeyes@wikimedia.org> wrote:
Update; I read Dan's thread about hashing, read this thread, and a
penny dropped ;).

This is totally explainable by the fact that we /expect/ to see
multiple pageIDs per IP. And we are! The hashing problem just means
those aren't /appearing/ to be the same IP.

On 15 September 2015 at 18:05, Erik Bernhardson
<ebernhardson@wikimedia.org> wrote:
> We've deployed the change to bucketing, but we are still seeing the same
> issue in the collected data.
>
> Again we are generating a unique 64 bit random number when the user gets to
> the page. We are seeing this same 64 bit unique number being reported by
> multiple ip addresses.
>
> Since deploying the new schema number with the updated bucket selection we
> have seen 13 distinct tokens coming from 42 distinct ip addresses. This
> shouldn't be possible.
>
> mysql:research@analytics-store.eqiad.wmnet [log]> select count(distinct
> clientIp) from CompletionSugges
> tions_13630018;
> +--------------------------+
> | count(distinct clientIp) |
> +--------------------------+
> |                       42 |
> +--------------------------+
> 1 row in set (0.00 sec)
>
> mysql:research@analytics-store.eqiad.wmnet [log]> select count(distinct
> event_pageViewToken) from CompletionSuggestions_13630018;
>
> +-------------------------------------+
> | count(distinct event_pageViewToken) |
> +-------------------------------------+
> |                                  13 |
> +-------------------------------------+
> 1 row in set (0.00 sec)
>
>
>
> My best guess at this point is that something has changed in the way these
> clientIp's are collected and is incorrect.
>
>
> On Mon, Sep 14, 2015 at 1:32 PM, Erik Bernhardson
> <ebernhardson@wikimedia.org> wrote:
>>
>> Thanks for taking a look over this. I've incorperated your suggestions
>> into a patch[1] and if all looks good will send that out in SWAT. We should
>> be able to look at the data collected overnight and see if things are more
>> sane tomorrow.
>>
>> [1] https://gerrit.wikimedia.org/r/#/c/238306/
>>
>> On Mon, Sep 14, 2015 at 11:56 AM, Gergo Tisza <gtisza@wikimedia.org>
>> wrote:
>>>
>>> You are queueing a logging callback every time a request is sent (which
>>> is roughly every time the user types another character in the search box)
>>> until the tracking module finishes loading and mw.searchSuggest.request is
>>> restored. On a slow connection the user might type several characters and
>>> trigger several log events by then. If you filter for queries from the same
>>> non-unique IP, you will probably see something like "a", "ab", "abc"...
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



--
Oliver Keyes
Count Logula
Wikimedia Foundation

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics