Last week we started up a new AB test[1] comparing the existing completion suggestions against a new completion suggestion API.  This very simply puts 1 in 10000 users into the test bucket, and a further 1 in 10000 users into the control bucket like so:

  • function oneIn(population) {
  •     return Math.floor( Math.random() * populationSize ) === 0;
  • }

  • if ( oneIn( 10000 ) ) {
  • // test bucket
  • } else if ( oneIn ( 10000 ) ) {
  • // sample bucket
  • } else {
  • return; // rejected
  • }

On every page load we generate a random 64 bit number via `mw.user.generateRandomSessionId()`.  This is used to correlate together events that were performed by the same user on the same page. This is logged with all our events as event_pageId.  In older tests (this was turned off September 3rd) using this same event_pageId scheme roughly 0.3% of event_pageId values came from multiple IP addresses, which seems sane and normal:

  • mysql:research@analytics-store.eqiad.wmnet [log]> select count, count(count) from (select count(distinct clientIp) as count from TestSearchSatisfaction_12423691 group by event_pageId) x group by count;                                           
  • +-------+--------------+
  • | count | count(count) |
  • +-------+--------------+
  • |     1 |       411104 |
  • |     2 |         1500 |
  • +-------+--------------+
  • 2 rows in set (3.11 sec)

On the test we just started though, we are seeing 48% of event_pageId values being reported by multiple ip addresses. We can't seem to find any way to explain why this has changed so much, and as such are uncertain we can rely on the other data collected by this same test.

  • mysql:research@analytics-store.eqiad.wmnet [log]> select count, count(count) from (select count(distinct clientIp) as count from CompletionSuggestions_13424343 group by event_pageId) x group by count;      
  • +-------+--------------+
  • | count | count(count) |
  • +-------+--------------+
  • |     1 |         1176 |
  • |     2 |          243 |
  • |     3 |          254 |
  • |     4 |          212 |
  • |     5 |          143 |
  • |     6 |          102 |
  • |     7 |           64 |
  • |     8 |           36 |
  • |     9 |           16 |
  • |    10 |           14 |
  • |    11 |            8 |
  • |    12 |            5 |
  • +-------+--------------+
  • 12 rows in set (0.03 sec)

We have a third schema in production that has been collecting events the entire time.  It seems to have started showing this issue on September 10th which lines up with a thursday train deployment:
    
    mysql:research@analytics-store.eqiad.wmnet [log]> select date, MAX(count) from (select substr(timestamp, 1, 8) as date, count(distinct clientIp) as count from TestSearchSatisfaction2_13223897 group by substr(timestamp, 1, 8), event_pageId) x group by date;;
  • +----------+------------+
  • | date     | MAX(count) |
  • +----------+------------+
  • | 20150902 |          1 |
  • | 20150903 |          2 |
  • | 20150904 |          2 |
  • | 20150905 |          4 |
  • | 20150906 |          3 |
  • | 20150907 |          3 |
  • | 20150908 |          3 |
  • | 20150909 |          3 |
  • | 20150910 |         11 |
  • | 20150911 |         12 |
  • | 20150912 |         14 |
  • | 20150913 |         18 |
  • | 20150914 |         13 |
  • +----------+------------+
  • 13 rows in set (1.74 sec)

Does anyone have any ideas for where this change could have come from?