After reviewing a weeks worth of data for the commons terms A/B test we have decided that we have not collected enough information. The initial sampling was:

1:1000 users chosen to participate in test

Those users split into 6 buckets, giving each bucket a 1:6000 sampling

This has collected ~100 events per bucket, much less in the "strict" bucket

We are increasing the main sampling by 5x, to 1:200. This will give each bucket a 1:1200 sampling of users. The reason these collect so little data is that quite a few queries don't meet the minimum requirements to be effected by the tests. The "aggressive recall" test requires at least 3 words in the query, and the "strict" test requires at least 6 words in the query.

Erik B.