Hi Daniel, I think your request is probably one that can be worked out in such a way that private information is sufficiently protected. The request from Michal, at least as I understand its current form, is of a much different scope. Thanks for following up.

Pine

On Wed, Mar 23, 2016 at 7:29 AM, Nuria Ruiz <nuria@wikimedia.org> wrote:
>In my understanding these fields do not include any "personal information" as per the WMF privacy policy. Please correct me if I'm wrong here.
This is correct for data requested here: https://phabricator.wikimedia.org/T128132 

On Wed, Mar 23, 2016 at 1:23 AM, Daniel Berger <berger@cs.uni-kl.de> wrote:
Hi everyone,

as the one, who requested data for performance research/testing, I'm happy to participate in the discussion.

The second request, by Michal, might not be about performance. I believe Michal hasn't provided any details, as yet. I thought I could help Michal by pointing out similarities to my request, but I now see that the two requests might be quite different.

It is my goal to compile a dataset, which does not include any private data. My request essentially asks for a higher-resolution version of the publicly available pagecounts data. And an update to a dataset, which has been made public in 2007 [1].

Specifically, the data set would hold the same fields as the pagecounts data, at a higher sampling rate: 1:10 instead of hourly.
In addition to the pagecounts fields, the public 2007 dataset has one additional field "save_flag", which indicates whether the request changed a web page. In order to compile this save_flag, three other webrequest fields need to be accessed, as pointed out in Tim Starling's email [2]. Tim was the one, who helped compiling the 2007 dataset.

In my understanding these fields do not include any "personal information" as per the WMF privacy policy. Please correct me if I'm wrong here.


I also would like to point out that I'm asking to make this dataset public (as opposed to giving it to only my research group). If helpful, I'd be willing to host this dataset on my institutions web server, or in a public AWS S3 bucket to facilitate access by the community.

I made a few updates to clarify these points in the phabricator item, were you can find further information: https://phabricator.wikimedia.org/T128132
The comments on that page discuss how we can restrict the scope to only the English Wikipedia and to individual WMF caching servers to scale down the dataset size.


Let me know what you think.

Best,
Daniel

[1] http://www.wikibench.eu/?page_id=60
[2] http://thread.gmane.org/gmane.org.wikimedia.analytics/3405/focus=3408



On 03/22/2016 08:55 PM, Pine W wrote:
Hi Dan,

Agreed, I think it makes sense to consider a subject-specific request for pages that are within the scope of epidemiology, such as influenza, where we have reason to think that there could be public health benefits in analyzing the data and there are reasonable safeguards to protect user anonymity.

A request for 1 month of the private data requested here, which appears to be for all pages on all projects, is far too broadly scoped. Also, in general, I my instinct would be to deny external requests for WMF private data for purposes of performance testing. It seems to me that the risks far outweigh the benefits to Wikimedia, and that processing requests like these would be a suboptimal use of WMF staff time.

Pine

On Tue, Mar 22, 2016 at 12:44 PM, Dan Andreescu <dandreescu@wikimedia.org> wrote:
Pine, there are actually two separate requests and they shouldn't be mixed.  The performance-related one is research as far as I understand, and the other one we have no details yet.  I welcome a public discussion of either, and of course would respect any opinions held by the analytics community at large.  We have every intention to be good stewards of this data and for what it's worth, I'm very skeptical of allowing access to private data, unless for obviously beneficial purposes like flu forecasting, etc.

On Tue, Mar 22, 2016 at 1:37 PM, Pine W <wiki.pine@gmail.com> wrote:

I'd appreciate a clarification about the purpose of this request if Wikimedia private data is involved. If I am understanding correctly, the purpose of this request is for access to Wikimedia private data for assistsnce with 3rd party performance testing. If that is the case, I believe that the access request for private should simply be denied.

Pine


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics




_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics