as per the WMF privacy policy. Please correct me if
I'm wrong here.
This is correct for data requested here:
On Wed, Mar 23, 2016 at 1:23 AM, Daniel Berger <berger(a)cs.uni-kl.de> wrote:
Hi everyone,
as the one, who requested data for performance research/testing, I'm happy
to participate in the discussion.
The second request, by Michal, might not be about performance. I believe
Michal hasn't provided any details, as yet. I thought I could help Michal
by pointing out similarities to my request, but I now see that the two
requests might be quite different.
It is my goal to compile a dataset, which does not include any private
data. My request essentially asks for a higher-resolution version of the
publicly available pagecounts data. And an update to a dataset, which has
been made public in 2007 [1].
Specifically, the data set would hold the same fields as the pagecounts
data, at a higher sampling rate: 1:10 instead of hourly.
In addition to the pagecounts fields, the public 2007 dataset has one
additional field "save_flag", which indicates whether the request changed a
web page. In order to compile this save_flag, three other webrequest fields
need to be accessed, as pointed out in Tim Starling's email [2]. Tim was
the one, who helped compiling the 2007 dataset.
In my understanding these fields do not include any "personal information"
as per the WMF privacy policy. Please correct me if I'm wrong here.
I also would like to point out that I'm asking to make this dataset public
(as opposed to giving it to only my research group). If helpful, I'd be
willing to host this dataset on my institutions web server, or in a public
AWS S3 bucket to facilitate access by the community.
I made a few updates to clarify these points in the phabricator item, were
you can find further information:
https://phabricator.wikimedia.org/T128132
The comments on that page discuss how we can restrict the scope to only
the English Wikipedia and to individual WMF caching servers to scale down
the dataset size.
Let me know what you think.
Best,
Daniel
[1]
http://www.wikibench.eu/?page_id=60
[2]
http://thread.gmane.org/gmane.org.wikimedia.analytics/3405/focus=3408
On 03/22/2016 08:55 PM, Pine W wrote:
Hi Dan,
Agreed, I think it makes sense to consider a subject-specific request for
pages that are within the scope of epidemiology, such as influenza, where
we have reason to think that there could be public health benefits in
analyzing the data and there are reasonable safeguards to protect user
anonymity.
A request for 1 month of the private data requested here, which appears to
be for all pages on all projects, is far too broadly scoped. Also, in
general, I my instinct would be to deny external requests for WMF private
data for purposes of performance testing. It seems to me that the risks far
outweigh the benefits to Wikimedia, and that processing requests like these
would be a suboptimal use of WMF staff time.
Pine
On Tue, Mar 22, 2016 at 12:44 PM, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
Pine, there are actually two separate requests
and they shouldn't be
mixed. The performance-related one is research as far as I understand, and
the other one we have no details yet. I welcome a public discussion of
either, and of course would respect any opinions held by the analytics
community at large. We have every intention to be good stewards of this
data and for what it's worth, I'm very skeptical of allowing access to
private data, unless for obviously beneficial purposes like flu
forecasting, etc.
On Tue, Mar 22, 2016 at 1:37 PM, Pine W < <wiki.pine(a)gmail.com>
wiki.pine(a)gmail.com> wrote:
I'd appreciate a clarification about the
purpose of this request if
Wikimedia private data is involved. If I am understanding correctly, the
purpose of this request is for access to Wikimedia private data for
assistsnce with 3rd party performance testing. If that is the case, I
believe that the access request for private should simply be denied.
Pine
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing
listAnalytics@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics