Hi everyone,
as the one, who requested data for performance research/testing, I'm
happy to participate in the discussion.
The second request, by Michal, might not be about performance. I believe
Michal hasn't provided any details, as yet. I thought I could help
Michal by pointing out similarities to my request, but I now see that
the two requests might be quite different.
It is my goal to compile a dataset, which does not include any private
data. My request essentially asks for a higher-resolution version of the
publicly available pagecounts data. And an update to a dataset, which
has been made public in 2007 [1].
Specifically, the data set would hold the same fields as the pagecounts
data, at a higher sampling rate: 1:10 instead of hourly.
In addition to the pagecounts fields, the public 2007 dataset has one
additional field "save_flag", which indicates whether the request
changed a web page. In order to compile this save_flag, three other
webrequest fields need to be accessed, as pointed out in Tim Starling's
email [2]. Tim was the one, who helped compiling the 2007 dataset.
In my understanding these fields do not include any "personal
information" as per the WMF privacy policy. Please correct me if I'm
wrong here.
I also would like to point out that I'm asking to make this dataset
public (as opposed to giving it to only my research group). If helpful,
I'd be willing to host this dataset on my institutions web server, or in
a public AWS S3 bucket to facilitate access by the community.
I made a few updates to clarify these points in the phabricator item,
were you can find further information:
https://phabricator.wikimedia.org/T128132
The comments on that page discuss how we can restrict the scope to only
the English Wikipedia and to individual WMF caching servers to scale
down the dataset size.
Let me know what you think.
Best,
Daniel
[1]
http://www.wikibench.eu/?page_id=60
[2]
http://thread.gmane.org/gmane.org.wikimedia.analytics/3405/focus=3408
On 03/22/2016 08:55 PM, Pine W wrote:
Hi Dan,
Agreed, I think it makes sense to consider a subject-specific request
for pages that are within the scope of epidemiology, such as
influenza, where we have reason to think that there could be public
health benefits in analyzing the data and there are reasonable
safeguards to protect user anonymity.
A request for 1 month of the private data requested here, which
appears to be for all pages on all projects, is far too broadly
scoped. Also, in general, I my instinct would be to deny external
requests for WMF private data for purposes of performance testing. It
seems to me that the risks far outweigh the benefits to Wikimedia, and
that processing requests like these would be a suboptimal use of WMF
staff time.
Pine
On Tue, Mar 22, 2016 at 12:44 PM, Dan Andreescu
<dandreescu(a)wikimedia.org <mailto:dandreescu@wikimedia.org>> wrote:
Pine, there are actually two separate requests and they shouldn't
be mixed. The performance-related one is research as far as I
understand, and the other one we have no details yet. I welcome a
public discussion of either, and of course would respect any
opinions held by the analytics community at large. We have every
intention to be good stewards of this data and for what it's
worth, I'm very skeptical of allowing access to private data,
unless for obviously beneficial purposes like flu forecasting, etc.
On Tue, Mar 22, 2016 at 1:37 PM, Pine W <wiki.pine(a)gmail.com
<mailto:wiki.pine@gmail.com>> wrote:
I'd appreciate a clarification about the purpose of this
request if Wikimedia private data is involved. If I am
understanding correctly, the purpose of this request is for
access to Wikimedia private data for assistsnce with 3rd party
performance testing. If that is the case, I believe that the
access request for private should simply be denied.
Pine
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
<mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics