Hi Alexander!

This indeed seems like an interesting project. Responding to your suggestions:

First, I am ready to collaborate with you on making this data available as other researchers have done in the past. I would appreciate if you let me know which steps I need to take in order to work with you on this task.

I'd suggest you apply for a research project here[1]. The research team will discuss the project with you. And if it gets approved, you can sign and NDA and have access to the raw data. You can also apply for a grant here[2].
 
Second, you can consider making this data available after achieving the necessary level of confidentiality. For example, you can group request types so that each group has at least 1000 unique IP-addresses.

There are a couple tasks[3] in our backlog about effectively anonymizing the pageview data for a general purpose. We used an algorithm similar to what you proposed. Our experience, though, is that anonymization (for general purpose) is a non-trivial task. We plan to work on this in the mid-term (actually, we already started to work on it, see tasks) but we have other priorities for the next quarter. I'd suggest again that you apply for a specific project for the needs of your study here[1][2].

Another challenge, I guess, would be categorizing the articles as educational or entertainment. The categories in Wikipedia are a cool way to browse, but not an exact way of clustering contents. And I guess the frontier between educational and entertainment can be sometimes fuzzy, no? A very interesting challenge anyway.

cheers!

[1] https://meta.wikimedia.org/wiki/Research:New_project
[2] https://meta.wikimedia.org/wiki/Grants:Project
[3]
https://phabricator.wikimedia.org/T114675
https://phabricator.wikimedia.org/T118839
https://phabricator.wikimedia.org/T118838
https://phabricator.wikimedia.org/T118841


On Wed, Dec 14, 2016 at 5:02 PM, Alexander Ugarov <augarov@email.uark.edu> wrote:
Dear members of the Analytics Team!

Please, consider my request for information or collaboration. I am conducting the research project on the international determinants of education quality. In my view, Wikimedia statistics is the priceless resource of information on how much learning people do outside of educational institutions.

I would like to access the data on Wikipedia pageviews by country, language and content area to measure the private learning in different countries. My previous empirical results suggest that Wikipedia pageviews are highly correlated with education quality. Unfortunately, the available data does not allow to separate the educational pageviews from the pure entertainment pageviews (for example, celebrities biographies).

I am aware that the data currently is not the part of the publicly available dataset. Please, consider two options. First, I am ready to collaborate with you on making this data available as other researchers have done in the past. I would appreciate if you let me know which steps I need to take in order to work with you on this task. Second, you can consider making this data available after achieving the necessary level of confidentiality. For example, you can group request types so that each group has at least 1000 unique IP-addresses.

I am looking forward to hear from you on my opportunities to use this data. I think that it is going to be very interesting to know how much people learn from Wikipedia, for example, in India versus Brazil and Egypt. Do people in Indonesia learn less than people in Germany due to poor quality school systems or low private incentives for learning? I am also sure that many social scientists will also benefit from using such information (if you make it available) and will produce some policy-relevant research.


Best regards,
Alexander Ugarov,
Ph. D. Candidate.
Sam M. Walton College of Business
Department of Economics
University of Arkansas
Office: ECOB260
E-mail: augarov@uark.edu.

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics




--
Marcel Ruiz Forns
Analytics Developer
Wikimedia Foundation