Hi! And thanks for the question. The pageview hourly dataset includes
sensitive data and our policy does not allow moving it outside servers we
manage. To work with it, you would have to apply for a formal
collaboration via
https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
I do wonder if for cases like this we could establish some kind of lighter
weight process whereby you practice on some sample data and then submit a
proposal for a data dump for public review. Once it's reviewed by enough
people, which could take a while, we could in theory just run the code and
publish the data somewhere. I'll talk to my team about this later today
and write back here. This would only work if the results of the query
preserved the anonimity of our users, but I think DDoS research should
probably fall in that category.
On Thu, Sep 30, 2021 at 05:30 Charel Felten via Analytics <
analytics(a)lists.wikimedia.org> wrote:
Dear Wikimedia analytics team,
We are 3 master students from Vrije Universiteit Amsterdam (VU) and
Universtity of Amsterdam (UVA) doing a large scale data engineering project
about detecting DDOS attacks on Wikipedia by analysing page views and
traffic and trying to distinguish e.g. DDOS attacks from trending topics.
For this project, we need a lot of data. We found two sources of public
data, Pageview complete (
https://dumps.wikimedia.org/other/pageview_complete/) and the filtered
version thereof (
https://dumps.wikimedia.org/other/pageviews/). While
these dumps are already quite useful, we also found that there is a dataset
with even more information (
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_ho…),
in particular it contains the country a pageview came from and the referer,
which could both be very useful for our project.
According to the above page, this dataset has been made private since
2018. We would like to ask whether it is possible to have access to this
dataset for our research, or any other extended version of the public dump,
which would enable us to do more in-depth research. We have our own cluster
so we could work on a copy of the data. Moreover we would like to share our
project and all our results with you to help contribute to your security
measures.
Best regards,
Charel Felten, Gilles Magalhaes and Aleksander Janczewski
_______________________________________________
Analytics mailing list -- analytics(a)lists.wikimedia.org
To unsubscribe send an email to analytics-leave(a)lists.wikimedia.org