Analytics June 2016

analytics@lists.wikimedia.org

16 participants
21 discussions

by Madhumitha Viswanathan

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

5 years, 7 months

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

Request stream data set for cache tuning

by Daniel Berger

Hi everyone, I'm a phd student studying mathematical models to improve the hit ratio of web caches. In my research community, we are lacking realistic data sets and frequently rely on outdated modelling assumptions. Previously, (~2007) a trace containing 10% of user requests issued to the Wikipedia was publicly released [1]. This data set has been used widely for performance evaluations of new caching algorithms, e.g., for the new Caffeine caching framework for Java [2]. I would like to ask for your comments about compiling a similar (updated) data set and making it public. In my understanding, the necessary logs are readily available, e.g., in the Analytics/Data/Mobile requests stream [3] on stat1002, with a sampling rate of 1:100. As this request stream contains sensitive data (e.g., client IPs), it would need anonymization before making it public. It would be glad to help with that. The previously released data set [1] contains no client information. It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update flag. I would additionally suggest to include 5) the cache's hostname, 6) the cache_status, and 7) the response size (from the Wikimedia cache log format). I believe this format would preserve anonymity, and would be interesting for many researchers. Let me know your thoughts. Thanks, Daniel Berger http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger [1] http://www.wikibench.eu/?page_id=60 [2] https://github.com/ben-manes/caffeine/wiki/Efficiency [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream

7 years, 8 months

Pagecount Datasets to be Deprecated at the end of May

by Dan Andreescu

Just a reminder, we will be deprecating the pagecounts datasets at the end of May, as we mentioned earlier this year [0]. This means these files will remain there to be used by researchers but new files will not be generated in the future. *Pagecounts datasets that will be deprecated* pagecounts-raw pagecounts-all-sites Options for switching to the new datasets [1]: pageviews for the same format but better quality data pagecounts-ez for compressed data [0] https://lists.wikimedia.org/pipermail/analytics/2016-March/005060.html [1] https://dumps.wikimedia.org/other/analytics/

7 years, 9 months

Survey for Wikipedia readers

by Vipul Naik

Hello Wikimedia analytics mailing list, As part of research into how people read Wikipedia, a friend and I created a short survey. We are interested in seeing how people on this mailing list (not a representative sample of Wikipedia readers for sure!) fill the survey. The survey should take 2 to 10 minutes to complete. https://www.surveymonkey.com/r/QBCCVFY I would also appreciate if any of you have the ability to circulate the survey to a different audience. If you are interested in doing that, please let me know (off-list, if you prefer) and I will give you a separate URL through which to do so for each such audience. The URLs represent different audiences to whom the survey is shared so that it is easier to understand how responses differ based on audience. Any feedback on the survey questions would also be appreciated, on- or off-thread. Thank you very much! Vipul

7 years, 9 months

Clickhouse

by Gilles Dubuc

This might be of interest: https://clickhouse.yandex/ ClickHouse is an open-source column-oriented database management system that allows generating analytical data reports in real time. ClickHouse manages extremely large volumes of data in a stable and sustainable manner. It currently powers Yandex.Metrica, world’s second largest web analytics platform, with over 13 trillion database records and over 20 billion events a day, generating customized reports on-the-fly, directly from non-aggregated data. This system was successfully implemented at CERN’s LHCb experiment to store and process metadata on 10bn events with over 1000 attributes per event registered in 2011.

7 years, 10 months

Re: [Analytics] [Wiki-research-l] question about Pageviews dumps

by Leila Zia

+ Analytics On Tue, Jun 28, 2016 at 6:36 AM, Marc Miquel <marcmiquel(a)gmail.com> wrote: > Hello, > > I have a question for you regarding pageviews datadumps. > > I am considering to study reader engagement for different article topics > in different languages. Because of this, I would like to know if there is > any plan to make available pageviews dumps detailing activity log at > session level per user - in a similar way to editor sessions. > > Since this would be for a research project I might ask funding for it, I > would like to know if I could count on that, what is the nature of the > available data, and what would be the procedure to obtain this data and if > there would be any implication because of privacy concerns. > > Thank you very much! > > Best, > > Marc Miquel > ᐧ > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >

7 years, 10 months

Re: [Analytics] In-kind support for the Analytics team

by Nuria Ruiz

Adding analytics@ a public e-mail list where you can post questions such as this one. >that doesn’t tell us how often entities are accessed through Special:EntityData or wbgetclaims >Does this data already exist, even in the form of raw access logs? Is this data always requested via http from an api endpoint that will hit a varnish cache? (Daniel can probably answer this) >From what I see on our data we have requests like the following: www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q633155 www.wikidata.org /w/api.php ?callback=jQuery11130020702992017004984_1465195743367&format=json&action=wbgetclaims&property=P373&entity=Q5296&_=1465195743368 www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q573612 www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q472729 www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q349797 www.wikidata.org /w/api.php ?action=compare&torev=344163911&fromrev=344163907&format=json www.wikidata.org /w/api.php ?action=wbgetentities&format=xml&ids=Q2356135 www.wikidata.org /w/api.php ?action=wbgetentities&format=xml&ids=Q2355988 www.wikidata.org /w/api.php ?action=compare&torev=344164023&fromrev=344163948&format=json If the data you are interested in can be inferred from these requests there is no additional data gathering needed. >If not, what effort would be required to gather this data? For the purposes of my proposal to the U.S. Census Bureau I am estimating around >six weeks of effort for this for one person working full-time. If it will take more time I will need to know. I think I have mentioned this before on an e-mail thread but without knowing the details of what you want to do we cannot give you a time estimate. What are the exact metrics you are interested on? Is the project described anywhere in meta? Thanks, Nuria On Thu, Jun 30, 2016 at 11:45 AM, James Hare <james(a)hxstrategy.com> wrote: > Copying Lydia Pintscher and Daniel Kinzler (with whom I’ve discussed this > very topic). > > I am interested in metrics that describe how Wikidata is used. While we do > have views on individual pages, that doesn’t tell us how often entities are > accessed through Special:EntityData or wbgetclaims. Nor does it tell us how > often statements/RDF triples show up in the Wikidata Query Service. Does > this data already exist, even in the form of raw access logs? If not, what > effort would be required to gather this data? For the purposes of my > proposal to the U.S. Census Bureau I am estimating around six weeks of > effort for this for one person working full-time. If it will take more time > I will need to know. > > > Thank you, > James Hare > > On Thursday, June 2, 2016 at 2:18 PM, Nuria Ruiz wrote: > > James: > > >My current operating assumption is that it would take one person, > working on a full time basis, around six weeks to go from raw access logs > >to a functioning API that would provide information on how many times a > Wikidata entity was accessed through the various APIs and the >query > service. Do you believe this to be an accurate level of effort estimation > based on your experience with past projects of this nature? > You are starting from the assumption that we do have the data you are > interested in in the logs which I am not sure it is the case, have you done > you checks on this regard with wikidata developers? > > Analytics 'automagically' collects data from logs about *page* requests, > any other requests collections (and it seems that yours fit on this > scenario) need to be instrumented. I would send an e-mail to analytics@ > public list and wikidata folks to ask about how to harvest the data you are > interested in, it doesn't sound like it is being collected at this time so > your project scope might be quite a bit bigger than you think. > > Thanks, > > Nuria > > > > > On Thu, Jun 2, 2016 at 5:06 AM, James Hare <james(a)hxstrategy.com> wrote: > > Hello Nuria, > > I am currently developing a proposal for the U.S. Census Bureau to > integrate their datasets with Wikidata. As part of this, I am interested in > getting Wikidata usage metrics beyond the page view data currently > available. My concern is that the page views API gives you information only > on how many times a *page* is accessed – but Wikidata is not really used > in this way. More often is it the case that Wikidata’s information is > accessed through the API endpoints (wbgetclaims etc.), through > Special:EntityData, and the Wikidata Query Service. If we have information > on usage through those mechanisms, that would give me much better > information on Wikidata’s usage. > > To the extent these metrics are important to my prospective client, I am > willing to provide in-kind support to the analytics team to make this > information available, including expenses associated with the NDA process > (I understand that such a person may need to deal with raw access logs that > include PII.) My current operating assumption is that it would take one > person, working on a full time basis, around six weeks to go from raw > access logs to a functioning API that would provide information on how many > times a Wikidata entity was accessed through the various APIs and the query > service. Do you believe this to be an accurate level of effort estimation > based on your experience with past projects of this nature? > > Please let me know if you have any questions. I am happy to discuss my > idea with you further. > > > Regards, > James Hare > > > >

7 years, 10 months

Upcoming reboots of stat1002, stat1003 and stat1004 (on Jun 30th)

by Luca Toscano

Hi! Tomorrow morning (Jun 30th - CET timezone) I'd need to reboot stat1002, stat1003 and stat1004 for kernel upgrades (Ubuntu security patches). This could potentially terminate long running queries or jobs, so please ping me on IRC or email me if your work can't be postponed or stopped. Thanks! Regards, Luca

7 years, 10 months

Proposal to reopen the "Unacceptable behavior" section

by Matthew Flaschen

Yaron Koren has proposed to reopen the "Unacceptable behavior" section (https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Suggested_change_…). His perspective and mine are given on the talk page. In brief: * He disagrees with how "marginalized and otherwise underrepresented groups" and "encouraged" are handled in the original text. * I support the current text and process, and have explained why on the talk page. Thanks, Matt Flaschen

7 years, 10 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics June 2016