Analytics September 2018

analytics@lists.wikimedia.org

14 participants
15 discussions

Statistics about republication of Wikimedia content

by Pine W

Hi Analytics, Are views of republished Wikimedia content, such as on Google and Youtube, something that we could include in addition to Wikimedia pageview statistics? I imagine that this would require cooperation from Alphabet and other companies that reuse Wikimedia content. It would be nice if we could get that cooperation. Also, Is this republication taken into account in website traffic rankings? My guess is that the answer is no, and that other types of republication such as embedded Youtube videos are not taken into account for their content provider's site rankings, although I think that Youtube would count views of embedded videos in its own statistics of video views. I am thinking that for Youtube and Wikipedia, and other similar sites for which republication or embedding are common, site rankings which are based on pageviews could significantly underestimate the popularity and influence of the sites. Regards, Pine ( https://meta.wikimedia.org/wiki/User:Pine )

5 years, 7 months

Beeline as Hive client

by Madhumitha Viswanathan

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

5 years, 8 months

Brief unavailability scheduled for the Event Logging database replica

by Luca Toscano

Hi everybody, Tomorrow Sept 27th at 10 CEST db1108 (alias analytics-slave) will be down for a brief (max 30 mins) maintenance (Mariadb and Linux kernel upgrade). This means that the log database will not be available for querying during this time frame. Please reach out to me or to the Analytics team if this impacts your work (elukey or #wikimedia-analytics on IRC Freenode). Thanks! Luca

5 years, 8 months

Question about data in pageview api

by Felix J. Scholz

Hey, I've been looking through the documentation on the pageview api in recent days, and have a question that I have not been able to come up with a solution to so far. Per my understanding, the data accessible through the "aggregated by project" pageview api [1], when filtered to just query "user" agents, should return the same results as can be found in the hourly pageview dumps data [2 / 3]. However, while the data is close, in two of my brief tests (for the data of October 1, 2015) the values did not match up. Data from "aggregate" API: en.wikipedia & excluding spiders [4]: 238.845.634 pt.wikipedia & excluding spiders [5]: 11.390.043 Data from pageview dumps [3]: en & en.zero & en.m: 238.840.836 pt & pt.zero & pt.m: 11.389.979 As you can see while the values are close, they do not match. What am I missing here? Am I maybe mistaken in the notion that the two data sources are providing data from the same source and thus should be compatible? Felix [1] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews [3] https://dumps.wikimedia.org/other/pageviews/ [4] https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/… [5] https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/pt.wikipedia/…

5 years, 8 months

Analytics Hadoop cluster full shutdown scheduled for Sept 25th

by Luca Toscano

Hi everybody, the Analytics team needs to replace the Hadoop master node hosts (analytics100[1,2]) and the Hive/Oozie host (analytics1003) as part of regular hardware refresh (hosts getting out of warranty). In order to do things safely we decided to proceed with a full cluster shutdown on Sept 25th at 10 AM CEST. The maintenance should last a couple of hours and all there shouldn't be any noticeable change for the Hadoop users. This means that during the maintenance: - HDFS will not be available - Yarn will not be available - Hive/Spark (cluster mode)/Oozie/etc.. will not be available Please let us know if this impacts your work in https://phabricator.wikimedia.org/T203635 or on the #wikimedia-analytics Freenode IRC channel. Thanks a lot! Luca

5 years, 8 months

Persisting some temp data in hive, so that others can access it

by Ian Marlier

Hi there -- I've been doing some analysis using the raw pageviews table in Hive, in order to try to understand the effect that adding a sitemap to it.wikipedia.org had on traffic[1]. As part of this analysis, I created three temporary tables. But, of course, those tables only exist within the context of my own session, which is sub-optimal since I'm not the only one trying to understand this. What's the best way to go about persisting these tables? I can SELECT INTO to move the data in to a non-temp table, but don't want to do so willy-nilly. (They'll probably need to stick around for about 2 weeks, I would guess, and each of the three tables in question is about 5 million rows with three columns each (a string, and two int)) Thanks! - Ian

5 years, 8 months

Results from 2018 global Wikimedia survey are published!

by Edward Galvez

Hi everyone, I'm excited to share that our annual survey about Wikimedia communities is now published! This survey included 170 questions and reaches over 4,000 community members across four audiences: Contributors, Affiliate organizers, Program Organizers, and Volunteer Developers. This survey helps us hear from the experience of Wikimedians from across the movement so that teams are able to use community feedback in their planning and their work. This survey also helps us learn about long term changes in communities, such as community health or demographics. The report is available on meta: https://meta.wikimedia.org/wiki/Community_Engagement_Insights/2018_Report For this survey, we worked with 11 teams to develop the questions. Once the results were analyzed, we spent time with each team to help them understand their results. Most teams have already identified how they will use the results to help improve their work to support you. The report could be useful for your work in the Wikimedia movement as well! What are you learning from the data? Take some time to read the report and share your feedback on the talk pages. We have also published a blog that you can read.[1] We are hosting a livestream presentation[2] on September 20 at 1600 UTC. Hope to see you there! Feel free to email me directly with any questions. All the best, Edward [1] https://wikimediafoundation.org/2018/09/13/what-we-learned-surveying-4000-c… [2] https://www.youtube.com/watch?v=qGQtWFP9Cjc -- Edward Galvez Evaluation Strategist, Surveys Learning & Evaluation Community Engagement Wikimedia Foundation -- Edward Galvez Evaluation Strategist, Surveys Learning & Evaluation Community Engagement Wikimedia Foundation

5 years, 8 months

[Wikimedia Research Showcase] Wednesday September 19, 2018 at 11:30 AM (PDT) 18:30 UTC

by Sarah R

Hi Everyone, The next Wikimedia Research Showcase will be live-streamed Wednesday, September 19 2018 at 11:30 AM (PDT) 18:30 UTC. YouTube stream: https://www.youtube.com/watch?v=OY8vZ6wES9o As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here. <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#Upcoming_Showcase> Hope to see you there! This month's presentations is: The impact of news exposure on collective attention in the United States during the 2016 Zika epidemicBy *Michele Tizzoni, André Panisson, Daniela Paolotti, Ciro Cattuto*In recent years, many studies have drawn attention to the important role of collective awareness and human behaviour during epidemic outbreaks. A number of modelling efforts have investigated the interaction between the disease transmission dynamics and human behaviour change mediated by news coverage and by information spreading in the population. Yet, given the scarcity of data on public awareness during an epidemic, few studies have relied on empirical data. Here, we use fine-grained, geo-referenced data from three online sources - Wikipedia, the GDELT Project and the Internet Archive - to quantify population-scale information seeking about the 2016 Zika virus epidemic in the U.S., explicitly linking such behavioural signal to epidemiological data. Geo-localized Wikipedia pageview data reveal that visiting patterns of Zika-related pages in Wikipedia were highly synchronized across the United States and largely explained by exposure to national television broadcast. Contrary to the assumption of some theoretical models, news volume and Wikipedia visiting patterns were not significantly correlated with the magnitude or the extent of the epidemic. Attention to Zika, in terms of Zika-related Wikipedia pageviews, was high at the beginning of the outbreak, when public health agencies raised an international alert and triggered media coverage, but subsequently exhibited an activity profile that suggests nonlinear dependencies and memory effects in the relationship between information seeking, media pressure, and disease dynamics. This calls for a new and more general modelling framework to describe the interaction between media exposure, public awareness, and disease dynamics during epidemic outbreaks. Deliberation and resolution on WikipediaA case study of requests for commentsBy *Amy Zhang, Jane Im*Resolving disputes in a timely manner is crucial for any online production group. We present an analysis of Requests for Comments (RfCs), one of the main vehicles on Wikipedia for formally resolving a policy or content dispute. We collected an exhaustive dataset of 7,316 RfCs on English Wikipedia over the course of 7 years and conducted a qualitative and quantitative analysis into what issues affect the RfC process. Our analysis was informed by 10 interviews with frequent RfC closers. We found that a major issue affecting the RfC process is the prevalence of RfCs that could have benefited from formal closure but that linger indefinitely without one, with factors including participants' interest and expertise impacting the likelihood of resolution. From these findings, we developed a model that predicts whether -- Sarah R. Rodlund Technical Writer, Developer Advocacy <https://meta.wikimedia.org/wiki/Developer_Advocacy> srodlund(a)wikimedia.org

5 years, 8 months

New viz.: Wikipedias, participation per language

by Erik Zachte

Hi all, I just published a new visualization: Wikipedias, compared by participation per language (= active editors per million speakers) There are several pages, one for a global overview https://stats.wikimedia.org/wikimedia/participation/d3_participation_global… one with breakdown by continent https://stats.wikimedia.org/wikimedia/participation/d3_participation_contin… You can also zoom in on one continent, by clicking on it Any feedback is welcome. Erik Zachte

5 years, 8 months

EventLogging MySQL Schema Whitelist

by Andrew Otto

Hi all you EventLogging users out there! tl;dr We will switch EventLogging MySQL ingestion to be based on a schema whitelist rather than blacklist. As you know, we currently import EventLogging events into two locations for analysis: The MySQL ‘log’ database, and the Hive ‘event’ database. MySQL is not able to handle high volume events. We currently blacklist <https://github.com/wikimedia/puppet/blob/production/hieradata/common.yaml#L…> any schemas that we know have high volumes from being ingested into MySQL. This can cause problems when a new high volume schema is deployed, as it requires knowledge and communication from the schema owners to the Analytics team, and it requires an Analytics Operations engineer to make a Puppet commit to blacklist the schema. To address this problem, we will switch the EventLogging MySQL schema blacklist to a whitelist. All schemas that are actively being ingested into MySQL today will be whitelisted. In the future, if you want an event schema to be ingested into MySQL, you’ll need to ask the Analytics team to whitelist it. Hive has been working for EventLogging analysis for a while now. It has almost all of the schemas that MySQL does, plus the high volume ones. One day in the (distant?) future, we’d like to decommission MySQL storage of events. (Don’t worry yet, MySQL decommissioning has a lot of blockers and this work is not planned.) By not ingesting events into MySQL by default, we hope to encourage more users to switch to Hive. This switch to a whitelist will happen this week. If you are deploying a new schema and expect it to show up in MySQL, let us know so we can whitelist it! Thanks! -Andrew + the Analytics Engineering team https://phabricator.wikimedia.org/T203596

5 years, 9 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics September 2018