Analytics October 2016

analytics@lists.wikimedia.org

28 participants
15 discussions

by Madhumitha Viswanathan

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

5 years, 7 months

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

Parsing user agents in EventLogging data

by Tilman Bayer

Hi all, the webrequest and pageview_hourly tables on Hive contain the very useful user_agent_map field, which stores the following data extracted from the raw user agent (still available as a separate field): device_family, browser_family, browser_major, os_family, os_major, os_minor and wmf_app_version. (The Analytics Engineering team has built a dashboard that uses this data and last month published a popular blog post about it.) I understand it is mainly based on the ua-parser library (http://www.uaparser.org/ ) . In contrast, the event capsule in our EventLogging tables only contains the raw, unparsed user agent. * Does anyone on this list have experience in parsing user agents in EventLogging data for the purpose of detecting browser family, version etc, and would like to share advice on how to do this most efficiently? (In the past, I have written some expressions in MySQL to extract the app version number for the Wikipedia apps. But it seems a bit of a pain to do that for classifying browsers in general. One option would be to export the data and use the Python version of ua-parser, however doing it directly in MySQL would fit better into existing workflows.) * Assuming it is technically possible to add such a pre-parsed user_agent_map field to the EventLogging tables, would other analysts be interested in using it too? This came up recently with the Reading web team, for the purpose of investigating whether certain issues are caused by certain browsers only. But I imagine it has arisen in other places as well. -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

7 years, 4 months

Missing Data for 7/29/2015 - 03:00

by Raymond Liang

Hello WIkipedia Analytics Team, Could you please look into the missing raw pageview data for 7/29/2015 - 03:00 at https://dumps.wikimedia.org/other/pageviews/2015/2015-07/ ? Thanks, Raymond Liang

7 years, 5 months

Tool to identify the articles versions across Wikipedia language editions?

by Reem Al-Kashif

Hi, Hope this finds you all well. I'm wondering if there is a way/tool to identify the articles that exist in the one edition of Wikipedia and have counterparts in another. I'm also wondering if there is a way to generate a list of these articles' titles for certain categories. Best, Reem -- *Kind regards,Reem Al-Kashif*

7 years, 6 months

8 am UTC Oct 29, maintenance for dataset1001 (dumps.wikimedia.org)

by Ariel Glenn WMF

On Saturday Oct 29, at 8 am UTC, the web server for the dumps and other datasets will be unavailable due to maintenance. This should take no longer than 10 minutes. Thanks for your understanding. Ariel

7 years, 6 months

Upcoming reboots of stat100[234] and most of the Analytics hosts (Kafka and Hadoop)

by Luca Toscano

Hi everybody, due to a severe kernel vulnerability ( https://access.redhat.com/security/vulnerabilities/2706661) I need to reboot the stat1002, stat1003 and stat1004 hosts to install the new kernel. The reboots are scheduled for 9 AM CEST tomorrow (Oct 21st), please follow up with me or anybody in the Analytics team if you have ongoing work that can't be stopped. The Analytics Hadoop and Kafka clusters will be rebooted too during the next hours. Event if this maintenance shouldn't cause any major issue, you might experience some service degradation. More up to date information on IRC in the analytics and operations channels. Thanks and apologies in advance for the trouble! Luca

7 years, 6 months

Research Showcase October 19, 2016

by Sarah R

Hi Everyone, The next Research Showcase will be live-streamed this Wednesday, October 19, 2016 at 11:30 AM (PST) 18:30 (UTC). Link for remote presenters to join the Hangout on Air: As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#October_2016>. YouTube stream: https://www.youtube.com/watch?v=cBImUZ_si5s This month's showcase includes. Human centered design for using and editing structured data in Wikipedia infoboxesBy *Charlie Kritschmar <https://www.mediawiki.org/wiki/User:Charlie_Kritschmar_(WMDE)> UX Intern, Wikimedia Deutschland <https://meta.wikimedia.org/wiki/Wikimedia_Deutschland>*Wikidata is a Wikimedia project which stores structured data to be used by other Wikimedia projects like Wikipedia. Currently, integrating its data in Wikipedia is difficult for users, since there’s no predefined way to do so and requires some technical knowledge. To tackle these issues, human-centered design methods were applied to find needs from which solutions were generated and evaluated with the help of the community. The concept may serve as a basis which may be implemented into various Wiki projects in the future to make editing Wikidata from within another Wikimedia project more user-friendly and improve the project’s acceptance in the community. Emergent Work in WikipediaBy *Ofer Arazy <http://oferarazy.com/> (University of Haifa)*Online production communities present an exciting opportunity for investigating novel organizational forms. Extant theoretical accounts of knowledge co-production point to organizational policies, norms, and communication as key mechanisms enabling the coordination of work. Yet, in practice participants in initiatives such as Wikipedia are often occasional contributors who are unaware of community policies and do not communicate with other members. How then is work coordinated and how does the organization maintain stability in the face of dynamics in individuals’ task enactment? In this study we develop a conceptualization of emergent roles - the prototypical activity patterns that organically emerge from individuals’ spontaneous actions – and investigate the temporal dynamics of emergent role behaviors. Conducing a multi-level large-scale empirical study stretching over a decade, we tracked co-production of a thousand Wikipedia articles, logging two hundred thousand distinct participants and seven hundred thousand co-production activities. Using a combination of manual tagging and machine learning, we annotated each activity type, and then clustered participants’ activity profiles to arrive at seven prototypical emergent roles. Our analysis shows that participants’ behavior is turbulent, with substantial flow in and out of co-production work and across roles. Our findings at the organizational level, however, show that work is organized around a highly stable set of emergent roles, despite the absence of traditional stabilizing mechanisms such as pre-defined work procedures or role expectations. We conceptualize this dualism in emergent work as “Turbulent Stability”. Further analyses suggest that co-production is artifact-centric, where contributors mutually adjust according to the artifact’s changing needs. Our study advances the theoretical understandings of self-organizing knowledge co-production and particularly the nature of emergent roles. Hope to see you there! Sarah R. Rodlund Senior Project Coordinator-Engineering, Wikimedia Foundation srodlund(a)wikimedia.org

7 years, 6 months

The Wikimedia Developer Summit wants YOU!

by Srishti Sethi

Hello! The Wikimedia Developer Summit <https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit> is the annual meeting to push the evolution of MediaWiki and other technologies supporting the Wikimedia movement. The next edition will be held in San Francisco on January 9-11, 2017. We welcome all Wikimedia technical contributors, third party developers, and users of MediaWiki and the Wikimedia APIs. We specifically want to increase the participation of volunteer developers and other contributors dealing with extensions, apps, tools, bots, gadgets, and templates. Important deadlines: - Monday, October 24: This is the last day to request travel sponsorship. Applying takes less than five minutes. - Monday, October 31: This is the last day to propose an activity. Bring the topics you care about! Subscribe to weekly updates: https://www.mediawiki.org/ wiki/Topic:Td5wfd70vptn8eu4 Please feel free to forward this email to anyone who might be interested in attending! Thanks, Srishti -- Srishti Sethi ssethi(a)wikimedia.org

7 years, 6 months

Fwd: [Wikimedia-l] We appear have been partially blocked in France (probably accidentally)

by Pine W

In case you recently observed unexpected drops in Wikimedia site traffic from France, see below. Pine ---------- Forwarded message ---------- From: "geni" <geniice(a)gmail.com> Date: Oct 17, 2016 1:55 PM Subject: [Wikimedia-l] We appear have been partially blocked in France (probably accidentally) To: "Wikimedia Mailing List" <wikimedia-l(a)lists.wikimedia.org> Cc: Apparently on the orders of the french government orange added us to their blocked terrorist sites list. This did apparently have the fun effect of DOS the government page people were redirected to, Source (among others): http://www.lemonde.fr/pixels/article/2016/10/17/une-erreur- bloque-l-acces-a-google-pour-les-clients-d-orange_5014900_4408996.html -- geni _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/ wiki/Mailing_lists/Guidelines New messages to: Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

7 years, 6 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics October 2016