Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi all,
the webrequest and pageview_hourly tables on Hive contain the very
useful user_agent_map field, which stores the following data extracted
from the raw user agent (still available as a separate field):
device_family, browser_family, browser_major, os_family, os_major,
os_minor and wmf_app_version. (The Analytics Engineering team has
built a dashboard that uses this data and last month published a
popular blog post about it.) I understand it is mainly based on the
ua-parser library (http://www.uaparser.org/ ) .
In contrast, the event capsule in our EventLogging tables only
contains the raw, unparsed user agent.
* Does anyone on this list have experience in parsing user agents in
EventLogging data for the purpose of detecting browser family, version
etc, and would like to share advice on how to do this most
efficiently? (In the past, I have written some expressions in MySQL to
extract the app version number for the Wikipedia apps. But it seems a
bit of a pain to do that for classifying browsers in general. One
option would be to export the data and use the Python version of
ua-parser, however doing it directly in MySQL would fit better into
existing workflows.)
* Assuming it is technically possible to add such a pre-parsed
user_agent_map field to the EventLogging tables, would other analysts
be interested in using it too?
This came up recently with the Reading web team, for the purpose of
investigating whether certain issues are caused by certain browsers
only. But I imagine it has arisen in other places as well.
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
Hi,
Hope this finds you all well. I'm wondering if there is a way/tool to
identify the articles that exist in the one edition of Wikipedia and have
counterparts in another. I'm also wondering if there is a way to generate a
list of these articles' titles for certain categories.
Best,
Reem
--
*Kind regards,Reem Al-Kashif*
On Saturday Oct 29, at 8 am UTC, the web server for the dumps and other
datasets will be unavailable due to maintenance. This should take no
longer than 10 minutes. Thanks for your understanding.
Ariel
Hi everybody,
due to a severe kernel vulnerability (
https://access.redhat.com/security/vulnerabilities/2706661) I need to
reboot the stat1002, stat1003 and stat1004 hosts to install the new kernel.
The reboots are scheduled for 9 AM CEST tomorrow (Oct 21st), please follow
up with me or anybody in the Analytics team if you have ongoing work that
can't be stopped.
The Analytics Hadoop and Kafka clusters will be rebooted too during the
next hours. Event if this maintenance shouldn't cause any major issue, you
might experience some service degradation. More up to date information on
IRC in the analytics and operations channels.
Thanks and apologies in advance for the trouble!
Luca
Hi Everyone,
The next Research Showcase will be live-streamed this Wednesday, October
19, 2016 at 11:30 AM (PST) 18:30 (UTC).
Link for remote presenters to join the Hangout on Air:
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#October_2016>.
YouTube stream: https://www.youtube.com/watch?v=cBImUZ_si5s
This month's showcase includes.
Human centered design for using and editing structured data in Wikipedia
infoboxesBy *Charlie Kritschmar
<https://www.mediawiki.org/wiki/User:Charlie_Kritschmar_(WMDE)> UX
Intern, Wikimedia Deutschland
<https://meta.wikimedia.org/wiki/Wikimedia_Deutschland>*Wikidata is a
Wikimedia project which stores structured data to be used by other
Wikimedia projects like Wikipedia. Currently, integrating its data in
Wikipedia is difficult for users, since there’s no predefined way to do so
and requires some technical knowledge. To tackle these issues,
human-centered design methods were applied to find needs from which
solutions were generated and evaluated with the help of the community. The
concept may serve as a basis which may be implemented into various Wiki
projects in the future to make editing Wikidata from within another
Wikimedia project more user-friendly and improve the project’s acceptance
in the community.
Emergent Work in WikipediaBy *Ofer Arazy
<http://oferarazy.com/> (University of Haifa)*Online production communities
present an exciting opportunity for investigating novel organizational
forms. Extant theoretical accounts of knowledge co-production point to
organizational policies, norms, and communication as key mechanisms
enabling the coordination of work. Yet, in practice participants in
initiatives such as Wikipedia are often occasional contributors who are
unaware of community policies and do not communicate with other members.
How then is work coordinated and how does the organization maintain
stability in the face of dynamics in individuals’ task enactment? In this
study we develop a conceptualization of emergent roles - the prototypical
activity patterns that organically emerge from individuals’ spontaneous
actions – and investigate the temporal dynamics of emergent role behaviors.
Conducing a multi-level large-scale empirical study stretching over a
decade, we tracked co-production of a thousand Wikipedia articles, logging
two hundred thousand distinct participants and seven hundred thousand
co-production activities. Using a combination of manual tagging and machine
learning, we annotated each activity type, and then clustered participants’
activity profiles to arrive at seven prototypical emergent roles. Our
analysis shows that participants’ behavior is turbulent, with substantial
flow in and out of co-production work and across roles. Our findings at the
organizational level, however, show that work is organized around a highly
stable set of emergent roles, despite the absence of traditional
stabilizing mechanisms such as pre-defined work procedures or role
expectations. We conceptualize this dualism in emergent work as “Turbulent
Stability”. Further analyses suggest that co-production is
artifact-centric, where contributors mutually adjust according to the
artifact’s changing needs. Our study advances the theoretical
understandings of self-organizing knowledge co-production and particularly
the nature of emergent roles.
Hope to see you there!
Sarah R. Rodlund
Senior Project Coordinator-Engineering, Wikimedia Foundation
srodlund(a)wikimedia.org
Hello!
The Wikimedia Developer Summit
<https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit> is the annual
meeting to push the evolution of MediaWiki and other technologies
supporting the Wikimedia movement. The next edition will be held in San
Francisco on January 9-11, 2017.
We welcome all Wikimedia technical contributors, third party developers,
and users of MediaWiki and the Wikimedia APIs. We specifically want to
increase the participation of volunteer developers and other contributors
dealing with extensions, apps, tools, bots, gadgets, and templates.
Important deadlines:
- Monday, October 24: This is the last day to request travel
sponsorship. Applying takes less than five minutes.
- Monday, October 31: This is the last day to propose an activity. Bring
the topics you care about!
Subscribe to weekly updates: https://www.mediawiki.org/
wiki/Topic:Td5wfd70vptn8eu4
Please feel free to forward this email to anyone who might be interested in
attending!
Thanks,
Srishti
--
Srishti Sethi
ssethi(a)wikimedia.org
In case you recently observed unexpected drops in Wikimedia site traffic
from France, see below.
Pine
---------- Forwarded message ----------
From: "geni" <geniice(a)gmail.com>
Date: Oct 17, 2016 1:55 PM
Subject: [Wikimedia-l] We appear have been partially blocked in France
(probably accidentally)
To: "Wikimedia Mailing List" <wikimedia-l(a)lists.wikimedia.org>
Cc:
Apparently on the orders of the french government orange added us to
their blocked terrorist sites list. This did apparently have the fun
effect of DOS the government page people were redirected to, Source
(among others):
http://www.lemonde.fr/pixels/article/2016/10/17/une-erreur-
bloque-l-acces-a-google-pour-les-clients-d-orange_5014900_4408996.html
--
geni
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
wiki/Mailing_lists/Guidelines
New messages to: Wikimedia-l(a)lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
<mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>