Analytics March 2013

analytics@lists.wikimedia.org

28 participants
23 discussions

by Magnus Manske

Hi all, as you might know, I have a few GLAM-related tools on the toolserver. Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions. Now, the monthly updated stats have always been slow to run, but did almost grind to a halt recently. The on-demand tools have stalled completely. All these tools get their data from stats.grok.se, which works well but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-( I know you are working on page view numbers, and for what I gather it's up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page. Now, I know that there is no public API yet, but is there any way I can get to the data, at least for the monthly stats? Cheers, Magnus

10 years, 10 months

No longer escaping spaces in HTTP headers of udplogging

by Mark Bergsma

Hey, As part of ongoing cleanup and debugging of varnishncsa, part of the Varnish toolset for udp logging of access requests, I'm removing our custom code for escaping spaces in headers. Since we've now switched to using tabs as field separators, this shouldn't be necessary anymore. Please let me know if there are any objections - if not I'll deploy this change in the next couple of days. -- Mark Bergsma <mark(a)wikimedia.org> Lead Operations Architect Wikimedia Foundation

11 years

stat1 reverts.py post mortem + new run

by Aaron Halfaker

Hey folks, I was running a script to update the revert tables on db1047 with stat1 two days ago that had some bad disk access patterns. (FYI, don't use python shelve as an on-disk cache of a dict().) As soon as I saw the load come up, I killed the script. For any difficulty that occurred in the meantime, I'm very sorry. I've since re-written things to behave much better. I currently have two processes running on the machine: - sessions.py - Updating session table on db1047. Useful for measuring editor labor hours. - reverts.py - Updating revert tables on db1047. Fixed to not need a disk cache. Both of these processes are nice'd, so they should wait in line for CPU access behind any non-nice'd processes you have running. If the processes cause any trouble, please feel free to kill them or let me know and I'll kill them. For Science, -Aaron

11 years, 1 month

Wikimedia is hiring a Director of Analytics

by Erik Moeller

In the coming weeks, Wikimedia Foundation will organize its analytics capabilities into a single department to better support the entire organization in data-driven decision-making. We're looking for a Director of Analytics to lead this effort. Details here: http://hire.jobvite.com/Jobvite/Job.aspx?j=oJriXfw9&c=qSa9VfwQ Please pass this on to people with the right qualifications in your respective networks. Thanks, Erik -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation Support Free Knowledge: https://wikimediafoundation.org/wiki/Donate

11 years, 1 month

Analytics Showcase Sprint ending March 27th, 2013

by Diederik van Liere

Hi! Here's the summary of Wednesday's Analytics sprint planning and demo. Apologies for cross-posting; ideally you should receive this on the Analytics Mailinglist so we can have one focal point for conversation. If you are not on the Analytics list then please subscribe at https://lists.wikimedia.org/mailman/listinfo/analytics # TL;DR # Our most recent sprint continued our focus on improving visibility into mobile initiatives, including the mobile site, support for mobile applications, and Wikipedia Zero, finetuning the cluster and improving monitoring of our datastreams. ## Defects & Features taken during Sprint ending 2013-03-27 ## #68 F - Visualize Commons Mobile App (Android & iOS) metrics in Limn dashboard (N/E) Done requested by Mobile #78 F - Document pageview business logic for analysts (N/E) Done requested by Analytics #361 D - HTTPS generates two hits in server log, count only one of those Done requested by Analytics / Community #155 F - Improve accuracy of packetloss monitoring (N/E) Done requested by Analytics #147 I - iptables for NameNode (N/E) Done requested by Ops #154 F - Provide unsampled blog webtraffic as datastream (N/E) Shipping requested by Communications #272 F - Dump stats: tally wikis by activity level (# active users) (N/E) Ready for Showcase requested by Erik & Sue #461 F - Configure FairScheduler on Kraken (N/E) Ready for Showcase requested by Analytics ## Planned for Showcase on 2013-04-03 ## Mingle:#52 - Puppetize Limn (N/E) Coding Mingle:#60 (F) - Mobile pageview requests reporting in wikistats Mingle:#61 (F) - Mobile Site Pageviews by Device Class Mingle:#92 - Page View Metrics Report for Official Wikipedia Mobile Apps (5) Testing Mingle:#244 (F) - Track user adoption of Wikipedia Zero Mingle:#426 F - Authenticate users of Metrics API Admin UI (5) ## Current Sprint (ending 2013-04-03) ## The current sprint's theme is still focused on Mobile. Stories in progress from last sprint: #52 - Puppetize Limn (N/E) Coding #61 F - Mobile site pageviews by device class (N/E) Testing #92 - Page View Metrics Report for Official Wikipedia Mobile Apps (5) Testing #60 D - Mobile pageview requests reporting in Wikistats (N/E) Testing Stories started but blocked: #240 - Session Analysis of mobile site visits by mode (alpha/beta/standard) (8) Coding New stories #94 F- Create chart through GUI (13) #353 D - Wikistats cannot run from origin/master (N/E) #378 F - Update Reportcard for April Metrics Meeting (N/E) #426 F - Authenticate users of Metrics API Admin UI (5) #460 I - udp2log server maintenance (13) (Number in parentheses) = estimate of complexity N/E = not estimated; F = Feature D = Defect I = Infrastructure Task Best, Diederik

11 years, 1 month

Skin and active editor correlation

by Matthew Flaschen

There's discussion at https://bugzilla.wikimedia.org/show_bug.cgi?id=44448 about how skin usage correlates with who's an active editor. It would be great to know what percentage of active editor (5+ edits in the main namespace) uses each skin on English Wikipedia. Perhaps for the last three months. Matt Flaschen

11 years, 1 month

Active editors

by Denny Vrandečić

Hi, just a bit confused (and it is probably a 101 question): the number of active editors that I get here <http://stats.wikimedia.org/EN/TablesWikipediaSIMPLE.htm> is very well defined, and says what it means. But trouble is, it is inconsistent with the number that I get here <http://simple.wikipedia.org/wiki/Special:Statistics> for active editors, and if I understand the definition at the latter correctly it means "one edit or action within the last 30 days". Can someone please help me to interpret the numbers appropriately? Cheers, Denny P.S.: this is true for all WMF projects, I took simple as an example. -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

11 years, 1 month

Fwd: [E3-team] Account creation campaign support

by Matthew Flaschen

-------- Original Message -------- Subject: [E3-team] Account creation campaign support Date: Tue, 26 Mar 2013 12:03:53 -0700 From: Steven Walling <swalling(a)wikimedia.org> Reply-To: E3 team discussion list <e3-team(a)lists.wikimedia.org> To: WMF Product Team <wmfproduct(a)lists.wikimedia.org>, Internal E3 team discussion list <e3-team(a)lists.wikimedia.org> Hi all, This is just a heads up about an impending change and our future plans. *Background: *As most of you know, we are hard at working trying to implement our tests of a new account creation interface in core.[1] For the past several months during our tests, Extension:E3Experiments and the new signup interface on enwiki have supported tracking of signup campaigns via a URL parameter (ala &campaign=foo). Though we considered it a "nice to have" addition to the account creation A/B tests, this piece of functionality has been very useful for understanding the difference in editing activity and overall retention for users who sign up via various calls to action, such as Article Feedback Tool, or an outside party like the American Sociological Association. *What's changing: *starting potentially as soon as this Thursday, E3 infrastructure won't be supporting the current implementation of campaign tracking via URL parameters. This needs to come down as part of our removal of the old code for the test version of the new signup page, since we're in the process of committing the new UI modifications to core. For those interested, this will happen during the regularly scheduled E3 deployment window most likely, which happens every Thursday. As far as I am aware, there aren't any currently-running campaigns undergoing active analysis, but I wanted to send out an announcement in case anyone was planning on running any campaigns in the near future. *Future work on campaign support: *we are taking down the current campaign tracking for the foreseeable future, but implementing it properly and permanently is on our roadmap. Responding to the bugs and feedback we get on the key functionality of account creation and login is our first priority, but after that campaign support is something we want to support. Please let me know if you have any questions about what precisely we plan to support, and please forward this email on to anyone relevant. Steven 1. https://www.mediawiki.org/wiki/Account_creation_user_experience

11 years, 1 month

Re: [Analytics] Packet loss on oxygen and locke now

by David Schoonover

Hey all, quick update with a summary of where we stand and the plan of attack. tl;dr: It's probably the known-to-be-busted nginx sequence numbers on the SSL boxes polluting the average. We should modify PacketLossLogtailer.py to drop their entries. 1. Ganglia Numbers As the Ganglia numbers are our main operational metric for data-integrity/analytics system health, it's essential we understand how they're generated. A rough outline: - The PacketLossLogtailer.py script aggregates the packet-loss.log and reports to Ganglia - packet-loss.log, in turn, is generated by bin/packet-loss from the udplog project. From Andrew's previous email, lines look like (SPOILER ALERT): [2013-03-20T21:38:47] ssl1 lost: (62.44131 +/- 4.19919)% [2013-03-20T21:38:47] ssl1001 lost: (66.71974 +/- 0.20411)% [2013-03-20T21:38:47] ssl1002 lost: (66.78863 +/- 0.19653)% Andrew is digging into the source to ensure we understand how these error intervals are generated, especially given the logtailer drops entries with a high error margin. We assume we won't find anything shocking in packet-loss.cpp, but we need to be sure we understand what these numbers mean. 2. Current Alerts Erik Zachte performed analysis of sequence number deltas via the sampled 1:1000 squid logs, finding the SSL hosts had loss 60%+; Andrew verified this using the packet-loss logs. It's a known bug that nginx generates busted sequence numbers, so this is probably the explanation. Andrew is following up. There's a clear trend visible if you look at the loss numbers over the last month -- the 90th shows it most clearly: - 90th Percentile: http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&title=Packet… - Average: http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&title=packet… - Last week: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&title=Packet+… As Oxygen does not receive logging from the SSL proxies, this seems almost certain to be the answer. However, while filtering out entries in the packet-loss log from the SSL hosts might fix the alerts, it doesn't explain why we're seeing the increase in the first place. More troubling, there was a huge increase in Oct that we somehow did not notice: - Average: http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&title=packet_… - 90th: http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&title=Packet+… There are theories aplenty, with both load and filters contributing to the loss numbers. Further, different filters run on each box (themselves with a wide range of hardware), so their performance isn't perfectly comparable. Andrew's on point looking into these questions, but the immediate next-step is to ignore the SSL machines in the packet-loss log. Modifying PacketLossLogtailer.py lets us preserve the data for future analysis. -- David Schoonover dsc(a)wikimedia.org On Wed, Mar 20, 2013 at 3:11 PM, Diederik van Liere <dvanliere(a)wikimedia.org > wrote: > I do think that the Nginx / SSL servers are skewing the packetloss numbers > because https://rt.wikimedia.org/Ticket/Display.html?id=859 has never > been resolved. > D > > > On Wed, Mar 20, 2013 at 6:02 PM, Andrew Otto <otto(a)wikimedia.org> wrote: > >> (Tim, I'm CCing you here, maybe you can help me interpret udp2log packet >> loss numbers?) >> >> Hm, so, I've set up gadolinium as a 4th udp2log webrequest host. It's >> running some (but not all) of the same filters that locke is currently >> running. udp2log packet_loss_average has hovered at 4 or 5% the entire >> time it has been up. >> >> >> http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%2… >> >> Is that normal? I don't think so. This leads me to believe that my >> previous diagnosis (increased traffic hitting some thresholds) was wrong. >> >> I see high numbers for packet loss from the SSL machines in both locke >> and gadolinium's packet-loss.log. e.g. >> >> [2013-03-20T21:38:47] ssl1 lost: (62.44131 +/- 4.19919)% >> [2013-03-20T21:38:47] ssl1001 lost: (66.71974 +/- 0.20411)% >> [2013-03-20T21:38:47] ssl1002 lost: (66.78863 +/- 0.19653)% >> … >> >> Also, some of sq*.wikimedia.org hosts (I think these are the Tampa >> squids) have higher than normal packet loss numbers, whereas the eqiad and >> exams hosts are mostly close to 0%. >> >> Could it be that a few machines here are skewing the packet loss average? >> >> The fact that gadolinium is seeing packet loss with not much running on >> in indicates to me that I'm totally missing something here. Since the data >> on gadolinium isn't being consumed yet, and its still duplicated on locke, >> I'm going to turn off pretty much all filters there and see what happens. >> >> -Ao >> >> >> >> >> On Mar 19, 2013, at 10:11 AM, Andrew Otto <otto(a)wikimedia.org> wrote: >> >> (Jeff, I'm CCing you here because you're involved in locke stuff.) >> >> >> >> 1. how well does the solution reduce/eliminate packet loss? (Erik Z's >> input useful here) >> >> I could be totally wrong with my diagnosis of the problem (especially >> since there was packet loss on emery yesterday too), but I believe that if >> we do the proposed solutions (replace locke, reduce filters on oxygen) this >> should eliminate the regular packet loss for the time being. emery might >> need some help too. Perhaps we can keep locke around for a while longer >> and spread out emery's filter's between the locke and gadolinium. >> >> 2. is the solution temporary or permanent? >> >> If it solves the problem, this is as permanent as it gets. We can expect >> traffic to only increase, so eventually this problem could happen again. >> Also, people always want more data, so its possible that there will be >> enough demand for new filters that our current hardware would reach >> capacity again. Eventually we'd like for Kafka to replace much of >> udp2log's functionality. That is our actual permanent solution. >> >> 3. how soon could we implement the solution? >> >> I will work on gadolinium (locke replacement) today. Oxygen filter >> reduction is pending verification from Evan that it is ok to do so. He's >> not so sure. >> >> 4. what kind of investment in time/resource would it would take and from >> whom (likely)? >> >> locke replacement shouldn't take more than a day. The production >> frontend cache servers all need config changes deployed to them. I'd need >> to get that reviewed and a helpful ops babysitter. >> >> Jeff Green has special fundraising stuff on locke, so we'd need his help >> to migrate that off. Perhaps we should leave locke running and let Jeff >> Green keep it for Fundraising? >> >> >> >> >> On Mar 18, 2013, at 9:31 PM, Kraig Parkinson <kparkinson(a)wikimedia.org> >> wrote: >> >> Andrew/Diederik, we've captured a lot of info in this thread so far. >> Could you briefly articulate/summarize the set of options we have for >> solving this problem as well as help us understand the trade-offs? >> >> I'm interested in seeing the differences in terms of the following: >> 1. how well does the solution reduce/eliminate packet loss? (Erik Z's >> input useful here) >> 2. is the solution temporary or permanent? >> 3. how soon could we implement the solution? >> 4. what kind of investment in time/resource would it would take and from >> whom (likely)? >> >> Kraig >> >> >> On Mon, Mar 18, 2013 at 2:02 PM, Rob Lanphier <robla(a)wikimedia.org>wrote: >> >>> Hi Diederik, >>> >>> I don't think it'd be responsible to let this go on for another 2-3 >>> days. It's already arguably been going on for too long as it stands. >>> If you all haven't been cautioning people to adjust their numbers >>> based on known loss, and you haven't been doing that yourselves, you >>> probably should let folks know. >>> >>> At any rate, I'm probably not the customer here. In the team, Erik >>> Zachte probably has the best perspective on what is acceptable and >>> what isn't, so I'm cc'ing him. >>> >>> Rob >>> >>> On Mon, Mar 18, 2013 at 12:31 PM, Diederik van Liere >>> <dvanliere(a)wikimedia.org> wrote: >>> > Robla: when you say 'soonish' what exact timeframe do you have in mind >>> for >>> > solving this? >>> > D >>> > >>> > >>> > On Mon, Mar 18, 2013 at 3:25 PM, Diederik van Liere >>> > <dvanliere(a)wikimedia.org> wrote: >>> >> >>> >> Office wiki: https://office.wikimedia.org/wiki/Partner_IP_Ranges >>> >> udp-filter config: >>> >> >>> https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=temp… >>> >> D >>> >> >>> >> >>> >> On Mon, Mar 18, 2013 at 2:51 PM, David Schoonover <dsc(a)wikimedia.org> >>> >> wrote: >>> >>> >>> >>> For reference, here's the varnish config in operations/puppet. >>> >>> >>> >>> >>> >>> >>> https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=temp… >>> >>> >>> >>> I don't have links to the office wiki list or the udp-filter config; >>> >>> Deiderik, could you link them just to double-check? >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> David Schoonover >>> >>> dsc(a)wikimedia.org >>> >>> >>> >>> >>> >>> On Mon, Mar 18, 2013 at 11:41 AM, Diederik van Liere >>> >>> <dvanliere(a)wikimedia.org> wrote: >>> >>>> >>> >>>> >>> >>>>> It's not a big difference, but it seems like good reason to be a >>> little >>> >>>>> wary about switching between the ip logic and the varnish logic. >>> Thoughts? >>> >>>> >>> >>>> >>> >>>> There is no fundamental difference between ip logic and varnish >>> logic, >>> >>>> varnish sets the X-CS header on the same ip-ranges as udp-filter >>> does. I >>> >>>> just checked the office wiki, the varnish config and udp-filter and >>> for >>> >>>> Orange Uganda everything looks good. >>> >>>> >>> >>>> D >>> >>> >>> >>> >>> >> >>> > >>> >> >> >> >> >

11 years, 1 month

Fundraising wants to model user behaviour

by Matthew Walker

All, Fundraising is proposing to an experiment to model user behavior on our properties. I've written an RfC on exactly what I'm proposing here [1]. I would love any comments/concerns/methodology changes/and additional considerations you might have. [1] http://meta.wikimedia.org/wiki/User_site_behavior_collection Thanks ~Matt Walker Wikimedia Foundation Fundraising Technology Team

11 years, 1 month

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics March 2013