Analytics December 2012

analytics@lists.wikimedia.org

14 participants
10 discussions

by Magnus Manske

Hi all, as you might know, I have a few GLAM-related tools on the toolserver. Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions. Now, the monthly updated stats have always been slow to run, but did almost grind to a halt recently. The on-demand tools have stalled completely. All these tools get their data from stats.grok.se, which works well but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-( I know you are working on page view numbers, and for what I gather it's up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page. Now, I know that there is no public API yet, but is there any way I can get to the data, at least for the monthly stats? Cheers, Magnus

10 years, 10 months

Analytics "dreams"

by Sumana Harihareswara

I thought people might like to add to this list: https://www.mediawiki.org/wiki/Analytics/Dreams of queries that people want to run, once the infrastructure exists. Mine include: What proportion of site vandalism comes from behind Tor proxies? Does it differ across wikis? And Rachel Farrand suggested the question: What proportion of longtime logged-in editors have customized user scripts (JS or CSS in their userspace)? -- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

11 years, 4 months

State of Analytics cluster right now

by Rob Lanphier

Hi folks, David and Andrew gave me an in depth overview of the current state of the analytics cluster, so I figured it might be good for me to clean up the notes from that and send them here, both to check my understanding of these things, and in hopes that this may be useful. The official docs for this are here: https://www.mediawiki.org/wiki/Analytics/Kraken One prominent item on that page is the System Architecture Overview, which is out of date: https://upload.wikimedia.org/wikipedia/mediawiki/3/38/Kraken_flow_diagram.p… In particular, here are the things that are different: "Pixel service" - no longer will have bundlers. "ReqLog service" - no longer will have bundlers. Currently udp2log. More on this below. "Bundle queue [Cassandra]" - Currently a combination of udp2log and Kafka. "Datawarehouse [Cassandra]" - this is Hadoop now The long term plan is to use Kafka here. Short term, udp2log is in use (piped from udp2log into Kafka in a couple cases) Here's what the current data pipeline looks like for pageviews going to Hadoop and out to useful information to analysts, using the terminology from the Kraken flow diagram (even though it's probably not appropriate anymore in many cases). These are each separate hosts or clusters unless otherwise noted: 1. ReqLog service: Squid/Varnish udp logging emitters 2. Bundle queue: udp2log 3. Bundle queue: Kafka producers (on udp2log host) 4. Bundle queue: Kafka broker 5. Canonicalization topology: Kafka consumer 6. Data Warehouse (ETL phase): Log storage on Hadoop cluster 7. MapReduce Jobs: Pig scripts (running on Hadoop cluster) 8. Data Warehouse (Processing): Processed data, usually as CSVs (stored on Hadoop cluster) Routing of Kafka producers to Kafka brokers is done via Zookeeper. The glorious future looks something like: 1. ReqLog service: Squid/Varnish Kafka producers 2. Bundle queue: Kafka broker 3. Canonicalization topology: Multiple Kafka consumers/Storm Spouts (on Storm cluster) 4. Canonicalization topology: Storm bolts for anonymization/geotagging, etc (on Storm cluster) 5. Data Warehouse (ETL phase): Log storage on Hadoop cluster 6. MapReduce Jobs: Pig scripts (running on Hadoop cluster) 7. Data Warehouse (Processing): Processed data, usually as CSVs (stored on Hadoop cluster) More detail on the pipeline: ReqLog service: Currently udp2log - 1:100 sampling for most things, plus finer granularity for some things. We eventually need to decide if we're going to put Kafka providers directly on the Varnish hosts in production, or if we're going to continue to emit the logs via udp to a udp2log collector. Pixel service: We didn't get into talking about this part of the architecture. Rather than restating my jumbled recollection of what's going on here (randomly throwing around terms like "ClickTracking extension", "EventLogging extension", "varnishncsa"), I'll ask that someone else venture a correct assessment of current/future state of this. Bundle queue: Kafka producers produce from udp2log for now (one per stream). We currently have 2 Kafka brokers, managed via Zookeeper. Raw logs are persisted for one week sliding window, straight to disk. Not sanitized, so this host will be locked down for the forseable future. If we weren't sampling, we would typically get about 16.8 terabytes in one week's sliding window of uncompressed data. Snappy compression (not implemented yet) should reduce size by roughly 2/3. Apache Zookeeper (not in the diagram) - Zookeeper provides application-level routing of data, keeping track of unavailable nodes, and signaling to senders to use different hosts when necessary. We have 3 Zookeeper nodes. Thread about zookeeper kafka failover http://lists.wikimedia.org/pipermail/analytics/2012-August/000095.html Canonicalization Topology: We currently have a single Kafka consumer - hourly cron job, pulling from brokers, and pushing the raw logs straight into Hadoop (in approximately 1 gig chunks). We'd like to do much more processing at this stage. Most importantly, we'd like to anonymize the data at this stage so that we don't have to be stingy about who we give access to the Hadoop cluster. The plan is to use Storm <http://storm-project.net/> as an ETL layer. There is some custom dev work that needs to happen at this stage, since log anonymization is not something that we're aware of any off-the-shelf solutions that fit into this architecture (in particular, with the ability to do the anonymization in real-time). We plan to do a formal privacy and security review before relying on whatever solution that we develop. Glossary for Storm: "Worker nodes": run Bolts. "Bolt": a small unit of processing "Topology": arrangement of bolts, flow of data. Bolts operates on data. Bolts can be writte in any language. "Tuple": Unit of data in storm. "Spout": source of tuples. "Nimbus": job manager/scheduler, doesn't actually do any processing. If Nimbus goes down, topology still works. We plan to have 8 hosts running as Storm workers. This is probably more than we need; it may be as few as 2-3. If we're right about this, it's easy enough to put that hardware to other uses. Data Warehouse: We're using Hadoop/HDFS to store data. The logs currently come in hourly blocks of 1:100 sampled data, which translates into lots of 1gig files stored hierarchically by date/time. HDFS is block filesystem with 256 mb blocks. Blocks are replicated 3 times over, each stored on different physical hardware. A central NameNode coordinates what gets stored where. MapReduce Jobs: Our Batch Processing System is Hadoop/MapReduce under the hood. "Hue" is used to provide web management user interface. "Oozie" is technology which manages XML-based workflow definition, and does things like schedule regular updates to data. "Pig" and "Hive" are two different layers for building MapReduce jobs. Pig is a simple domain-specific language for defining batch processing jobs. For those that have access to the analytics cluster, here's the blog stats pig script: http://hue.analytics.wikimedia.org/filebrowser/view/user/diederik/blog/blog… Hive is a different technology, which turns SQL queries into Map/Reduce jobs. Pig is currently the preferred tool, but we'll support Hive if it catches on. A few other notes: * Infrastructure and Hardware ** https://www.mediawiki.org/wiki/Analytics/Kraken/Infrastructure ** Current allocation plans refelct a priori estimates -- we'll be revising them as we get a feel for the workloads (ex: 8 machines on ETL is a ridiculous overestimate; we'll probably only need 2-3.) * Alteratives & Comparisons ** We benchmarked and tested Cassandra as a datastore, but ultimately went with HDFS ** Request Logging *** Recommendation: https://www.mediawiki.org/wiki/Analytics/Kraken/Request_Logging *** Alternatives: https://www.mediawiki.org/wiki/Analytics/Kraken/Logging_Solutions_Overview I hope this is helpful to everyone, and please let me know if there's anything I missed or got wrong here. Thanks Rob

11 years, 4 months

Analytics Roadmap Updated for Dec 2012-Jan 2013

by David Schoonover

Hay errybody -- Jus lettin y'all know the Analytics roadmap page has been updated through Jan 2013 and I've propagated the changes to the larger Engineering roadmap page. Please review everything to ensure its continued accuracy, veracity, and sanctity. https://www.mediawiki.org/wiki/Analytics/Roadmap https://www.mediawiki.org/wiki/Roadmap#Analytics Cheers, -- David Schoonover dsc(a)wikimedia.org

11 years, 4 months

Analytics Quarterly Review Deck

by David Schoonover

Hiya all, Yesterday we had the analytics quarterly review meeting, which was a joy to behold and fun for all ages, as evidenced by the meeting notes and this deck prepared by Diederik and I (but mostly Diederik). Meeting Notes: https://www.mediawiki.org/wiki/Analytics/Roadmap/PlanningMeetings/2012_Q2_Q… Deck: https://docs.google.com/a/wikimedia.org/presentation/d/1EutD_z6Koyv71JY8qM1… A huge thanks to Erik Zachte (I believe) for the notes -- without him we'd have none at all as everyone was engaged in the discussion. The notes are a bit sparse, but we'll be updating our plans and fleshing out the roadmap based on the meeting's feedback, so there will be more helpful info forthcoming. As always, y'all are encouraged to ask if you have any questions. Cheers, -- David Schoonover dsc(a)wikimedia.org

11 years, 4 months

DRIVE 2013 Data Processing/Data Analytics Conference

by Peter Gehres

We (fundraising) got an invitation to the DRIVE 2013 Data Processing/Data Analytics Conference in February with instructions to forward along to anyone else who might be interested. See below, edited for clarity. -- I want to personally invite you to our Data Processing/Data Analytics Conference this upcoming February. It's a two day conference at the Bell Harbor Conference Center in Seattle. I encourage you to visit our website at engage.washington.edu/drive to learn more about DRIVE 2013. Should you have any questions that aren’t addressed on the website, please contact us at drive(a)uw.edu. -- Peter Gehres Wikimedia Foundation https://donate.wikimedia.org

11 years, 4 months

Fwd: November community metrics report

by Quim Gil

-------- Original Message -------- Subject: November community metrics report Date: Wed, 05 Dec 2012 14:19:23 -0800 From: Quim Gil <qgil(a)wikimedia.org> Organization: Wikimedia Foundation To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Second issue of the MediaWiki community metrics monthly report! We have added a bunch of bug tracking data in order to highlight some of the QA and testing activities. Hopefully next month we will show mediawiki.org data to reflect the documentation work. http://www.mediawiki.org/wiki/Community_metrics/November_2012 The monthly community metrics reports are still under heavy work in progress. Your feedback and help is welcome! -- Quim Gil Technical Contributor Coordinator Wikimedia Foundation

11 years, 5 months

poll for dates in May 2013: Amsterdam Hackathon

by Sumana Harihareswara

Please respond on https://www.mediawiki.org/wiki/Amsterdam_Hackathon_2013#Straw_Poll . Thanks! -Sumana -------- Original Message -------- Subject: [Toolserver-l] Amsterdam Hackathon 2013 Date: Sat, 01 Dec 2012 13:43:18 +0100 From: Maarten Dammers <maarten(a)mdammers.nl> Reply-To: Wikimedia Toolserver <toolserver-l(a)lists.wikimedia.org> To: wikitech-l(a)lists.wikimedia.org, toolserver-l(a)lists.wikimedia.org, pywikipedia-l(a)lists.wikimedia.org Hi everyone, Unlike previous years the big European Hackathon won't be in Berlin, but in Amsterdam. We're aiming to do the hackathon in May 2013 with a preference for the weekend of Saturday the 25th. To make sure this is a good weekend I've set up a straw poll at https://www.mediawiki.org/wiki/Amsterdam_Hackathon_2013#Straw_Poll . Please fill it out so we can finalize the date! Thank you, Maarten Wikimedia Nederland Ps. Please forward to any relevant lists I might have missed. _______________________________________________ Toolserver-l mailing list (Toolserver-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

11 years, 5 months

Fwd: Complete (basic) analysis of MediaWiki

by Quim Gil

fyi -------- Original Message -------- Subject: Complete (basic) analysis of MediaWiki Date: Mon, 03 Dec 2012 08:22:03 -0800 From: Quim Gil <qgil(a)wikimedia.org> Organization: Wikimedia Foundation To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>, "Jesus M. Gonzalez-Barahona" <jgb(a)bitergia.com> If last October we got a bunch of MediaWiki developer stats thanks to the aggregation of data by Ohloh [1], now we are getting plenty more stats from Bitergia, including data from bug reporting and mailing lists: http://blog.bitergia.com/2012/12/03/complete-basic-analysis-of-mediawiki/ Bitergia is a company based in Madrid formed by a small team of developers that have been working on FLOSS stats software for a long time. All the tools they develop are free software publicly available and open to contributions. They have been kind enough to contribute some time and work setting up stats for the MediaWiki community. They also welcome feedback about the service and the data collected. I'm CCing Jesús M. González-Barahona, who has been my regular contact for this task in the past weeks. Al good news for http://www.mediawiki.org/wiki/Community_Metrics ! [1] https://www.ohloh.net/orgs/wikimedia -- Quim Gil Technical Contributor Coordinator Wikimedia Foundation

11 years, 5 months

Updates from Google Knowledge Graph

by Dario Taraborelli

As of last week [1], Google Knowledge Graph started displaying information on drugs pulled from PubMed. Links to Wikipedia articles on drugs (which still appear in organic search results) have entirely disappeared from the infobox. I expect this to have a significant impact on our inbound traffic from search engines on these topics. [1] http://techcrunch.com/2012/11/30/google-adds-key-facts-about-medicines-to-i…

11 years, 5 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics December 2012