Hi all,
as you might know, I have a few GLAM-related tools on the toolserver. Some
are updated once a month, some can be used live, but all are in high demand
by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did almost
grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but not
really high-speed; my on-demand tools have apparently been shut out
recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's
up-and-running internally already. My requirements are simple: I have a
list of pages on many Wikimedia projects; I need view counts for these
pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can get
to the data, at least for the monthly stats?
Cheers,
Magnus
I thought people might like to add to this list:
https://www.mediawiki.org/wiki/Analytics/Dreams
of queries that people want to run, once the infrastructure exists.
Mine include: What proportion of site vandalism comes from behind Tor
proxies? Does it differ across wikis?
And Rachel Farrand suggested the question: What proportion of longtime
logged-in editors have customized user scripts (JS or CSS in their
userspace)?
--
Sumana Harihareswara
Engineering Community Manager
Wikimedia Foundation
Hi folks,
David and Andrew gave me an in depth overview of the current state of
the analytics cluster, so I figured it might be good for me to clean
up the notes from that and send them here, both to check my
understanding of these things, and in hopes that this may be useful.
The official docs for this are here:
https://www.mediawiki.org/wiki/Analytics/Kraken
One prominent item on that page is the System Architecture Overview,
which is out of date:
https://upload.wikimedia.org/wikipedia/mediawiki/3/38/Kraken_flow_diagram.p…
In particular, here are the things that are different:
"Pixel service" - no longer will have bundlers.
"ReqLog service" - no longer will have bundlers. Currently udp2log.
More on this below.
"Bundle queue [Cassandra]" - Currently a combination of udp2log and Kafka.
"Datawarehouse [Cassandra]" - this is Hadoop now
The long term plan is to use Kafka here. Short term, udp2log is in
use (piped from udp2log into Kafka in a couple cases)
Here's what the current data pipeline looks like for pageviews going
to Hadoop and out to useful information to analysts, using the
terminology from the Kraken flow diagram (even though it's probably
not appropriate anymore in many cases). These are each separate hosts
or clusters unless otherwise noted:
1. ReqLog service: Squid/Varnish udp logging emitters
2. Bundle queue: udp2log
3. Bundle queue: Kafka producers (on udp2log host)
4. Bundle queue: Kafka broker
5. Canonicalization topology: Kafka consumer
6. Data Warehouse (ETL phase): Log storage on Hadoop cluster
7. MapReduce Jobs: Pig scripts (running on Hadoop cluster)
8. Data Warehouse (Processing): Processed data, usually as CSVs
(stored on Hadoop cluster)
Routing of Kafka producers to Kafka brokers is done via Zookeeper.
The glorious future looks something like:
1. ReqLog service: Squid/Varnish Kafka producers
2. Bundle queue: Kafka broker
3. Canonicalization topology: Multiple Kafka consumers/Storm Spouts
(on Storm cluster)
4. Canonicalization topology: Storm bolts for
anonymization/geotagging, etc (on Storm cluster)
5. Data Warehouse (ETL phase): Log storage on Hadoop cluster
6. MapReduce Jobs: Pig scripts (running on Hadoop cluster)
7. Data Warehouse (Processing): Processed data, usually as CSVs
(stored on Hadoop cluster)
More detail on the pipeline:
ReqLog service:
Currently udp2log - 1:100 sampling for most things, plus finer
granularity for some things. We eventually need to decide if we're
going to put Kafka providers directly on the Varnish hosts in
production, or if we're going to continue to emit the logs via udp to
a udp2log collector.
Pixel service:
We didn't get into talking about this part of the architecture.
Rather than restating my jumbled recollection of what's going on here
(randomly throwing around terms like "ClickTracking extension",
"EventLogging extension", "varnishncsa"), I'll ask that someone else
venture a correct assessment of current/future state of this.
Bundle queue:
Kafka producers produce from udp2log for now (one per stream). We
currently have 2 Kafka brokers, managed via Zookeeper. Raw logs are
persisted for one week sliding window, straight to disk. Not
sanitized, so this host will be locked down for the forseable future.
If we weren't sampling, we would typically get about 16.8 terabytes in
one week's sliding window of uncompressed data. Snappy compression
(not implemented yet) should reduce size by roughly 2/3.
Apache Zookeeper (not in the diagram) - Zookeeper provides
application-level routing of data, keeping track of unavailable nodes,
and signaling to senders to use different hosts when necessary.
We have 3 Zookeeper nodes. Thread about zookeeper kafka failover
http://lists.wikimedia.org/pipermail/analytics/2012-August/000095.html
Canonicalization Topology:
We currently have a single Kafka consumer - hourly cron job, pulling
from brokers, and pushing the raw logs straight into Hadoop (in
approximately 1 gig chunks). We'd like to do much more processing at
this stage. Most importantly, we'd like to anonymize the data at this
stage so that we don't have to be stingy about who we give access to
the Hadoop cluster.
The plan is to use Storm <http://storm-project.net/> as an ETL layer.
There is some custom dev work that needs to happen at this stage,
since log anonymization is not something that we're aware of any
off-the-shelf solutions that fit into this architecture (in
particular, with the ability to do the anonymization in real-time).
We plan to do a formal privacy and security review before relying on
whatever solution that we develop.
Glossary for Storm:
"Worker nodes": run Bolts.
"Bolt": a small unit of processing
"Topology": arrangement of bolts, flow of data. Bolts operates on
data. Bolts can be writte in any language.
"Tuple": Unit of data in storm.
"Spout": source of tuples.
"Nimbus": job manager/scheduler, doesn't actually do any processing.
If Nimbus goes down, topology still works.
We plan to have 8 hosts running as Storm workers. This is probably
more than we need; it may be as few as 2-3. If we're right about
this, it's easy enough to put that hardware to other uses.
Data Warehouse:
We're using Hadoop/HDFS to store data. The logs currently come in
hourly blocks of 1:100 sampled data, which translates into lots of
1gig files stored hierarchically by date/time.
HDFS is block filesystem with 256 mb blocks. Blocks are replicated 3
times over, each stored on different physical hardware. A central
NameNode coordinates what gets stored where.
MapReduce Jobs:
Our Batch Processing System is Hadoop/MapReduce under the hood. "Hue"
is used to provide web management user interface. "Oozie" is
technology which manages XML-based workflow definition, and does
things like schedule regular updates to data. "Pig" and "Hive" are
two different layers for building MapReduce jobs.
Pig is a simple domain-specific language for defining batch processing
jobs. For those that have access to the analytics cluster, here's the
blog stats pig script:
http://hue.analytics.wikimedia.org/filebrowser/view/user/diederik/blog/blog…
Hive is a different technology, which turns SQL queries into Map/Reduce jobs.
Pig is currently the preferred tool, but we'll support Hive if it catches on.
A few other notes:
* Infrastructure and Hardware
** https://www.mediawiki.org/wiki/Analytics/Kraken/Infrastructure
** Current allocation plans refelct a priori estimates -- we'll be
revising them as we get a feel for the workloads (ex: 8 machines on
ETL is a ridiculous overestimate; we'll probably only need 2-3.)
* Alteratives & Comparisons
** We benchmarked and tested Cassandra as a datastore, but ultimately
went with HDFS
** Request Logging
*** Recommendation:
https://www.mediawiki.org/wiki/Analytics/Kraken/Request_Logging
*** Alternatives:
https://www.mediawiki.org/wiki/Analytics/Kraken/Logging_Solutions_Overview
I hope this is helpful to everyone, and please let me know if there's
anything I missed or got wrong here.
Thanks
Rob
Hay errybody -- Jus lettin y'all know the Analytics roadmap page has been updated through Jan 2013 and I've propagated the changes to the larger Engineering roadmap page. Please review everything to ensure its continued accuracy, veracity, and sanctity.
https://www.mediawiki.org/wiki/Analytics/Roadmaphttps://www.mediawiki.org/wiki/Roadmap#Analytics
Cheers,
--
David Schoonover
dsc(a)wikimedia.org
Hiya all,
Yesterday we had the analytics quarterly review meeting, which was a joy to behold and fun for all ages, as evidenced by the meeting notes and this deck prepared by Diederik and I (but mostly Diederik).
Meeting Notes: https://www.mediawiki.org/wiki/Analytics/Roadmap/PlanningMeetings/2012_Q2_Q…
Deck: https://docs.google.com/a/wikimedia.org/presentation/d/1EutD_z6Koyv71JY8qM1…
A huge thanks to Erik Zachte (I believe) for the notes -- without him we'd have none at all as everyone was engaged in the discussion.
The notes are a bit sparse, but we'll be updating our plans and fleshing out the roadmap based on the meeting's feedback, so there will be more helpful info forthcoming.
As always, y'all are encouraged to ask if you have any questions.
Cheers,
--
David Schoonover
dsc(a)wikimedia.org
We (fundraising) got an invitation to the DRIVE 2013 Data Processing/Data
Analytics Conference in February with instructions to forward along to
anyone else who might be interested. See below, edited for clarity.
--
I want to personally invite you to our Data Processing/Data Analytics
Conference this upcoming February. It's a two day conference at the Bell
Harbor Conference Center in Seattle. I encourage you to visit our website
at engage.washington.edu/drive to learn more about DRIVE 2013.
Should you have any questions that aren’t addressed on the website, please
contact us at drive(a)uw.edu.
--
Peter Gehres
Wikimedia Foundation
https://donate.wikimedia.org
-------- Original Message --------
Subject: November community metrics report
Date: Wed, 05 Dec 2012 14:19:23 -0800
From: Quim Gil <qgil(a)wikimedia.org>
Organization: Wikimedia Foundation
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Second issue of the MediaWiki community metrics monthly report!
We have added a bunch of bug tracking data in order to highlight some of
the QA and testing activities. Hopefully next month we will show
mediawiki.org data to reflect the documentation work.
http://www.mediawiki.org/wiki/Community_metrics/November_2012
The monthly community metrics reports are still under heavy work in
progress. Your feedback and help is welcome!
--
Quim Gil
Technical Contributor Coordinator
Wikimedia Foundation
Please respond on
https://www.mediawiki.org/wiki/Amsterdam_Hackathon_2013#Straw_Poll . Thanks!
-Sumana
-------- Original Message --------
Subject: [Toolserver-l] Amsterdam Hackathon 2013
Date: Sat, 01 Dec 2012 13:43:18 +0100
From: Maarten Dammers <maarten(a)mdammers.nl>
Reply-To: Wikimedia Toolserver <toolserver-l(a)lists.wikimedia.org>
To: wikitech-l(a)lists.wikimedia.org, toolserver-l(a)lists.wikimedia.org,
pywikipedia-l(a)lists.wikimedia.org
Hi everyone,
Unlike previous years the big European Hackathon won't be in Berlin, but
in Amsterdam. We're aiming to do the hackathon in May 2013 with a
preference for the weekend of Saturday the 25th. To make sure this is a
good weekend I've set up a straw poll at
https://www.mediawiki.org/wiki/Amsterdam_Hackathon_2013#Straw_Poll .
Please fill it out so we can finalize the date!
Thank you,
Maarten
Wikimedia Nederland
Ps. Please forward to any relevant lists I might have missed.
_______________________________________________
Toolserver-l mailing list (Toolserver-l(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list:
https://wiki.toolserver.org/view/Mailing_list_etiquette
fyi
-------- Original Message --------
Subject: Complete (basic) analysis of MediaWiki
Date: Mon, 03 Dec 2012 08:22:03 -0800
From: Quim Gil <qgil(a)wikimedia.org>
Organization: Wikimedia Foundation
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>, "Jesus M.
Gonzalez-Barahona" <jgb(a)bitergia.com>
If last October we got a bunch of MediaWiki developer stats thanks to
the aggregation of data by Ohloh [1], now we are getting plenty more
stats from Bitergia, including data from bug reporting and mailing lists:
http://blog.bitergia.com/2012/12/03/complete-basic-analysis-of-mediawiki/
Bitergia is a company based in Madrid formed by a small team of
developers that have been working on FLOSS stats software for a long
time. All the tools they develop are free software publicly available
and open to contributions.
They have been kind enough to contribute some time and work setting up
stats for the MediaWiki community. They also welcome feedback about the
service and the data collected. I'm CCing Jesús M. González-Barahona,
who has been my regular contact for this task in the past weeks.
Al good news for http://www.mediawiki.org/wiki/Community_Metrics !
[1] https://www.ohloh.net/orgs/wikimedia
--
Quim Gil
Technical Contributor Coordinator
Wikimedia Foundation
As of last week [1], Google Knowledge Graph started displaying information on drugs pulled from PubMed. Links to Wikipedia articles on drugs (which still appear in organic search results) have entirely disappeared from the infobox. I expect this to have a significant impact on our inbound traffic from search engines on these topics.
[1] http://techcrunch.com/2012/11/30/google-adds-key-facts-about-medicines-to-i…