Analytics February 2015

analytics@lists.wikimedia.org

45 participants
44 discussions

by Oliver Keyes

After the discovery of the duplication problem I reran the comparative analysis of pageviews implementations. The result: there really isn't a difference![0] Actually I had to generate a plot with jittered lines just to be able to /find/ one of them.[1] Thanks to Christian and Otto for the debugging. On to hand-coding we go! [0] https://upload.wikimedia.org/wikipedia/commons/a/a1/Pageviews_QA.png [1] https://upload.wikimedia.org/wikipedia/commons/1/19/Pageviews_QA_jittered.p… -- Oliver Keyes Research Analyst Wikimedia Foundation

9 years, 2 months

Testing the new Pageviews implementations

by Oliver Keyes

Hey all :). We've been doing some comparative analysis of all the different Pageviews options we have, as a first step towards putting the new definitions in Production. TL;DR: 'promising' doesn't cover how happy these results make me - see http://ironholds.org/deimos/qa_tests.png Some background: we've been working on a new definition to address deficiencies in the existing one. At the same time, we've also been working on writing UDFs - User Defined Functions - in Java, so that analysts can conveniently apply whatever we come up with to our data store of requests. This is a step forward from where we are at the moment, which is relying on complex Hive queries or implementations in various languages that are run over the /sampled/, rather than unsampled logs, and so can't be used for per-page stats. As part of the robustness testing of the new definition I compared four options. These were: 1. The legacy definition, through a Hive query; 2. The legacy definition, through its new UDF; 3. The new definition, through the sampled logs; 4. The new definition, through its new UDF. The results can be seen at http://ironholds.org/deimos/qa_tests.png - the reason it only appears that there are three lines, most of the time, is that the results match so closely that it's not possible to visually distinguish them. This is a pretty good heuristic for "good implementations" :D. So, what does this mean? It means we're confident in the UDFs' ability to replicate the previous implementations. This means we can conveniently use it to test the validity of the actual definition, and can rely on it for production-ready analysis when that definition is signed off on. What's next? Digging into the spike on 29 January, and if it doesn't show anything scary, hand-coding the output of the two UDFs to see if we're confident in the new definition. And if we are...well. We release that data :D. Tremendous thanks to Aaron Halfaker, Christian A. and Andrew Otto for (respectively) their poking, code review and introduction to Java! -- Oliver Keyes Research Analyst Wikimedia Foundation

9 years, 2 months

[Technical][Debate] Historical client ip and geocoded data

by Joseph Allemandou

Hi, As part of my first assignment, I'll recompute our historical webrequest dataset, adding client_ip and geocoded information. While it seems correct to compute historical client_ip based on the existing ip and the x_forwarded_for, the use of the current state of the geocoded maxmind library to compute historical data is more error-prone. I can either compute it anyway, knowing that there'll be some errors, or put null values for data older than a given point in time. I'll launch the script to recompute the data as soon as max(a consensus is find on this matter, operations gives me the right to run the script) :) Thanks -- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

9 years, 2 months

more public discussions, new tags

by Dan Andreescu

Dear List, We, the WMF Analytics team, want to bring more of our internal discussions public. We benefit tremendously from everyone who participates on this list and want to have as much transparency as possible into what we do. We also know that many people on this list feel overwhelmed with the large amount of content. So we propose a new way to tag our subject lines to enable filtering. We welcome feedback: https://wikitech.wikimedia.org/wiki/Analytics/MailingList Much appreciation and respect, Dan milimetric

9 years, 2 months

Re: [Analytics] more public discussions, new tags

by Paul J. Weiss

I think this is a great idea! At 2015-02-23 08:33 AM, you wrote: >Dear List, > >We, the WMF Analytics team, want to bring more >of our internal discussions public.Â We benefit >tremendously from everyone who participates on >this list and want to have as much transparency >as possible into what we do. > >We also know that many people on this list feel >overwhelmed with the large amount of >content.Â So we propose a new way to tag our >subject lines to enable filtering.Â We welcome feedback: > ><https://wikitech.wikimedia.org/wiki/Analytics/MailingList>https://wikitech.wikimedia.org/wiki/Analytics/MailingList > >Much appreciation and respect, > >Dan >milimetric >_______________________________________________ >Analytics mailing list >Analytics(a)lists.wikimedia.org >https://lists.wikimedia.org/mailman/listinfo/analytics

9 years, 2 months

Pageviews library for R

by Oliver Keyes

(This is probably mostly of interest to WMFers, but I thought I'd post it extra-publicly since transparency is good and so is peer usage. Whee!) A while back, Toby pointed out to me the fact that I keep having to implement the pageviews definition in different forms because there are always different questions - and that making sure everything works nicely together, every time, is time-consuming. It'd be a lot better if there was just a library of standalone pageviews functions, which would also give us things like unit tests. Toby is usually right, so I put together https://github.com/Ironholds/pageviews . It contains a sampled log reader, a fast implementation of the pageviews definition, an "is this to the mobile web, the desktop site or the app?" request tagger, and a host of other things. Hopefully it'll be useful to someone other than me, although I'm okay with it just being useful to me :) -- Oliver Keyes Research Analyst Wikimedia Foundation

9 years, 2 months

Duplicate events in the webrequest logs

by Oliver Keyes

As previously mentioned, I've been digging into QA testing the new pageviews definitions, and noticed a weird spike.[0] This was narrowed down to 27 January, and thence to 22:00-23:00 UTC on 27 January,[1] and a breakpoint was then seen at approximately 22:35 UTC.[2] TL;DR: either I don't understand how sequence number/hostname combinations work or there's massive duplication and sometimes triplication happening in the webrequest table. I grabbed a 5-minute slice of pageviews around the 22:35 breakpoint, coming to 6 million rows in total. My first hypothesis was that we were looking at some form of external attack (automata, say?), but the requests were evenly distributed between the desktop and mobile sites,[3] were not linked to any particular user agent or class of user agents,[4] and were not linked to any particular IP address.[5] With that hypothesis looking tentative I instead investigated internal snafus. The most obvious was duplicate events. As I understand it (and I really hope I'm wrong about this), each hostname issues a unique-to-the-host sequence number with each request, incrementing each time. Accordingly, in a universe where we have no duplicate events, the {hostname, sequence_number} tuples in a dataset of requests should contain zero duplicates. I dug into this and looked at how many duplicate tuples we had. And...bingo. We have many, /many/ duplicate tuples, and the point at which it reduces lines up with when the number of pageviews reduces.[6] Moreover, the number of duplicates is not proportionate to the number of pageviews.[7] So it looks like what we're dealing with here is a tremendous rise in duplicate events in the webrequests table. After the duplicate requests were removed, we ended up with a more natural pattern.[8] IOW, a chaotic pattern matching the chaotic pattern we see in the number of distinct IPs. Thoughts: 1. I thought we had systems in place to stop this? We should be calculating a per-host arithmetic series over the sequence numbers when data is loaded. 2. Please tell me that my understanding of how unique sequence numbers are is terribly terribly wrong, because the alternative is...trouble. 3. I'm not sure what this means for our "actual" pageviews, given that as [7] shows, we still have a lot of duplicates after the artificial spike ends. 4. How many issues do I have to ID before people take me up on my request to be exclusively referred to as 'Count Logula'? ;) [0] https://upload.wikimedia.org/wikipedia/commons/4/40/First_pageview_QA_test.… [1] https://upload.wikimedia.org/wikipedia/commons/0/02/First_Pageview_QA_test_… [2] https://upload.wikimedia.org/wikipedia/commons/a/a4/First_Pageview_QA_test_… [3] https://upload.wikimedia.org/wikipedia/commons/e/ec/27_2200_analysis_per_so… [4] https://upload.wikimedia.org/wikipedia/commons/d/dd/27_2200_analysis_per_ag… [5] https://upload.wikimedia.org/wikipedia/commons/5/51/27_2200_analysis_distin… [6] https://upload.wikimedia.org/wikipedia/commons/4/48/27_2200_analysis_duplic… [7] https://upload.wikimedia.org/wikipedia/commons/7/7d/27_2200_analysis_duplic… [8] https://upload.wikimedia.org/wikipedia/commons/a/a6/27_2200_analysis_de_dup… -- Oliver Keyes Research Analyst Wikimedia Foundation

9 years, 2 months

Welcome Joseph

by Toby Negrin

Hi Everyone, I'd like to welcome Joseph Allemendou to the Analytics team! We are really excited to get some of Joseph's calibre to help take our analytics work to the next level. In his own words: Joseph's experiences were mostly with private companies and almost always involved open source software. After a M.S. in Computer Science with a specialization in programming languages theory and a PhD in the Natural Language Processing and Dialog Systems fields, Joseph worked four years in Ireland. He spent two years at IBM learning and applying project management and process improvement methodologies, and two other years building a start-up to help English as a foreign language teachers find up-to-date teaching material. Then he moved back to France and worked for Criteo as a specialist in scalabilty for one year, and as a manager for another year. Lastly Joseph worked with Fotolia, where he built the analytics architecture and team. Working with the Wikimedia Foundation allows him to really apply his energy and skills in the direction he wish the world to move on. Joseph is based in Brittany, France. Welcome Joseph! -Toby

9 years, 2 months

Thank you !

by Joseph Allemandou

Hi guys, Thank you for your welcoming messages :) My first work assignments are in the hadoop zone, trying to put my hands arounds jobs, flows, and datastructre. I'll surely get to learn and help on other things in the near future ! I am eager to meet with more of you soon :) Cheers -- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

9 years, 2 months

February 2015 Research Showcase: Global South survey results; data imports in OpenStreetMap

by Dario Taraborelli

I am thrilled to announce our speaker lineup for this month’s research showcase <https://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase#Februar…>. Our own Haitham Shammaa will present results from the Global South survey. We also invited Stamen’s Alan McConchie, an OpenStreetMap expert, to talk about the challenges the OSM community is facing with external data imports. The showcase will be recorded and publicly streamed at 11.30 PT on Wednesday, February 18 (livestream link will follow). We’ll hold a discussion and take questions from remote participants via the Wikimedia Research IRC channel (#wikimedia-research <http://webchat.freenode.net/?channels=wikimedia-research> on freenode). Looking forward to seeing you there. Dario Global South User Survey 2014 By Haitham Shammaa <https://meta.wikimedia.org/wiki/User:HaithamS_(WMF)> Users' trends in the Global South have significantly changed over the past two years, and given the increase in interest in Global South communities and their activities, we wanted this survey to focus on understanding the statistics and needs of our users (both readers, and editors) in the regions listed in the WMF's New Global South Strategy <https://m.mediawiki.org/wiki/File:WMF%27s_New_Global_South_Strategy.pdf>. This survey aims to provide a better understanding of the specific needs of local user communities in the Global South, as well as provide data that supports product and program development decision making process. Ingesting Open Geodata: Observations from OpenStreetMap By Alan McConchie <http://stamen.com/studio/alan> As Wikidata grapples with the challenges of ingesting external data sources such as Freebase, what lessons can we learn from other open knowledge projects that have had similar experiences? OpenStreetMap, often called "The Wikipedia of Maps", is a crowdsourced geospatial data project covering the entire world. Since the earliest years of the project, OSM has combined user contributions with existing data imported from external sources. Within the OSM community, these imports have been controversial; some core OSM contributors complain that imported data is lower quality than user-contributed data, or that it discourages the growth of local mapping communities. In this talk, I'll review the history of data imports in OSM, and describe how OSM's best-practices have evolved over time in response to these critiques.

9 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics February 2015