Hey all :). We've been doing some comparative analysis of all the
different Pageviews options we have, as a first step towards putting
the new definitions in Production. TL;DR: 'promising' doesn't cover
how happy these results make me - see
http://ironholds.org/deimos/qa_tests.png
Some background: we've been working on a new definition to address
deficiencies in the existing one. At the same time, we've also been
working on writing UDFs - User Defined Functions - in Java, so that
analysts can conveniently apply whatever we come up with to our data
store of requests. This is a step forward from where we are at the
moment, which is relying on complex Hive queries or implementations in
various languages that are run over the /sampled/, rather than
unsampled logs, and so can't be used for per-page stats.
As part of the robustness testing of the new definition I compared
four options. These were:
1. The legacy definition, through a Hive query;
2. The legacy definition, through its new UDF;
3. The new definition, through the sampled logs;
4. The new definition, through its new UDF.
The results can be seen at http://ironholds.org/deimos/qa_tests.png -
the reason it only appears that there are three lines, most of the
time, is that the results match so closely that it's not possible to
visually distinguish them. This is a pretty good heuristic for "good
implementations" :D.
So, what does this mean? It means we're confident in the UDFs' ability
to replicate the previous implementations. This means we can
conveniently use it to test the validity of the actual definition, and
can rely on it for production-ready analysis when that definition is
signed off on.
What's next? Digging into the spike on 29 January, and if it doesn't
show anything scary, hand-coding the output of the two UDFs to see if
we're confident in the new definition. And if we are...well. We
release that data :D.
Tremendous thanks to Aaron Halfaker, Christian A. and Andrew Otto for
(respectively) their poking, code review and introduction to Java!
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
Hi,
As part of my first assignment, I'll recompute our historical webrequest
dataset, adding client_ip and geocoded information.
While it seems correct to compute historical client_ip based on the
existing ip and the x_forwarded_for, the use of the current state of the
geocoded maxmind library to compute historical data is more error-prone.
I can either compute it anyway, knowing that there'll be some errors, or
put null values for data older than a given point in time.
I'll launch the script to recompute the data as soon as max(a consensus is
find on this matter, operations gives me the right to run the script) :)
Thanks
--
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
Dear List,
We, the WMF Analytics team, want to bring more of our internal discussions
public. We benefit tremendously from everyone who participates on this
list and want to have as much transparency as possible into what we do.
We also know that many people on this list feel overwhelmed with the large
amount of content. So we propose a new way to tag our subject lines to
enable filtering. We welcome feedback:
https://wikitech.wikimedia.org/wiki/Analytics/MailingList
Much appreciation and respect,
Dan
milimetric
I think this is a great idea!
At 2015-02-23 08:33 AM, you wrote:
>Dear List,
>
>We, the WMF Analytics team, want to bring more
>of our internal discussions public. We benefit
>tremendously from everyone who participates on
>this list and want to have as much transparency
>as possible into what we do.
>
>We also know that many people on this list feel
>overwhelmed with the large amount of
>content. So we propose a new way to tag our
>subject lines to enable filtering. We welcome feedback:
>
><https://wikitech.wikimedia.org/wiki/Analytics/MailingList>https://wikitech.wikimedia.org/wiki/Analytics/MailingList
>
>Much appreciation and respect,
>
>Dan
>milimetric
>_______________________________________________
>Analytics mailing list
>Analytics(a)lists.wikimedia.org
>https://lists.wikimedia.org/mailman/listinfo/analytics
(This is probably mostly of interest to WMFers, but I thought I'd post
it extra-publicly since transparency is good and so is peer usage.
Whee!)
A while back, Toby pointed out to me the fact that I keep having to
implement the pageviews definition in different forms because there
are always different questions - and that making sure everything works
nicely together, every time, is time-consuming. It'd be a lot better
if there was just a library of standalone pageviews functions, which
would also give us things like unit tests.
Toby is usually right, so I put together
https://github.com/Ironholds/pageviews . It contains a sampled log
reader, a fast implementation of the pageviews definition, an "is this
to the mobile web, the desktop site or the app?" request tagger, and a
host of other things. Hopefully it'll be useful to someone other than
me, although I'm okay with it just being useful to me :)
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
As previously mentioned, I've been digging into QA testing the new
pageviews definitions, and noticed a weird spike.[0] This was narrowed
down to 27 January, and thence to 22:00-23:00 UTC on 27 January,[1]
and a breakpoint was then seen at approximately 22:35 UTC.[2]
TL;DR: either I don't understand how sequence number/hostname
combinations work or there's massive duplication and sometimes
triplication happening in the webrequest table.
I grabbed a 5-minute slice of pageviews around the 22:35 breakpoint,
coming to 6 million rows in total. My first hypothesis was that we
were looking at some form of external attack (automata, say?), but the
requests were evenly distributed between the desktop and mobile
sites,[3] were not linked to any particular user agent or class of
user agents,[4] and were not linked to any particular IP address.[5]
With that hypothesis looking tentative I instead investigated internal
snafus. The most obvious was duplicate events. As I understand it (and
I really hope I'm wrong about this), each hostname issues a
unique-to-the-host sequence number with each request, incrementing
each time. Accordingly, in a universe where we have no duplicate
events, the {hostname, sequence_number} tuples in a dataset of
requests should contain zero duplicates.
I dug into this and looked at how many duplicate tuples we had.
And...bingo. We have many, /many/ duplicate tuples, and the point at
which it reduces lines up with when the number of pageviews
reduces.[6] Moreover, the number of duplicates is not proportionate to
the number of pageviews.[7] So it looks like what we're dealing with
here is a tremendous rise in duplicate events in the webrequests
table. After the duplicate requests were removed, we ended up with a
more natural pattern.[8] IOW, a chaotic pattern matching the chaotic
pattern we see in the number of distinct IPs.
Thoughts:
1. I thought we had systems in place to stop this? We should be
calculating a per-host arithmetic series over the sequence numbers
when data is loaded.
2. Please tell me that my understanding of how unique sequence numbers
are is terribly terribly wrong, because the alternative is...trouble.
3. I'm not sure what this means for our "actual" pageviews, given that
as [7] shows, we still have a lot of duplicates after the artificial
spike ends.
4. How many issues do I have to ID before people take me up on my
request to be exclusively referred to as 'Count Logula'? ;)
[0] https://upload.wikimedia.org/wikipedia/commons/4/40/First_pageview_QA_test.…
[1] https://upload.wikimedia.org/wikipedia/commons/0/02/First_Pageview_QA_test_…
[2] https://upload.wikimedia.org/wikipedia/commons/a/a4/First_Pageview_QA_test_…
[3] https://upload.wikimedia.org/wikipedia/commons/e/ec/27_2200_analysis_per_so…
[4] https://upload.wikimedia.org/wikipedia/commons/d/dd/27_2200_analysis_per_ag…
[5] https://upload.wikimedia.org/wikipedia/commons/5/51/27_2200_analysis_distin…
[6] https://upload.wikimedia.org/wikipedia/commons/4/48/27_2200_analysis_duplic…
[7] https://upload.wikimedia.org/wikipedia/commons/7/7d/27_2200_analysis_duplic…
[8] https://upload.wikimedia.org/wikipedia/commons/a/a6/27_2200_analysis_de_dup…
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
Hi Everyone,
I'd like to welcome Joseph Allemendou to the Analytics team! We are really
excited to get some of Joseph's calibre to help take our analytics work to
the next level.
In his own words:
Joseph's experiences were mostly with private companies and almost always
involved open source software. After a M.S. in Computer Science with a
specialization in programming languages theory and a PhD in the Natural
Language Processing and Dialog Systems fields, Joseph worked four years in
Ireland. He spent two years at IBM learning and applying project management
and process improvement methodologies, and two other years building a
start-up to help English as a foreign language teachers find up-to-date
teaching material. Then he moved back to France and worked for Criteo as a
specialist in scalabilty for one year, and as a manager for another year.
Lastly Joseph worked with Fotolia, where he built the analytics
architecture and team. Working with the Wikimedia Foundation allows him to
really apply his energy and skills in the direction he wish the world to
move on.
Joseph is based in Brittany, France. Welcome Joseph!
-Toby
Hi guys,
Thank you for your welcoming messages :)
My first work assignments are in the hadoop zone, trying to put my hands
arounds jobs, flows, and datastructre.
I'll surely get to learn and help on other things in the near future !
I am eager to meet with more of you soon :)
Cheers
--
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
I am thrilled to announce our speaker lineup for this month’s research showcase <https://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase#Februar…>.
Our own Haitham Shammaa will present results from the Global South survey. We also invited Stamen’s Alan McConchie, an OpenStreetMap expert, to talk about the challenges the OSM community is facing with external data imports.
The showcase will be recorded and publicly streamed at 11.30 PT on Wednesday, February 18 (livestream link will follow). We’ll hold a discussion and take questions from remote participants via the Wikimedia Research IRC channel (#wikimedia-research <http://webchat.freenode.net/?channels=wikimedia-research> on freenode).
Looking forward to seeing you there.
Dario
Global South User Survey 2014
By Haitham Shammaa <https://meta.wikimedia.org/wiki/User:HaithamS_(WMF)>
Users' trends in the Global South have significantly changed over the past two years, and given the increase in interest in Global South communities and their activities, we wanted this survey to focus on understanding the statistics and needs of our users (both readers, and editors) in the regions listed in the WMF's New Global South Strategy <https://m.mediawiki.org/wiki/File:WMF%27s_New_Global_South_Strategy.pdf>. This survey aims to provide a better understanding of the specific needs of local user communities in the Global South, as well as provide data that supports product and program development decision making process.
Ingesting Open Geodata: Observations from OpenStreetMap
By Alan McConchie <http://stamen.com/studio/alan>
As Wikidata grapples with the challenges of ingesting external data sources such as Freebase, what lessons can we learn from other open knowledge projects that have had similar experiences? OpenStreetMap, often called "The Wikipedia of Maps", is a crowdsourced geospatial data project covering the entire world. Since the earliest years of the project, OSM has combined user contributions with existing data imported from external sources. Within the OSM community, these imports have been controversial; some core OSM contributors complain that imported data is lower quality than user-contributed data, or that it discourages the growth of local mapping communities. In this talk, I'll review the history of data imports in OSM, and describe how OSM's best-practices have evolved over time in response to these critiques.