Analytics March 2015

analytics@lists.wikimedia.org

52 participants
49 discussions

Draft blog post on decline in Wikipedia pageviews: looking for analytics explanations

by Vipul Naik

9 years, 1 month

Provenance Params

by Adam Baso

Hi all - I'm checking with people in ops, but we're planning to add a well defined parameter to the end of URLs to see the level of clickthroughs on such links. For example: https://en.wikipedia.org/wiki/Epirus?analytics=ios_share_a_fact_v1 (If there are existing params on the URL - not an issue so far that I know of for the apps as they canonicalize the title and URL - then the param would be last in the ampersand separated query string parameter.) And then we'd use Varnish to remove the parameter to reduce the risk of cache fragmentation. We "know" this is probably only a short term solution, and as a follow up from the meeting with the people on the CC line, I'm emailing to open the discussion on options for a more generic option. So far I think there are a few options from what we've discussed, if we're to support additional bucketing. (1) More parameters (e.g., ?analytics=ios_share_a_fact&version=1) Downside: potentially harder to standardize and remove things from the URL (2) More conventional provenance (e.g., https://en.wikipedia.org/w/index.php?title=Castle&oldid=645632619/ref=_wref…<...more provenance info as desired>/). Downside: technically speaking, may break the schema of well-formed titles (3) Rely upon (1) or (2), or perhaps an even more RESTful shortlinker (it could have features like target - web or w:// wor wiki:// protocol or whatever - versioning, etc.). Downside: maybe a little more work to stand up service. As we recalled, there's an extension out there that may, perhaps with some tweaks, fit the build. -Adam

9 years, 1 month

Partial outage of Event Logging on March 20th

by Nuria Ruiz

Hello, Eventlogging had some issues on March 20th due to an inflow of client side events higher than the system can support. Inflow was due to the new instrumentation deployed for Wikitext to be able to compare Wikitext usage with Visual editor usage. Issues were resolved promptly and analytics team shall backfill client side events that were dropped on the 20th as a result of the outage. Details here: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150320-EventLo… Thanks, Nuria

9 years, 1 month

Research Showcase Starting in 8 minutes!

by Ellery Wulczyn

Today we will have two presentation: 1. User Session Identification by Aaron Halfaker 2. Mining Missing Hyperlinks in Wikipedia by Bob West. You can follow the talk on youtube <http://youtu.be/CgkwLXbALQg>. We will hold a discussion and take questions from remote participants via the Wikimedia Research IRC channel (#wikimedia-research on freenode). See you there, Ellery

9 years, 1 month

Rough estimate of percentage of requests without Javascript enabled/capable clients

by Nuria Ruiz

Gabriel: I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful. TL;DR According to our new pageview definition ( https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%. Details here (still some homework to do regarding IE6 and IE7) https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript Thanks, Nuria

9 years, 1 month

Re :A new format for the pageview dumps

by Roni Wiener

Sounds great, I believe that normalization of the title will be very useful for future researchers and usages, so as adding the pageId. Currently it is not always straight forward to correlate the wikipedia page with the unnormalized title > On Mar 14, 2015, at 14:00, analytics-request(a)lists.wikimedia.org wrote: > > Send Analytics mailing list submissions to > analytics(a)lists.wikimedia.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.wikimedia.org/mailman/listinfo/analytics > or, via email, send a message with subject or body 'help' to > analytics-request(a)lists.wikimedia.org > > You can reach the person managing the list at > analytics-owner(a)lists.wikimedia.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Analytics digest..." > > > Today's Topics: > > 1. [Technical][Request for Comment] A new format for the > pageview dumps (Oliver Keyes) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 13 Mar 2015 15:06:00 -0400 > From: Oliver Keyes <okeyes(a)wikimedia.org> > To: "A mailing list for the Analytics Team at WMF and everybody who > has an interest in Wikipedia and analytics." > <analytics(a)lists.wikimedia.org>, Research into Wikimedia content and > communities <wiki-research-l(a)lists.wikimedia.org> > Subject: [Analytics] [Technical][Request for Comment] A new format for > the pageview dumps > Message-ID: > <CAAUQgdCSVg8htCS4VFDZJaL2rUEjmk5ZEA+zXcp3o9OWn-udfQ(a)mail.gmail.com> > Content-Type: text/plain; charset=UTF-8 > > So, we've got a new pageviews definition; it's nicely integrated and > spitting out TRUE/FALSE values on each row with the best of em. But > what does that mean for third-party researchers? > > Well...not much, at the moment, because the data isn't being released > somewhere. But one resource we do have that third-parties use a heck > of a lot, is the per-page pageviews dumps on dumps.wikimedia.org. > > Due to historical size constrains and decision-making (and by > historical I mean: last decade) these have a number of weirdnesses in > formatting terms; project identification is done using a notation > style not really used anywhere else, mobile/zero/desktop appear on > different lines, and the files are space-separated. I'd like to put > some volunteer time into spitting out dumps in an easier-to-work-with > format, using the new definition, to run in /parallel/ with the > existing logs. > > *The new format* > At the moment we have the format: > > project_notation - encoded_title - pageviews - bytes > > This puts zero and mobile requests to pageX in a different place to > desktop requests, requires some reconstruction of project_notation, > and contains (for some use cases) extraneous information - that being > the byte-count. The files are also headerless, unquoted and > space-separated, which saves space but is sometimes...I think the term > is "eeeeh-inducing". > > What I'd like to use as a new format is: > > full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews > > This file would: > > 1. Include a header row; > 2. Be formatted as a tab-separated, rather than space-separated, file; > 3. Exclude bytecounts; > 4. Include desktop and mobile pageview counts on the same line; > 5. Use the full project URL ("en.wikivoyage.org") instead of the > pagecounts-specific notation ("en.v") > > So, as a made-up example, instead of: > > de.m.v Florence 32 9024 > de.v Florence 920 7570 > > we'd end up with: > > de.wikivoyage.org Florence 920 32 > > In the future we could also work to /normalise/ the title - replacing > it with the page title that refers to the actual pageID. This won't > impact legacy files, and is currently blocked on the Apps team, but > should be viable as soon as that blocker goes away. > > I've written a script capable of parsing and reformatting the legacy > files, so we should be able to backfill in this new format too, if > that's wanted (see below). > > *The size constraints* > > There really aren't any. Like I said, the historical rationale for a > lot of these decisions seems to have been keeping the files small. But > by putting requests to the same title from different site versions on > the same line, and dropping byte-count, we save enough space that the > resulting files are approximately the same size as the old ones - or > in many cases, actually smaller. > > *What I'm asking for* > > Feedback! What do people think of the new format? What would they like > to see that they don't? What don't they need, here? How useful would > normalisation be? How useful would backfilling be? > > *What I'm not asking for* > WMF time! Like I said, this is a spare-time project; I've also got > volunteers for Code Review and checking, too (Yuvi and Otto). > > The replacement of the old files! Too many people depend on that > format and that definition, and I don't want to make them sad. > > Thoughts? > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > > > ------------------------------ > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > > End of Analytics Digest, Vol 37, Issue 33 > *****************************************

9 years, 1 month

[Technical] pageviews definition undercounting app requests

by Oliver Keyes

According to Dan Garry, the Apps team is now sending sections=all instead of sections=0 on recent iOS app requests. The result is that apps will be underreported, since the existing implementation of the pageview definition does not know this.[0] I've filed a phabricator ticket,[1] but this is just a note to make sure it's surfaced more widely - pageviews-based requests using the "New" definition are not currently reliable for apps. This is the Nth reminder to !analytics that if you're planning on (a) asking analytics for data and (b) getting useful numbers, it's probably nice to tell them about this sort of change /before/ you make it. [0] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery… [1] https://phabricator.wikimedia.org/T93255 -- Oliver Keyes Research Analyst Wikimedia Foundation

9 years, 1 month

[Announcement] March 2015 Research Showcase

by Leila Zia

Hi, This month's research showcase <https://www.mediawiki.org/w/index.php?title=Analytics/Research_and_Data/Sho…> is scheduled for Wednesday, March 25, 11:30 (PST). We will have two presentations on user session identification by Aaron Halfaker, and mining missing hyperlinks in Wikipedia by Bob West. As usual, the event will be recorded and publicly streamed on YouTube (links will follow). We will hold a discussion and take questions from remote participants via the Wikimedia Research IRC channel (#wikimedia-research on freenode). Looking forward to seeing you there. Leila

9 years, 1 month

[Technical][Request for Comment] A new format for the pageview dumps

by Oliver Keyes

So, we've got a new pageviews definition; it's nicely integrated and spitting out TRUE/FALSE values on each row with the best of em. But what does that mean for third-party researchers? Well...not much, at the moment, because the data isn't being released somewhere. But one resource we do have that third-parties use a heck of a lot, is the per-page pageviews dumps on dumps.wikimedia.org. Due to historical size constrains and decision-making (and by historical I mean: last decade) these have a number of weirdnesses in formatting terms; project identification is done using a notation style not really used anywhere else, mobile/zero/desktop appear on different lines, and the files are space-separated. I'd like to put some volunteer time into spitting out dumps in an easier-to-work-with format, using the new definition, to run in /parallel/ with the existing logs. *The new format* At the moment we have the format: project_notation - encoded_title - pageviews - bytes This puts zero and mobile requests to pageX in a different place to desktop requests, requires some reconstruction of project_notation, and contains (for some use cases) extraneous information - that being the byte-count. The files are also headerless, unquoted and space-separated, which saves space but is sometimes...I think the term is "eeeeh-inducing". What I'd like to use as a new format is: full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews This file would: 1. Include a header row; 2. Be formatted as a tab-separated, rather than space-separated, file; 3. Exclude bytecounts; 4. Include desktop and mobile pageview counts on the same line; 5. Use the full project URL ("en.wikivoyage.org") instead of the pagecounts-specific notation ("en.v") So, as a made-up example, instead of: de.m.v Florence 32 9024 de.v Florence 920 7570 we'd end up with: de.wikivoyage.org Florence 920 32 In the future we could also work to /normalise/ the title - replacing it with the page title that refers to the actual pageID. This won't impact legacy files, and is currently blocked on the Apps team, but should be viable as soon as that blocker goes away. I've written a script capable of parsing and reformatting the legacy files, so we should be able to backfill in this new format too, if that's wanted (see below). *The size constraints* There really aren't any. Like I said, the historical rationale for a lot of these decisions seems to have been keeping the files small. But by putting requests to the same title from different site versions on the same line, and dropping byte-count, we save enough space that the resulting files are approximately the same size as the old ones - or in many cases, actually smaller. *What I'm asking for* Feedback! What do people think of the new format? What would they like to see that they don't? What don't they need, here? How useful would normalisation be? How useful would backfilling be? *What I'm not asking for* WMF time! Like I said, this is a spare-time project; I've also got volunteers for Code Review and checking, too (Yuvi and Otto). The replacement of the old files! Too many people depend on that format and that definition, and I don't want to make them sad. Thoughts? -- Oliver Keyes Research Analyst Wikimedia Foundation

9 years, 2 months

Data on Wikidata references coverage

by Ryan Kaldari

Hi, I'm looking to generate some data around what percentage of claims in Wikidata have references. What's the best way for me to get this data? As a bonus, I would like to find out what percentage of claims in Wikidata have references other than "XX Wikipedia". Thanks, Kaldari

9 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics March 2015