We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hello,
I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it:
* We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs?
* Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time?
* What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
* We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping?
Many Thanks,
Rawia
[Description: Strategy& Logo]
Formerly Booz & Company
Rawia Abdel Samad
Direct: +9611985655 | Mobile: +97455153807
Email: Rawia.AbdelSamad(a)strategyand.pwc.com<mailto:Rawia.AbdelSamad@strategyand.pwc.com>
www.strategyand.com
This discussion is about needed updates of the definition and Analytics
implementation for mobile apps page view metrics. There is also an
associated Phab task[4]. Please add the proper Analytics project there.
Background / Changes
As you probably remember, the Android app splits a page view into two
requests: one for the lead section and metadata, plus another one for the
remainder.
The mobile apps are going to change the way they load pages in two
different ways:
1. We'll add a link preview when someone clicks on a link from a page.
2. We're planning on switching over the using RESTBase for loading pages
and also the link preview (initially just the Android beta, ater more)
This will have implications for the pageviews definition and how we count
user engagement.
The big question is
Should we count link previews as a page view since it's an indication of
user engagement? Or should there be a separate metric for link previews?
Counting page views
IIRC we currently count action=mobileview§ions=0 query parameters of
api.php as a page view. When we publish link previews for all Android app
users then we would either want to count also the calls to
action=query&prop=extracts as a page view or add them to another metric.
Once the apps use RESTBase the HTTPS requests will be very different:
- Page view: Instead of action=mobileview§ions=0 the app would call
the RESTBase endpoint for lead request[1] instead of the PHP API mentioned
above. Then it would call [2].
- Link preview: Instead of action=query&prop=extracts it would call the
lead request[1], too, since there is a lot of overlap. At least that our
current plan. The advantage of that is that the client doesn't need to
execute the lead request a second time if the user clicks on the link
preview (-- either through caching or app logic.)
So, in the RESTBase case we either want to count the
mobile-html-sections-lead requests or the
mobile-html-sections-remaining requests
depending on what our definition for page views actually is. We could also
add a query parameter or extra HTTP header to one of the
mobile-html-sections-lead requests if we need to distinguish between
previews and page views.
Both the current PHP API and the RESTBase based metrics would need to be
compatible and be collected in parallel since we cannot control when users
update their apps.
[1]
https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-lead/Dilbert
[2]
https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-remaining/Di…
[3]
https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/RESTBase_services_for_ap…
[4] https://phabricator.wikimedia.org/T109383
Cheers,
Bernd
A number of us are discussing the year to date editor population stats.
When can we anticipate seeing the August stats? It would be helpful to have
them be published at least a week before the publication of the monthly
Recent Research report for September.
Thanks,
Pine
Hi,
I've been asked in a private email why WMF forked ua-parser [1]
(a library used to extract information from User-Agents headers).
There is no need to discuss this is private, hence I am replying to
the mailing list.
TL;DR: It was no real fork. We just worked around issues with
upstream's release management.
-----------------------------------------------------
What follows is a bit detailed. But given the context I decided to
better err on being over-verbose.
Back in October 2014, WMF pushed towards analyzing User-Agent headers
in the logs to for example allow more accurate estimations of how many
requests WMF sees from Android vs. iPhone devices, which Browsers get
used in which version etc.
Extracting information from User-Agents is a bit tricky as there are
quite some corner cases. So it was decided to use a third-party
library for it. ua-parser [1] got chosen for this purpose.
ua-parser comes with a Java build, so it naturally matched the log
processing's Java eco-system. However, (at least) back then ua-parser
did not offer compelling prebuilt jars, and ua-parser's versioning and
release cycle of the Java part was broken.
The latest release was about a year old, and no proper release was in
sight. So all upstream gave us was a jar versioned as
ua-parser-1.3.0-SNAPSHOT.jar
Deploying such jar to the cluster is a bad idea, as its name does not
give a clue on which commit it is based. For this concrete setting,
there would be about 250 commits in ua-parser that would produce the
same version number. That would make debugging hard and nix
reproducability.
Since WMF cannot do a proper release for ua-parser, the typical
workaround for WMF in such cases is to produce a “wmf” branch in
Gerrit and do “wmf” releases at known commits. And that's what the
ua-parser “fork” in Gerrit does.
Comparing upstream with the “fork” in Gerrit, the only difference is:
https://gerrit.wikimedia.org/r/#/c/169204/
That commit allows for a wmf release, is tagged 1.3.0-wmf1 and results
in an artifact name of
ua-parser-1.3.0-wmf1.jar
which (due to the 1.3.0-wmf1 tag) is good for releasing [2].
As one of the questions in the private email was whether WMF could
switch back to upstream ... I hope you see that WMF never switched
away from upstream and WMF never “forked” upstream. WMF only rolled
their own release.
If upstream now provides proper releases, sure, just use them :-)
Have fun,
Christian
P.S.:
* How can I find out who actually created a repository?
Look at the first commit to the meta/config branch. Like here:
https://git.wikimedia.org/commit/analytics%2Fua-parser/2fd5dc00ac9e087b307f…
* How can I see the difference between branches?
Use `git cherry` (Yes, really. Just “cherry”, no trailing “-pick”)
An example session is at [3].
* How could one have found out about the wmf1 thing?
For example from the IRC logs of the day from the commit [4]:
[20:23:08] <ottomata> we can just make wmf1 be our release of the current master?
[20:23:13] <qchris> k
----------------------------------------------------------------------
[1] Back then at
https://github.com/tobie/ua-parser
now the relevant repos for WMF seem to be at
https://github.com/ua-parser/uap-corehttps://github.com/ua-parser/uap-java
.
[2] It made it into archiva:
https://archiva.wikimedia.org/#artifact/ua_parser/ua-parser/1.3.0-wmf1
into the refinery-hive jars:
https://gerrit.wikimedia.org/r/#/c/166142/11..14/refinery-hive/pom.xml
and also to the cluster:
https://gerrit.wikimedia.org/r/#/c/170373/1/refinery-hive/pom.xmlhttps://gerrit.wikimedia.org/r/#/c/170375/
[3]
_________________________________________________________________
christian@spencer // jobs: 0 // time: 21:40:28 // exit code: 0
cwd: ~/tmp
git clone https://github.com/tobie/ua-parser
Cloning into 'ua-parser'...
remote: Counting objects: 4507, done.
remote: Total 4507 (delta 0), reused 0 (delta 0), pack-reused 4507
Receiving objects: 100% (4507/4507), 4.31 MiB | 923 KiB/s, done.
Resolving deltas: 100% (2301/2301), done.
_________________________________________________________________
christian@spencer // jobs: 0 // time: 21:41:10 // exit code: 0
cwd: ~/tmp
cd ua-parser
_________________________________________________________________
christian@spencer // jobs: 0 // time: 21:41:14 // exit code: 0
cwd: ~/tmp/ua-parser
git remote add gerrit https://gerrit.wikimedia.org/r/analytics/ua-parser
_________________________________________________________________
christian@spencer // jobs: 0 // time: 21:41:33 // exit code: 0
cwd: ~/tmp/ua-parser
git fetch gerrit
remote: Finding sources: 100% (4/4)
remote: Total 4 (delta 3), reused 4 (delta 3)
Unpacking objects: 100% (4/4), done.
From https://gerrit.wikimedia.org/r/analytics/ua-parser
* [new branch] master -> gerrit/master
* [new branch] wmf -> gerrit/wmf
* [new tag] v1.3.0-wmf1 -> v1.3.0-wmf1
_________________________________________________________________
christian@spencer // jobs: 0 // time: 21:41:38 // exit code: 0
cwd: ~/tmp/ua-parser
git cherry origin/master gerrit/master
_________________________________________________________________
christian@spencer // jobs: 0 // time: 21:42:10 // exit code: 0
cwd: ~/tmp/ua-parser
git cherry origin/master gerrit/wmf
+ 2a44875355b558d9f880a63c86630af229044a63
_________________________________________________________________
christian@spencer // jobs: 0 // time: 21:42:17 // exit code: 0
cwd: ~/tmp/ua-parser
git cherry origin/master v1.3.0-wmf1
+ 2a44875355b558d9f880a63c86630af229044a63
[4] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20141027.txt
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hello all!
Almost all of the graphs on the Vital Signs dashboard
<https://vital-signs.wmflabs.org/> display no data (the only exception is
legacy pageviews). Could someone explain to me why that is, and whether
there's a plan to fix it?
I ask because Vital Signs includes several metrics from the editor model,
<https://meta.wikimedia.org/wiki/Research:Editor_model>which the Editing
department really wants to track on an ongoing basis. I need to find out
whether we need to pursue other ways of doing so.
Thanks!
--
Neil P. Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF>,
product analyst
Wikimedia Foundation
Is the pageviews_hourly table meant to contain pageviews according to
the new or old definition? If old, where can I find aggregates for the
new one?
--
Oliver Keyes
Count Logula
Wikimedia Foundation
Heyo, Discovery team!
(Analytics CCd)
This is just a quick writeup of the Scaleable Event Systems meeting
that Erik, Dan, Stas and I went to (although just from my
perspective).
For people not in the initial thread, this is a proposal to replace
the internal architecture of EventLogging and similar services with
Apache Kafka brokers
(http://www.confluent.io/blog/stream-data-platform-1/ ). What that
means in practice is that the current 1-2k events/second limit on
EventLogging will disappear and we can stop worrying about sampling
and accidentally bringing down the system. We can be a lot less
cautious about our schemas and a lot less cautious about our sampling
rate!
It also offers up a lot of opportunities around streaming data and
making it available in a layered fashion - while we don't want to
explore that right now, I don't think, it's nice to have as an option
when we better understand our search data and how we can safely
distribute it.
I'd like to thank the Analytics team, particularly Andrew, for putting
this together; it was a super-helpful discussion to be in and this
sort of product is precisely what I, at least, have been hoping for
out of the AnEng brain trust. Full speed ahead!
--
Oliver Keyes
Count Logula
Wikimedia Foundation