We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hello,
I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it:
* We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs?
* Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time?
* What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
* We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping?
Many Thanks,
Rawia
[Description: Strategy& Logo]
Formerly Booz & Company
Rawia Abdel Samad
Direct: +9611985655 | Mobile: +97455153807
Email: Rawia.AbdelSamad(a)strategyand.pwc.com<mailto:Rawia.AbdelSamad@strategyand.pwc.com>
www.strategyand.com
Hello,
I'm inquiring about the delay for publishing the January compressed Wikistats files that are maintained by Erik Zachte. I'm guessing those processes are given a low priority compared to the content backups that need to run. More generally, I'm interested in finding new ways that I can help out. I'm an ex-Microsoftie who is now on the fraud analytics team at TD Bank. I've been involved with the Wikimedia group in Atlanta. I organize the picnic each summer, and helped get the rest of the historic buildings photographed. I've dabbled in reverting vandalism, and I contribute to articles when I actually have something to contribute. I don't feel like I've settled into a contributor role that really fits me yet though.
I enjoy using a variety of the traffic data sets that Wikimedia publishes. It seems the traffic servers get bogged down sometimes though. Can I help? Should I try to get the Atlanta group to pool our donations this year for an extra computer?
Thanks,
Michael
Hello,
My username is rbaasland and I would like to contribute to the analytics
project. I was wondering if I could have access to the project, or how I go
about contributing to this project?
Thank you very much,
Ron Baasland
I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1]
Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps.
Feedback on the proposal is welcome on the lists or the project talk page on Meta [3]
Dario
[1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagev…
[2] http://dx.doi.org/10.1371/journal.pcbi.1003892
[3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_…
Thank you! Would you mind posting a note on Analytics(a)lists.wikimedia.org
when it is working normally again?
On Wed, Feb 11, 2015 at 1:36 PM, Henrik Abelsson <henrik(a)abelsson.com>
wrote:
> Hi Kevin,
>
> Looking into it!
>
> -henrik
>
>
> On 11/02/15 16:36, Kevin Leduc wrote:
>
> Hi Henrik,
>
> stats.grok.se has missing data in the last week. Can you restart the
> service to see if that helps?
>
> Thanks!
> Kevin Leduc
> Analytics Product Manager
>
>
>
Hi all -
I'm checking with people in ops, but we're planning to add a well defined
parameter to the end of URLs to see the level of clickthroughs on such
links. For example:
https://en.wikipedia.org/wiki/Epirus?analytics=ios_share_a_fact_v1
(If there are existing params on the URL - not an issue so far that I know
of for the apps as they canonicalize the title and URL - then the param
would be last in the ampersand separated query string parameter.)
And then we'd use Varnish to remove the parameter to reduce the risk of
cache fragmentation.
We "know" this is probably only a short term solution, and as a follow up
from the meeting with the people on the CC line, I'm emailing to open the
discussion on options for a more generic option.
So far I think there are a few options from what we've discussed, if we're
to support additional bucketing.
(1) More parameters (e.g., ?analytics=ios_share_a_fact&version=1)
Downside: potentially harder to standardize and remove things from the URL
(2) More conventional provenance (e.g.,
https://en.wikipedia.org/w/index.php?title=Castle&oldid=645632619/ref=_wref…<...more
provenance info as desired>/).
Downside: technically speaking, may break the schema of well-formed titles
(3) Rely upon (1) or (2), or perhaps an even more RESTful shortlinker (it
could have features like target - web or w:// wor wiki:// protocol or
whatever - versioning, etc.).
Downside: maybe a little more work to stand up service. As we recalled,
there's an extension out there that may, perhaps with some tweaks, fit the
build.
-Adam
Gabriel:
I have run through the data and have a rough estimate of how many of our
pageviews are requested from browsers w/o strong javascript support. It is
a preliminary rough estimate but I think is pretty useful.
TL;DR
According to our new pageview definition (
https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews
come from clients w/o much javascript support. But - BIG CAVEAT- this
includes bots requests. If you remove the easy-too-spot-big-bots the
percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7)
https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
Thanks,
Nuria
Hi all,
Erik Zachte is working on file view stats and is looking for a way to track
Media Viewer image views (for which there is no 1:1 relation between server
hits and actual image views); after some back and forth in
https://phabricator.wikimedia.org/T86914 I proposed the following hack:
whenever the javascript code in MediaViewer determines that an image view
happened (e.g. an image has been displayed for a certain amount of time),
it makes a request to a certain fake image, say
upload.wikimedia.org/wikipedia/commons/thumb/0/00/Virtual-imageview-<real
image name>/<size>px-thumbnail.<ext> . These hits can than be easily
filtered from the varnish request logs and added to the normal requests. We
would add a rule to Vagrant to make sure it does not try to look up such
requests in Swift but returns a 404 immediately.
This would be a temporary workaround until there is a proper way to log
virtual image views, such as EventLogging with a non-SQL backend.
Do you see any fundamental problem with this?
I think we should split up Eventlogging and the other m2 clients (OTRS and
some minor players). Several reasons:
- Backfilling causes replication lag. Using faster out-of-band replication
for EL is easy because it is all simple bulk-INSERT statements, but the
same does not apply for the other clients. They need different approaches.
- Master disk space. Even with the data purging discussed at the MW Summit,
I would feel better if EL had more headroom that is does currently, and
zero possibility of unexpected spikes in disk activity and usage affecting
other services.
- EL is the service most sensitive to connection dropouts. Recently Ori and
Nuria have been tweaking SqlAlchemy, but future connection problems like
those seen last week would be easier to debug without having to risk
affecting other services.
I am therefore arranging to promote the current m2 slave db1046 to master
of an m4 cluster tuned for EL, including backfilling. Analytics-store,
s1-analytics-slave, and the new CODFW server will simply switch to
replicate from the new master.
For switchover of writes, we'll need to coordinate an EL consumer restart
to use a new CNAME of m4-master.eqiad.wmnet and allow vanadium the relevant
network access, and then presumably do a little backfilling. When would be
a reasonable time within the next fortnight or so?
Sean