Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi everyone,
I'm a phd student studying mathematical models to improve the hit ratio
of web caches. In my research community, we are lacking realistic data
sets and frequently rely on outdated modelling assumptions.
Previously, (~2007) a trace containing 10% of user requests issued to
the Wikipedia was publicly released [1]. This data set has been used
widely for performance evaluations of new caching algorithms, e.g., for
the new Caffeine caching framework for Java [2].
I would like to ask for your comments about compiling a similar
(updated) data set and making it public.
In my understanding, the necessary logs are readily available, e.g., in
the Analytics/Data/Mobile requests stream [3] on stat1002, with a
sampling rate of 1:100. As this request stream contains sensitive data
(e.g., client IPs), it would need anonymization before making it public.
It would be glad to help with that.
The previously released data set [1] contains no client information. It
contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update
flag. I would additionally suggest to include 5) the cache's hostname,
6) the cache_status, and 7) the response size (from the Wikimedia cache
log format).
I believe this format would preserve anonymity, and would be interesting
for many researchers.
Let me know your thoughts.
Thanks,
Daniel Berger
http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger
[1] http://www.wikibench.eu/?page_id=60
[2] https://github.com/ben-manes/caffeine/wiki/Efficiency
[3]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream
Hi all,
I hacked up a very quick count of the 2015 video viewing aggregate
figures, using the data that Bartosz put together last year - with the
caveat that the data only goes up to 10 December, but it's probably
indicative of whole-year trends. I haven't yet tried to merge in the
11-31/12 data. Nothing very insightful but I don't recall seeing it
done before, so it might be of interest!
http://www.generalist.org.uk/blog/2016/most-popular-videos-on-wikipedia/
The headline figure is that we had about three billion (!!)
video/audio plays during the year, and that some of the most popular
items are insanely popular - the most popular was viewed an average of
42,000 times a day, every day.
Pine: the video you asked about in the other thread was viewed 187,899
times from 31/10/15 to 10/12/15. So there's half your answer :-)
--
- Andrew Gray
andrew.gray(a)dunelm.org.uk
Would it be crazy to ask for statistics of user agents per page or per
namespace?
I'd hypothesize, for example, that IE is used much less outside of the
article and portal namespaces.
In case you're wondering what is it useful for: When I have a patch that
requires browser compatibility trickery, I may want to invest less time in
IE compatibility in a page that is unlikely to be viewed in IE much.
--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
“We're living in pieces,
I want to live in peace.” – T. Moore
Hi all,
as a new subscriber to this mailing list I would like to introduce myself.
My name is Sander, I'm a student and I'm currently working on a project
using Wikimedia APIs. A Dutch cultural institution has requested a team of
students to analyze how their uploaded material is being used. Some
examples of what they're interested in is knowing where their material is
reused, how many visitors view to pages, if visitors open the media on the
page, etc. Which APIs would you suggest we should look at that could have
valuable information? Also, is there any general documentation about the
various APIs? Any advice would be greatly appreciated.
Regards,
Sander
Hi, is there anybody in this list planning to attend OSCON (Austin, May
16-19)?
http://conferences.oreilly.com/oscon/open-source-us
Next to that event there will be a workshop about Software Development
Analytics and the new Grimoire toolkit platform, on May 16th (10am - 1pm CDT
):
https://www.eventbrite.com/e/software-development-analytics-workshop-ticket…
The registration is not free, but the organizers (Bitergia, the developers
of http://korma.wmflabs.org/ ) are offering us a couple of invitations.
Andre and I are not attending OSCON (we have participated in the FOSDEM
edition of this workshop). If you or someone you know is interested,
contact me.
--
Quim Gil
Engineering Community Manager @ Wikimedia Foundation
http://www.mediawiki.org/wiki/User:Qgil
Gergo Tisza makes a valid point.
The nl-wiki has 300.000-350.000 hits on the main page per day. The rest of
the top10 drops quickly down to 5000 hits per day, a reasonable amount.
But the 1.500.000 unique visitors per day then seems overstated, when I do
a rough estimate, it looks like 1.500.000 is the total number of page
views, the number of unique devices must be a lot smaller then.
See
https://wikimedia.org/api/rest_v1/metrics/unique-devices/nl.wikipedia.org/a…
Edo de Roo
nl-wiki, wikidata
On Tue, Apr 19, 2016 at 11:50 PM, <analytics-request(a)lists.wikimedia.org>
wrote:
> Send Analytics mailing list submissions to
> analytics(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/analytics
> or, via email, send a message with subject or body 'help' to
> analytics-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> analytics-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Analytics digest..."
>
>
> Today's Topics:
>
> 1. Unique Devices data available on API (Nuria Ruiz)
> 2. Hive & Oozie downtime tomorrow (Andrew Otto)
> 3. Re: Unique Devices data available on API (Gergo Tisza)
> 4. Re: Unique Devices data available on API (Kevin Leduc)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 19 Apr 2016 12:17:12 -0700
> From: Nuria Ruiz <nuria(a)wikimedia.org>
> To: "A mailing list for the Analytics Team at WMF and everybody who
> has an interest in Wikipedia and analytics."
> <analytics(a)lists.wikimedia.org>, Wikimedia developers
> <wikitech-l(a)lists.wikimedia.org>,
> wiki-research-l(a)lists.wikimedia.org
> Subject: [Analytics] Unique Devices data available on API
> Message-ID:
> <CAMpYYkGngUfQOu-f6sL1TVTDbDUX_=
> MG9FXEqUV5xmZVZL3nqQ(a)mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hello!
>
> The analytics team is happy to announce that the Unique Devices data is now
> available to be queried programmatically via an API.
>
> This means that getting the daily number of unique devices [1] for English
> Wikipedia for the month of February 2016, for all sites (desktop and
> mobile) is as easy as launching this query:
>
>
> https://wikimedia.org/api/rest_v1/metrics/unique-devices/en.wikipedia.org/a…
>
> You can get started by taking a look at our docs:
> https://wikitech.wikimedia.org/wiki/Analytics/Unique_Devices#Quick_Start
>
> If you are not familiar with the Unique Devices data the main thing you
> need to know is that
> is a good proxy metric to measure Unique Users, more info below.
>
> Since 2009, the Wikimedia Foundation used comScore to report data about
> unique web visitors. In January 2016, however, we decided to stop
> reporting comScore numbers [2] because of certain limitations in the
> methodology, these limitations translated into misreported mobile usage. We
> are now ready to replace comscore numbers with the Unique Devices Dataset .
> While unique devices does not equal unique visitors, it is a good proxy for
> that metric, meaning that a major increase in the number of unique devices
> is likely to come from an increase in distinct users. We understand that
> counting uniques raises fairly big privacy concerns and we use a very
> private conscious way to count unique devices, it does not include any
> cookie by which your browser history can be tracked [3].
>
>
> [1] https://meta.wikimedia.org/wiki/Research:Unique_Devices
> [2] [https://meta.wikimedia.org/wiki/ComScore/Announcement
> [3]
>
> https://meta.wikimedia.org/wiki/Research:Unique_Devices#How_do_we_count_uni…
> devices.3F
>