Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi everyone,
I'm a phd student studying mathematical models to improve the hit ratio
of web caches. In my research community, we are lacking realistic data
sets and frequently rely on outdated modelling assumptions.
Previously, (~2007) a trace containing 10% of user requests issued to
the Wikipedia was publicly released [1]. This data set has been used
widely for performance evaluations of new caching algorithms, e.g., for
the new Caffeine caching framework for Java [2].
I would like to ask for your comments about compiling a similar
(updated) data set and making it public.
In my understanding, the necessary logs are readily available, e.g., in
the Analytics/Data/Mobile requests stream [3] on stat1002, with a
sampling rate of 1:100. As this request stream contains sensitive data
(e.g., client IPs), it would need anonymization before making it public.
It would be glad to help with that.
The previously released data set [1] contains no client information. It
contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update
flag. I would additionally suggest to include 5) the cache's hostname,
6) the cache_status, and 7) the response size (from the Wikimedia cache
log format).
I believe this format would preserve anonymity, and would be interesting
for many researchers.
Let me know your thoughts.
Thanks,
Daniel Berger
http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger
[1] http://www.wikibench.eu/?page_id=60
[2] https://github.com/ben-manes/caffeine/wiki/Efficiency
[3]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream
Just a reminder, we will be deprecating the pagecounts datasets at the end
of May, as we mentioned earlier this year [0]. This means these files will
remain there to be used by researchers but new files will not be generated
in the future.
*Pagecounts datasets that will be deprecated*
pagecounts-raw
pagecounts-all-sites
Options for switching to the new datasets [1]:
pageviews for the same format but better quality data
pagecounts-ez for compressed data
[0] https://lists.wikimedia.org/pipermail/analytics/2016-March/005060.html
[1] https://dumps.wikimedia.org/other/analytics/
Hello Wikimedia analytics mailing list,
As part of research into how people read Wikipedia, a friend and I created
a short survey. We are interested in seeing how people on this mailing list
(not a representative sample of Wikipedia readers for sure!) fill the
survey. The survey should take 2 to 10 minutes to complete.
https://www.surveymonkey.com/r/QBCCVFY
I would also appreciate if any of you have the ability to circulate the
survey to a different audience. If you are interested in doing that, please
let me know (off-list, if you prefer) and I will give you a separate URL
through which to do so for each such audience. The URLs represent different
audiences to whom the survey is shared so that it is easier to understand
how responses differ based on audience.
Any feedback on the survey questions would also be appreciated, on- or
off-thread.
Thank you very much!
Vipul
This might be of interest: https://clickhouse.yandex/
ClickHouse is an open-source column-oriented database management
system that allows generating analytical data reports in real time.
ClickHouse manages extremely large volumes of data in a stable and
sustainable manner. It currently powers Yandex.Metrica, world’s second
largest web analytics platform, with over 13 trillion database records
and over 20 billion events a day, generating customized reports
on-the-fly, directly from non-aggregated data. This system was
successfully implemented at CERN’s LHCb experiment to store and
process metadata on 10bn events with over 1000 attributes per event
registered in 2011.
+ Analytics
On Tue, Jun 28, 2016 at 6:36 AM, Marc Miquel <marcmiquel(a)gmail.com> wrote:
> Hello,
>
> I have a question for you regarding pageviews datadumps.
>
> I am considering to study reader engagement for different article topics
> in different languages. Because of this, I would like to know if there is
> any plan to make available pageviews dumps detailing activity log at
> session level per user - in a similar way to editor sessions.
>
> Since this would be for a research project I might ask funding for it, I
> would like to know if I could count on that, what is the nature of the
> available data, and what would be the procedure to obtain this data and if
> there would be any implication because of privacy concerns.
>
> Thank you very much!
>
> Best,
>
> Marc Miquel
> ᐧ
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
Adding analytics@ a public e-mail list where you can post questions such
as this one.
>that doesn’t tell us how often entities are accessed through
Special:EntityData or wbgetclaims
>Does this data already exist, even in the form of raw access logs?
Is this data always requested via http from an api endpoint that will hit a
varnish cache? (Daniel can probably answer this)
>From what I see on our data we have requests like the following:
www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q633155
www.wikidata.org /w/api.php
?callback=jQuery11130020702992017004984_1465195743367&format=json&action=wbgetclaims&property=P373&entity=Q5296&_=1465195743368
www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q573612
www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q472729
www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q349797
www.wikidata.org /w/api.php
?action=compare&torev=344163911&fromrev=344163907&format=json
www.wikidata.org /w/api.php ?action=wbgetentities&format=xml&ids=Q2356135
www.wikidata.org /w/api.php ?action=wbgetentities&format=xml&ids=Q2355988
www.wikidata.org /w/api.php
?action=compare&torev=344164023&fromrev=344163948&format=json
If the data you are interested in can be inferred from these requests there
is no additional data gathering needed.
>If not, what effort would be required to gather this data? For the
purposes of my proposal to the U.S. Census Bureau I am estimating around
>six weeks of effort for this for one person working full-time. If it will
take more time I will need to know.
I think I have mentioned this before on an e-mail thread but without
knowing the details of what you want to do we cannot give you a time
estimate. What are the exact metrics you are interested on? Is the
project described anywhere in meta?
Thanks,
Nuria
On Thu, Jun 30, 2016 at 11:45 AM, James Hare <james(a)hxstrategy.com> wrote:
> Copying Lydia Pintscher and Daniel Kinzler (with whom I’ve discussed this
> very topic).
>
> I am interested in metrics that describe how Wikidata is used. While we do
> have views on individual pages, that doesn’t tell us how often entities are
> accessed through Special:EntityData or wbgetclaims. Nor does it tell us how
> often statements/RDF triples show up in the Wikidata Query Service. Does
> this data already exist, even in the form of raw access logs? If not, what
> effort would be required to gather this data? For the purposes of my
> proposal to the U.S. Census Bureau I am estimating around six weeks of
> effort for this for one person working full-time. If it will take more time
> I will need to know.
>
>
> Thank you,
> James Hare
>
> On Thursday, June 2, 2016 at 2:18 PM, Nuria Ruiz wrote:
>
> James:
>
> >My current operating assumption is that it would take one person,
> working on a full time basis, around six weeks to go from raw access logs
> >to a functioning API that would provide information on how many times a
> Wikidata entity was accessed through the various APIs and the >query
> service. Do you believe this to be an accurate level of effort estimation
> based on your experience with past projects of this nature?
> You are starting from the assumption that we do have the data you are
> interested in in the logs which I am not sure it is the case, have you done
> you checks on this regard with wikidata developers?
>
> Analytics 'automagically' collects data from logs about *page* requests,
> any other requests collections (and it seems that yours fit on this
> scenario) need to be instrumented. I would send an e-mail to analytics@
> public list and wikidata folks to ask about how to harvest the data you are
> interested in, it doesn't sound like it is being collected at this time so
> your project scope might be quite a bit bigger than you think.
>
> Thanks,
>
> Nuria
>
>
>
>
> On Thu, Jun 2, 2016 at 5:06 AM, James Hare <james(a)hxstrategy.com> wrote:
>
> Hello Nuria,
>
> I am currently developing a proposal for the U.S. Census Bureau to
> integrate their datasets with Wikidata. As part of this, I am interested in
> getting Wikidata usage metrics beyond the page view data currently
> available. My concern is that the page views API gives you information only
> on how many times a *page* is accessed – but Wikidata is not really used
> in this way. More often is it the case that Wikidata’s information is
> accessed through the API endpoints (wbgetclaims etc.), through
> Special:EntityData, and the Wikidata Query Service. If we have information
> on usage through those mechanisms, that would give me much better
> information on Wikidata’s usage.
>
> To the extent these metrics are important to my prospective client, I am
> willing to provide in-kind support to the analytics team to make this
> information available, including expenses associated with the NDA process
> (I understand that such a person may need to deal with raw access logs that
> include PII.) My current operating assumption is that it would take one
> person, working on a full time basis, around six weeks to go from raw
> access logs to a functioning API that would provide information on how many
> times a Wikidata entity was accessed through the various APIs and the query
> service. Do you believe this to be an accurate level of effort estimation
> based on your experience with past projects of this nature?
>
> Please let me know if you have any questions. I am happy to discuss my
> idea with you further.
>
>
> Regards,
> James Hare
>
>
>
>
Hi!
Tomorrow morning (Jun 30th - CET timezone) I'd need to reboot stat1002,
stat1003 and stat1004 for kernel upgrades (Ubuntu security patches). This
could potentially terminate long running queries or jobs, so please ping me
on IRC or email me if your work can't be postponed or stopped.
Thanks!
Regards,
Luca
Yaron Koren has proposed to reopen the "Unacceptable behavior" section
(https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Suggested_change_…).
His perspective and mine are given on the talk page.
In brief:
* He disagrees with how "marginalized and otherwise underrepresented
groups" and "encouraged" are handled in the original text.
* I support the current text and process, and have explained why on the
talk page.
Thanks,
Matt Flaschen