Analytics May 2014

analytics@lists.wikimedia.org

35 participants
39 discussions

by Fabrice Florin

Hi guys, Does anyone know why the Media Viewer metrics dashboards seem to be stuck with old data from Friday? http://multimedia-metrics.wmflabs.org/dashboards/mmv Is there anything we could fiddle with to get the new data to show up? Thanks for any insights :) Fabrice _______________________________ Fabrice Florin Product Manager Wikimedia Foundation http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)

9 years, 12 months

EventLogging ballooning

by Federico Leva (Nemo)

* From 40 to 260 events logged per second in a month: what's going on? <https://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Miscellan…> * Schema docs outdated: https://meta.wikimedia.org/wiki/Research_talk:Schemas * Confirmations needed: https://www.mediawiki.org/?diff=1006822&oldid=964133 Nemo

9 years, 12 months

generating sequences on analytics slaves

by Sean Pringle

Hi All The query form listed inline below runs periodically on s1-analytics-slave and now analytics-store. It generates 30 days worth of data without gaps by joining a table known to have more than 30 rows. That old trick is perfectly OK. However the table chosen is information_schema.columns and that's NOT OK :-) The problem is that the metadata in some information_schema tables must be materialized every time they're accessed. In this case the .frm file for *every table in the system* must be opened and checked before the query runs[1]. Yes, every table. This was probably always slow on s1-analytics-slave with enwiki + log + personal tables. It's even slower now that analytics-store holds all wikis. SELECT Month.Date, COALESCE(Web.Web, 0) AS Web FROM (SELECT DATE_FORMAT(ADDDATE(CURDATE() - INTERVAL 30 - 1 DAY, @num:=@num+1), '%Y-%m-%d') AS Date FROM information_schema.columns, (SELECT @num:=-1) num LIMIT 30) AS Month LEFT JOIN (SELECT DATE(timestamp) AS Date, SUM(1) AS Web FROM ( SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_5644223 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_6077315 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_6637866 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_7675117 ) as MobileWebEditing WHERE event_action = 'error' AND wiki != 'testwiki' GROUP BY Date ) AS Web ON Month.Date = Web.Date; MariaDB 10 has an SQL trick for generating sequences. Hackish, but simple: SELECT seq FROM seq_1_to_5; +-----+ | seq | +-----+ | 1 | | 2 | | 3 | | 4 | | 5 | +-----+ https://mariadb.com/kb/en/sequence/ Alternatively, we can put a real sequence table somewhere handy with, say, 1000 integers. BR Sean [1] The file accesses can be alleviated by having large table [definition] cache(s), however then we're talking hundreds of thousands more open file handles which is a whole new ceiling waiting to be hit :-) -- DBA @ WMF

10 years

Re: [Analytics] [Wiki-research-l] Wikipedia traffic: selected language versions

by Dario Taraborelli

+ analytics On May 16, 2014, at 8:50 AM, Taha Yasseri <taha.yaseri(a)gmail.com> wrote: > Very useful Han-Teng, but one should note that the original data is about "the percentage of requesting ip addresses", excluding duplications of a single IP address within the same day, and not for example the number of edits. These two can be very different depending on dynamic/static IP address models in different countries. > And that explains the discrepancy between your results and our earlier analysis based on circadian patterns and edits timestamps. > > Again, very interesting and well done. > Best, > Taha > > > On Fri, May 16, 2014 at 4:28 PM, h <hanteng(a)gmail.com> wrote: > Dear all, > > With the aim to compare Wikipedia traffic report data (e.g. viewing versus editing, regional differences within a language version, etc.), I have made a few more interactive infographics which show the historical changes since late 2011. (Historical numbers are scraped from the past versions archived by the Internet archive) > > For more, please visit follow the link below: > http://people.oii.ox.ac.uk/hanteng/2014/05/16/wikipedia-traffic/ > > It has at least one nice interactive feature: a user can zoom and pan to view the chart easily with a mouse or mousepad. The SVG vector-based presentation insures the picture quality is consistent when users zoom in to compare data points. (I haven't figured out how mpld3's html tooltip work for this project, though.) > > It is also possible to extend the prototype with dynamic json objects so that the chart/tables can be updated automatically. > > Any suggestions and comments are welcome. > > Best, > han-teng liao > > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > > -- > .t > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

10 years

Re: [Analytics] [Multimedia] Media Viewer Metrics + Research Questions

by Fabrice Florin

Hi Mark, Thanks so much for applying Dan’s proposed fix so quickly! As discussed, I am puzzled by the numbers on these updated graphs: http://multimedia-metrics.wmflabs.org/dashboards/mmv#actions-graphs-tab Global image views used to be up above 3 million on Sunday 4/27, after the launch to 9 new pilot sites but before the outage (1). However, they now only appear to be 2 million. And the decline is surprising, compared to other pilot launches where image views keep increasing. Also, local image views on the Polish Wikipedia used to be above 1 million on the same date before the outage, but are now down to 500k for the same date: http://multimedia-metrics.wmflabs.org/dashboards/mmv_plwiki Dan, Gilles or others, any ideas what’s going on — and how we can get the metrics back on track? I’m Cc:ing Analytics in case others have some insights that could help us. We’re launching today on French and Dutch Wikipedias, and would like to have stable metrics to discuss with our communities. Thanks for any help you can provide. :) Fabrice (1) https://www.mediawiki.org/wiki/Multimedia/Media_Viewer/Release_Plan#Pilots On May 1, 2014, at 9:19 AM, Mark Holmquist <mtraceur(a)member.fsf.org> wrote: > On Thu, May 01, 2014 at 11:40:08AM -0400, Dan Andreescu wrote: >> https://gerrit.wikimedia.org/r/#/c/130833/1/mobile/config.yaml > > I've done this on stat1003; we don't have any of yer high-falutin' config > files in the multimedia analytics repos. > > Thanks for the suggestion :) > > -- > Mark Holmquist > Software Engineer, Multimedia > Wikimedia Foundation > mtraceur(a)member.fsf.org > https://wikimediafoundation.org/wiki/User:MHolmquist > _______________________________________________ > Multimedia mailing list > Multimedia(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/multimedia On May 1, 2014, at 8:40 AM, Dan Andreescu <dandreescu(a)wikimedia.org> wrote: > db1047 (where all the event logging data is being replicated to) is still lagging and we'll update the thread on the analytics list when it's done. > > In the meantime, you can switch your queries from s1 to a new box that's caught up already. Here's a similar change I submitted for the mobile web team to consider: > > https://gerrit.wikimedia.org/r/#/c/130833/1/mobile/config.yaml > > > On Thu, May 1, 2014 at 11:37 AM, Fabrice Florin <fflorin(a)wikimedia.org> wrote: > Hi guys, > > Any word from analytics as to when the Media Viewer metrics dashboards will be updated? Is the EventsLogging migration almost done? Now is the time when we need this data. :) > > http://multimedia-metrics.wmflabs.org/dashboards/mmv > > Also, do we have enough data now to define acceptance performance criteria for Media Viewer and complete card #149? > > https://wikimedia.mingle.thoughtworks.com/projects/multimedia/cards/149 > > Other research questions which we would like to answer based on pilot results include: > > * how fast do images load? > I think we now have enough data to make a statement, both on a global and local basis, with breakdowns for warm and cold caches, as well as large images. Want to take a first stab at it, or want me to? > > * does performance improve with more users? > It would be great if we could confirm our hypothesis that images load faster when more users are clicking on them. > > * any slowdowns during peak hours? > Do we have any data that would show what happens during peak hours? Would this require us to create hourly dashboards in a few pilot sites? > > * is the overall performance acceptable? > This question can be answered once we define acceptable performance criteria. It could also be confirmed to some extent by survey responses, particularly if we see a decrease in complaints about the speed. > > To that end, I will do another sweep at latest survey results and compile more results tomorrow on this page: > https://www.mediawiki.org/wiki/Multimedia/Media_Viewer/Survey > > Look forward to discussing these questions together in coming days, based on the data now on hand. > > Thanks, > > > Fabrice > > > _______________________________ > > Fabrice Florin > Product Manager > Wikimedia Foundation > > http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF) > > _______________________________ Fabrice Florin Product Manager Wikimedia Foundation http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)

10 years

Post mortem: slow queries on analytics slaves

by Dario Taraborelli

If – like me – you had jobs or queries on the analytics slaves that used to finish in seconds or minutes and suddenly started taking several minutes to hours to complete, you were probably affected by a problem with the table types used on the slaves. Sean nailed down the issue which has now been temporarily solved by restoring recentchanges tables as InnoDB. Post mortem below (thanks Sean!) Dario >> Aside from tuning the SQL, I think the slow down you saw was at least partly due to the migration of tables from InnoDB to TokuDB. The combination of table scan on either change_tag or tag_summary plus an eq_ref join on recentchanges hits a performance wall somewhere around 3M rows. It isn't due to obvious things like memory, swap, mutexes, or temp tables hitting disk. So I guess I need to search the upstream bugs list :-) >> > Found it: https://mariadb.atlassian.net/browse/MDEV-5595 -- Related specifically to slow-downs on tables with high delete rates, precisely like our recentchanges :-) > > Will take a couple days to organize packages with the relevant fix, so for now I've reverted recentchanges to InnoDB on analytics boxes. No other tables or queries seem affected.

10 years

s1-analytics-slave enwiki replication

by Sean Pringle

Hi! Just a heads-up: Today s1-analytics-slave db1047 encountered an issue with a enwiki table, which in turn stopped S1 replication. The bug has been idetified and the affected table is being repaired, so replag should catch up within a few hours. Note that this did not affect M2 replication (eventlogging). BR Sean -- DBA @ WMF

10 years

Fwd: [Wiki-Medicine] need traffic data for health content...

by Lane Rasberry

Hello Analytics list! I am following up on a thread I started in October 2013 in which I asked for guidance about framing claims on the popularity of Wikipedia's health content. <http://lists.wikimedia.org/pipermail/analytics/2013-October/001085.html> Thank you all. With the help of your comments I got a feature article published in BMJ, the "British Medical Journal". Even though I did not engage you all in conversation I really put a lot of thought into everything you all wrote, and found the response very encouraging. In this article in various ways I said "Health content on Wikipedia is more requested and accessed than comparable information from most other sources." When I originally wrote to this board I asked for analytic backing to say this, and I appreciate the comments that I got. If anyone in the future would like to talk more about Wikipedia's health traffic then please post to this board and contact me or contact me and others through WikiProject Medicine on English Wikipedia. My article is "Wikipedia: what it is and why it matters for healthcare" and it can be accessed by those with a BMJ magazine subscription at the first link or through an alternative method in the second link. <http://www.bmj.com/content/348/bmj.g2478> < http://bluerasberry.com/2014/04/wikipedia-and-health-information-published-… > I also wish to respond to the common concern that people ought not get their health information from Wikipedia, and I wanted to share with you all what I tell people when they ask me why I care about Wikipedia's health information. Wikipedia is an extremely popular source of health information, and it is also a source with quality problems. All other sources of health information are unpopular, and they may or may not have good quality. It is my opinion that it would be less expensive by orders of magnitude to improve the quality of Wikipedia's health information than it would be to increase the popularity and accessibility of any other source or health information to a level of accessibility comparable to Wikipedia. Right now the Wikimedia movement is not imagined as a public or global health movement, but I feel that there is something here and that analytics might be the argument on which to base a call to action. The request I originally expressed to this board still stands - I still would like whatever information might be available describing the audience accessing health content on Wikipedia, and I think comprehensive information would be appreciated in health more than anywhere else in a Wikimedia project. Thank you all, and thank you again if you commented months ago. yours, On Mon, Oct 7, 2013 at 12:15 PM, Jonathan Morgan <jmorgan(a)wikimedia.org>wrote: > Can I just say how geektastically awesome it is that we're having a > discussion about how to frame claims about Wikipedia's popularity? Now this > is what lists are FOR. > > But in the interest of avoiding stasis<https://en.wikipedia.org/wiki/Stasis_(rhetoric)#Stasis>, > I also want to say to Lane: don't sweat the language too much ;) You're not > going to be spouting untruths or despoiling the brand if you say Wikipedia > is the, or one of the, highest trafficked websites in the world for health > info. Wikipedia researchers make claims like that frequently, and often > with less data to back it up than you're offering. > > Also, Lane: do you want someone to script up that pageview request? I > agree with Erik that using WP Med/WP Health categories will get you better > results. I've been on the hook for getting some similar data for > Biosthmores for about... 6 months now. I could work on it on my own time > some evening this week. > > - J > > > On Fri, Oct 4, 2013 at 5:15 PM, Lars Aronsson <lars(a)aronsson.se> wrote: > >> On 10/04/2013 11:39 PM, Matthew Flaschen wrote: >> >>> "Search engines increasingly lead people to Wikipedia, which is one of >>> the factors in making Wikipedia the single highest traffic source of health >>> information in the world." >>> >> >> I can search for images, but only when they have words >> associated with them, e.g. descriptions, tags or categories. >> >> In this sense, doctors examining a patient and giving them >> a diagnosis is similar to tagging an image. Suddenly, the >> illness that this patient felt becomes possible to search. >> >> >> -- >> Lars Aronsson (lars(a)aronsson.se) >> Aronsson Datateknik - http://aronsson.se >> >> >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > > -- > Jonathan T. Morgan > Learning Strategist > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- Lane Rasberry user:bluerasberry on Wikipedia 206.801.0814 lane(a)bluerasberry.com On Mon, Oct 7, 2013 at 12:15 PM, Jonathan Morgan <jmorgan(a)wikimedia.org>wrote: > Can I just say how geektastically awesome it is that we're having a > discussion about how to frame claims about Wikipedia's popularity? Now this > is what lists are FOR. > > But in the interest of avoiding stasis<https://en.wikipedia.org/wiki/Stasis_(rhetoric)#Stasis>, > I also want to say to Lane: don't sweat the language too much ;) You're not > going to be spouting untruths or despoiling the brand if you say Wikipedia > is the, or one of the, highest trafficked websites in the world for health > info. Wikipedia researchers make claims like that frequently, and often > with less data to back it up than you're offering. > > Also, Lane: do you want someone to script up that pageview request? I > agree with Erik that using WP Med/WP Health categories will get you better > results. I've been on the hook for getting some similar data for > Biosthmores for about... 6 months now. I could work on it on my own time > some evening this week. > > - J > > > On Fri, Oct 4, 2013 at 5:15 PM, Lars Aronsson <lars(a)aronsson.se> wrote: > >> On 10/04/2013 11:39 PM, Matthew Flaschen wrote: >> >>> "Search engines increasingly lead people to Wikipedia, which is one of >>> the factors in making Wikipedia the single highest traffic source of health >>> information in the world." >>> >> >> I can search for images, but only when they have words >> associated with them, e.g. descriptions, tags or categories. >> >> In this sense, doctors examining a patient and giving them >> a diagnosis is similar to tagging an image. Suddenly, the >> illness that this patient felt becomes possible to search. >> >> >> -- >> Lars Aronsson (lars(a)aronsson.se) >> Aronsson Datateknik - http://aronsson.se >> >> >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > > -- > Jonathan T. Morgan > Learning Strategist > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- Lane Rasberry user:bluerasberry on Wikipedia 206.801.0814 lane(a)bluerasberry.com -- Lane Rasberry user:bluerasberry on Wikipedia 206.801.0814 lane(a)bluerasberry.com

10 years

Adding indexes to EventLogging tables

by Gilles Dubuc

Is it possible? What's the process for it? Some of the tables on db1047 I'm dealing with are getting massive (the main offender is 27.8GB) and some of our daily queries used to build tsvs are badly in need on indexes. The SQL credentials I have access to don't have the rights to alter those tables.

10 years

Enwiki replication lag on s1-analytics-slave.eqiad.wmnet >12 hours

by Christian Aistleitner

Hi, just a quick heads up that the replication lag for the enwiki database on the analytics s1 slave (s1-analytics-slave.eqiad.wmnet, db1047.eqiad.wmnet) is again >12 hours [1]. If you run jobs that rely on current enwiki data, you can temporary switch to using analytics-store.eqiad.wmnet, which /currently/ does not suffer replication lag. The usual research user has access to the enwiki database there. Best regards, Christian P.S.: This time I did not file an RT ticket yet, as we already knew that the s1 slave is suffering replication lag for EventLogging tables. The new thing is only that now also enwiki is affected. [1] Do not trust Ganglia saying mysql_slave_lag being 0 for this host. Since the start of EventLogging database migration, Ganglia is saying 0 for all mysql metrics of this host. Even for those that are not expected to be 0. That's likely a separate issue. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian(a)quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

10 years

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics May 2014