Hi All
The query form listed inline below runs periodically on s1-analytics-slave
and now analytics-store. It generates 30 days worth of data without gaps by
joining a table known to have more than 30 rows.
That old trick is perfectly OK. However the table chosen is
information_schema.columns and that's NOT OK :-)
The problem is that the metadata in some information_schema tables must be
materialized every time they're accessed. In this case the .frm file for
*every table in the system* must be opened and checked before the query
runs[1]. Yes, every table.
This was probably always slow on s1-analytics-slave with enwiki + log +
personal tables. It's even slower now that analytics-store holds all wikis.
SELECT
Month.Date,
COALESCE(Web.Web, 0) AS Web
FROM
(SELECT
DATE_FORMAT(ADDDATE(CURDATE() - INTERVAL 30 - 1 DAY, @num:=@num+1),
'%Y-%m-%d') AS Date
FROM information_schema.columns, (SELECT @num:=-1) num LIMIT 30) AS
Month
LEFT JOIN
(SELECT DATE(timestamp) AS Date, SUM(1) AS Web FROM (
SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount
FROM MobileWebEditing_5644223
UNION
SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount
FROM MobileWebEditing_6077315
UNION
SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount
from MobileWebEditing_6637866
UNION
SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount
from MobileWebEditing_7675117
) as MobileWebEditing
WHERE
event_action = 'error'
AND wiki != 'testwiki'
GROUP BY Date
) AS Web
ON Month.Date = Web.Date;
MariaDB 10 has an SQL trick for generating sequences. Hackish, but simple:
SELECT seq FROM seq_1_to_5;
+-----+
| seq |
+-----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+-----+
https://mariadb.com/kb/en/sequence/
Alternatively, we can put a real sequence table somewhere handy with, say,
1000 integers.
BR
Sean
[1] The file accesses can be alleviated by having large table [definition]
cache(s), however then we're talking hundreds of thousands more open file
handles which is a whole new ceiling waiting to be hit :-)
--
DBA @ WMF
+ analytics
On May 16, 2014, at 8:50 AM, Taha Yasseri <taha.yaseri(a)gmail.com> wrote:
> Very useful Han-Teng, but one should note that the original data is about "the percentage of requesting ip addresses", excluding duplications of a single IP address within the same day, and not for example the number of edits. These two can be very different depending on dynamic/static IP address models in different countries.
> And that explains the discrepancy between your results and our earlier analysis based on circadian patterns and edits timestamps.
>
> Again, very interesting and well done.
> Best,
> Taha
>
>
> On Fri, May 16, 2014 at 4:28 PM, h <hanteng(a)gmail.com> wrote:
> Dear all,
>
> With the aim to compare Wikipedia traffic report data (e.g. viewing versus editing, regional differences within a language version, etc.), I have made a few more interactive infographics which show the historical changes since late 2011. (Historical numbers are scraped from the past versions archived by the Internet archive)
>
> For more, please visit follow the link below:
> http://people.oii.ox.ac.uk/hanteng/2014/05/16/wikipedia-traffic/
>
> It has at least one nice interactive feature: a user can zoom and pan to view the chart easily with a mouse or mousepad. The SVG vector-based presentation insures the picture quality is consistent when users zoom in to compare data points. (I haven't figured out how mpld3's html tooltip work for this project, though.)
>
> It is also possible to extend the prototype with dynamic json objects so that the chart/tables can be updated automatically.
>
> Any suggestions and comments are welcome.
>
> Best,
> han-teng liao
>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
>
> --
> .t
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Mark,
Thanks so much for applying Dan’s proposed fix so quickly!
As discussed, I am puzzled by the numbers on these updated graphs:
http://multimedia-metrics.wmflabs.org/dashboards/mmv#actions-graphs-tab
Global image views used to be up above 3 million on Sunday 4/27, after the launch to 9 new pilot sites but before the outage (1). However, they now only appear to be 2 million. And the decline is surprising, compared to other pilot launches where image views keep increasing.
Also, local image views on the Polish Wikipedia used to be above 1 million on the same date before the outage, but are now down to 500k for the same date:
http://multimedia-metrics.wmflabs.org/dashboards/mmv_plwiki
Dan, Gilles or others, any ideas what’s going on — and how we can get the metrics back on track? I’m Cc:ing Analytics in case others have some insights that could help us.
We’re launching today on French and Dutch Wikipedias, and would like to have stable metrics to discuss with our communities.
Thanks for any help you can provide. :)
Fabrice
(1) https://www.mediawiki.org/wiki/Multimedia/Media_Viewer/Release_Plan#Pilots
On May 1, 2014, at 9:19 AM, Mark Holmquist <mtraceur(a)member.fsf.org> wrote:
> On Thu, May 01, 2014 at 11:40:08AM -0400, Dan Andreescu wrote:
>> https://gerrit.wikimedia.org/r/#/c/130833/1/mobile/config.yaml
>
> I've done this on stat1003; we don't have any of yer high-falutin' config
> files in the multimedia analytics repos.
>
> Thanks for the suggestion :)
>
> --
> Mark Holmquist
> Software Engineer, Multimedia
> Wikimedia Foundation
> mtraceur(a)member.fsf.org
> https://wikimediafoundation.org/wiki/User:MHolmquist
> _______________________________________________
> Multimedia mailing list
> Multimedia(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/multimedia
On May 1, 2014, at 8:40 AM, Dan Andreescu <dandreescu(a)wikimedia.org> wrote:
> db1047 (where all the event logging data is being replicated to) is still lagging and we'll update the thread on the analytics list when it's done.
>
> In the meantime, you can switch your queries from s1 to a new box that's caught up already. Here's a similar change I submitted for the mobile web team to consider:
>
> https://gerrit.wikimedia.org/r/#/c/130833/1/mobile/config.yaml
>
>
> On Thu, May 1, 2014 at 11:37 AM, Fabrice Florin <fflorin(a)wikimedia.org> wrote:
> Hi guys,
>
> Any word from analytics as to when the Media Viewer metrics dashboards will be updated? Is the EventsLogging migration almost done? Now is the time when we need this data. :)
>
> http://multimedia-metrics.wmflabs.org/dashboards/mmv
>
> Also, do we have enough data now to define acceptance performance criteria for Media Viewer and complete card #149?
>
> https://wikimedia.mingle.thoughtworks.com/projects/multimedia/cards/149
>
> Other research questions which we would like to answer based on pilot results include:
>
> * how fast do images load?
> I think we now have enough data to make a statement, both on a global and local basis, with breakdowns for warm and cold caches, as well as large images. Want to take a first stab at it, or want me to?
>
> * does performance improve with more users?
> It would be great if we could confirm our hypothesis that images load faster when more users are clicking on them.
>
> * any slowdowns during peak hours?
> Do we have any data that would show what happens during peak hours? Would this require us to create hourly dashboards in a few pilot sites?
>
> * is the overall performance acceptable?
> This question can be answered once we define acceptable performance criteria. It could also be confirmed to some extent by survey responses, particularly if we see a decrease in complaints about the speed.
>
> To that end, I will do another sweep at latest survey results and compile more results tomorrow on this page:
> https://www.mediawiki.org/wiki/Multimedia/Media_Viewer/Survey
>
> Look forward to discussing these questions together in coming days, based on the data now on hand.
>
> Thanks,
>
>
> Fabrice
>
>
> _______________________________
>
> Fabrice Florin
> Product Manager
> Wikimedia Foundation
>
> http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)
>
>
_______________________________
Fabrice Florin
Product Manager
Wikimedia Foundation
http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)
If – like me – you had jobs or queries on the analytics slaves that used to finish in seconds or minutes and suddenly started taking several minutes to hours to complete, you were probably affected by a problem with the table types used on the slaves.
Sean nailed down the issue which has now been temporarily solved by restoring recentchanges tables as InnoDB. Post mortem below (thanks Sean!)
Dario
>> Aside from tuning the SQL, I think the slow down you saw was at least partly due to the migration of tables from InnoDB to TokuDB. The combination of table scan on either change_tag or tag_summary plus an eq_ref join on recentchanges hits a performance wall somewhere around 3M rows. It isn't due to obvious things like memory, swap, mutexes, or temp tables hitting disk. So I guess I need to search the upstream bugs list :-)
>>
> Found it: https://mariadb.atlassian.net/browse/MDEV-5595 -- Related specifically to slow-downs on tables with high delete rates, precisely like our recentchanges :-)
>
> Will take a couple days to organize packages with the relevant fix, so for now I've reverted recentchanges to InnoDB on analytics boxes. No other tables or queries seem affected.
Hi!
Just a heads-up:
Today s1-analytics-slave db1047 encountered an issue with a enwiki table,
which in turn stopped S1 replication. The bug has been idetified and the
affected table is being repaired, so replag should catch up within a few
hours.
Note that this did not affect M2 replication (eventlogging).
BR
Sean
--
DBA @ WMF
Hello Analytics list!
I am following up on a thread I started in October 2013 in which I asked
for guidance about framing claims on the popularity of Wikipedia's health
content.
<http://lists.wikimedia.org/pipermail/analytics/2013-October/001085.html>
Thank you all. With the help of your comments I got a feature article
published in BMJ, the "British Medical Journal". Even though I did not
engage you all in conversation I really put a lot of thought into
everything you all wrote, and found the response very encouraging.
In this article in various ways I said "Health content on Wikipedia is more
requested and accessed than comparable information from most other
sources." When I originally wrote to this board I asked for analytic
backing to say this, and I appreciate the comments that I got. If anyone in
the future would like to talk more about Wikipedia's health traffic then
please post to this board and contact me or contact me and others through
WikiProject Medicine on English Wikipedia. My article is "Wikipedia: what
it is and why it matters for healthcare" and it can be accessed by those
with a BMJ magazine subscription at the first link or through an
alternative method in the second link.
<http://www.bmj.com/content/348/bmj.g2478>
<
http://bluerasberry.com/2014/04/wikipedia-and-health-information-published-…
>
I also wish to respond to the common concern that people ought not get
their health information from Wikipedia, and I wanted to share with you all
what I tell people when they ask me why I care about Wikipedia's health
information. Wikipedia is an extremely popular source of health
information, and it is also a source with quality problems. All other
sources of health information are unpopular, and they may or may not have
good quality. It is my opinion that it would be less expensive by orders of
magnitude to improve the quality of Wikipedia's health information than it
would be to increase the popularity and accessibility of any other source
or health information to a level of accessibility comparable to Wikipedia.
Right now the Wikimedia movement is not imagined as a public or global
health movement, but I feel that there is something here and that analytics
might be the argument on which to base a call to action.
The request I originally expressed to this board still stands - I still
would like whatever information might be available describing the audience
accessing health content on Wikipedia, and I think comprehensive
information would be appreciated in health more than anywhere else in a
Wikimedia project.
Thank you all, and thank you again if you commented months ago.
yours,
On Mon, Oct 7, 2013 at 12:15 PM, Jonathan Morgan <jmorgan(a)wikimedia.org>wrote:
> Can I just say how geektastically awesome it is that we're having a
> discussion about how to frame claims about Wikipedia's popularity? Now this
> is what lists are FOR.
>
> But in the interest of avoiding stasis<https://en.wikipedia.org/wiki/Stasis_(rhetoric)#Stasis>,
> I also want to say to Lane: don't sweat the language too much ;) You're not
> going to be spouting untruths or despoiling the brand if you say Wikipedia
> is the, or one of the, highest trafficked websites in the world for health
> info. Wikipedia researchers make claims like that frequently, and often
> with less data to back it up than you're offering.
>
> Also, Lane: do you want someone to script up that pageview request? I
> agree with Erik that using WP Med/WP Health categories will get you better
> results. I've been on the hook for getting some similar data for
> Biosthmores for about... 6 months now. I could work on it on my own time
> some evening this week.
>
> - J
>
>
> On Fri, Oct 4, 2013 at 5:15 PM, Lars Aronsson <lars(a)aronsson.se> wrote:
>
>> On 10/04/2013 11:39 PM, Matthew Flaschen wrote:
>>
>>> "Search engines increasingly lead people to Wikipedia, which is one of
>>> the factors in making Wikipedia the single highest traffic source of health
>>> information in the world."
>>>
>>
>> I can search for images, but only when they have words
>> associated with them, e.g. descriptions, tags or categories.
>>
>> In this sense, doctors examining a patient and giving them
>> a diagnosis is similar to tagging an image. Suddenly, the
>> illness that this patient felt becomes possible to search.
>>
>>
>> --
>> Lars Aronsson (lars(a)aronsson.se)
>> Aronsson Datateknik - http://aronsson.se
>>
>>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
>
> --
> Jonathan T. Morgan
> Learning Strategist
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
--
Lane Rasberry
user:bluerasberry on Wikipedia
206.801.0814
lane(a)bluerasberry.com
On Mon, Oct 7, 2013 at 12:15 PM, Jonathan Morgan <jmorgan(a)wikimedia.org>wrote:
> Can I just say how geektastically awesome it is that we're having a
> discussion about how to frame claims about Wikipedia's popularity? Now this
> is what lists are FOR.
>
> But in the interest of avoiding stasis<https://en.wikipedia.org/wiki/Stasis_(rhetoric)#Stasis>,
> I also want to say to Lane: don't sweat the language too much ;) You're not
> going to be spouting untruths or despoiling the brand if you say Wikipedia
> is the, or one of the, highest trafficked websites in the world for health
> info. Wikipedia researchers make claims like that frequently, and often
> with less data to back it up than you're offering.
>
> Also, Lane: do you want someone to script up that pageview request? I
> agree with Erik that using WP Med/WP Health categories will get you better
> results. I've been on the hook for getting some similar data for
> Biosthmores for about... 6 months now. I could work on it on my own time
> some evening this week.
>
> - J
>
>
> On Fri, Oct 4, 2013 at 5:15 PM, Lars Aronsson <lars(a)aronsson.se> wrote:
>
>> On 10/04/2013 11:39 PM, Matthew Flaschen wrote:
>>
>>> "Search engines increasingly lead people to Wikipedia, which is one of
>>> the factors in making Wikipedia the single highest traffic source of health
>>> information in the world."
>>>
>>
>> I can search for images, but only when they have words
>> associated with them, e.g. descriptions, tags or categories.
>>
>> In this sense, doctors examining a patient and giving them
>> a diagnosis is similar to tagging an image. Suddenly, the
>> illness that this patient felt becomes possible to search.
>>
>>
>> --
>> Lars Aronsson (lars(a)aronsson.se)
>> Aronsson Datateknik - http://aronsson.se
>>
>>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
>
> --
> Jonathan T. Morgan
> Learning Strategist
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
--
Lane Rasberry
user:bluerasberry on Wikipedia
206.801.0814
lane(a)bluerasberry.com
--
Lane Rasberry
user:bluerasberry on Wikipedia
206.801.0814
lane(a)bluerasberry.com
Is it possible? What's the process for it?
Some of the tables on db1047 I'm dealing with are getting massive (the main
offender is 27.8GB) and some of our daily queries used to build tsvs are
badly in need on indexes.
The SQL credentials I have access to don't have the rights to alter those
tables.
Hi,
just a quick heads up that the replication lag for the enwiki database
on the analytics s1 slave (s1-analytics-slave.eqiad.wmnet,
db1047.eqiad.wmnet) is again >12 hours [1].
If you run jobs that rely on current enwiki data, you can temporary
switch to using analytics-store.eqiad.wmnet, which /currently/ does
not suffer replication lag. The usual research user has access to the
enwiki database there.
Best regards,
Christian
P.S.: This time I did not file an RT ticket yet, as we already knew
that the s1 slave is suffering replication lag for EventLogging
tables. The new thing is only that now also enwiki is affected.
[1] Do not trust Ganglia saying mysql_slave_lag being 0 for this
host. Since the start of EventLogging database migration, Ganglia is
saying 0 for all mysql metrics of this host. Even for those that are
not expected to be 0. That's likely a separate issue.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
Homepage: http://quelltextlich.at/
---------------------------------------------------------------