📈 Wikimedia production errors help

List overview All Threads
Download

newer

older

Require tag for API writes

TechCom meeting 2020-09-23

Tyler Cipriani

14 Sep 2020 14 Sep '20

8:49 p.m.

Hello all! Over the past few months we've reached the ignominious milestone of the most open tasks of all time on the wikimedia-production-error dashboard[0]. Background: The wikimedia-production-error dashboard is a workboard of tasks created while digging through the Wikimedia production error logs. All tasks there are log messages that have originated on production servers. The number of new tasks being created with this tag in a given week is outpacing the number of tasks being closed in a given week: this past week we added 41 tasks and only closed 22. This is beginning to be unsustainable :( There are currently 281 open tasks filed for errors in production. Although we're triaging this workboard weekly, we rely on the expertise of developers most familiar with the error messages to triage them, prioritize them, and "fix" them (for whatever value of "fix" is appropriate). Below is a smattering of selected issues that could use some attention: 1. PHP Fatal error: Out of memory in cdb/src/Reader/DBA.php[1] 2. Uncaught ReferenceError: collectionCall is not defined[2] 3. Flow: PHP Notice: Undefined index: flow-workflow-change[3] 4. PHP Warning: unpack(): Type H: not enough input, need 4, have 0[4] 5. TypeError: undefined is not an object (evaluating 'this.getMIMEType')[5] 6. Elastica\Exception\ResponseException from line 56 of GeoData/includes/Searcher.php[6] 7. Wikimedia\CSS\Objects\ComponentValueList may not contain tokens of type "[".[7] Please help to triage or resolve these problems or any of the other 166 tasks needing triage[8] if you are able. <3 -- Tyler [0]: <https://phabricator.wikimedia.org/tag/wikimedia-production-error/> [1]: <https://phabricator.wikimedia.org/T260234> [2]: <https://phabricator.wikimedia.org/T259809> [3]: <https://phabricator.wikimedia.org/T259739> [4]: <https://phabricator.wikimedia.org/T259592> [5]: <https://phabricator.wikimedia.org/T259419> [6]: <https://phabricator.wikimedia.org/T258641> [7]: <https://phabricator.wikimedia.org/T258093> [8]: <https://phabricator.wikimedia.org/maniphest/query/LW5WTEnToXDn/#R.>

Show replies by date

Niklas Laxström

15 Sep 15 Sep

8:55 a.m.

New subject: [Wikitech-l] 📈 Wikimedia production errors help

ma 14. syysk. 2020 klo 23.49 Tyler Cipriani (tcipriani(a)wikimedia.org) kirjoitti:

...

The number of new tasks being created with this tag in a given week is outpacing the number of tasks being closed in a given week: this past week we added 41 tasks and only closed 22.

Majority of the recently created tasks are frontend JavaScript errors. The logging of these errors have only started recently. These issues may have been present for years already, but they are reported now.

...

This is beginning to be unsustainable :(

If there is an increase in the amount of real new issues and/or decrease in the amount of issues fixed, then I would be worried. Given what I said above, it's difficult to see if this is the case. Regardless, I do agree that we should aim to minimize production errors to make it easier to spot any new issues. I would encourage all maintainers and development teams to ensure that they have a regular process to check if they have and triage any production issues in code they maintain. I think we should expect the number to go up while the backlog of unreported frontend errors are being reported, and then it would start going down as developers work on to reduce the backlog of reported issues. It will probably stabilize at some level, higher than previously, indicating that some areas of code lack maintainers or maintenance resources. Ending with a question: do we want to have both frontend and backend errors on the same tag/board, or should they be on separate ones? -Niklas

Derk-Jan Hartman

11:23 a.m.

New subject: [Wikitech-l] 📈 Wikimedia production errors help

In particular I count 13 frontend problems with the old TMH kaltura player. There is clearly no intent to fix those (volunteer or employee), as the Kaltura player has been unmaintained for 8 years. The choices as far as I can tell are to ignore them, undeploy a/v playback or to direct C-level management to get the audio and video stuff together. DJ On Tue, Sep 15, 2020 at 11:00 AM Niklas Laxström <niklas.laxstrom(a)gmail.com> wrote:

...

ma 14. syysk. 2020 klo 23.49 Tyler Cipriani (tcipriani(a)wikimedia.org) kirjoitti:

The number of new tasks being created with this tag in a given week is outpacing the number of tasks being closed in a given week: this past week we added 41 tasks and only closed 22.

This is beginning to be unsustainable :(

Tyler Cipriani

3:35 p.m.

New subject: [Wikitech-l] 📈 Wikimedia production errors help

On Tue, Sep 15, 2020 at 5:24 AM Derk-Jan Hartman <d.j.hartman+wmf_ml(a)gmail.com> wrote:

...

The tasks that I mentioned in my original message are, likewise, tasks that I'm not sure belong to any team or any particular person. I have been using the phab tag/milestone "Release Engineering (Logspam)" to ensure that we don't lose track of tasks that are: 1. problems in production 2. tagged in phabricator with a team or component (in contrast to problems with unknown components/team tags) 3. no longer resourced or maintained in a discernible way Feel free to apply that tag if those 3 conditions apply to these tasks. Tracking these will make it easier to raise awareness later. Thanks! -- Tyler

Alex Ezell

3:43 p.m.

New subject: [Wikitech-l] 📈 Wikimedia production errors help

Hi y'all, Do we use levels for any of these error log outputs? That is, are they classified on output as High, Medium, Low, Info, or something like that? Or do we have to triage each of them as we examine them? I was just thinking if they were somehow leveled, we could use measurements of the number of each type and set targets for lowering the number of those kinds of logs. That would potentially help visualize and prioritize the work. It might be easier to say something like that, "Let's have a goal to produce 10% less High errors in the next two months," than to have a more nebulous approach that seems to require Tyler or someone from his team to highlight tasks that are especially impactful. I'm mostly ignorant of exactly how these processes work now so if I'm telling y'all something you already know, forgive me. I was mostly thinking out loud about how we could start to approach the work more systematically. Alex Ezell (he/him) Senior Engineering Manager Wikimedia Foundation On Tue, Sep 15, 2020 at 10:36 AM Tyler Cipriani <tcipriani(a)wikimedia.org> wrote:

...

On Tue, Sep 15, 2020 at 5:24 AM Derk-Jan Hartman <d.j.hartman+wmf_ml(a)gmail.com> wrote:

In particular I count 13 frontend problems with the old TMH kaltura

player.

There is clearly no intent to fix those (volunteer or employee), as the Kaltura player has been unmaintained for 8 years. The choices as far as I can tell are to ignore them, undeploy a/v

playback

or to direct C-level management to get the audio and video stuff

together. The tasks that I mentioned in my original message are, likewise, tasks that I'm not sure belong to any team or any particular person. I have been using the phab tag/milestone "Release Engineering (Logspam)" to ensure that we don't lose track of tasks that are: 1. problems in production 2. tagged in phabricator with a team or component (in contrast to problems with unknown components/team tags) 3. no longer resourced or maintained in a discernible way Feel free to apply that tag if those 3 conditions apply to these tasks. Tracking these will make it easier to raise awareness later. Thanks! -- Tyler _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brennen Bearnes

5:05 p.m.

On 9/15/20 9:43 AM, Alex Ezell wrote:

...

Do we use levels for any of these error log outputs? That is, are they classified on output as High, Medium, Low, Info, or something like that?

To an extent, yes. We have separate channels for PHP errors and exceptions, for example, and although I don't think we currently differentiate in logstash, maybe we could plausibly draw a further distinction between PHP error levels. Intuitively, a low number of PHP notices probably indicates something of lower severity than a high number of fatals, and so forth. Teasing out more detail about reported error severity could be a useful exercise, but I'm not sure it would result in much more meaningful signals than we currently have about production health. Serious problems can manifest as trivial-seeming notices, some issues start out that way and cascade over time, and generally any form of recurring logspam needs human evaluation before we can easily say much more than "this is a problem".

...

Or do we have to triage each of them as we examine them?

Yeah. There are doubtless a lot of ways to improve the tooling we use for that process, but right now I think it would be most helpful if we just had more eyes _routinely_ on the logs and the workboard. (See Tyler's earlier and much more detailed/thoughtful response to this thread.) -- Brennen Bearnes Release Engineering

Tyler Cipriani

16 Sep 16 Sep

11:16 p.m.

New subject: [Wikitech-l] 📈 Wikimedia production errors help

On Tue, Sep 15, 2020 at 11:06 AM Brennen Bearnes <bbearnes(a)wikimedia.org> wrote:

...

On 9/15/20 9:43 AM, Alex Ezell wrote:

Do we use levels for any of these error log outputs? That is, are they classified on output as High, Medium, Low, Info, or something like that?

Teasing out more detail about reported error severity could be a useful exercise, but I'm not sure it would result in much more meaningful signals than we currently have about production health. Serious problems can manifest as trivial-seeming notices, some issues start out that way and cascade over time, and generally any form of recurring logspam needs human evaluation before we can easily say much more than "this is a problem".

This aligns with my view of our team's ability to assign meaningful priorities. High-level general knowledge about our deployment, errors, and error logging can't substitute for domain expertise. Teams with expertise in particular codebase are best positioned to understand the impact of a particular message and derive a useful priority.

...

it would be most helpful if we just had more eyes _routinely_ on the logs and the workboard. (See Tyler's earlier and much more detailed/thoughtful response to this thread.)

+1 An interface between the log triage workboard and process with team/maintainer workflows is a missing component of assigning priorities. There is a long developer feedback loop past integration. Hopefully, this process helps to shorten the feedback loop to developers and reduce the opacity of the process beyond integration through release and monitoring. Having the expertise of developers writing the code be a part of the deployment and monitoring of that code in production is the goal of this process and the key to its utility. -- Tyler

Tyler Cipriani

15 Sep 15 Sep

3:27 p.m.

New subject: [Wikitech-l] 📈 Wikimedia production errors help

Hi! Thanks for the feedback, this is useful information. On Tue, Sep 15, 2020 at 3:00 AM Niklas Laxström <niklas.laxstrom(a)gmail.com> wrote:

...

ma 14. syysk. 2020 klo 23.49 Tyler Cipriani (tcipriani(a)wikimedia.org) kirjoitti: If there is an increase in the amount of real new issues and/or decrease in the amount of issues fixed, then I would be worried. Given what I said above, it's difficult to see if this is the case.

Indeed, a trendline for production quality is difficult to compare if a large backlog is being added.

...

Regardless, I do agree that we should aim to minimize production errors to make it easier to spot any new issues. I would encourage all maintainers and development teams to ensure that they have a regular process to check if they have and triage any production issues in code they maintain.

+100 to checking for production errors. It's my hope that folks who have code that is going out on a train are: 1. Aware their code is going to production that week 2. Watching for related logs and alerts (where possible) 3. Performing other software quality assurance activities on their code as it rolls out (manual testing, for example) My assessment of risk as a person deploying software to production is necessarily linked to my view into quality assurance activities. If production errors are growing, I worry about sustainability. The production error dashboard's past stability has provided assurances about shared awareness and priority of a given week's deployment. That is, I know there are software quality activities that take place sometime after code hits group0 or group1 or group2; however, much of that activity remains opaque. This is why this dashboard is crucial for deployment. Having the explicit assurances of folks whose code is going to production that week would be preferable to any inference I can make from this dashboard. It's my hope that maintainers and teams triaging and grooming this dashboard will create an emergent process that can be used to provide real insight. That is, if we all are keeping this dashboard up-to-date collectively, it will be easier to see when quality assurance activities have taken place. Further, if we collectively fret over this dashboard then we'll share a collective awareness of anomalies.

...

Ending with a question: do we want to have both frontend and backend errors on the same tag/board, or should they be on separate ones?

That's a good question. I think that having a single workboard is nice as there are reporting features[0] that provide some insights about the overall health of production. Those insights are, as evidenced, only as good as their inputs, but they remain valuable to me. Additionally, a single tag may be used in saved searches and custom dashboards to make it easy to stay on top of issues seen in production (is my hope which may not align with how folks triage in practice). Thanks for the feedback. This anomaly makes more sense to me than it did :) -- Tyler [0]: <https://phabricator.wikimedia.org/project/reports/1055/>

Krinkle

16 Sep 16 Sep

5:32 p.m.

New subject: [Wikitech-l] 📈 Wikimedia production errors help

On Tue, Sep 15, 2020 at 10:00 AM Niklas Laxström <niklas.laxstrom(a)gmail.com> wrote:

...

ma 14. syysk. 2020 klo 23.49 Tyler Cipriani (tcipriani(a)wikimedia.org) kirjoitti:

The number of new tasks being created with this tag in a given week is outpacing the number of tasks being closed in a given week: this past week we added 41 tasks and only closed 22.

Majority of the recently created tasks are frontend JavaScript errors. The logging of these errors have only started recently.

Aye, this is indeed a distraction currently. In talking with Tyler prior to this email I failed to highlight what I think the main area of concern is, which is indeed not just the total number of reports from this and last month. Rather, my main concern is that over the past six month (incl long before the JS stuff came along), we've fallen quite a bit in addressing on-going production errors. For example, of the 30 odd backend errors reported in June, 14 were still open a month later in July [1], and 12 were still open – three months later – in September. The majority of these haven't even yet been triaged, assigned assigned or otherwise acknowledged. And meanwhile we've got more (non-JavaScript) stuff from July, August and September adding pressure. We have to do better. -- Timo [1] https://phabricator.wikimedia.org/phame/post/view/203/production_excellence…

Dan Andreescu

17 Sep 17 Sep

1:04 a.m.

New subject: [Wikitech-l] 📈 Wikimedia production errors help

...

For example, of the 30 odd backend errors reported in June, 14 were still open a month later in July [1], and 12 were still open – three months later – in September. The majority of these haven't even yet been triaged, assigned assigned or otherwise acknowledged. And meanwhile we've got more (non-JavaScript) stuff from July, August and September adding pressure. We have to do better. -- Timo

This feels like it needs some higher level coordination. Like perhaps managers getting together and deciding production issues are a priority and diverting resources dynamically to address them. Building an awesome new feature will have a lot less impact if the users are hurting from growing disrepair. It seems to me like if individual contributors and maintainers could have solved this problem, they would have by now. I'm a little worried that the only viable solution right now seems like heroes stepping up to fix these bugs. Concretely, I think expanding something like the Core Platform Team's clinic duty might work. Does anyone have a very rough idea of the time it would take to tackle 293 (wow we went up by a dozen since this thread started) tasks?

AntiCompositeNumber

3:24 a.m.

New subject: [Wikitech-l] 📈 Wikimedia production errors help

There is an impression among many community members, myself included, that Foundation development generally prioritizes new features over fixing existing problems. Foundation teams will sprint for a few months to put together a minimum viable product, release it, then move on to the new hotness, leaving user requests, bugfixes, and the like behind. It often seems that the only way to get a bug fixed is to get a volunteer developer to look at it. This is likely unintentional, but it happens nonetheless. Putting a higher priority within the Foundation on cleaning up old toys before taking out new ones is necessary for the long-term stability of the projects. ACN On Wed, Sep 16, 2020 at 9:05 PM Dan Andreescu <dandreescu(a)wikimedia.org> wrote:

...

C. Scott Ananian

3:39 p.m.

New subject: [Wikitech-l] 📈 Wikimedia production errors help

ACN -- for what it's worth, I've been working for the foundation for a while now, and I can report from the inside that the trend is definitely in a positive direction. There is a lot more internal focus on addressing code debt and giving maintenance a fair spot at the table. (In fact, my entire team is now sitting inside 'maintenance' now, apparently; we used to be 'platform evolution'.) This email thread is one visible aspect of that focus on code quality, not just features. That said, the one aspect which hasn't improved much in my time at the foundation has been the tendency of teams to work in silos. This thread also seems to be a symptom of that: a bunch of production issues are being dropped on the floor ('not resolved in over a month') because they are falling between the silos and nobody knows who is best able to fix them. There are knowledge/expertise gaps among the silos as well: someone qualified to fix a DB issue might be at sea trying to track down a front end bug, and vice-versa---a number of generalists in the org could technically tackle a bug no matter where it lies, but it will take them much longer to grok an unfamiliar codebase than it would for someone more familiar with that silo. So bug triage is an increasingly technical task in its own right. This thread, as I read it sitting inside the org, isn't so much asking for more attention to be paid to maintenance -- we're winning that battle, internally -- as it is a plea for those folks on the edges of their silos to keep an eye out for these things which are currently falling between them and help with the triage. --scott, speaking only for myself and my view here On Wed, Sep 16, 2020 at 11:25 PM AntiCompositeNumber < anticompositenumber(a)gmail.com> wrote:

...

> > For example, of the 30 odd backend errors reported in June, 14 were

still

> open a month later in July [1], and 12 were still open – three months

later

> – in September. The majority of these haven't even yet been triaged, > assigned assigned or otherwise acknowledged. And meanwhile we've got

> (non-JavaScript) stuff from July, August and September adding

pressure. We

have to do better. -- Timo

This feels like it needs some higher level coordination. Like perhaps managers getting together and deciding production issues are a priority

and

diverting resources dynamically to address them. Building an awesome new feature will have a lot less impact if the users are hurting from growing disrepair. It seems to me like if individual contributors and

maintainers

could have solved this problem, they would have by now. I'm a little worried that the only viable solution right now seems like heroes

stepping

up to fix these bugs. Concretely, I think expanding something like the Core Platform Team's clinic duty might work. Does anyone have a very rough idea of the time

would take to tackle 293 (wow we went up by a dozen since this thread started) tasks? _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- (http://cscott.net)

Ed Sanders

22 Sep 22 Sep

4:58 p.m.

New subject: [Wikitech-l] 📈 Wikimedia production errors help

Speaking specifically about the new JavaScript error logging, and specifically to Alex's point about triaging these tasks, it would be very helpful if the reports included some indication of how often the error is occurring. For example, VisualEditor is loaded several hundred thousands times per day. If an error has occurred 4 times in the last 30 days (based on a recent example) then it is probably very low priority. On Thu, 17 Sep 2020 at 16:40, C. Scott Ananian <cananian(a)wikimedia.org> wrote:

...

> > For example, of the 30 odd backend errors reported in June, 14 were

still

> open a month later in July [1], and 12 were still open – three months

later

> – in September. The majority of these haven't even yet been triaged, > assigned assigned or otherwise acknowledged. And meanwhile we've got

> (non-JavaScript) stuff from July, August and September adding

pressure. We

have to do better. -- Timo

This feels like it needs some higher level coordination. Like perhaps managers getting together and deciding production issues are a priority

and > diverting resources dynamically to address them. Building an awesome

new

> feature will have a lot less impact if the users are hurting from

growing

disrepair. It seems to me like if individual contributors and

maintainers

could have solved this problem, they would have by now. I'm a little worried that the only viable solution right now seems like heroes

stepping

up to fix these bugs. Concretely, I think expanding something like the Core Platform Team's clinic duty might work. Does anyone have a very rough idea of the time

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- (http://cscott.net) _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Jon Robson

23 Sep 23 Sep

2:45 p.m.

New subject: [Wikitech-l] 📈 Wikimedia production errors help

Id be careful about using numbers in triage right now. The numbers are a little misleading as the error logging is only enabled on smaller wikis. Also if an error results in data loss but only impacts a small amount of people I would say that's worse than a benign error that occurs for lots. We rolled out to Spanish, German and Japanese wikipedia yesterday so these numbers will start becoming more useful, but English Wikipedia will severely skew these numbers when we finally enable it. On Tue, Sep 22, 2020, 9:59 AM Ed Sanders <esanders(a)wikimedia.org> wrote:

...

ACN -- for what it's worth, I've been working for the foundation for a while now, and I can report from the inside that the trend is definitely

a positive direction. There is a lot more internal focus on addressing code debt and giving maintenance a fair spot at the table. (In fact, my entire team is now sitting inside 'maintenance' now, apparently; we used

be 'platform evolution'.) This email thread is one visible aspect of

that

focus on code quality, not just features. That said, the one aspect which hasn't improved much in my time at the foundation has been the tendency of teams to work in silos. This thread also seems to be a symptom of that: a bunch of production issues are

being

dropped on the floor ('not resolved in over a month') because they are falling between the silos and nobody knows who is best able to fix them. There are knowledge/expertise gaps among the silos as well: someone qualified to fix a DB issue might be at sea trying to track down a front end bug, and vice-versa---a number of generalists in the org could technically tackle a bug no matter where it lies, but it will take them much longer to grok an unfamiliar codebase than it would for someone more familiar with that silo. So bug triage is an increasingly technical task in its own right. This thread, as I read it sitting inside the org, isn't so much asking

for

more attention to be paid to maintenance -- we're winning that battle, internally -- as it is a plea for those folks on the edges of their silos to keep an eye out for these things which are currently falling between them and help with the triage. --scott, speaking only for myself and my view here On Wed, Sep 16, 2020 at 11:25 PM AntiCompositeNumber < anticompositenumber(a)gmail.com> wrote: > There is an impression among many community members, myself included, > that Foundation development generally prioritizes new features over > fixing existing problems. Foundation teams will sprint for a few > months to put together a minimum viable product, release it, then move > on to the new hotness, leaving user requests, bugfixes, and the like > behind. It often seems that the only way to get a bug fixed is to get > a volunteer developer to look at it. This is likely unintentional, but > it happens nonetheless. > > Putting a higher priority within the Foundation on cleaning up old > toys before taking out new ones is necessary for the long-term > stability of the projects. > > ACN > > On Wed, Sep 16, 2020 at 9:05 PM Dan Andreescu <

dandreescu(a)wikimedia.org>

> wrote: > > > > > > > > For example, of the 30 odd backend errors reported in June, 14 were > still > > > open a month later in July [1], and 12 were still open – three

months

> later > > > – in September. The majority of these haven't even yet been

triaged,

> > > assigned assigned or otherwise acknowledged. And meanwhile we've

got

> more > > > (non-JavaScript) stuff from July, August and September adding > pressure. We > > > have to do better. > > > > > > -- Timo > > > > > > > This feels like it needs some higher level coordination. Like

perhaps

> > managers getting together and deciding production issues are a

priority

and > diverting resources dynamically to address them. Building an awesome

new

> feature will have a lot less impact if the users are hurting from

growing > > disrepair. It seems to me like if individual contributors and > maintainers > > could have solved this problem, they would have by now. I'm a little > > worried that the only viable solution right now seems like heroes > stepping > > up to fix these bugs. > > > > Concretely, I think expanding something like the Core Platform Team's > > clinic duty might work. Does anyone have a very rough idea of the

time

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- (http://cscott.net) _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

1335

days inactive

1344

days old

wikitech-l@lists.wikimedia.org

Manage subscription

13 comments

11 participants

tags (0)

participants (11)

Alex Ezell
AntiCompositeNumber
Brennen Bearnes
C. Scott Ananian
Dan Andreescu
Derk-Jan Hartman
Ed Sanders
Jon Robson
Krinkle
Niklas Laxström
Tyler Cipriani