[Engineering] Update from the Wikimedia Performance Team

Tue Feb 2 19:14:17 UTC 2016

Hi,

This is the monthly report from the Wikimedia Performance Team for January
2016.

## Our progress ##

### Multi-datacenter

* The central login system started to use DB slaves for some actions
instead of master DB.
* MediaWiki Special pages and Action classes now support defining DB query
and write expectations (with logging for violations thereof).
* Clean up of haphazard database transaction methods largely finished in
MediaWiki core and extensions.
* Lock acquisition time reduced. Logic was inefficient and resulted in
wasted time. This reduced time spent in backend when saving edits.
* "Rebound purges" enabled in production. To compensate for DB lag, a
secondary purge for articles avoids stale content in Varnish. –
https://phabricator.wikimedia.org/T113192

### Navigation Timing
* We experimenting with creating a new metric "Time to first image" (how
long for the principle image to show). Based on video capture, we were
unable to correlate User Timing API measurements with when an image
actually becomes visible. To be revisited at a later time. –
https://phabricator.wikimedia.org/T115600.

### Media handling (Thumbor)
* Added cgroup support for controlling resource consumption.
* Implemented ability to pre-render multiple thumbnail sizes with a single
request.
* Video thumbnails render faster by loading only the relevant frame from
Swift (not the whole video).
* Working on a strategy for Thumbor deployment in Wikimedia production.
Thumbor is stateless and acts as drop-in replacement for current MediaWiki
PHP image scalers. – https://phabricator.wikimedia.org/T121388
* Our Thumbor plugins have been consolidated into a single Git repo and
moved Gerrit to Phabricator Diffusion.

### ResourceLoader:
* Work has started on the solution for the cache performance problem with
static MediaWiki resources. – https://phabricator.wikimedia.org/T99096

### Metric dashboards
* Earlier this month, Timo Tijhof gave a tech talk on "Creating Useful
Dashboards with Grafana" – https://www.youtube.com/watch?v=UlL6UoRUQAM
* New dashboard: https://grafana.wikimedia.org/dashboard/db/edit-count
(Global edit rate of Wikimedia wikis)
* New dashboard:
https://grafana.wikimedia.org/dashboard/db/time-to-first-byte (Navigation
Timing "responseStart" metric)

## How are we doing? ##

### Metrics

Client-side performance has remained stable over the past month. Save
Timing has also remained stable, around the 1s median mark.

Backend Save Processing Timing has improved slightly and was consistently
45ms (median) and 95ms (p75) lower in January compared to December.

https://performance.wikimedia.org/#!/month
https://grafana.wikimedia.org/dashboard/db/save-timing?from=now-50d
https://grafana.wikimedia.org/dashboard/db/navigation-timing?from=now-50d
https://grafana.wikimedia.org/dashboard/db/time-to-first-byte?from=now-50d

### Job Queue

On January 1st, the Job Queue started growing rapidly with htmlCacheUpdate
jobs. This was mitigated on 21 January by adding a dedicated runner for
that job type. Total queue size reached over 7 million before the dedicated
runner went live (typical size is under 300K).  –
https://phabricator.wikimedia.org/T124194. There is an ongoing
investigation at https://phabricator.wikimedia.org/T124418 about the
increase of those jobs.

https://grafana.wikimedia.org/dashboard/db/job-queue-health?from=now-50d

Until the next time,
Gilles, Peter, Aaron, Ori, and Timo.

https://www.mediawiki.org/wiki/Wikimedia_Performance_Team
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/engineering/attachments/20160202/75424d9e/attachment.html>