On Thursday every week a new WFM branch is cut to deploy the group0
wikis (test* and wm.o). On the following Tuesday it is promoted to the
group1 wikis (all-wikipedias). Finally on Thursday is it promoted to
group2 (wikipedias) while the group0 wikis start using another new
version. At the current release cadence (one new branch a week) after
2 weeks in production a branch is no longer used. There can be minor
exceptions to this due to major difficulties with a branch and/or
holiday conflicts, but for the sake of this discussion those
differences can be mostly ignored.
A branch can't be deleted from the server cluster immediately after it
is removed from the last wiki however. For better or worse, each
branch contains static assets from core (resources & skins) and
extensions that are served by the apaches. These assets are served
using versioned URLs such as
https://bits.wikimedia.org/static-1.23wmf17/skins/common/images/poweredby_m….
Varnish caches pages containing these URLs for anons for up to 30
days. That means that a request for static content contained by the
1.23wmf17 branch could be needed to satisfly an apache request for up
to 30 days after that branch is no longer being used to satisfy PHP
backed requests. Assuming the weekly release cadence, this means that
the static assets from a branch are needed on the cluster for at least
45 days (14 days of active branch use + 31 days of cached page use).
At the moment we don't have a well documented procedure for cleaning
up old branches on tin and servers that rsync with tin (directly and
indirectly). It seems to be a process that Sam does occasionally. The
last commits that cleaned up old branches were merged on 2014-02-15:
https://gerrit.wikimedia.org/r/#/c/113640/,https://gerrit.wikimedia.org/r/#….
These commits cleaned up some truly ancient branches.
A slightly different by related problem is the amount of disk space
consumed by the l10n cache files for unused MW versions. The combined
json and CDB files for the current 1.23 branches consume ~1.7G per
version. It looks like Sam has been pruning these at some point as
well as the cache/l10n directory for version 1.23wmf12 and earlier are
empty.
I recommend that we add two new weekly cleanup steps:
* When we deploy a new branch to group0 (Thursdays), all branches
retired more than 5 weeks ago should be removed. This should really
only include multiple branches the first time it's done to catch up.
After that it will be an "add a branch, kill a branch" situation. With
the current release cadence this will keep us at 7 checked out
branches on tin, 2 versions in active use and 5 waiting for potential
cache references to expire.
* When we move group1 to the newest branch (Tuesdays), the cache/l10n
directory of all non-active branches should be purged. By this point
there is little chance that we will be reverting the wikipedias to the
N-2 branch and thus the l10n cache is just taking up disk space and
slowing down rsync comparisons.
Are there any objections to adding these procedures to the MW deploy process?
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA
irc: bd808 v:415.839.6885 x6855