My cable modem is being super flakey again today. I'll leave hangouts
running on my phone in case anybody needs to ping me.
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA
irc: bd808 v:415.839.6885 x6855
Hello,
I am wondering how we ended up with a second mailing for our mwcore team.
We had one created back in October to facilitate cross team
communication which is: mwcore-l(a)wikimedia.org
And I found out that my filter also grabbed mails from a second mailling
list: mediawiki-core(a)lists.wikimedia.org
I suspect the first is internal with no archive and the second on
lists.wm.o is public/archived.
If we are to keep both, I find them totally confusing.
I do not see the use for a public list since we are all using wikitech-l
already.
Sorry if I missed something.
--
Antoine "hashar" Musso
Let's plan an offsite this calendar year for the team...I don't think
we've ever done one just for us. Somewhere not the Bay Area.
-Chad
---------- Forwarded message ----------
From: Erik Moeller <erik(a)wikimedia.org>
Date: Tue, Mar 18, 2014 at 10:23 AM
Subject: [Engineering] 3/18 - this week in WMF engineering
To: Development and Operations Engineers <engineering(a)lists.wikimedia.org>
[snip]
Upcoming:
* Ops offsite coming up April 8 - 11 in Athens, Greece
* Language Eng face-to-face May 1-10 in Valencia, Spain
On Thursday every week a new WFM branch is cut to deploy the group0
wikis (test* and wm.o). On the following Tuesday it is promoted to the
group1 wikis (all-wikipedias). Finally on Thursday is it promoted to
group2 (wikipedias) while the group0 wikis start using another new
version. At the current release cadence (one new branch a week) after
2 weeks in production a branch is no longer used. There can be minor
exceptions to this due to major difficulties with a branch and/or
holiday conflicts, but for the sake of this discussion those
differences can be mostly ignored.
A branch can't be deleted from the server cluster immediately after it
is removed from the last wiki however. For better or worse, each
branch contains static assets from core (resources & skins) and
extensions that are served by the apaches. These assets are served
using versioned URLs such as
https://bits.wikimedia.org/static-1.23wmf17/skins/common/images/poweredby_m….
Varnish caches pages containing these URLs for anons for up to 30
days. That means that a request for static content contained by the
1.23wmf17 branch could be needed to satisfly an apache request for up
to 30 days after that branch is no longer being used to satisfy PHP
backed requests. Assuming the weekly release cadence, this means that
the static assets from a branch are needed on the cluster for at least
45 days (14 days of active branch use + 31 days of cached page use).
At the moment we don't have a well documented procedure for cleaning
up old branches on tin and servers that rsync with tin (directly and
indirectly). It seems to be a process that Sam does occasionally. The
last commits that cleaned up old branches were merged on 2014-02-15:
https://gerrit.wikimedia.org/r/#/c/113640/,https://gerrit.wikimedia.org/r/#….
These commits cleaned up some truly ancient branches.
A slightly different by related problem is the amount of disk space
consumed by the l10n cache files for unused MW versions. The combined
json and CDB files for the current 1.23 branches consume ~1.7G per
version. It looks like Sam has been pruning these at some point as
well as the cache/l10n directory for version 1.23wmf12 and earlier are
empty.
I recommend that we add two new weekly cleanup steps:
* When we deploy a new branch to group0 (Thursdays), all branches
retired more than 5 weeks ago should be removed. This should really
only include multiple branches the first time it's done to catch up.
After that it will be an "add a branch, kill a branch" situation. With
the current release cadence this will keep us at 7 checked out
branches on tin, 2 versions in active use and 5 waiting for potential
cache references to expire.
* When we move group1 to the newest branch (Tuesdays), the cache/l10n
directory of all non-active branches should be purged. By this point
there is little chance that we will be reverting the wikipedias to the
N-2 branch and thus the l10n cache is just taking up disk space and
slowing down rsync comparisons.
Are there any objections to adding these procedures to the MW deploy process?
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA
irc: bd808 v:415.839.6885 x6855
This week's checklist is in my personal bug tracker [0]. Highlights
and comments below.
* mw-update-l10n continues to have problems dealing with a new branch.
This week I tried adding a step to bootstrapping that removed the stub
english l10n file [1] before running the full
rebuildLocalisationCache.php step. I missed the file permissions
protection of this file in the first patch and chose to cancel the
scap to correct with a new patch [2]. On re-running scap the code
executed as desired, but I still ended up syncing an incomplete
english l10n file to the cluster. Broken l10n on mw.o was reported by
several people on irc at ~21:00Z. I re-ran a full scap and was able to
confirm that l10n was fixed.
I have the before and after json dumps of the en l10n cache in my home
directory on tin but haven't had time to dig into them very deeply.
What was obvious from the errors seen is that some extension l10n was
not picked up. Confusingly this didn't seem to affect all extensions.
The saga of this problem is chronicled in bug 51174 [3]. While I was
writing this up I had a minor epiphany about a potential fix;
rebuildLocalisationCache.php has a `--force` option that could be used
after stubbing the pre-extension l10n file rather than trying to clean
things up. I'll make a patch to try doing that before next week's
deploy.
* Creating the on-wiki deploy notes is still a PITA. I made some
changes to make-deploy-notes this week [4] that fixed my problems with
generating a blank report. Sam looked at the report I generated and
found it to be lacking however [5]. This diff looks funny (big hunk
missing in the middle) and may have been caused by my use of
cut-n-paste to publish the report. I was having problems
authenticating to the api to upload directly, but have tracked that
down to PEBKAC (problem exists between keyboard and chair) as I was
trying to use an old password to authenticate to mw.o.
* There were some errors in the fatal log caused by the merge of the
pmtpa dsh cleanup [6] and subsequent wmf-config [7] patches. srv270,
mw31 and mw40 were barfing because they no longer had files needed to
answer the icinga checks. Mutante and I played whack-a-mole with
touching missing files which just moved the problem around until I
scp'd the prior *-pmtpa.php files back to these hosts. Afterwards I
submitted a patch to remove the pmtpa rsync slaves [8] and the
snapshot[1234] hosts that had snuck back into the dsh group [9].
[0]: https://github.com/bd808/wmf-kanban/issues/61
[1]: https://gerrit.wikimedia.org/r/#/c/117154/
[2]: https://gerrit.wikimedia.org/r/#/c/117236/
[3]: https://bugzilla.wikimedia.org/show_bug.cgi?id=51174
[4]: https://gerrit.wikimedia.org/r/#/q/status:merged+project:mediawiki/tools/re…
[5]: https://www.mediawiki.org/w/index.php?title=MediaWiki_1.23%2Fwmf17%2FChange…
[6]: https://gerrit.wikimedia.org/r/#/c/108070/
[7]: https://gerrit.wikimedia.org/r/#/c/116036/
[8]: https://gerrit.wikimedia.org/r/#/c/117244/
[9]: https://gerrit.wikimedia.org/r/#/c/117326/
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA
irc: bd808 v:415.839.6885 x6855
My wife is not feeling well, so I'm going to spend my son's waking hours
helping her out. I will make up the hours later tonight.
---
Ori Livneh
ori(a)wikimedia.org
https://bugzilla.wikimedia.org/show_bug.cgi?id=46014
The common thread here seems to be that sometimes when a vandal edit is
reverted within seconds, sometimes the page content from the vandal edit is
being used while oldids in the skin data (JS, print footer, etc) indicate
the revert.
This doesn't seem to have anything to do with missed cache invalidation via
the API:
* A UI edit calls EditPage::attemptSave, which calls
EditPage::internalAttemptSave and then processes the status without doing
any additional cache checks. EditPage::internalAttemptSave itself calls
WikiPage::doEditContent, then just checks the return value and doesn't do
any additional cache checks.
* An API edit calls EditPage::internalAttemptSave and then processes the
status without doing any additional cache checks.
* An API rollback calls WikiPage::doRollback, which calls
WikiPage::commitRollback, which calls WikiPage::doEditContent and doesn't
do any other cache checks.
The fact that the page content is from an old revision but the skin data is
new makes me suspect a parser cache issue. I suspect Gerrit change 85917
probably improved the situation but didn't completely eliminate it, and
with the number of reversions ClueBot does it probably manages to hit some
race occasionally.
Looking at the parser cache handling:
* WikiPage::doEditContent updates page_touched, then calls
WikiPage::doEditUpdates which saves the just-parsed revision to the parser
cache.
* PoolWorkArticleView also saves its parsed data into the parser cache.
* ApiPurge with forcelinksupdate saves its parsed data into the parser
cache.
* RefreshLinksJob saves its parsed data into the parser cache, if it took
over 1 second to parse.
I'm suspecting there's a race somehow where the vandal saves their edit,
then gets redirected back to action=view on the article, and makes it into
the branch of Article::view that uses PoolWorkArticleView to reparse the
text. Article pre-fetches the content to be parsed and passes it into
PoolWorkArticleView. Then before the PoolWorkArticleView gets a chance to
run, ClueBot comes along and reverts, updating the parser cache in the
process. Then the PoolWorkArticleView gets to run, re-parses the vandal's
content, and then saves that into the parser cache replacing the newer
version.
I'm not sure how to test this theory, though.
If this really is what's happening, it might be enough to just have
PoolWorkArticleView store a timestamp in __construct (maybe only when
$content is passed) instead of calculating the timestamp in doWork. Or we
could look at storing the revid in the parser cache (fetch it from the
$page passed to ParserCache::save if it's not explicitly provided, for BC)
and consider the parser cache entry stale if the stored revid doesn't match
the current revid, just like we do with the page_touched timestamp now.
--
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation
On Mon, Mar 3, 2014 at 9:25 AM, Antoine Musso <amusso(a)wikimedia.org> wrote:
> > * Cutting the branch should be automatic. Jenkins could do this easily
> > and make the timing predictable for all parties.
>
> Can you fill in as a bug under either the Deployment or Continuous
> Integration component? Would be more than happy to pair with someone to
> craft the job.
>
> We might want to have that job on a secured/private Jenkins instead of
> the CI one though.
I know we talked about maybe someday doing this, as it would solve a lot of
issues. Is this something we should wait on? It would definitely help with
the tarball process too, and as I was thinking about it, it seems like this
is something a few of us could do pretty quickly. I don't actually know
Gerrit or Jenkins, but it seems like these would all be things someone in
our team has done before, so it's should just be banging it out instead of
trying to do something new.
Assuming ops was able to get the hardware, all we need to do is:
* We setup a machine (well 2, and mirror them somehow for redundancy..) on
an isolated network inside the cluster
* It runs gerrit and jenkins
* We merge in new patches from the current git repo automatically
** If there's a conflict with a security patch, alert/page/etc (<- this is
the major downside I see, if someone is trying to merge an emergency patch
to deploy, but it conflicts with an existing security patch, and it's
midnight in SFO, we will have problems..)
* Security patches are submitted and +2'ed in the gerrit instance
* Jenkins cuts the weekly (or daily) branches on some schedule and tarballs
when new tags are pushed
* Tin points to the private repo to get its code, and a cron job creates
the new directory structure when new wmf branches are available.
Am I way off thinking this isn't actually that difficult? Maybe something
we could schedule for q2?
The checklist I ran through is in my personal bug tracker [0]. Almost
everything worked and almost everything went smoothly. Almost.
Here's some of the notes I made as things went along:
* Running make-wmf-branch on bast1001 needed a change to the build dir
because reedy already owns the default dir there.
** Since the first thing it does is `rm -rf` the dir I think we could
do something smarter in the script like prompt to delete after
completion and/or add the script pid to the dir name.
* Cutting the branch should be automatic. Jenkins could do this easily
and make the timing predictable for all parties.
* It seems like the php-1.XwmfY checkouts on tin could be either
shallow or single branch checkouts. I think Chad started playing with
having multiple working copies that share the same repository which
might be even nicer.
* Speaking of Chad's prototype work, /a/common/php-git makes
`updateWikiversions` throw a warning: "updateBitsBranchPointers: link
target /usr/local/apache/common-local/php-git/skins does not exist."
* I tried to make a script to automate copying security patches from
one branch checkout to the next. It worked sort of. `git apply` wasn't
smart enough to figure out that the patches I pulled off of the wmf15
checkout were already applied in the wmf16 branch. It would be nice to
figure this out and get it automated or find a better way in general
to manage security patches.
* Creating the on-wiki deploy notes is a PITA. I read the script a
couple of times and tried running on my own and with tips from Sam. I
never did get it to work for me (kept getting empty output). Sam ran
it and it worked fine but he said "I recall the script being
temperamental". We should definitely make this an automated job in
Jenkins or elsewhere. Nobody should have to babysit this kind of
communications process. (The cobbler's children have no shoes.)
* My ssh-agent (OS X 10.8.5) croaked badly when trying to run
sync-wikiversions. This seems to be triggered by the full fanout (not
batched) dsh call. Aaron had to step in and run both sync-wikiversions
for me.
* l10n sync went badly again. wmf16 got partial en l10n data and then
my `scap --versions php-1.23wmf16` to fix it blew up badly in the
scap-rebuild-cdbs step:
** Bug 62018 [1] - scap-rebuild-cdbs fails when scap called with
`--versions` command line flag - is in the python scap code I'm pretty
sure and I'll get on fixing that.
** Bug 51174 [2] - Scap broken for deploying new versions of MediaWiki
due to ExtensionMessage file not being created - looks like the things
I added in I5467ac8 [3] were necessary but not sufficient to fix this.
I stupidly didn't save a copy of the first .json files, but wmf16
didn't get full english l10n cache in the cdb until a second scap was
run. It seems likely to me that this is related to the "bootstrap" en
l10n build that I put in there to get mergeMessageFileList.php to run.
Things actually went pretty smoothly though. Thanks a lot to Chad and
Sam for helping me make a checklist and Aaron for being around to lend
a hand when I fell and couldn't get up.
[0]: https://github.com/bd808/wmf-kanban/issues/57
[1]: https://bugzilla.wikimedia.org/show_bug.cgi?id=62018
[2]: https://bugzilla.wikimedia.org/show_bug.cgi?id=51174
[3]: https://gerrit.wikimedia.org/r/#/c/113260/
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA
irc: bd808 v:415.839.6885 x6855