MediaWiki-Core March 2014

mediawiki-core@lists.wikimedia.org

8 participants
10 discussions

by Bryan Davis

My cable modem is being super flakey again today. I'll leave hangouts running on my phone in case anybody needs to ping me. -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855

10 years, 1 month

A second mwcore team list?

by Antoine Musso

Hello, I am wondering how we ended up with a second mailing for our mwcore team. We had one created back in October to facilitate cross team communication which is: mwcore-l(a)wikimedia.org And I found out that my filter also grabbed mails from a second mailling list: mediawiki-core(a)lists.wikimedia.org I suspect the first is internal with no archive and the second on lists.wm.o is public/archived. If we are to keep both, I find them totally confusing. I do not see the use for a public list since we are all using wikitech-l already. Sorry if I missed something. -- Antoine "hashar" Musso

10 years, 1 month

Fwd: [Engineering] 3/18 - this week in WMF engineering

by Chad Horohoe

Let's plan an offsite this calendar year for the team...I don't think we've ever done one just for us. Somewhere not the Bay Area. -Chad ---------- Forwarded message ---------- From: Erik Moeller <erik(a)wikimedia.org> Date: Tue, Mar 18, 2014 at 10:23 AM Subject: [Engineering] 3/18 - this week in WMF engineering To: Development and Operations Engineers <engineering(a)lists.wikimedia.org> [snip] Upcoming: * Ops offsite coming up April 8 - 11 in Athens, Greece * Language Eng face-to-face May 1-10 in Valencia, Spain

10 years, 1 month

Cleaning up old MW branches on tin and apaches

by Bryan Davis

On Thursday every week a new WFM branch is cut to deploy the group0 wikis (test* and wm.o). On the following Tuesday it is promoted to the group1 wikis (all-wikipedias). Finally on Thursday is it promoted to group2 (wikipedias) while the group0 wikis start using another new version. At the current release cadence (one new branch a week) after 2 weeks in production a branch is no longer used. There can be minor exceptions to this due to major difficulties with a branch and/or holiday conflicts, but for the sake of this discussion those differences can be mostly ignored. A branch can't be deleted from the server cluster immediately after it is removed from the last wiki however. For better or worse, each branch contains static assets from core (resources & skins) and extensions that are served by the apaches. These assets are served using versioned URLs such as https://bits.wikimedia.org/static-1.23wmf17/skins/common/images/poweredby_m…. Varnish caches pages containing these URLs for anons for up to 30 days. That means that a request for static content contained by the 1.23wmf17 branch could be needed to satisfly an apache request for up to 30 days after that branch is no longer being used to satisfy PHP backed requests. Assuming the weekly release cadence, this means that the static assets from a branch are needed on the cluster for at least 45 days (14 days of active branch use + 31 days of cached page use). At the moment we don't have a well documented procedure for cleaning up old branches on tin and servers that rsync with tin (directly and indirectly). It seems to be a process that Sam does occasionally. The last commits that cleaned up old branches were merged on 2014-02-15: https://gerrit.wikimedia.org/r/#/c/113640/,https://gerrit.wikimedia.org/r/#…. These commits cleaned up some truly ancient branches. A slightly different by related problem is the amount of disk space consumed by the l10n cache files for unused MW versions. The combined json and CDB files for the current 1.23 branches consume ~1.7G per version. It looks like Sam has been pruning these at some point as well as the cache/l10n directory for version 1.23wmf12 and earlier are empty. I recommend that we add two new weekly cleanup steps: * When we deploy a new branch to group0 (Thursdays), all branches retired more than 5 weeks ago should be removed. This should really only include multiple branches the first time it's done to catch up. After that it will be an "add a branch, kill a branch" situation. With the current release cadence this will keep us at 7 checked out branches on tin, 2 versions in active use and 5 waiting for potential cache references to expire. * When we move group1 to the newest branch (Tuesdays), the cache/l10n directory of all non-active branches should be purged. By this point there is little chance that we will be reverting the wikipedias to the N-2 branch and thus the l10n cache is just taking up disk space and slowing down rsync comparisons. Are there any objections to adding these procedures to the MW deploy process? Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855

10 years, 1 month

Yesterday's meeting (2014-03-10)

by Brad Jorsch (Anomie)

Since I couldn't make it to yesterday's meeting, I'm looking in the Etherpad for anything I should have commented on. I see there's a question about https://bugzilla.wikimedia.org/show_bug.cgi?id=57176. Once we have agreement on the schema, I'll be happy to write the change (if no one else already volunteered). Also, does anyone know if anyone has looked at the problem described at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#server… https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Recurr… ? Thanks, -- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

10 years, 1 month

1.23wmf17 deploy notes

by Bryan Davis

This week's checklist is in my personal bug tracker [0]. Highlights and comments below. * mw-update-l10n continues to have problems dealing with a new branch. This week I tried adding a step to bootstrapping that removed the stub english l10n file [1] before running the full rebuildLocalisationCache.php step. I missed the file permissions protection of this file in the first patch and chose to cancel the scap to correct with a new patch [2]. On re-running scap the code executed as desired, but I still ended up syncing an incomplete english l10n file to the cluster. Broken l10n on mw.o was reported by several people on irc at ~21:00Z. I re-ran a full scap and was able to confirm that l10n was fixed. I have the before and after json dumps of the en l10n cache in my home directory on tin but haven't had time to dig into them very deeply. What was obvious from the errors seen is that some extension l10n was not picked up. Confusingly this didn't seem to affect all extensions. The saga of this problem is chronicled in bug 51174 [3]. While I was writing this up I had a minor epiphany about a potential fix; rebuildLocalisationCache.php has a `--force` option that could be used after stubbing the pre-extension l10n file rather than trying to clean things up. I'll make a patch to try doing that before next week's deploy. * Creating the on-wiki deploy notes is still a PITA. I made some changes to make-deploy-notes this week [4] that fixed my problems with generating a blank report. Sam looked at the report I generated and found it to be lacking however [5]. This diff looks funny (big hunk missing in the middle) and may have been caused by my use of cut-n-paste to publish the report. I was having problems authenticating to the api to upload directly, but have tracked that down to PEBKAC (problem exists between keyboard and chair) as I was trying to use an old password to authenticate to mw.o. * There were some errors in the fatal log caused by the merge of the pmtpa dsh cleanup [6] and subsequent wmf-config [7] patches. srv270, mw31 and mw40 were barfing because they no longer had files needed to answer the icinga checks. Mutante and I played whack-a-mole with touching missing files which just moved the problem around until I scp'd the prior *-pmtpa.php files back to these hosts. Afterwards I submitted a patch to remove the pmtpa rsync slaves [8] and the snapshot[1234] hosts that had snuck back into the dsh group [9]. [0]: https://github.com/bd808/wmf-kanban/issues/61 [1]: https://gerrit.wikimedia.org/r/#/c/117154/ [2]: https://gerrit.wikimedia.org/r/#/c/117236/ [3]: https://bugzilla.wikimedia.org/show_bug.cgi?id=51174 [4]: https://gerrit.wikimedia.org/r/#/q/status:merged+project:mediawiki/tools/re… [5]: https://www.mediawiki.org/w/index.php?title=MediaWiki_1.23%2Fwmf17%2FChange… [6]: https://gerrit.wikimedia.org/r/#/c/108070/ [7]: https://gerrit.wikimedia.org/r/#/c/116036/ [8]: https://gerrit.wikimedia.org/r/#/c/117244/ [9]: https://gerrit.wikimedia.org/r/#/c/117326/ Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855

10 years, 1 month

Fwd: WFH this afternoon / evening

by Ori Livneh

My wife is not feeling well, so I'm going to spend my son's waking hours helping her out. I will make up the hours later tonight. --- Ori Livneh ori(a)wikimedia.org

10 years, 1 month

Bug 46014

by Brad Jorsch (Anomie)

https://bugzilla.wikimedia.org/show_bug.cgi?id=46014 The common thread here seems to be that sometimes when a vandal edit is reverted within seconds, sometimes the page content from the vandal edit is being used while oldids in the skin data (JS, print footer, etc) indicate the revert. This doesn't seem to have anything to do with missed cache invalidation via the API: * A UI edit calls EditPage::attemptSave, which calls EditPage::internalAttemptSave and then processes the status without doing any additional cache checks. EditPage::internalAttemptSave itself calls WikiPage::doEditContent, then just checks the return value and doesn't do any additional cache checks. * An API edit calls EditPage::internalAttemptSave and then processes the status without doing any additional cache checks. * An API rollback calls WikiPage::doRollback, which calls WikiPage::commitRollback, which calls WikiPage::doEditContent and doesn't do any other cache checks. The fact that the page content is from an old revision but the skin data is new makes me suspect a parser cache issue. I suspect Gerrit change 85917 probably improved the situation but didn't completely eliminate it, and with the number of reversions ClueBot does it probably manages to hit some race occasionally. Looking at the parser cache handling: * WikiPage::doEditContent updates page_touched, then calls WikiPage::doEditUpdates which saves the just-parsed revision to the parser cache. * PoolWorkArticleView also saves its parsed data into the parser cache. * ApiPurge with forcelinksupdate saves its parsed data into the parser cache. * RefreshLinksJob saves its parsed data into the parser cache, if it took over 1 second to parse. I'm suspecting there's a race somehow where the vandal saves their edit, then gets redirected back to action=view on the article, and makes it into the branch of Article::view that uses PoolWorkArticleView to reparse the text. Article pre-fetches the content to be parsed and passes it into PoolWorkArticleView. Then before the PoolWorkArticleView gets a chance to run, ClueBot comes along and reverts, updating the parser cache in the process. Then the PoolWorkArticleView gets to run, re-parses the vandal's content, and then saves that into the parser cache replacing the newer version. I'm not sure how to test this theory, though. If this really is what's happening, it might be enough to just have PoolWorkArticleView store a timestamp in __construct (maybe only when $content is passed) instead of calculating the timestamp in doWork. Or we could look at storing the revid in the parser cache (fetch it from the $page passed to ParserCache::save if it's not explicitly provided, for BC) and consider the parser cache entry stale if the stored revid doesn't match the current revid, just like we do with the page_touched timestamp now. -- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

10 years, 2 months

Private Jenkins / Gerrit? (was Re: 1.23wmf16 deploy notes)

by Chris Steipp

On Mon, Mar 3, 2014 at 9:25 AM, Antoine Musso <amusso(a)wikimedia.org> wrote: > > * Cutting the branch should be automatic. Jenkins could do this easily > > and make the timing predictable for all parties. > > Can you fill in as a bug under either the Deployment or Continuous > Integration component? Would be more than happy to pair with someone to > craft the job. > > We might want to have that job on a secured/private Jenkins instead of > the CI one though. I know we talked about maybe someday doing this, as it would solve a lot of issues. Is this something we should wait on? It would definitely help with the tarball process too, and as I was thinking about it, it seems like this is something a few of us could do pretty quickly. I don't actually know Gerrit or Jenkins, but it seems like these would all be things someone in our team has done before, so it's should just be banging it out instead of trying to do something new. Assuming ops was able to get the hardware, all we need to do is: * We setup a machine (well 2, and mirror them somehow for redundancy..) on an isolated network inside the cluster * It runs gerrit and jenkins * We merge in new patches from the current git repo automatically ** If there's a conflict with a security patch, alert/page/etc (<- this is the major downside I see, if someone is trying to merge an emergency patch to deploy, but it conflicts with an existing security patch, and it's midnight in SFO, we will have problems..) * Security patches are submitted and +2'ed in the gerrit instance * Jenkins cuts the weekly (or daily) branches on some schedule and tarballs when new tags are pushed * Tin points to the private repo to get its code, and a cron job creates the new directory structure when new wmf branches are available. Am I way off thinking this isn't actually that difficult? Maybe something we could schedule for q2?

10 years, 2 months

1.23wmf16 deploy notes

by Bryan Davis

The checklist I ran through is in my personal bug tracker [0]. Almost everything worked and almost everything went smoothly. Almost. Here's some of the notes I made as things went along: * Running make-wmf-branch on bast1001 needed a change to the build dir because reedy already owns the default dir there. ** Since the first thing it does is `rm -rf` the dir I think we could do something smarter in the script like prompt to delete after completion and/or add the script pid to the dir name. * Cutting the branch should be automatic. Jenkins could do this easily and make the timing predictable for all parties. * It seems like the php-1.XwmfY checkouts on tin could be either shallow or single branch checkouts. I think Chad started playing with having multiple working copies that share the same repository which might be even nicer. * Speaking of Chad's prototype work, /a/common/php-git makes `updateWikiversions` throw a warning: "updateBitsBranchPointers: link target /usr/local/apache/common-local/php-git/skins does not exist." * I tried to make a script to automate copying security patches from one branch checkout to the next. It worked sort of. `git apply` wasn't smart enough to figure out that the patches I pulled off of the wmf15 checkout were already applied in the wmf16 branch. It would be nice to figure this out and get it automated or find a better way in general to manage security patches. * Creating the on-wiki deploy notes is a PITA. I read the script a couple of times and tried running on my own and with tips from Sam. I never did get it to work for me (kept getting empty output). Sam ran it and it worked fine but he said "I recall the script being temperamental". We should definitely make this an automated job in Jenkins or elsewhere. Nobody should have to babysit this kind of communications process. (The cobbler's children have no shoes.) * My ssh-agent (OS X 10.8.5) croaked badly when trying to run sync-wikiversions. This seems to be triggered by the full fanout (not batched) dsh call. Aaron had to step in and run both sync-wikiversions for me. * l10n sync went badly again. wmf16 got partial en l10n data and then my `scap --versions php-1.23wmf16` to fix it blew up badly in the scap-rebuild-cdbs step: ** Bug 62018 [1] - scap-rebuild-cdbs fails when scap called with `--versions` command line flag - is in the python scap code I'm pretty sure and I'll get on fixing that. ** Bug 51174 [2] - Scap broken for deploying new versions of MediaWiki due to ExtensionMessage file not being created - looks like the things I added in I5467ac8 [3] were necessary but not sufficient to fix this. I stupidly didn't save a copy of the first .json files, but wmf16 didn't get full english l10n cache in the cdb until a second scap was run. It seems likely to me that this is related to the "bootstrap" en l10n build that I put in there to get mergeMessageFileList.php to run. Things actually went pretty smoothly though. Thanks a lot to Chad and Sam for helping me make a checklist and Aaron for being around to lend a hand when I fell and couldn't get up. [0]: https://github.com/bd808/wmf-kanban/issues/57 [1]: https://bugzilla.wikimedia.org/show_bug.cgi?id=62018 [2]: https://bugzilla.wikimedia.org/show_bug.cgi?id=51174 [3]: https://gerrit.wikimedia.org/r/#/c/113260/ Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855

10 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

MediaWiki-Core March 2014