On Wed, Jan 3, 2018 at 2:49 PM, Manuel Arostegui <marostegui@wikimedia.org> wrote:

Happy new year!

Tomorrow I will deploy this change on s7, so expect some delay there.

Thanks
Manuel.

On Mon, Dec 18, 2017 at 4:54 PM, Manuel Arostegui <marostegui@wikimedia.org> wrote:
Hello again!

I will be altering s1 tomorrow early european morning. Expect some delay on labs!

Manuel.

On Tue, Dec 12, 2017 at 5:03 PM, Manuel Arostegui <marostegui@wikimedia.org> wrote:
Hello!

It is time for s4. I will be doing it tomorrow on the sanitarium master. There will be around 3h delay, as the logging table is quite big and takes around 2-3h to ALTER.

Manuel.

On Thu, Dec 7, 2017 at 8:38 AM, Manuel Arostegui <marostegui@wikimedia.org> wrote:
Hello,

I will be running this schema change on s2 on Monday. Expect delay on s2 on the replicas.

Manuel.

On Wed, Nov 29, 2017 at 1:53 PM, Manuel Arostegui <marostegui@wikimedia.org> wrote:
Hey Cloud Team,

I am now running this schema changes on s3, for all the wikis (around 900). I have throttled it a bit and it has been running for an hour without any significant delay on the new replicas.
labsdb1003 is delayed a bit, but it normally is lately, so I don't think it is related to this change.
This should take another 15h or so to finish completely.

Cheers
Manuel.

On Wed, Nov 15, 2017 at 6:45 PM, Manuel Arostegui <manuel@wikimedia.org> wrote:

On Wed, Nov 15, 2017 at 6:39 PM, Bryan Davis <bd808@wikimedia.org> wrote:
On Wed, Nov 15, 2017 at 9:48 AM, Manuel Arostegui <manuel@wikimedia.org> wrote:
> Hello Cloud Admins!
>
> As part of https://phabricator.wikimedia.org/T174569 we have to alter some
> big tables.
> One of them is logging, which, for instance, in wikidata takes around 8h.
> Which is the shard I am currently working on.
>
> Because of the nature of the change (some columns being added) and ROW based
> replication (what we use in sanitariums) this change needs to be done with
> replication (from sanitarium, or their masters, to the labs servers).
>
> This will obviously generate lag and if not done that way, it will break
> replication till the column is added on the labs hosts, and this is less
> desirable than replication lag.
>
> I am planning to run the alter probably tomorrow or Monday (I will notify
> when I start it) for the sanitarium host in s5, that means that there will
> be lag on the labs servers, for a few hours, on the s5 instance (which will
> also affect s1 and s3 because we are using the same replication thread for
> those shards too - which is a FIXME we have pending).
>
> s2, s4, s6 and s7 will remain unaffected as they have their own replication
> thread.
>
> Should you have any questions, let me know!

Should we send a message to cloud-announce about this, or just be
ready to tell people that the lag is a known issue due to production
schema changes?

Don't think it is necessary to send an announcement about it, it is just maintenance. I would suggest you just just to point people to that task so they can know when other shards will be done too :-)

Manuel.