[Labs-l] Accessing the databases from labs - A comparison with the toolserver

Platonides platonides at gmail.com
Fri Jul 12 17:59:55 UTC 2013


On 12/07/13 17:43, Marc A. Pelletier wrote:
> On 07/12/2013 11:13 AM, Platonides wrote:
>> - A toolserver table, available on all database servers.
>
> The problem is that, AFAICT, there is currently no agreement of what
> that table should contain and what its schema should be.  Viz bz 48626[1]

Everything on toolserver one (and maybe more columns).


>> - sql-sX-rr.labsdb and sql-sX-userdb.labsdb "dns" entries. I would need
>> to detect in my tools if it should
>> append .toolserver.org or .labsdb (is there a supported way of detecting
>> that you are running on tool labs?)
>> But that step seems reasonable. (BTW, What about adding labsdb to
>> resolv.conf(5) search?)
>
> I'm not sure where the difference is between deciding what to add to the
> host name and what the host name should be lies.
>
> Toolserver:  %s-p.rrdb.toolserver.org
> Tool Labs:   %s.labsdb
>
> Also, relying on a particular mapping between shards and databases is a
> Bad Thing regardless; this way maintenance woes lies.  You shouldn't be
> connecting to "shard N which happens to be where foowiki_p is" but "to
> where foowiki_p is".  Not only can the mapping of database to cluster
> change in production, but there is no reason why that mapping needs to
> remain the same for the replicas.
>
> In other words, on Labs, connecting to shards "by number" is an error
> (and I don't expect to preserve the undocumented s?.labsdb names at all
> once things are moved to DNS).


In my tools I use two primitives:
- DBforCluster($number, $dbName = false)
- DBforName($dbName, $strict = true)

DBforCluster connects you to the given cluster number, while DBforName
resolves first the name to the given number by using the toolserver table.

These connections are cached, so if I connected to fiwiki and then to
eowiki, the same db object would be returned. This way iterating all
(wikipedias, wikitionaries, every wiki...) will use number-of-clusters
connections, not connecting (many times to the same server)
number-of-databases times, and obviously not leaving number-of-databases 
open connections in a cache.

A quick grep shows me that I only have hardcoded the cluster numbers 4 
and 1 (often appended with "-user"), which are safe. There are also a 
couple of tools which connect by cluster number just after obtaining it 
on toolserver.wiki table.
The worst offender is a DBforCluster( 'z-dat-s7-a', 'centralauth_p' ); 
call, but centralauth_p isn't on toolserver.wiki for some reason. :(

If the mapping of database to cluster changes, I wouldn't blame you. 
Unless you didn't update that “bridge” table I am requesting, of course. ;)

I don't know why you would want to have a different mapping in labs than 
in production, it's unlikely it'd be worth the splitting of the binlogs, 
but as far as the cluster view from labs was consistent...

I'm surprised about the lack of differenciation between replica-only 
servers and those that must contain user dbs (-rr and -user servers in 
TS). The first ones are easy to roundrobin, while the second will be a 
pain at the point where a single server won't be able to handle 
everything. So if the host were separated like in TS, transition would 
be very simple when the time comes.
We could go even further, also differenciating ro and rw access to user 
tables.



>> - Database names compatible with those of the toolserver. References to
>> the dbs are sometimes spread on
>> the codebase, and migrating shouldn't require a hunt for them if it's
>> avoidable.
>
> In this particular case, it's not avoidable (for user databases).

Why?


> AFAICT, the replicated databases names are the same.

Yes (once you connect to the right host)


>> - dns names like project-p.labsdb for compatibility with TS tools?
>> Perhaps *.(rr|user)db.toolserver.org
>> should be aliased to .labsdb
>
>> - Marking the global dbs in that toolserver table would also be nice.
>
> Having database hostnames in /etc/hosts rather than in DNS is a
> temporary hack that is, actually, scheduled to go away shortly (days).
> Provided toolserver-like aliases is entirely possible, but I'm not
> certain I see the point (because of the first section above).
>
>> - How to detect if you are running in labs? (for dual tools)
>
> Possibly the very simplest way to do this would be to provide (say)
> /etc/labs on every tool labs host, testing for its presence should be a
> reliable indication.  I'll add this shortly.
>
> --
>
> All of that said, I agree that having the same code run unchanged on
> both the Toolserver and Tool Labs would require some adaptation but:
>
> (a) this is, at any rate, unavoidable.  /All/ Tool Labs project need to
> be multi-maintainer (with a different and simpler system)

I haven't yet used it, but doesn't seem _that_ different.

I don't know where I'd put the code shared between the tools, but that's 
a problem of changing to single user to multi-maintainer, not a complete 
difference.


> and run through the grid engine both of which implies bigger changes than
> database names to connect to;

There is also a grid engine on TS. It's not possible to specify a 
cluster as requisite in labs, BTW. See 
https://wiki.toolserver.org/view/Batch_job_scheduling#Optional_resources

> and
>
> (b) the effort of having the same code run unchanged on both
> "variations" of replicas does not seem to be a worthwhile investment of
> time and effort for maintainers since the toolserver replicas are going
> away for good on 2014-06-30 at the very latest (and possibly earlier).

I want to be able to run (test) on labs the same script as used on 
toolserver. If the library was dual, this would be a breeze.



More information about the Labs-l mailing list