Hello!
Fyi, I just added 2 new panels to the the Elasticsearch Grafana
dashboard [1] (located at the bottom).
1) Nodes with < 25% disk free
The idea is to be able to catch cluster imbalanced earlier. There is
filter on the the data to show only nodes with < 25% disk free. At the
moment there are no nodes satisfying the criteria, so that graph is
empty. You can play with it by increasing the threshold...
2) Request rate (nginx)
This is mostly the same data as the QPS graphs we already have, but at
a lower level. I like having different measurements of similar things,
so that when things are not as expected, we might have a chance to
understand why. For example, this show ~ 500 request / second in
equiad, which are probably the Translate extension, and the index
updates (I need to check).
Let me know if you have other ideas of things that make sense...
MrG
[1] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
Erik, Trey, David, Kevin, and I met this morning to discuss how we're going
to handle data collection for the upcoming TextCat test [1]. A big problem
in this particular case is that the system wasn't designed/engineered in a
way that's conducive for cross-wiki logging / session tracking. And
recently we even lost the ability to use the referrer info to see which
page the user came from when visiting another wiki page when going between
wikis. (I was told this was done for user privacy reasons.)
Erik said he had recently implemented a click event in the
TestSearchSatisfaction2 schema that we might be able to hook into to
measure clickthrough rate for users who are eligible for TextCat language
detection & get shown results in the language their non-English query
probably is written in. Whether we use this and how much we rely on this
particular method of measuring whether TextCat is successful (beyond just
measuring how it impacts the zero results rate) depends on the validation
[2] of the click events and how they compare to page visit events (which
cannot be fired in an interwiki context).
We also discussed an alternative approach which uses web requests with the
caveat being that if a user is selected for the test once, they'll be
selected every time. So if a particular IP+UA combination is part of the
test and performs 2 million searches (as is sometimes the case), then we'll
have to do some very careful filtering which will also exclude some
completely valid use cases (a computer lab in a school or a country with
only 2 public IP addresses). But we're shooting for being able to use
TestSearchSatisfaction2 :)
[1] https://phabricator.wikimedia.org/T121542
[2] https://phabricator.wikimedia.org/T132706
--
*Mikhail Popov* // Count Logula, Discovery
<https://www.mediawiki.org/wiki/Wikimedia_Discovery>
https://wikimediafoundation.org/
*Imagine a world in which every single human being can freely share in
the **sum
of all knowledge. That's our commitment.* Donate
<https://donate.wikimedia.org/>.
On Wed, Apr 13, 2016 at 6:16 PM, Kevin Smith <ksmith(a)wikimedia.org> wrote:
> This is a ping to remind you about action items we ended up with from our
> March retrospective, in case you haven't acted on them. Sorry this is so
> late!
>
> Mikhail: Check whether meta Discovery/Testing page is up-to-date
> Guillaume talk to robh to understand how procurement works
-> Done. I have probably not understood everything yet, but I'm
getting there (slowly). One of the things I've understood that might
interest some of you (you probably know it, but we never know): There
are 2 workboards worth watching if you want to know the status of
hardware requests:
Hardware-request board [1]:
Tracks all hardware requests. If you are waiting for hardware, but the
task does not show up in that board, your task is probably lost and
Robh is not working on it (and I'm probably not going to find it
either). Columns are self explanatory and give you some visibility on
the status.
Procurement board [2]:
Once the hardware request is approved, a procurement task is created
and will move forward in the procurement board. As those tasks
contains price information that we are not allowed to share, the
access is restricted.
The full procurement process is documented on Office wiki (again, some
private info there, so not public).
> Should probably have an automatic task to announce each test
> Think more about velocity question: Hire more? Change process? Is it OK as
> is? Start doing guesstimations?
Before changing the way we do things (hire, process improvement,
estimation, ...) I think we should put in place a few metrics, so that
we know if our changes improve the situation or not. I had a quick
look into the reports available out of the box from Phabricator, and
they seem a bit lightweight (to say the least). I might just not be
looking at the right place (I have been known to do that).
The first metric I'd like to see is something about cycle time (how
long do we take to finish a task once we started working on it). Or a
Cumulative Flow Diagram, which should give us visibility on our cycle
time and might give more insight on its evolution over time.
As you see, I'm a big fan of metrics. Now that I've made my point, I'm
not actually sure it make sense to invest in better visibility in our
team. So far I have not felt like we have an issue in delivering
value. Digging into how we work, creating metrics, following them does
not come for free. So the usual "if it's not broken, don't fix it"
probably applies here. You know that better than I do...
> Announce the past test(s)
>
> I think Dan already did that third item. The fourth was probably on my
> plate, but I haven't had a chance to do it, so I'll probably leave it as an
> action item coming out of the April retro.
>
> Who would be best to handle that last item, or has someone already done it?
>
>
> Kevin Smith
> Agile Coach, Wikimedia Foundation
>
[1] https://phabricator.wikimedia.org/project/view/1014/
[2] https://phabricator.wikimedia.org/project/view/1155/
[3] https://office.wikimedia.org/wiki/Operations/Procurement
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
@Antoine: you'll need to give a bit more context.
There was an idea thrown by JustinO on #wikimedia-discovery to provide
an easy way to use the dumps of our elasticsearch indices. This idea
itself was coming from a SO post [1]. At the moment, we do provide
the dumps [2], but while it is not rocket science to import them, it
isn't as straightforward as we could wish. Providing a Vagrant project
that take care of using the correct version of elasticsearch, has the
correct scripts, ... would be nice.
As far as I know there are no Phabricator ticket filled for this idea yet.
[1] http://stackoverflow.com/questions/36485614/import-wikipedias-indices-into-…
[2] https://dumps.wikimedia.org/other/cirrussearch/
On Tue, Apr 12, 2016 at 5:23 PM, Antoine Boegli
<antoine.boegli(a)gmail.com> wrote:
> vagrant boxes are not finished for the moment, and I think they will not be
> available until next week (not enough time)
>
> Q : is it okay if I put the boxes themselves in the Atlas service by
> Hashicorp ? If yes, is there already some WMF account ?
>
> 2016-04-12 16:57 GMT+02:00 Giuseppe Lavagetto <glavagetto(a)wikimedia.org>:
>>
>> On Mon, Apr 11, 2016 at 12:06 PM, Guillaume Lederrey
>> <glederrey(a)wikimedia.org> wrote:
>> > Short status about the micro-hackathon that took place in my kitchen
>> > last Saturday:
>> >
>> > First of all, thanks to Joe for his support! And loads of thanks to
>> > Jan, Alex, Nicko and Antoine for participating!
>> >
>>
>> [CUT]
>>
>> This is seriously great and sorry for not being around more - I'd have
>> loved to actually help instead of delivering some random advice on IRC
>> in the evening.
>>
>> Thanks to everyone involved, and please bug me on irc/phabricator if
>> you need help/feedback with your patches :)
>>
>> Cheers,
>>
>> Giuseppe
>> --
>> Giuseppe Lavagetto, Ph.d.
>> Senior Technical Operations Engineer, Wikimedia Foundation
>
>
>
>
> --
> Antoine Boegli
> software engineer & linux expert
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
Short status about the micro-hackathon that took place in my kitchen
last Saturday:
First of all, thanks to Joe for his support! And loads of thanks to
Jan, Alex, Nicko and Antoine for participating!
The main goal was to have fun and introduce a few people to what we
are doing. At the same time, we did manage to do get some actual work
done.
We did have a look at the following:
* T128786 [1]: Improve robustness of es-tool
Implementation is done, needs to be tested somewhere and deployed.
Thanks Alex!
* T78342 [2]: Create a basic RSpec unit test for operations/puppet
Some work has been done, but it is not yet in a state where it can
be merged. Nicko will continue to look into it and let us know.
* T130861 [3]: Investigate possible simplification of Cassandra
Logstash filtering
Implementation is done, needs to be tested somewhere and deployed.
Thanks Jan!
* T131760 [4]: Add icinga monitoring for varnish statistics daemons
Implementation is done, needs to be tested somewhere and deployed.
Thanks Alex!
Antoine also had a look into offering a Vagrant image with a fully
working Elasticsearch (with indices from our dumps). Not yet working.
We might see Jan, Alex, Nicko and Antoine on IRC, trying to push their
changes up to completion...
Note: you know you have a great job when you are happy to continue
doing it on weekend AND you have friends coming over to do it with you
just for fun!
[1] https://phabricator.wikimedia.org/T128786
[2] https://phabricator.wikimedia.org/T78342
[3] https://phabricator.wikimedia.org/T130861
[4] https://phabricator.wikimedia.org/T131760
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
So wdqs1002 is not going well. Long story short: we're going to need
to reinstall it. That's a first for me, so I'm going to need some
help. I'll ping the Ops side to get the low level stuff, but I'm
probably going to need a bit of time from Stas and a few pointers in
the right direction.
@Stas: expect to hear from me in your morning ...
I'll give you all the details tomorrow, but right now I need some sleep...
Good night.
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
WDQS servers were schedule for a reboot during the weekly deployment
window. While the first server rebooted without issue, things did not
go as well with the second one. We do not know yet what the issue is,
but we are investigating [1].
This has no direct impact to the end user, we can run on a single
server, as long as we don't loose it as well...
[1] https://phabricator.wikimedia.org/T132387
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation