Well, no, HDFS is a means to and end of storing data in a form that
can be cleaned with ETL processes so that /then/ they can go to the
somewhere/something - which is a lot of use cases but most prominently
our dashboards and ad-hoc research tasks.
Let me be clear here that this isn't a theoretical exercise existing
in a vacuum; we do not want /a/ answer that can be hooked up to the
dashboards. That's easy. That's a hideous shell script that scaps
nginx files over. We want a answer that can be hooked up to the
dashboards for many, many, many things, because we're not just wanting
metrics and analytics for WDQS, we're also wanting them for the
production API and for user events and for the cirrus logs and for
high-level KPIs and that's just the things we've wanted this month.
I can't be building out an entirely new pipeline every single time
someone builds a thing. That's not an efficient use of our analysts
time and it massively increases the chance that something will go
wrong. I'm not asking for an alternative to HDFS, because I don't want
to be doing that. I'm asking for HDFS because then we don't need to
reinvent the wheel every time we build a thing. If we can't do HDFS
and going to production isn't going to work, then let's talk about
what the alternatives are. Until then the use case is "the data being
in HDFS so that analysts can consume it" and higher-level use cases
are overthinking.
On 17 June 2015 at 03:22, Stas Malyshev <smalyshev(a)wikimedia.org> wrote:
Hi!
The problem, as we've gone back and forth
about for a while on
phabricator, is that labs has absolutely zero inbuilt infrastructure
for analytics.
If things are in production they go through the frontend varnishes,
which are hooked up to HDFS, and all is fine. We have the request
logs. If things are in labs...nothing. There is no access to HDFS,
there is no consistent varnish setup that pipes things there, and
analytics engineering has pretty much no plans to set up that sort of
infrastructure.
Right. What I am still missing is that HDFS, varnish, etc. are means to
an end, end being delivering info (in this case, usage logs) somewhere,
and then doing something. So I do not have right now clear picture of
what is that somewhere/something, and what data it consumes in what
form. Maybe if I would be more up to speed on this - or at least
understood what inputs are required and which forms of these inputs are
acceptable, I could have a better picture.
--
Stas Malyshev
smalyshev(a)wikimedia.org
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
--
Oliver Keyes
Research Analyst
Wikimedia Foundation