[Engineering] Hadoop - Last week data needs to be backfilled

Joseph Allemandou jallemandou at wikimedia.org
Tue Mar 1 15:24:26 UTC 2016


Hey Oliver,
It depends on what data you've used: if page_title or other 'encoding
sensitive' data (I can't think of any other, but ...) is part of it, then
yes, you should !

On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes <okeyes at wikimedia.org> wrote:

> Hey Joseph,
>
> Thanks for letting us know. So we should delete and backfill last
> week's data, for our regularly scheduled scripts?
>
> On 1 March 2016 at 08:26, Joseph Allemandou <jallemandou at wikimedia.org>
> wrote:
> > Hi,
> >
> > TL,DR: Please don't use hive / spark / hadoop before next week.
> >
> > Last week the Analytics Team performed an upgrade to the Hadoop Cluster.
> > It went reasonably well except for many of the hadoop processes were
> > launched with a special option to NOT use utf-8 as default encoding.
> > This issue caused trouble particularly in page title extraction and was
> > detected last sunday (many kudos to the people having filled bugs on
> > Analytics API about encoding :)
> > We found the bug and fixed it yesterday, and backfill starts today, with
> the
> > cluster recomputing every dataset starting 2016-02-23 onward.
> > This means you shouldn't query last week data during this week, first
> > because it is incorrect, and second because you'll curse the cluster for
> > being too slow :)
> >
> > We are sorry for the inconvenience.
> > Don't hesitate to contact us if you have any question
> >
> >
> > --
> > Joseph Allemandou
> > Data Engineer @ Wikimedia Foundation
> > IRC: joal
> >
> > _______________________________________________
> > Engineering mailing list
> > Engineering at lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/engineering
> >
>
>
>
> --
> Oliver Keyes
> Count Logula
> Wikimedia Foundation
>



-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/engineering/attachments/20160301/284c1190/attachment.html>


More information about the Engineering mailing list