[Engineering] [Analytics] Hadoop - Last week data needs to be backfilled

Andrew Otto otto at wikimedia.org
Tue Mar 1 19:26:09 UTC 2016


https://phabricator.wikimedia.org/T128295

On Tue, Mar 1, 2016 at 2:15 PM, Bo Han <bo.ning.han at gmail.com> wrote:

> Hi,
>
> Would you mind linking the bug fix here? I couldn't find it on phabricator.
>
> Thanks,
> Bo
>
> On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou
> <jallemandou at wikimedia.org> wrote:
> > Hey Oliver,
> > It depends on what data you've used: if page_title or other 'encoding
> > sensitive' data (I can't think of any other, but ...) is part of it, then
> > yes, you should !
> >
> > On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes <okeyes at wikimedia.org>
> wrote:
> >>
> >> Hey Joseph,
> >>
> >> Thanks for letting us know. So we should delete and backfill last
> >> week's data, for our regularly scheduled scripts?
> >>
> >> On 1 March 2016 at 08:26, Joseph Allemandou <jallemandou at wikimedia.org>
> >> wrote:
> >> > Hi,
> >> >
> >> > TL,DR: Please don't use hive / spark / hadoop before next week.
> >> >
> >> > Last week the Analytics Team performed an upgrade to the Hadoop
> Cluster.
> >> > It went reasonably well except for many of the hadoop processes were
> >> > launched with a special option to NOT use utf-8 as default encoding.
> >> > This issue caused trouble particularly in page title extraction and
> was
> >> > detected last sunday (many kudos to the people having filled bugs on
> >> > Analytics API about encoding :)
> >> > We found the bug and fixed it yesterday, and backfill starts today,
> with
> >> > the
> >> > cluster recomputing every dataset starting 2016-02-23 onward.
> >> > This means you shouldn't query last week data during this week, first
> >> > because it is incorrect, and second because you'll curse the cluster
> for
> >> > being too slow :)
> >> >
> >> > We are sorry for the inconvenience.
> >> > Don't hesitate to contact us if you have any question
> >> >
> >> >
> >> > --
> >> > Joseph Allemandou
> >> > Data Engineer @ Wikimedia Foundation
> >> > IRC: joal
> >> >
> >> > _______________________________________________
> >> > Engineering mailing list
> >> > Engineering at lists.wikimedia.org
> >> > https://lists.wikimedia.org/mailman/listinfo/engineering
> >> >
> >>
> >>
> >>
> >> --
> >> Oliver Keyes
> >> Count Logula
> >> Wikimedia Foundation
> >
> >
> >
> >
> > --
> > Joseph Allemandou
> > Data Engineer @ Wikimedia Foundation
> > IRC: joal
> >
> > _______________________________________________
> > Analytics mailing list
> > Analytics at lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
> _______________________________________________
> Analytics mailing list
> Analytics at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/engineering/attachments/20160301/c92d0a13/attachment-0001.html>


More information about the Engineering mailing list