[Engineering] [Analytics] Hadoop - Last week data needs to be backfilled

Tilman Bayer tbayer at wikimedia.org
Wed Mar 2 03:22:44 UTC 2016


Thanks Joseph! Is it reasonable to assume that the aggregate data in
projectview_hourly
<https://wikitech.wikimedia.org/wiki/Analytics/Data/Projectview_hourly> has
not been affected?

On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou <jallemandou at wikimedia.org
> wrote:

> Hey Oliver,
> It depends on what data you've used: if page_title or other 'encoding
> sensitive' data (I can't think of any other, but ...) is part of it, then
> yes, you should !
>
> On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes <okeyes at wikimedia.org> wrote:
>
>> Hey Joseph,
>>
>> Thanks for letting us know. So we should delete and backfill last
>> week's data, for our regularly scheduled scripts?
>>
>> On 1 March 2016 at 08:26, Joseph Allemandou <jallemandou at wikimedia.org>
>> wrote:
>> > Hi,
>> >
>> > TL,DR: Please don't use hive / spark / hadoop before next week.
>> >
>> > Last week the Analytics Team performed an upgrade to the Hadoop Cluster.
>> > It went reasonably well except for many of the hadoop processes were
>> > launched with a special option to NOT use utf-8 as default encoding.
>> > This issue caused trouble particularly in page title extraction and was
>> > detected last sunday (many kudos to the people having filled bugs on
>> > Analytics API about encoding :)
>> > We found the bug and fixed it yesterday, and backfill starts today,
>> with the
>> > cluster recomputing every dataset starting 2016-02-23 onward.
>> > This means you shouldn't query last week data during this week, first
>> > because it is incorrect, and second because you'll curse the cluster for
>> > being too slow :)
>> >
>> > We are sorry for the inconvenience.
>> > Don't hesitate to contact us if you have any question
>> >
>> >
>> > --
>> > Joseph Allemandou
>> > Data Engineer @ Wikimedia Foundation
>> > IRC: joal
>> >
>> > _______________________________________________
>> > Engineering mailing list
>> > Engineering at lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/engineering
>> >
>>
>>
>>
>> --
>> Oliver Keyes
>> Count Logula
>> Wikimedia Foundation
>>
>
>
>
> --
> *Joseph Allemandou*
> Data Engineer @ Wikimedia Foundation
> IRC: joal
>
> _______________________________________________
> Analytics mailing list
> Analytics at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/engineering/attachments/20160301/3713d3d4/attachment.html>


More information about the Engineering mailing list