[Engineering] [Analytics] Hadoop - Last week data needs to be backfilled

Joseph Allemandou jallemandou at wikimedia.org
Wed Mar 2 09:27:51 UTC 2016


Hi Tilman,
Your assumption is correct, you can trust projectview_hourly :)

On Wed, Mar 2, 2016 at 4:22 AM, Tilman Bayer <tbayer at wikimedia.org> wrote:

> Thanks Joseph! Is it reasonable to assume that the aggregate data in
> projectview_hourly
> <https://wikitech.wikimedia.org/wiki/Analytics/Data/Projectview_hourly> has
> not been affected?
>
> On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou <
> jallemandou at wikimedia.org> wrote:
>
>> Hey Oliver,
>> It depends on what data you've used: if page_title or other 'encoding
>> sensitive' data (I can't think of any other, but ...) is part of it, then
>> yes, you should !
>>
>> On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes <okeyes at wikimedia.org>
>> wrote:
>>
>>> Hey Joseph,
>>>
>>> Thanks for letting us know. So we should delete and backfill last
>>> week's data, for our regularly scheduled scripts?
>>>
>>> On 1 March 2016 at 08:26, Joseph Allemandou <jallemandou at wikimedia.org>
>>> wrote:
>>> > Hi,
>>> >
>>> > TL,DR: Please don't use hive / spark / hadoop before next week.
>>> >
>>> > Last week the Analytics Team performed an upgrade to the Hadoop
>>> Cluster.
>>> > It went reasonably well except for many of the hadoop processes were
>>> > launched with a special option to NOT use utf-8 as default encoding.
>>> > This issue caused trouble particularly in page title extraction and was
>>> > detected last sunday (many kudos to the people having filled bugs on
>>> > Analytics API about encoding :)
>>> > We found the bug and fixed it yesterday, and backfill starts today,
>>> with the
>>> > cluster recomputing every dataset starting 2016-02-23 onward.
>>> > This means you shouldn't query last week data during this week, first
>>> > because it is incorrect, and second because you'll curse the cluster
>>> for
>>> > being too slow :)
>>> >
>>> > We are sorry for the inconvenience.
>>> > Don't hesitate to contact us if you have any question
>>> >
>>> >
>>> > --
>>> > Joseph Allemandou
>>> > Data Engineer @ Wikimedia Foundation
>>> > IRC: joal
>>> >
>>> > _______________________________________________
>>> > Engineering mailing list
>>> > Engineering at lists.wikimedia.org
>>> > https://lists.wikimedia.org/mailman/listinfo/engineering
>>> >
>>>
>>>
>>>
>>> --
>>> Oliver Keyes
>>> Count Logula
>>> Wikimedia Foundation
>>>
>>
>>
>>
>> --
>> *Joseph Allemandou*
>> Data Engineer @ Wikimedia Foundation
>> IRC: joal
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> _______________________________________________
> Engineering mailing list
> Engineering at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/engineering
>
>


-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/engineering/attachments/20160302/04c1c22b/attachment.html>


More information about the Engineering mailing list