[Engineering] Hadoop - Last week data needs to be backfilled

Joseph Allemandou jallemandou at wikimedia.org
Tue Mar 1 13:26:32 UTC 2016


Hi,

*TL,DR: Please don't use hive / spark / hadoop before next week.*

Last week the Analytics Team performed an upgrade to the Hadoop Cluster.
It went reasonably well except for many of the hadoop processes were
launched with a special option to NOT use utf-8 as default encoding.
This issue caused trouble particularly in page title extraction and was
detected last sunday (many kudos to the people having filled bugs on
Analytics API about encoding :)
We found the bug and fixed it yesterday, and backfill starts today, with
the cluster recomputing every dataset starting 2016-02-23 onward.
This means you shouldn't query last week data during this week, first
because it is incorrect, and second because you'll curse the cluster for
being too slow :)

We are sorry for the inconvenience.
Don't hesitate to contact us if you have any question


-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/engineering/attachments/20160301/42c2a620/attachment.html>


More information about the Engineering mailing list