[Engineering] Introducing the HDFS Trash directory in the Analytics Hadoop cluster

Luca Toscano ltoscano at wikimedia.org
Thu Apr 12 08:37:46 UTC 2018


Hi everybody,

in T189051 the Analytics team introduced a new feature in the Hadoop
cluster, namely the HDFS Trash directory. This means that now if you use
the hdfs -rm CLI command you will not directly delete a file or a
directory, but you'll move it under /user/$yourusername/.Trash. The Trash
directory is "partitioned" by daily directories (called checkpoints), and
will keep files for a month before deleting them. Here's a quick FAQ about
how to recover data if needed:

https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster#recover_files_deleted_by_mistake_using_the_hdfs_CLI_rm_command

If you want to skip the Trash directory then you can use the hdfs
-skipTrash option, but of course it should be done only when you are really
sure about what you are doing :)

We hope that this extra safety net will help all the Hadoop users to
preserve their data (in this case that might get deleted by mistake).

If you have comments/suggestions/etc.. feel free to reach out to the
Analytics team via mailing list or via IRC.

Thanks!

Luca (on behalf of the Analytics team)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/engineering/attachments/20180412/103bfcd3/attachment.html>


More information about the Engineering mailing list