On Tue, May 20, 2014 at 10:36 PM, Dario Taraborelli <dtaraborelli@wikimedia.org> wrote:
On May 20, 2014, at 10:09 PM, Sean Pringle <springle@wikimedia.org> wrote:

Hi!

I'd like to hear from stakeholders about purging old data from the eventlogging database. Yes, no, why [not], etc.

I understand from Ori that there is a 90 day retention policy, and that purging has been discussed previously but not addressed for various reasons. Certainly there are many timestamps older than 90 days still in the db, and apparently largely untouched by queries?

Perhaps we're in a better position now to do this properly what with data now in multiple places: log files, database, hadoop...

Can we please purge stuff? :-)

BR
Sean

Hi Sean, 

I sent a similar proposal to the internal list for preliminary feedback (see item 2 below)

All, I wanted to hear your thoughts informally (before posting to the lists) on two ideas that have been floating around recently:

1) add support for optional sampling in EventLogging via JSON schemas (given the sheer number of teams who have asked for it). See https://bugzilla.wikimedia.org/show_bug.cgi?id=65500
Not to hijack the thread, but: to do this in the schema itself confuses the structure of the data with the mechanics of its use. I think having a couple of helpers in JavaScript and PHP for simple random sampling is sufficient.

2) introduce 90-day pruning by default for all logs, (adding a dedicated schema element to override the default).
Same problem. To illustrate: suppose we're two months into a data collection job. The researcher carelessly forgot to modify the pruning policy, so it's set to the default 90 days, whereas the researcher needs it for 180. At this point our options are:

1) Decline to help, even though there's a full month before the pruning kicks in.
2) Somehow alter the schema revision without creating a new revision. EventLogging assumes that schema revisions are immutable and it exploits this property to provide guarantees about data validity and consistency, so this is a nonstarter.
3) Create a new schema revision that declares a 180 day expiration and then populate its table with a copy of each event logged under the previous schema.

The motivation behind your proposal is (I think) a desire to have a unified configuration interface for data collection jobs. This makes total sense and it's worth pursuing. I just don't think we should stuff everything into the schema. The schema is just that: a schema. It's a data model.



This would push to the customers the responsibility of ensuring the right data is collected and retained.

I understand 2) has already been partly implemented for the raw JSON logs (not yet for EL data stored in SQL). Obviously, we would need to audit existing logs to make sure that we don’t discard data that needs to be retained in a sanitized or aggregate form past 90 days.


Note that – per our data retention guidelines [1] – not all EL data is expected to be automatically purged within 90 days (see the section on “Non-personal information associated with a user account”): many of these logs have a status similar to MediaWiki data that is retained in the DB but not fully exposed to labs.
 
For this reason, I am proposing that we enable 90-day pruning by default for new schemas, with the ability to override the default.

Sounds good to me. I figure that the overrides would be specified as configuration values for the script that does the actual pruning. We could Puppetize that and document the process for adding exemptions.
 
Existing schemas would need to be audited on a case by case basis.

By whom? :) Surely not Sean! It'd be great to get this process going.