[Foundation-l] Release of squid log data

Brian Brian.Mingus at colorado.edu
Sat Sep 15 14:29:18 UTC 2007


Wikiresearch-l had a roundtable about this at Wikimania two years ago. We
reached no conclusion.  I would love to pipe this data through my quality
classifier, especially combined with the edit histories of the associated
users. But do you realize what kind of a double whammy that is? Not only do
you have their surfing habits, you've got their editing habits. On one of
the largest websites in the world. This data is of unspeakable value not
only to researchers, but to spammers, would-be identity thieves and others.

Although having this data is a wet dream of mine, I find it unconscionable
to release it, and I feel that whoever was responsible for releasing it has
already overstepped their bounds. We already know from the New York Times
analyzing AOL's search logs that persons can be identified from search logs,
and we know from Microsoft's Non-Disclosure Agreements with universities
around the world for portions of the Windows 2000 source code that these
NDAs, even to universities, are not effective in stopping the data from
being leaked.

Now that the data has already been released, it is imminent that the
foundation create an explicit philosophy about data retention policies and
the circumstances under which user data may be released. I suggest that it
never be released, and that the foundation hire and/or appoint a
statistician for analyzing logs in-house. Perhaps this person can act as a
liaison in certain, well defined situations that do not compromise the
personal information of anyone beyond what is already available in database
dumps. This is the only ethical approach in my opinion.

On 9/15/07, Erik Moeller <erik at wikimedia.org> wrote:
>
> On 9/14/07, Tim Starling <tstarling at wikimedia.org> wrote:
> > For a while now, we've been releasing squid log data, stripped of
> > personally identifying information such as IP addresses, to groups at
> > two universities: Vrije Universiteit and the University of Minnesota. We
> > now have a request pending from a third group, at Universidad Rey Juan
> > Carlos in Spain. They are asking if they can have the full data stream
> > including IP addresses, and they are prepared to sign a confidentiality
> > agreement to get it.
>
> "Wikimedia will not sell or share private information, such as email
> addresses, with third parties, unless you agree to release this
> information, or it is required by law to release the information."
> http://wikimediafoundation.org/wiki/Privacy_policy
>
> Under the current policy I would not support it, even if "private
> information" is somewhat ambiguous: we must err on the side of
> caution.
>
> I might support a research exemption clause in future versions of the
> policy _if_ a compelling case can be made that such an exemption is
> needed, and that no alternative research method would produce results
> of approximately the same quality. So far no such case has been made.
>
> Whatever we do, it is crucial that we make it clear to our users
> through our privacy policy what is going on. In that spirit, I would
> also appreciate it if the privacy policy could be updated to describe
> the existing agreements with universities, and the work that is being
> done on the toolserver.
> --
> Toward Peace, Love & Progress:
> Erik
>
> DISCLAIMER: This message does not represent an official position of
> the Wikimedia Foundation or its Board of Trustees.
>
> _______________________________________________
> foundation-l mailing list
> foundation-l at lists.wikimedia.org
> http://lists.wikimedia.org/mailman/listinfo/foundation-l
>


More information about the foundation-l mailing list