This month, our research showcase
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#March_2016> hosts
Andrei Rizoiu (Australian National University) to talk about his work
<http://cm.cecs.anu.edu.au/post/wikiprivacy/> on *how private traits of
Wikipedia editors can be exposed from public data* (such as edit histories)
using off-the-shelf machine learning techniques. (abstract below)
If you're interested in learning what the combination of machine learning
and public data mean for privacy and surveillance, come and join us
this *Wednesday
March 16* at *1pm Pacific Time*.
The event will be recorded and publicly streamed
<https://www.youtube.com/watch?v=Xle0oOFCNnk>. As usual, we will be hosting
the conversation with the speaker and Q&A on the #wikimedia-research
channel on IRC.
Looking forward to seeing you there,
Dario
Evolution of Privacy Loss in WikipediaThe cumulative effect of collective
online participation has an important and adverse impact on individual
privacy. As an online system evolves over time, new digital traces of
individual behavior may uncover previously hidden statistical links between
an individual’s past actions and her private traits. To quantify this
effect, we analyze the evolution of individual privacy loss by studying the
edit history of Wikipedia over 13 years, including more than 117,523
different users performing 188,805,088 edits. We trace each Wikipedia’s
contributor using apparently harmless features, such as the number of edits
performed on predefined broad categories in a given time period (e.g.
Mathematics, Culture or Nature). We show that even at this unspecific level
of behavior description, it is possible to use off-the-shelf machine
learning algorithms to uncover usually undisclosed personal traits, such as
gender, religion or education. We provide empirical evidence that the
prediction accuracy for almost all private traits consistently improves
over time. Surprisingly, the prediction performance for users who stopped
editing after a given time still improves. The activities performed by new
users seem to have contributed more to this effect than additional
activities from existing (but still active) users. Insights from this work
should help users, system designers, and policy makers understand and make
long-term design choices in online content creation systems.
*Dario Taraborelli *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
Hi folks!
Due to a kernel upgrade for a security fix we need to reboot each node of
the Hadoop cluster. The task will be started later on today and it will be
done in small batches to avoid causing major delays to outstanding jobs.
Please contact me if you notice any major issue (elukey or the
#wikimedia-analytics channel on freenode).
Thanks!
Regards,
Luca
Hi,
here's a headsup for people with long-running queries/jobs:
I need to reboot stat1002 and stat1003 for a kernel update tomorrow
morning (16th of March) at 9am UTC. Please ping me if that should
happen to be a really bad time and we can possibly reschedule.
Cheers,
Moritz
Hi,
I'm forwarding this email here, in the hope I can gather more feedback
and to explore whether Event Logging could be the right choice for
gathering the data (the NDA for access does not look good :P)
Thanks,
Strainu
---------- Forwarded message ----------
From: Strainu <strainu10(a)gmail.com>
Date: 2016-03-11 14:44 GMT+02:00
Subject: Usage of correct diacritics in readers of Romanian Wikipedia
To: mobile-l(a)lists.wikimedia.org
Hi,
I have proposed a new research project about the support for correct
diacritics in the readers of the Romanian Wikipedia [1]. The plan I
made is (probably) limited to the desktop site, but Adam suggested
there might be some overlap with the work you are doing around
emerging communities. So, if someone is interested in extending the
study to mobile users or you have any feedback on the project, please
leave a message on the talk page or contact me by email.
Thanks,
Strainu
P.S. Please keep me in the CC for any responses, as I don't get emails
from mobile-l.
[1] https://meta.wikimedia.org/wiki/Research:Usage_of_correct_diacritics_in_rea…
Hi all,
The analytics team, in an effort to collect sensitive data less, plans to
drop the clientIP field from the EventCapsule(
https://meta.wikimedia.org/wiki/Schema:EventCapsule), which is the wrapper
for all events flowing into Eventlogging (Currently IPs and User Agents get
purged after the 90 days mark). The field was originally meant only for
debugging, but has served some research usecases. Most of these cases have
been wrapped up at this point. It has also been used as a proxy to count
number of devices visiting sites like our blog - and since IP's are not a
good measure of that anyway - we plan to move such cases to use Piwik.
The rollout of the change will happen in stages (Drop clientIPs first on
the EL end, then the EventCapsule in meta, and finally on the VarnishKafka
end). It should be a clean deployment and there's no scheduled downtime -
EL will keep working as is. What does change? ClientIP's will start being
set as NULL in your mysql tables. If you update the Eventlogging schema you
maintain - causing new tables to be created, the new tables will not have
the clientIp field in them. The change is planned to be rolled out the week
of 11th or 18th March '16, pending the completion of data collection for
the ongoing QuickSurveys based research work.
Let us know if you have any questions/concerns on the list or on
#wikimedia-analytics. The related phab ticket is here -
https://phabricator.wikimedia.org/T128407.
Thanks,
Madhu Viswanathan
Software Engineer, Analytics
Hello Analytics Team,
We would like to have one-time access to wmf.webrequest data. What is
the correct way of accessing the data?
In our research group, we want to simulate the requests for specific
version of WikiMedia.
Thanks,
Michal Bystricky
Hi,
*TL,DR: Please don't use hive / spark / hadoop before next week.*
Last week the Analytics Team performed an upgrade to the Hadoop Cluster.
It went reasonably well except for many of the hadoop processes were
launched with a special option to NOT use utf-8 as default encoding.
This issue caused trouble particularly in page title extraction and was
detected last sunday (many kudos to the people having filled bugs on
Analytics API about encoding :)
We found the bug and fixed it yesterday, and backfill starts today, with
the cluster recomputing every dataset starting 2016-02-23 onward.
This means you shouldn't query last week data during this week, first
because it is incorrect, and second because you'll curse the cluster for
being too slow :)
We are sorry for the inconvenience.
Don't hesitate to contact us if you have any question
--
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
Fallback is: cable up the old 1GB nic (Chris has done this and set up the
port), PXE install on that, move to 10gb NIC once we're back up. Gross but
it gets the job done.
Set for tomorrow (Friday) 1 to 4 pm UTC, this time should be much smoother.
Same caveats apply as before.
Ariel
On Wed, Mar 2, 2016 at 8:47 PM, Ariel Glenn WMF <ariel(a)wikimedia.org> wrote:
> PXE boot from non-embedded nic failed spectacularly despite our best
> efforts. This means we'll have to schedule another window once we have
> someting new to try. I apologize for the extra inconvenience. All services
> are back exactly the way they were.
>
> Ariel
>
> On Wed, Mar 2, 2016 at 6:01 PM, Ariel Glenn WMF <ariel(a)wikimedia.org>
> wrote:
>
>> Extending this downtime window because we ran into unexpected issues with
>> PXE boot.
>>
>> On Tue, Mar 1, 2016 at 3:53 PM, Ariel Glenn WMF <ariel(a)wikimedia.org>
>> wrote:
>>
>>> Dataset1001, the host which serves dumps and other datasets to the
>>> public, as well as providing access to various datasets directly on
>>> stats100x, will be unavailable tomorrow for an upgrade to jessie. While I
>>> don't expect to need nearly 3 hours for the upgrade, better safe than
>>> sorry. In the meantime all files will be accessible via
>>> ms1001.wikimedia.org via the web, and all dumps and page view files
>>> from our mirrors as well.
>>>
>>> Thanks for your understanding.
>>>
>>> Ariel Glenn
>>>
>>>
>>>
>>
>
cc-ing Analytics list and Ariel who maintains dumps.
On Wed, Mar 2, 2016 at 8:31 AM, Gonzalo Diaz <gonzalo.diaz(a)cs.ox.ac.uk>
wrote:
> Dear Nuria Ruiz,
>
> My name is Gonzalo Diaz, and I am a PhD student of Computer Science at the
> University of Oxford. You can see my profile here:
> https://www.cs.ox.ac.uk/people/gonzalo.diaz/
>
> I am writing because I am currently working on a research project which
> would benefit from processing Wikipedia pagecount files.
>
> On Monday, 29 February 2016, we began downloading pagecount files from
> http://dumps.wikimedia.org/other/pagecounts-raw/. For the next 48 hours
> we managed to download ~15 months of raw pagecount files, using 3 different
> computers, and 3 instances of "wget" on each computer (for a total of 9
> concurrent downloads at any given moment).
>
> Since this morning, however, we are no longer able to download the
> pagecount files. Furthermore, the site dumps.wikimedia.org seems down.
>
> Hopefully, our downloads are not responsible for this. If they are,
> however, we would like to apologise for the inconvenience.
>
> In any case, we would like to request permission to continue downloading
> the raw pagecount files, as soon as the site is back online.
>
> I thank you very much for your time!
>
> Kindest regards,
> Gonzalo Diaz
> John Mittermeier
>
>
>
>
>
>
>
>
>