Hi all,
The next Research Showcase, focused on *Rules on Wikipedia*, will be
live-streamed on Wednesday, September 20, at 9:30 AM PST / 16:30 UTC. Find
your local time here <https://zonestamp.toolforge.org/1695227400>.
YouTube stream: https://youtube.com/live/h89l9JWZBCU?feature=share
<https://www.google.com/url?q=https://youtube.com/live/h89l9JWZBCU?feature%3…>.
As usual, you can join the conversation in the YouTube chat as soon as the
showcase goes live.
This month's presentations:
Variation and overlap in the peer production of community rules: the case
of five WikipediasBy *Sohyeon Hwang, Northwestern University*
In this talk, I present work analyzing the rules and rule-making on
Wikipedia. The governance of many online communities relies on rules
created by participants. However, work predominantly focuses on efforts
within a single community or on a platform as a whole. Here we investigate
the comparative and relational dimensions of online self-governance in a
set of similar communities by looking at the five largest language editions
of Wikipedia. Using exhaustive trace data spanning almost 20 years since
their founding, we examine patterns in rule-making and overlaps in rule
sets. Our findings show that language editions have similar trajectories of
rule-making activity, replicating and extending a rich body of work that
have focused on English-language Wikipedia alone. We also find that the
language editions have increasingly unique rule sets, even as editing
activity concentrates on rules shared between them. The results suggest
that self-governing communities aligned in key ways may share a common core
of rules and rule-making practices even as they develop and sustain
institutional variations.
Wikipedia Community Policies and Experiential Epistemology: Critical
Information Literacy, Social Justice, and Inclusive PracticesBy *Zachary J.
McDowell, University of Illinois at Chicago*Drawing from a meta-analysis of
research on learning outcomes in Wikipedia-based education, this
presentation addresses Wikipedia community policies and practices through
the Framework for Information Literacy in Higher Education from the
Association of College and Research Libraries’ (ACRL). Wikipedia-based
educational practices, which promote newcomers’ active engagement in the
encyclopedia, have been shown to support experiential learnings in critical
information literacy, communication and research outcomes, and social
justice. Exploring the connections between participation in Wikipedia and
transferable skills for information literacy in the context of the current
new media landscape, this presentation grapples with new questions for the
future of information literacies alongside the implications of large
language models (LLMs), systemic biases, and the representation and
inclusion of non-western and indigenous knowledge sources.
You can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Shshowcase
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase>
Best,
Kinneret
--
Kinneret Gordon
Lead Research Community Officer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello,
After experiencing some strange behavior re-fetching pageview data, I am
wondering if it is possible that the daily pageview count for an article
could change *after* the data is originally published to the API.
For example, if I fetch the daily pageviews on an article for the date
14-08-23, and then re-fetch the daily pageviews for the same article in the
future, is it expected that the value for 14-08-23 could be different?
Is there a backfill or correction process that can update daily pageview
counts for days that are already available via the API?
Any information is appreciated!
Thanks,
Duncan
--
Duncan Grubbs
Software Engineer
he/him
E: duncan(a)predata.com <first.lastname(a)predata.com>
Time Zone: ET (UTC-5/-4)
predata.com <https://www.predata.com/>
Hello,
We have to perform some scheduled maintenance of the analytics client
servers <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients>,
which are named stat100[4-9].
This maintenance requires a reboot of each server, so I'm planning to do
this next Monday the 14th of August at approximately 09:00 UTC.
I'll reboot each of the five servers in numeric sequence and I expect
the work to take no more than 1 hour in total.
Please do let me know if this will adversely impact your work and I will
try my best to work around your requirements.
I'll send another announcement nearer the time, as a reminder to save
any work that you may have in progress on these servers.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello all users of Airflow,
We need to perform some scheduled maintenance on all of our Airflow
instances, so I'm scheduling a maintenance window for tomorrow at 08:30
UTC and I expect the work to take no more than 30 minutes. The work
involves a reboot of the shared PostgreSQL database that serves all of
our instances, as well as a reboot of some instances themselves.
I will pause all active DAGs on all Airflow instances prior to the work,
allow some time for running tasks to complete, then un-pause the DAGs
afterwards.
Naturally, you are also free to pause your own DAGs prior to the
maintenance and un-pause them afterwards, should you wish to minimize
the risk of disruption.
Please do let me know if there is anything specific that you would like
me to check, either before or after this maintenance.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Dear members of the Analytics Team,
I am currently conducting research about the excludability of free
knowledge available on the Wikimedia projects as an example of a public
good. In order to calibrate the model, I need aggregate data on the page
views and edits by country and language.
After having carefully read Research:Data
<https://meta.wikimedia.org/wiki/Research:Data>, I was only able to find
data on page views by country and language, which would be enough to
calibrate the demand side of my model. So, is it possible to get aggregate
data on edits by country and language, which are similar to those on page
views available at WikiStats?
Thanks in advance.
Best regards,
Kiril Simeonovski
Hello everyone,
The next Research Showcase, focused on *Improving knowledge integrity in
Wikimedia projects*, will be live-streamed Wednesday, July 19, at 9:30 AM
PST / 16:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1689784256>.
The event is on the WMF Staff Calendar.
YouTube stream: https://youtube.com/live/_8DevIsi44s?feature=share
<https://www.google.com/url?q=https://youtube.com/live/_8DevIsi44s?feature%3…>
You can join the conversation on IRC at #wikimedia-research. You can also
watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
This month's presentations:
Assessment of Reference Quality on WikipediaBy *Aitolkyn Baigutanova, KAIST*In
this talk, I will present our research on the reliability of Wikipedia
through the lens of its references. I will primarily discuss our paper on
the longitudinal assessment of reference quality on English Wikipedia,
where we operationalize the notion of reference quality by defining
reference need (RN), i.e., the percentage of sentences missing a citation,
and reference risk (RR), i.e., the proportion of non-authoritative
references. I will share our research findings on two key aspects: (1) the
evolution of reference quality over a 10-year period and (2) factors that
affect reference quality. We discover that the RN score has dropped by 20
percent point, with more than half of verifiable statements now
accompanying references. The RR score has remained below 1% over the years
as a result of the efforts of the community to eliminate unreliable
references. As an extension of this work, we explore how community
initiatives, such as the perennial source list, help with maintaining
reference quality across multiple language editions of Wikipedia. We hope
our work encourages more active discussions within Wikipedia communities to
improve reference quality of the content.
- Paper: Aitolkyn Baigutanova, Jaehyeon Myung, Diego Saez-Trumper,
Ai-Jou Chou, Miriam Redi, Changwook Jung, and Meeyoung Cha. 2023.
Longitudinal Assessment of Reference Quality on Wikipedia. In Proceedings
of the ACM Web Conference 2023 (WWW '23). Association for Computing
Machinery, New York, NY, USA, 2831–2839.
<https://dl.acm.org/doi/abs/10.1145/3543507.3583218>
Multilingual approaches to support knowledge integrity in WikipediaBy *Diego
Saez-Trumper & Pablo Aragón, Wikimedia Foundation*Knowledge integrity in
Wikipedia is key to ensure the quality and reliability of information. For
that reason, editors devote a substantial amount of their time in
patrolling tasks in order to detect low-quality or misleading content. In
this talk we will cover recent multilingual approaches to support knowledge
integrity. First, we will present a novel design of a system aimed at
assisting the Wikipedia communities in addressing vandalism. This system
was built by collecting a massive dataset of multiple languages and then
applying advanced filtering and feature engineering techniques, including
multilingual masked language modeling to build the training dataset from
human-generated data. Second, we will showcase the Wikipedia Knowledge
Integrity Risk Observatory, a dashboard that relies on a language-agnostic
version of the former system to monitor high risk content in hundreds of
Wikipedia language editions. We will conclude with a discussion of
different challenges to be addressed in future work.
- Papers:
Trokhymovych, M., Aslam, M., Chou, A. J., Baeza-Yates, R., & Saez-Trumper,
D. (2023). Fair multilingual vandalism detection system for Wikipedia.
arXiv e-prints, arXiv-2306. https://arxiv.org/pdf/2306.01650.pdfAragón, P.,
& Sáez-Trumper, D. (2021). A preliminary approach to knowledge integrity
risk assessment in Wikipedia projects. arXiv preprint arXiv:2106.15940.
Best,
Kinneret
--
Kinneret Gordon
Senior Research Community Officer
Wikimedia Foundation <https://wikimediafoundation.org/>
Deprecation of Spark v2 scheduled for July 5th
The Data Engineering team is planning to deprecate Spark 2 on July 5th
2023. Its replacement, Spark 3 is already available and all of our
production data pipelines have been migrated successfully to this new
version
<https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Spark/Migration_to_Sp…>.
We have also assisted in the migration of several other teams’ Spark 2
pipelines to Spark 3, but there may still be other Spark 2 jobs that are
configured in code outside of our control.
We encourage you, therefore, to review any of your own Spark jobsthat
you run, to verify that they have been upgraded to work with Spark 3. In
most cases, this will mean checking that the command-line interfaces for
spark
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Spark#…>use
one of the supported forms, such as spark3-submitor pyspark3. In some
cases this may also mean upgrading your conda environments
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda#Migratin…>on
the stats servers from anaconda-wmfto conda-analytics, if you have not
already done so.
The specific change that is scheduled to happen on July 5th is a switch
of spark shuffler version used by YARN
<https://phabricator.wikimedia.org/T332765>from 2 to 3. This should
bring significant performance benefits for existing spark3 jobs, but it
is more than likely that any spark2 jobs attempting to use this new
shuffler will fail.
Please do reach out
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Contact>to the
Data Engineering team if you have any queries or concerns about this
change, or would like help in identifying whether or not you are likely
to be affected.
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
The next Research Showcase, with the theme of *Wikimedia and LGBTQIA+*,
will be live-streamed Wednesday, June 21 at 16:30 UTC. Find your local time
here <https://zonestamp.toolforge.org/1687365012>.
YouTube stream: https://www.youtube.com/watch?v=AOD2ZdxRNfo
You can join the conversation on IRC at #wikimedia-research or on the
YouTube chat.
This month's presentations:
- *Multilingual Contextual Affective Analysis of LGBT People Portrayals
in Wikipedia*
- *Speaker*: Chan Park, Carnegie Mellon University
- *Abstract*: In this talk, I present our research on analyzing the
portrayal of LGBT individuals in their biographies on Wikipedia, with a
particular focus on subtle word connotations and cross-cultural
comparisons. We aim to address two primary research questions: 1) How can
we effectively measure the nuanced connotations of words in multilingual
texts, which reflect sentiments, power dynamics, and agency? 2)
How can we
analyze the portrayal of a specific group, such as the LGBT
community, and
compare these portrayals across different languages? To answer these
questions, we collect the Multilingual Contextualized Connotation Frames
dataset, comprising 2,700 examples in English, Spanish, and Russian. We
also develop a new multilingual model based on pre-trained multilingual
language models. Additionally, we devise a matching algorithm to
construct
a comparison corpus for the target corpus, isolating the attribute of
interest. Finally, we showcase how our developed models and constructed
corpora enable us to conduct cross-cultural analysis of LGBT People
Portrayals on Wikipedia. Our results reveal systematic differences in how
the LGBT community is portrayed across languages, surfacing cultural
differences in narratives and signs of social biases.
- *Paperː* Park, C. Y., Yan, X., Field, A., & Tsvetkov, Y. (2021,
May). Multilingual contextual affective analysis of LGBT people
portrayals
in Wikipedia. In Proceedings of the International AAAI Conference on Web
and Social Media (Vol. 15, pp. 479-490).
<https://arxiv.org/pdf/2010.10820.pdf>
- *Visual gender biases in Wikipediaː A systematic evaluation across the
ten most spoken languages*
- *Speaker*: Daniele Metilli, University College London
- *Abstract*: Wikidata Gender Diversity (WiGeDi) is a one-year
project funded through the Wikimedia Research Fund. The project
is studying
gender diversity in Wikidata, focusing on marginalized gender identities
such as those of trans and non-binary people, and adopting a queer and
intersectional feminist perspective. The project is organised in three
strands — model, data, and community. First, we are looking at how the
current Wikidata ontology model represents gender, and the
extent to which
this representation is inclusive of marginalized gender
identities. We are
analysing the data stored in the knowledge base to gather insights and
identify possible gaps and biases. Finally, we are looking at how the
community has handled the move towards the inclusion of a wider
spectrum of
gender identities by studying a corpus of user discussions through
computational linguistics methods. This presentation will report on the
current status of the Wikidata Gender Diversity project and the
envisioned
outcomes. We will discuss the main challenges that we are facing and the
opportunities that our project will potentially enable, on Wikidata and
beyond.
- *Paperː* Metilli D. & Paolini C. (in press). ‘Non-binary gender
representation in Wikidata’. In: Provo A., Burlingame K. & Watson B.M.
Ethics in Linked Data. Litwin Books. <https://wigedi.com/chapter.pdf>
You can watch our past Research Showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Hope you can join us!
Warm regards,
--
*Pablo Aragón (he/him)*
Research Scientist
Wikimedia Foundation
https://research.wikimedia.org
Hi all,
It seems like the Wikimedia AQS Pageviews API isn't returning data for
yesterday (2023-06-19). Is there any update on when that data will
be available?
Thanks,
Ben
Hello,
We need to schedule a reboot of the servers that provide copies of the
Mediawiki databases for analytics purposes.
https://wikitech.wikimedia.org/wiki/Analytics/Systems/MariaDB
These are the servers: dbstore1003,dbstore1005, and dbstore1007.
I'm intending to carry out this work at 09:30 UTC next Tuesday the 9th
of May. I will restart all three servers in succession, so I expect the
maintenance to be complete within approximately 30 minutes.
Please note that the Wiki Replica databases are not affected by this
maintenance: https://wikitech.wikimedia.org/wiki/Wiki_Replicas
Please do let me know if you have any queries or if this choice of
maintenance window is likely to cause you any inconvenience.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>