Deprecation of Spark v2 scheduled for July 5th
The Data Engineering team is planning to deprecate Spark 2 on July 5th
2023. Its replacement, Spark 3 is already available and all of our
production data pipelines have been migrated successfully to this new
version
<https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Spark/Migration_to_Sp…>.
We have also assisted in the migration of several other teams’ Spark 2
pipelines to Spark 3, but there may still be other Spark 2 jobs that are
configured in code outside of our control.
We encourage you, therefore, to review any of your own Spark jobsthat
you run, to verify that they have been upgraded to work with Spark 3. In
most cases, this will mean checking that the command-line interfaces for
spark
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Spark#…>use
one of the supported forms, such as spark3-submitor pyspark3. In some
cases this may also mean upgrading your conda environments
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda#Migratin…>on
the stats servers from anaconda-wmfto conda-analytics, if you have not
already done so.
The specific change that is scheduled to happen on July 5th is a switch
of spark shuffler version used by YARN
<https://phabricator.wikimedia.org/T332765>from 2 to 3. This should
bring significant performance benefits for existing spark3 jobs, but it
is more than likely that any spark2 jobs attempting to use this new
shuffler will fail.
Please do reach out
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Contact>to the
Data Engineering team if you have any queries or concerns about this
change, or would like help in identifying whether or not you are likely
to be affected.
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
The next Research Showcase, with the theme of *Wikimedia and LGBTQIA+*,
will be live-streamed Wednesday, June 21 at 16:30 UTC. Find your local time
here <https://zonestamp.toolforge.org/1687365012>.
YouTube stream: https://www.youtube.com/watch?v=AOD2ZdxRNfo
You can join the conversation on IRC at #wikimedia-research or on the
YouTube chat.
This month's presentations:
- *Multilingual Contextual Affective Analysis of LGBT People Portrayals
in Wikipedia*
- *Speaker*: Chan Park, Carnegie Mellon University
- *Abstract*: In this talk, I present our research on analyzing the
portrayal of LGBT individuals in their biographies on Wikipedia, with a
particular focus on subtle word connotations and cross-cultural
comparisons. We aim to address two primary research questions: 1) How can
we effectively measure the nuanced connotations of words in multilingual
texts, which reflect sentiments, power dynamics, and agency? 2)
How can we
analyze the portrayal of a specific group, such as the LGBT
community, and
compare these portrayals across different languages? To answer these
questions, we collect the Multilingual Contextualized Connotation Frames
dataset, comprising 2,700 examples in English, Spanish, and Russian. We
also develop a new multilingual model based on pre-trained multilingual
language models. Additionally, we devise a matching algorithm to
construct
a comparison corpus for the target corpus, isolating the attribute of
interest. Finally, we showcase how our developed models and constructed
corpora enable us to conduct cross-cultural analysis of LGBT People
Portrayals on Wikipedia. Our results reveal systematic differences in how
the LGBT community is portrayed across languages, surfacing cultural
differences in narratives and signs of social biases.
- *Paperː* Park, C. Y., Yan, X., Field, A., & Tsvetkov, Y. (2021,
May). Multilingual contextual affective analysis of LGBT people
portrayals
in Wikipedia. In Proceedings of the International AAAI Conference on Web
and Social Media (Vol. 15, pp. 479-490).
<https://arxiv.org/pdf/2010.10820.pdf>
- *Visual gender biases in Wikipediaː A systematic evaluation across the
ten most spoken languages*
- *Speaker*: Daniele Metilli, University College London
- *Abstract*: Wikidata Gender Diversity (WiGeDi) is a one-year
project funded through the Wikimedia Research Fund. The project
is studying
gender diversity in Wikidata, focusing on marginalized gender identities
such as those of trans and non-binary people, and adopting a queer and
intersectional feminist perspective. The project is organised in three
strands — model, data, and community. First, we are looking at how the
current Wikidata ontology model represents gender, and the
extent to which
this representation is inclusive of marginalized gender
identities. We are
analysing the data stored in the knowledge base to gather insights and
identify possible gaps and biases. Finally, we are looking at how the
community has handled the move towards the inclusion of a wider
spectrum of
gender identities by studying a corpus of user discussions through
computational linguistics methods. This presentation will report on the
current status of the Wikidata Gender Diversity project and the
envisioned
outcomes. We will discuss the main challenges that we are facing and the
opportunities that our project will potentially enable, on Wikidata and
beyond.
- *Paperː* Metilli D. & Paolini C. (in press). ‘Non-binary gender
representation in Wikidata’. In: Provo A., Burlingame K. & Watson B.M.
Ethics in Linked Data. Litwin Books. <https://wigedi.com/chapter.pdf>
You can watch our past Research Showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Hope you can join us!
Warm regards,
--
*Pablo Aragón (he/him)*
Research Scientist
Wikimedia Foundation
https://research.wikimedia.org
Hi all,
It seems like the Wikimedia AQS Pageviews API isn't returning data for
yesterday (2023-06-19). Is there any update on when that data will
be available?
Thanks,
Ben
Hello,
We need to schedule a reboot of the servers that provide copies of the
Mediawiki databases for analytics purposes.
https://wikitech.wikimedia.org/wiki/Analytics/Systems/MariaDB
These are the servers: dbstore1003,dbstore1005, and dbstore1007.
I'm intending to carry out this work at 09:30 UTC next Tuesday the 9th
of May. I will restart all three servers in succession, so I expect the
maintenance to be complete within approximately 30 minutes.
Please note that the Wiki Replica databases are not affected by this
maintenance: https://wikitech.wikimedia.org/wiki/Wiki_Replicas
Please do let me know if you have any queries or if this choice of
maintenance window is likely to cause you any inconvenience.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
The next Research Showcase, with the theme of Images on Wikipedia, will be
live-streamed Wednesday, April 19, at 16:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1681921857>.
YouTube stream: https://www.youtube.com/watch?v=vW0waU-QArU
You can join the conversation on IRC at #wikimedia-research or on the
YouTube chat.
This month's presentations:
A large scale study of reader interactions with images on WikipediaBy *Daniele
Rama, University of Turin*Wikipedia is the largest source of free
encyclopedic knowledge and one of the most visited sites on the Web. To
increase reader understanding of the article, Wikipedia editors add images
within the text of the article’s body. However, despite their widespread
usage on web platforms and the huge volume of visual content on Wikipedia,
little is known about the importance of images in the context of free
knowledge environments. To bridge this gap, we collect data about English
Wikipedia reader interactions with images during one month and perform the
first large-scale analysis of how interactions with images happen on
Wikipedia. First, we quantify the overall engagement with images, finding
that one in 29 pageviews results in a click on at least one image, one
order of magnitude higher than interactions with other types of article
content. Second, we study what factors associate with image engagement and
observe that clicks on images occur more often in shorter articles and
articles about visual arts or transports and biographies of less well-known
people. Third, we look at interactions with Wikipedia article previews and
find that images help support reader information need when navigating
through the site, especially for more popular pages. The findings in this
study deepen our understanding of the role of images for free knowledge and
provide a guide for Wikipedia editors and web user communities to enrich
the world’s largest source of encyclopedic knowledge.
- Paperː
https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-021-0…
Visual gender biases in Wikipediaː A systematic evaluation across the ten
most spoken languagesBy *Pablo Beytia, Catholic University of Chile*The
existing research suggests a significant gender gap in Wikipedia
biographical articles, with a minimal representation of women and gender
asymmetries in the textual content. However, the visual aspects of this gap
(e.g., image volume and quality) have received little attention. This study
examined asymmetries between women's and men's biographies, exploring
written and visual content across the ten most widely spoken languages. The
cross-lingual analysis reveals that (1) the most salient male biases appear
when editors select which personalities should have a Wikipedia page, (2)
the trends in written and visual content are dissimilar, (3) male
biographies tend to have more images across languages, and (4) female
biographies have better visual quality on average. The open database of
this study provides eight indicators of gender asymmetries in ten
occupational domains and ten languages. That information allows for a
granular view of gender biases, as well as exploring more macroscopic
phenomena, such as the similarity between Wikipedia versions according to
their gender bias structures.
- Papersː
Beytía, P., Agarwal, P., Redi, M., & Singh, V. K. (2022). Visual Gender
Biases in Wikipedia: A Systematic Evaluation across the Ten Most Spoken
Languages. Proceedings of the International AAAI Conference on Web and
Social Media, 16(1), 43-54. https://doi.org/10.1609/icwsm.v16i1.19271https://ojs.aaai.org/index.php/ICWSM/article/view/19271Beytía, P. & Wagner,
C. (2022). Visibility layers: a framework for systematizing the gender gap
in Wikipedia content. Internet Policy Review, 11(1).
https://doi.org/10.14763/2022.1.1621https://policyreview.info/articles/analysis/visibility-layers-framework-sys…
You can watch our past Research Showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Hope you can join us!
Warm regards,
Emily
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation
Hello,
Apologies for the short notice. The SRE team will be carrying out an
upgrade of the switches in eqiad row D later today
(https://phabricator.wikimedia.org/T333377) at approximately 14:00 UTC.
The network outage to this row resulting from this work is expected to
be around 30 minutes, all being well.
In support of this work, the Data Engineering team will be putting HDFS
file system into safe mode at approximately 13:30 today, which means
that write operations to the cluster will be refused.
Jobs sent to the YARN cluster will also be refused from around the same
time, so please try to plan any work that you may have for the cluster
to avoid this maintenance window.
Read-only access to Hive, Presto, Superset, Turnilo, should continue to
function normally throughout the maintenance window.
Finally, two of the stats servers (stat1005 and stat1006) will be
unavailable, so please save any work that you may have on these servers
before the loss of connectivity.
Please do reach out via any of the normal channels (email:
analytics(a)lists.wikimedia.org , IRC: #wikimedia-analytics , Slack
#data-engineering ) if you have any queries or concerns.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello,
Tomorrow the SRE team will be carrying out an upgrade of the switches in
eqiad row B: (https://phabricator.wikimedia.org/T330165) at 14:00 UTC.
The network outage to this row resulting from this work is expected to
be around 30 minutes, all being well.
In support of this work, the Data Engineering team will be putting HDFS
file system into safe mode at approximately 13:30 UTC tomorrow, which
means that write operations to the cluster will be refused.
Jobs sent to the YARN cluster will also be refused from around the same
time, so please try to plan any work that you may have for the cluster
to avoid this maintenance window.
Some additional internal-facing services for analytics such as Hive,
Superset, Presto, and the Druid-analytics cluster will also be largely
unavailable for some periods while the switch upgrade takes place.
The public-facing Analytics Query Service (AQS) will continue to
function, albebeit with a degraded response to some queries. However
Wikistats (stats.wikimedia.org) will be unavailable whilst the switch
upgrade is in progress.
Finally, two of the stats servers, stat1007 and stat1009, will be
unavailable, so please save any work that you may have on these servers
before the loss of connectivity.
Please do reach out via any of the normal channels (email:
analytics(a)lists.wikimedia.org , IRC: #wikimedia-analytics , Slack
#data-engineering ) if you have any queries or concerns.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi Folks,
We encourage each and everyone of you to create a program submission. You can submit an interactive workshop or panel, a lecture, a short lighting talk or a poster for our dedicated poster session. Submissions are catered to both onsite and online (live or pre-recorded) or a hybrid combination. We would love to see submissions from all over the world, and this year there is an 'Open Data' track for projects relating to Linked Open Data.
The theme for this year's Wikimania is Diversity, Collaboration, Future. Topics that strengthen collaboration on Open Data including Data Analytics are topics we like to see this year.
Session submissions for Wikimania 2023 are open until 28 March.
Visit the following links for further info:
Wiki page: https://wikimania.wikimedia.org/wiki/2023:Program/Submissions
Diff post: https://diff.wikimedia.org/2023/02/28/be-part-of-the-wikimania-2023-program/
Program Submission Form: https://pretalx.com/wm2023/cfp
Kind regards,
Butch Bustria
Chair, Program Subcommittee
Event lead, ESEAP Wikimania 2023 Core Organizing Team
Hi friends,
Just a quick note that the Wikigrowth site has been updated to include wiki page creation data from 2021 and 2022.
https://francisco.dance/wikigrowth/
Sorry for the two year hiatus. Any suggestions to improve the tool and make it more useful (or even merge it with a current site) are always welcome.
Much love,
Fran Dans
Hi all,
The next Research Showcase, focused on Gender and Equity on Wikipedia, will
be live-streamed Wednesday, March 15, at 9:30 AM PST / 16:30 UTC. Find your
local time here <https://zonestamp.toolforge.org/1678897840>.
YouTube stream: https://www.youtube.com/watch?v=lw4MzJgDIzo
You can join the conversation on IRC at #wikimedia-research. You can also
watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
This month's presentations:
Men Are elected, women are marriedː events gender bias on Wikipedia
By *Jiao Sun, University of Southern California*Human activities can be
seen as sequences of events, which are crucial to understanding societies.
Disproportional event distribution for different demographic groups can
manifest and amplify social stereotypes, and potentially jeopardize the
ability of members in some groups to pursue certain goals. In this paper,
we present the first event-centric study of gender biases in a Wikipedia
corpus. To facilitate the study, we curate a corpus of career and personal
life descriptions with demographic information consisting of 7,854
fragments from 10,412 celebrities. Then we detect events with a
state-of-the-art event detection model, calibrate the results using
strategically generated templates, and extract events that have asymmetric
associations with genders. Our study discovers that the Wikipedia pages
tend to intermingle personal life events with professional events for
females but not for males, which calls for the awareness of the Wikipedia
community to formalize guidelines and train the editors to mind the
implicit biases that contributors carry. Our work also lays the foundation
for future works on quantifying and discovering event biases at the corpus
level.
- Paperː Sun, J. & Peng, N. (2021). Men Are Elected, Women Are Married:
Events Gender Bias on Wikipedia. Proceedings of the 59th Annual Meeting of
the Association for Computational Linguistics and the 11th International
Conference on Natural Language Processing, 350-360.
<https://aclanthology.org/2021.acl-short.45.pdf>
Twitter reacts to absence of women on Wikipediaː a mixed-methods analysis
of #VisibleWikiWomen campaignBy *Sneh Gupta, Guru Gobind Singh Indraprastha
University*Digital gender divide (DGD) is visible in access, participation,
representation, and biases against women embedded in Wikipedia, the largest
digital reservoir of co-created content. This article examined the content
of #VisibleWikiWomen, a global digital advocacy campaign aimed at
encouraging inclusion of women voices in the global technology conversation
and improving digital sustainability of feminist data on Wikipedia. In a
mixed-methods study, Sentiment Analysis followed by a Feminist Critical
Discourse Analysis of the campaign tweets reveals how digital gender divide
manifested in the public response. An overwhelming majority of tweets
expressed positive sentiment towards the objective of the campaign. An
inductive reading of the coded tweets (n = 1067) generated five themes:
Feminist Activism, Invisibility & Marginalization of Women, Technology for
Women Empowerment, Gendered Knowledge Inequity, and Power Dynamics in the
Digital Sphere. Twitter discourse presented many agitated digital users
calling out the epistemic injustice on Wikipedia that goes beyond the
invisibility of women. Their tweets reveal that they want an equal social
platform inclusive of women of color and varied identities currently absent
in the Wikipedia universe. Extracting ideas, values, and themes from new
media campaigns holds unparalleled potential in the diffusion of
interventions and messages on a larger scale.
- Paperː Gupta, S., & Trehan, K. (2022). Twitter reacts to absence of
women on Wikipedia: a mixed-methods analysis of #VisibleWikiWomen campaign.
Media Asia, 49(2), 130-154.
<https://www.researchgate.net/publication/356909618_Twitter_reacts_to_absenc…>
Warm regards,
Emily
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation