[Wikitech-l] Re: [Wikimedia-l] Accessing wikipedia metadata

16 Sep 2021

A memorable piece of research in this area sampled articles using the API.
https://arxiv.org/abs/1904.08139

Regards,

Will Avery

On Thu, 16 Sep 2021, 19:21 Risker, &lt;risker.wp(a)gmail.com&gt; wrote:

...
  Mike's suggestion is good.  You would likely get
better responses by
 asking this question to the Wikimedia developers, so I am forwarding to
 that list.

 Risker

 On Thu, 16 Sept 2021 at 14:04, Gava, Cristina via Wikimedia-l <
 wikimedia-l(a)lists.wikimedia.org&gt; wrote:

  Hello everyone,

 It is my first time interacting in this mailing list, so I will be happy
 to receive further feedbacks on how to better interact with the community :)

 I am trying to access Wikipedia meta data in a streaming and
 time/resource sustainable manner. By meta data I mean many of the voices
 that can be found in the statistics of a wiki article, such as edits,
 editors list, page views etc.

 I would like to do such for an online classifier type of structure:
 retrieve the data from a big number of wiki pages every tot time and use it
 as input for predictions.

 I tried to use the Wiki API, however it is time and resource expensive,
 both for me and Wikipedia.

 My preferred choice now would be to query the specific tables in the
 Wikipedia database, in the same way this is done through the Quarry tool.
 The problem with Quarry is that I would like to build a standalone script,
 without having to depend on a user interface like Quarry. Do you think that
 this is possible? I am still fairly new to all of this and I don’t know
 exactly which is the best direction.

 I saw [1] <https://meta.wikimedia.org/wiki/Research:Data> that I could
 access wiki replicas both through Toolforge and PAWS, however I didn’t
 understand which one would serve me better, could I ask you for some
 feedback?

 Also, as far as I understood [2]
 <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>, directly
 accessing the DB through Hive is too technical for what I need, right?
 Especially because it seems that I would need an account with production
 shell access and I honestly don’t think that I would be granted access to
 it. Also, I am not interested in accessing sensible and private data.

 Last resource is parsing analytics dumps, however this seems less organic
 in the way of retrieving and polishing the data. As also, it would be
 strongly decentralised and physical-machine dependent, unless I upload the
 polished data online every time.

 Sorry for this long message, but I thought it was better to give you a
 clearer picture (hoping this is clear enough). If you could give me even
 some hint it would be highly appreciated.

 Best,

 Cristina

 [1] https://meta.wikimedia.org/wiki/Research:Data

 [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
 _______________________________________________
 Wikimedia-l mailing list -- wikimedia-l(a)lists.wikimedia.org, guidelines
 at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
 https://meta.wikimedia.org/wiki/Wikimedia-l
 Public archives at

https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org…
 To unsubscribe send an email to wikimedia-l-leave(a)lists.wikimedia.org 
 _______________________________________________
 Wikitech-l mailing list -- wikitech-l(a)lists.wikimedia.org
 To unsubscribe send an email to wikitech-l-leave(a)lists.wikimedia.org
 https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Re: [Wikimedia-l] Accessing wikipedia metadata