On Fri, Sep 17, 2021 at 3:03 PM Cristina Gava via Analytics <analytics@lists.wikimedia.org> wrote:
Hi Jaime,

Thank you so much for the thorough reply :) All the references are super useful and I'll go through them now. I'll start with Toolforge, since it seems there is consensus on it being the most appropriate tool, and leave the dumps for later if needed.
I'll keep you posted.

It will depend a lot on the type of research needed. For example, ( to be the devil's advocate, with a simple example) if you wanted to count the total number of words written in Wikipedia and observe its frequency- (meaning reading all edits in history), dumps would be a way better option in this case, as wikireplicas only have access to medatada, not the actual data. On top of that, reading sequentially all edits will be much faster from a downloaded bundle, while on the live MariaDB database the access is faster for small portions with specific conditions or small to medium ranges.

I think starting with wikireplicas and later going for the dumps if you see it not working for you is a totally reasonable decision, in general, as it will require less investment on your local setup.

--
Jaime Crespo
<http://wikimedia.org>