Re: [Cloud] Loading wikipedia dump onto Clouds

4 May 2020

With several weeks of delay, I finally had a chance to submit three Phab
tasks around this topic. I have at least 10 more queries that I could add
to the list, but I don't want to overwhelm people, so I will wait for those
three to be resolved first.
Thanks!
Huji

On Tue, Apr 14, 2020 at 8:53 AM Huji Lee &lt;huji.huji(a)gmail.com&gt; wrote:

...
  I completely appreciate the points you are making,
Bryan and Jaime. And I
 would very much enjoy "dealing with you" if we end up going to "Cloud VPS
 project" route! If anything, I keep learning new things from you all.

 Let's start where you suggested. I will create Phab tickets on which I
 will seek advice about how to optimize those super-slow queries (if at all
 possible).

 Thank you for your attention!

 On Tue, Apr 14, 2020 at 8:49 AM Huji Lee &lt;huji.huji(a)gmail.com&gt; wrote:

> Yes. If you go to the source of all those pages, there is a hidden HTML
> element (<!-- --> kind) that has the SQL code for that report.
>
> Here is one example: [1]
>
>   [1]
>
https://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8…
>
> On Mon, Apr 13, 2020 at 7:19 PM MusikAnimal &lt;musikanimal(a)gmail.com&gt;
> wrote:
>
>> Is the source code public? Maybe the queries could be improved. I ran
>> into many such issues too after the actor migration, but after taking
>> advantage of specialized views[0] and join decomposition (get just the
>> actor IDs, i.e. rev_actor, then the actor_names in a separate query), my
>> tools are seemingly as fast as they were before.
>>
>> ~ MA
>>
>> [0]
>>
https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_…
>>
>> On Mon, Apr 13, 2020 at 5:03 PM Huji Lee &lt;huji.huji(a)gmail.com&gt; wrote:
>>
>>> I understand. However, I think that the use case we are looking at is
>>> relatively unique. I also think that indexes we need may not be desirable
>>> for all the Wiki Replicas (they would often be multi-column indexes geared
>>> towards a specific set of queries) and I honestly don't want to go
through
>>> the several weeks (months?) of discussion to justify them.
>>>
>>> Note that if we open the can of "more indexes on Wiki Replicas"
worms,
>>> this would all of a sudden become an all-wiki discussion. I'm not sure
if
>>> there are more than a handful wikis that do this level of page-level and
>>> user-level analytics as fawiki does, which means for most wikis (and for
>>> most Wiki Replica databases) those additional indexes may not even be
>>> justified.
>>>
>>> Even if we were to generalize parts of this approach and bring it to
>>> Wiki Replicas, I would still argue that doing it at a smaller extent (one
>>> wiki DB for now) would be a reasonable starting point, no?
>>>
>>> On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis &lt;bd808(a)wikimedia.org&gt;
>>> wrote:
>>>
>>>> On Sun, Apr 12, 2020 at 7:48 AM Huji Lee &lt;huji.huji(a)gmail.com&gt;
wrote:
>>>> >
>>>> > One possible solution is to create a script which is scheduled to
>>>> run once a month; the script would download the latest dump of the wiki
>>>> database,[3] load it into MySQL/MariaDB, create some additional indexes
>>>> that would make our desired queries run faster, and generate the reports
>>>> using this database. A separate script can then purge the data a few
days
>>>> later.
>>>>
>>>> If I am understanding your proposal here, I think the main difference
>>>> from the current Wiki Replicas would be "create some additional
>>>> indexes that would make our desired queries run faster". We do have
>>>> some indexes and views in the Wiki Replicas which are specifically
>>>> designed to make common things faster today. If possible, adding to
>>>> these rather than building a one-off process of moving lots of data
>>>> round for your tool would be nice.
>>>>
>>>> I say this not because what you are proposing is a ridiculous
>>>> solution, but because it is a unique solution for your current problem
>>>> that will not help others who are having similar problems. Having 1
>>>> tool use ToolsDB or a custom Cloud VPS project like this is possible,
>>>> but having 100 tools try to follow that pattern themselves is not.
>>>>
>>>> > Out of abundance of caution, I thought I should ask for permission
>>>> now, rather than forgiveness later. Do we have a process for getting
>>>> approval for projects that require gigabytes of storage and hours of
>>>> computation, or is what I proposed not even remotely considered a
"large"
>>>> project, meaning I am being overly cautious?
>>>>
>>>> <https://phabricator.wikimedia.org/project/view/2875/>
>>>>
>>>> Bryan
>>>> --
>>>> Bryan Davis              Technical Engagement      Wikimedia Foundation
>>>> Principal Software Engineer                               Boise, ID USA
>>>> [[m:User:BDavis_(WMF)]]                                      irc: bd808
>>>>
>>>> _______________________________________________
>>>> Wikimedia Cloud Services mailing list
>>>> Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>
>>> _______________________________________________
>>> Wikimedia Cloud Services mailing list
>>> Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>
>> _______________________________________________
>> Wikimedia Cloud Services mailing list
>> Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
>> https://lists.wikimedia.org/mailman/listinfo/cloud
>
> 

2024

2023

2022

2021

2020

2019

2018

2017

Re: [Cloud] Loading wikipedia dump onto Clouds