[Labs-l] long running queries

Jaime Crespo jcrespo at wikimedia.org
Sat Sep 23 17:50:05 UTC 2017


Sure, I was assuming at least 50 of the links-from-ns-0 would be on
the 5000 list (I haven't checked, though), making the filtering
efforst much faster.

Maybe you or someone else can come up with a mediawiki patch for a new
special page?

On Sat, Sep 23, 2017 at 7:30 PM, John <phoenixoverride at gmail.com> wrote:
> However there is one key difference in the query vs WantedPages if you take
> your first example "HaRav_Moshe_Nehemia_Cohenov" is listed with almost 20k
> incoming links, however if you filter for only ns 0 links you will discover
> that it actually has zero incoming links. Not only was I attempting to only
> get results from ns 0 but also looking to get only ns 0 based links. (Yes I
> know some templates can still taint the results) but it will stop templates
> such as https://en.wikipedia.org/wiki/Template:WikiProject_Israel from
> completely making the report useless. Because of the size of most wiki
> projects, if they add a wanted pages section to their talk page banner the
> validity of WantedPages becomes significantly skewed.
>
> On Sat, Sep 23, 2017 at 1:16 PM, Jaime Crespo <jcrespo at wikimedia.org> wrote:
>>
>> John,
>>
>> Despite your query returning only 50 results, it is unlikely to finish-
>> ever :-)
>>
>> Your query is reading (almost) every single row of pagelinks (1
>> billion rows), -which is also by itself not very optimized-, and
>> joining it with page (35 million rows), and then sorting all results,
>> all without using a proper index, then discarding all of them except
>> the top 50.
>>
>> I would wonder if you have considered using the cached functionality
>> found at https://en.wikipedia.org/wiki/Special:WantedPages but if you
>> absolutely have to run your query? Despite the wrong overriden
>> mediawiki messages show there, those are updated twice a month: This
>> is the same list as shown there on namespace 0:
>>
>> root at dbstore1001[enwiki]> SELECT qc_title as page, qc_value as
>> numlinks  FROM querycache WHERE qc_type =
>> 'Wantedpages' and qc_namespace=0 ORDER BY qc_value DESC LIMIT 50;
>> +--------------------------------------------+----------+
>> | page                                       | numlinks |
>> +--------------------------------------------+----------+
>> | HaRav_Moshe_Nehemia_Cohenov                |    19729 |
>> | HaRav_Meir_Shehenboim_(Etz_Hadar)          |    19728 |
>> | Wrestling_at_the_2010_South_American_Games |    12659 |
>> | Post-structuralist_ballet                  |    12025 |
>> | Robert_MacDonnell                          |    10541 |
>> | Charles_Michael_Lavery_QC                  |    10540 |
>> | Alison_Campbell                            |     9747 |
>> | Shauna_Gunn                                |     9744 |
>> | Sophie_Hoopman                             |     9742 |
>> | District_Policing_Partnerships             |     9741 |
>> | Stanley_Leisure                            |     9740 |
>> | Robert_Hall_(doctor)                       |     9739 |
>> | Hunt_Committee                             |     9738 |
>> | Brian_Baird_(newsreader)                   |     9738 |
>> | Choo_Chiat_Goh                             |     7483 |
>> | Michael_Cook_(ballet)                      |     7480 |
>> | Zoulikha_Naciri                            |     7127 |
>> | Mounir_Charïbi                             |     7127 |
>> | Mohamed_Moatassim                          |     7127 |
>> | Khalil_Hachimi_Idrissi                     |     7127 |
>> | Hassan_Aourid                              |     7127 |
>> | Faycal_Laraichi                            |     7127 |
>> | Campocraspedon                             |     6982 |
>> | Al-Masdar                                  |     6811 |
>> | Mackay_Radio                               |     6809 |
>> | Lebanese_literature                        |     6042 |
>> | Al_Nabad                                   |     6007 |
>> | Badran_Roy_Badran                          |     6006 |
>> | Mohammad_Abdul_Jabbar_Khairy               |     6001 |
>> | Toufic_Awwad                               |     5999 |
>> | Maurice_Chehab                             |     5999 |
>> | Ghazi_Aad                                  |     5999 |
>> | Kawasaki_Vulcan_500_LTD                    |     5724 |
>> | Buddhism_in_Cuba                           |     5706 |
>> | Thomas_steam_bicycle                       |     5698 |
>> | Cementerio_de_Santa_Ifigenia               |     5696 |
>> | Cuban_Liberation_Army                      |     5695 |
>> | Fidel_Castro_Díaz-Balart                   |     5694 |
>> | Consejo_popular_(Cuba)                     |     5694 |
>> | Moto_Guzzi_V7                              |     5687 |
>> | Dirt_Rider                                 |     5676 |
>> | Dick_O'Brien                               |     5673 |
>> | Motus_Motorcycles                          |     5671 |
>> | Greenhorn_Enduro                           |     5671 |
>> | Werner_Motocyclette                        |     5670 |
>> | Vietnam_Veterans_Motorcycle_Club           |     5670 |
>> | Moto_Guzzi_Eldorado                        |     5670 |
>> | Moto_Guzzi_Ambassador                      |     5670 |
>> | Bimota_Tesi                                |     5670 |
>> | Sara_Liberte                               |     5669 |
>> +--------------------------------------------+----------+
>> 50 rows in set (0.00 sec)
>>
>> Last updated 11 days ago:
>> root at dbstore1001[enwiki]> SELECT * FROM querycache_info WHERE
>> qci_type='Wantedpages';
>> +-------------+----------------+
>> | qci_type    | qci_timestamp  |
>> +-------------+----------------+
>> | Wantedpages | 20170912074914 |
>> +-------------+----------------+
>> 1 row in set (0.04 sec)
>>
>> This table, querycache, could be replicated to the wiki replicas, but
>> we need someone to help with the sanitization (remove the unwatched
>> pages info).
>>
>> On Sat, Sep 23, 2017 at 2:40 AM, John <phoenixoverride at gmail.com> wrote:
>> > The query I am using in this case is:
>> >
>> > select pl_title, count(*) from pagelinks  left join page on pl_title =
>> > page_title  where pl_namespace = 0  and pl_from_namespace  = 0 and
>> > page_id
>> > IS NULL  group by pl_title having COUNT(*) > 100 LIMIT 50;
>> >
>> > I am also running this on enwiki where the table is probably the
>> > largest. I
>> > am attempting to locate red linked pages via number of incoming links. I
>> > would expect it to take an hour or more to run probably, given the size
>> > of
>> > the table. But I am getting the error message in the phab ticket: ERROR
>> > 2006
>> > (HY000): MySQL server has gone away
>> >
>> > On Fri, Sep 22, 2017 at 7:53 PM, Madhumitha Viswanathan
>> > <mviswanathan at wikimedia.org> wrote:
>> >>
>> >> Hi John,
>> >>
>> >> Please provide us with a bit more information so we can help - which
>> >> host
>> >> are you connecting to, what long running queries are you trying to run,
>> >> etc.
>> >> I believe you may be running into some timeout/mechanisms that we have
>> >> set
>> >> up to prevent abuse, see -
>> >> https://phabricator.wikimedia.org/T76956#948591.
>> >>
>> >> As always, feel free to reach out to us on #wikimedia-cloud for any
>> >> questions.
>> >>
>> >> Thanks,
>> >>
>> >> On Fri, Sep 22, 2017 at 1:29 PM, John <phoenixoverride at gmail.com>
>> >> wrote:
>> >>>
>> >>> Do we have a way of letting SQL queries run for a day or so? Im using
>> >>> the
>> >>> interactive sql <host> and running them that way but the DB host is
>> >>> going
>> >>> away in the middle of the process.
>> >>>
>> >>> _______________________________________________
>> >>> Labs-l mailing list
>> >>> Labs-l at lists.wikimedia.org
>> >>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> --Madhu :)
>> >>
>> >> _______________________________________________
>> >> Labs-l mailing list
>> >> Labs-l at lists.wikimedia.org
>> >> https://lists.wikimedia.org/mailman/listinfo/labs-l
>> >>
>> >
>> >
>> > _______________________________________________
>> > Labs-l mailing list
>> > Labs-l at lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/labs-l
>> >
>>
>>
>>
>> --
>> Jaime Crespo
>> <http://wikimedia.org>
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>



-- 
Jaime Crespo
<http://wikimedia.org>



More information about the Labs-l mailing list