[Labs-l] long running queries

John phoenixoverride at gmail.com
Sat Sep 23 17:30:58 UTC 2017


However there is one key difference in the query vs WantedPages if you take
your first example "HaRav_Moshe_Nehemia_Cohenov" is listed with almost 20k
incoming links, however if you filter for only ns 0 links you will discover
that it actually has zero incoming links. Not only was I attempting to only
get results from ns 0 but also looking to get only ns 0 based links. (Yes I
know some templates can still taint the results) but it will stop templates
such as https://en.wikipedia.org/wiki/Template:WikiProject_Israel from
completely making the report useless. Because of the size of most wiki
projects, if they add a wanted pages section to their talk page banner the
validity of WantedPages becomes significantly skewed.

On Sat, Sep 23, 2017 at 1:16 PM, Jaime Crespo <jcrespo at wikimedia.org> wrote:

> John,
>
> Despite your query returning only 50 results, it is unlikely to finish-
> ever :-)
>
> Your query is reading (almost) every single row of pagelinks (1
> billion rows), -which is also by itself not very optimized-, and
> joining it with page (35 million rows), and then sorting all results,
> all without using a proper index, then discarding all of them except
> the top 50.
>
> I would wonder if you have considered using the cached functionality
> found at https://en.wikipedia.org/wiki/Special:WantedPages but if you
> absolutely have to run your query? Despite the wrong overriden
> mediawiki messages show there, those are updated twice a month: This
> is the same list as shown there on namespace 0:
>
> root at dbstore1001[enwiki]> SELECT qc_title as page, qc_value as
> numlinks  FROM querycache WHERE qc_type =
> 'Wantedpages' and qc_namespace=0 ORDER BY qc_value DESC LIMIT 50;
> +--------------------------------------------+----------+
> | page                                       | numlinks |
> +--------------------------------------------+----------+
> | HaRav_Moshe_Nehemia_Cohenov                |    19729 |
> | HaRav_Meir_Shehenboim_(Etz_Hadar)          |    19728 |
> | Wrestling_at_the_2010_South_American_Games |    12659 |
> | Post-structuralist_ballet                  |    12025 |
> | Robert_MacDonnell                          |    10541 |
> | Charles_Michael_Lavery_QC                  |    10540 |
> | Alison_Campbell                            |     9747 |
> | Shauna_Gunn                                |     9744 |
> | Sophie_Hoopman                             |     9742 |
> | District_Policing_Partnerships             |     9741 |
> | Stanley_Leisure                            |     9740 |
> | Robert_Hall_(doctor)                       |     9739 |
> | Hunt_Committee                             |     9738 |
> | Brian_Baird_(newsreader)                   |     9738 |
> | Choo_Chiat_Goh                             |     7483 |
> | Michael_Cook_(ballet)                      |     7480 |
> | Zoulikha_Naciri                            |     7127 |
> | Mounir_Charïbi                             |     7127 |
> | Mohamed_Moatassim                          |     7127 |
> | Khalil_Hachimi_Idrissi                     |     7127 |
> | Hassan_Aourid                              |     7127 |
> | Faycal_Laraichi                            |     7127 |
> | Campocraspedon                             |     6982 |
> | Al-Masdar                                  |     6811 |
> | Mackay_Radio                               |     6809 |
> | Lebanese_literature                        |     6042 |
> | Al_Nabad                                   |     6007 |
> | Badran_Roy_Badran                          |     6006 |
> | Mohammad_Abdul_Jabbar_Khairy               |     6001 |
> | Toufic_Awwad                               |     5999 |
> | Maurice_Chehab                             |     5999 |
> | Ghazi_Aad                                  |     5999 |
> | Kawasaki_Vulcan_500_LTD                    |     5724 |
> | Buddhism_in_Cuba                           |     5706 |
> | Thomas_steam_bicycle                       |     5698 |
> | Cementerio_de_Santa_Ifigenia               |     5696 |
> | Cuban_Liberation_Army                      |     5695 |
> | Fidel_Castro_Díaz-Balart                   |     5694 |
> | Consejo_popular_(Cuba)                     |     5694 |
> | Moto_Guzzi_V7                              |     5687 |
> | Dirt_Rider                                 |     5676 |
> | Dick_O'Brien                               |     5673 |
> | Motus_Motorcycles                          |     5671 |
> | Greenhorn_Enduro                           |     5671 |
> | Werner_Motocyclette                        |     5670 |
> | Vietnam_Veterans_Motorcycle_Club           |     5670 |
> | Moto_Guzzi_Eldorado                        |     5670 |
> | Moto_Guzzi_Ambassador                      |     5670 |
> | Bimota_Tesi                                |     5670 |
> | Sara_Liberte                               |     5669 |
> +--------------------------------------------+----------+
> 50 rows in set (0.00 sec)
>
> Last updated 11 days ago:
> root at dbstore1001[enwiki]> SELECT * FROM querycache_info WHERE
> qci_type='Wantedpages';
> +-------------+----------------+
> | qci_type    | qci_timestamp  |
> +-------------+----------------+
> | Wantedpages | 20170912074914 |
> +-------------+----------------+
> 1 row in set (0.04 sec)
>
> This table, querycache, could be replicated to the wiki replicas, but
> we need someone to help with the sanitization (remove the unwatched
> pages info).
>
> On Sat, Sep 23, 2017 at 2:40 AM, John <phoenixoverride at gmail.com> wrote:
> > The query I am using in this case is:
> >
> > select pl_title, count(*) from pagelinks  left join page on pl_title =
> > page_title  where pl_namespace = 0  and pl_from_namespace  = 0 and
> page_id
> > IS NULL  group by pl_title having COUNT(*) > 100 LIMIT 50;
> >
> > I am also running this on enwiki where the table is probably the
> largest. I
> > am attempting to locate red linked pages via number of incoming links. I
> > would expect it to take an hour or more to run probably, given the size
> of
> > the table. But I am getting the error message in the phab ticket: ERROR
> 2006
> > (HY000): MySQL server has gone away
> >
> > On Fri, Sep 22, 2017 at 7:53 PM, Madhumitha Viswanathan
> > <mviswanathan at wikimedia.org> wrote:
> >>
> >> Hi John,
> >>
> >> Please provide us with a bit more information so we can help - which
> host
> >> are you connecting to, what long running queries are you trying to run,
> etc.
> >> I believe you may be running into some timeout/mechanisms that we have
> set
> >> up to prevent abuse, see - https://phabricator.wikimedia.
> org/T76956#948591.
> >>
> >> As always, feel free to reach out to us on #wikimedia-cloud for any
> >> questions.
> >>
> >> Thanks,
> >>
> >> On Fri, Sep 22, 2017 at 1:29 PM, John <phoenixoverride at gmail.com>
> wrote:
> >>>
> >>> Do we have a way of letting SQL queries run for a day or so? Im using
> the
> >>> interactive sql <host> and running them that way but the DB host is
> going
> >>> away in the middle of the process.
> >>>
> >>> _______________________________________________
> >>> Labs-l mailing list
> >>> Labs-l at lists.wikimedia.org
> >>> https://lists.wikimedia.org/mailman/listinfo/labs-l
> >>>
> >>
> >>
> >>
> >> --
> >> --Madhu :)
> >>
> >> _______________________________________________
> >> Labs-l mailing list
> >> Labs-l at lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/labs-l
> >>
> >
> >
> > _______________________________________________
> > Labs-l mailing list
> > Labs-l at lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/labs-l
> >
>
>
>
> --
> Jaime Crespo
> <http://wikimedia.org>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20170923/f563098e/attachment-0001.html>


More information about the Labs-l mailing list