Severe slowness with a particular English Wikipedia API query

List overview All Threads
Download

newer

older

Re: [Mediawiki-api] Mediawiki-api...

Cross-domain AJAX problems with...

This, that and the other

6 Sep 2014 6 Sep '14

3:52 a.m.

A tool I have written, For the Common Good [1], uses the following type of query to fetch a list of "random" files that users may like to transfer to Commons. The category name may differ but the structure is the same: https://en.wikipedia.org/w/api.php?format=xml&cmnamespace=6&cmtitle… In 2011 when I was first writing FtCG, this query ran at an acceptable speed. Recently, though, it has become extremely slow, to the point where timeouts are now a regular occurrence. It sometimes takes 4 or 5 tries (and several minutes) before results are returned. From then on, however, it works quickly. If you run this exact query now, there's a good chance it will work quickly because others have been running the query before you. The cause seems to be the "cmsort=timestamp" portion of the request. If this is removed, it works essentially instantaneously. However, I don't really want the files in alphabetical order, as it doesn't seem very "random". Four questions: 1. Why does this query take so long? 2. Can anything be done on the server side to make it faster? 3. Why does it take so much longer now than it did in 2011? 4. Is there a better way to fetch a random cross-section of files in a particular category? TTO [1] https://en.wikipedia.org/wiki/User:This,_that_and_the_other/For_the_Common_…

Show replies by date

Brad Jorsch (Anomie)

6 Sep 6 Sep

7:58 p.m.

New subject: Severe slowness with a particular English Wikipedia API query

The database query for that is simple enough: SELECT /* ApiQueryCategoryMembers::run Anomie */ cl_from,cl_sortkey,cl_type,page_namespace,page_title,cl_timestamp FROM `page`,`categorylinks` FORCE INDEX (cl_timestamp) WHERE cl_to = 'Copy_to_Wikimedia_Commons_(bot-assessed)' AND (cl_from=page_id) ORDER BY cl_timestamp,cl_from LIMIT 501; And the PHP code doesn't do anything complicated either. Maybe Sean can give us more insight if there's some subtle database thing going on here. Note, though, that you're not getting anything at all random here; you're always getting the files that have been in the category longest first. On Fri, Sep 5, 2014 at 11:52 PM, This, that and the other < at.light(a)live.com.au> wrote:

...

A tool I have written, For the Common Good [1], uses the following type of query to fetch a list of "random" files that users may like to transfer to Commons. The category name may differ but the structure is the same: https://en.wikipedia.org/w/api.php?format=xml&cmnamespace=6&cmtitle= Category%3ACopy%20to%20Wikimedia%20Commons%20(bot- assessed)&action=query&list=categorymembers&cmsort=timestamp&cmprop=title& cmlimit=500 In 2011 when I was first writing FtCG, this query ran at an acceptable speed. Recently, though, it has become extremely slow, to the point where timeouts are now a regular occurrence. It sometimes takes 4 or 5 tries (and several minutes) before results are returned. From then on, however, it works quickly. If you run this exact query now, there's a good chance it will work quickly because others have been running the query before you. The cause seems to be the "cmsort=timestamp" portion of the request. If this is removed, it works essentially instantaneously. However, I don't really want the files in alphabetical order, as it doesn't seem very "random". Four questions: 1. Why does this query take so long? 2. Can anything be done on the server side to make it faster? 3. Why does it take so much longer now than it did in 2011? 4. Is there a better way to fetch a random cross-section of files in a particular category? TTO [1] https://en.wikipedia.org/wiki/User:This,_that_and_the_other/ For_the_Common_Good _______________________________________________ Mediawiki-api mailing list Mediawiki-api(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

-- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

Nikolas Everett

9:30 p.m.

New subject: Severe slowness with a particular English Wikipedia API query

...

A tool I have written, For the Common Good [1], uses the following type of query to fetch a list of "random" files that users may like to transfer to Commons. The category name may differ but the structure is the same: https://en.wikipedia.org/w/api.php?format=xml&cmnamespace=6&cmtitle= Category%3ACopy%20to%20Wikimedia%20Commons%20(bot- assessed)&action=query&list=categorymembers&cmsort= timestamp&cmprop=title&cmlimit=500 In 2011 when I was first writing FtCG, this query ran at an acceptable speed. Recently, though, it has become extremely slow, to the point where timeouts are now a regular occurrence. It sometimes takes 4 or 5 tries (and several minutes) before results are returned. From then on, however, it works quickly. If you run this exact query now, there's a good chance it will work quickly because others have been running the query before you. The cause seems to be the "cmsort=timestamp" portion of the request. If this is removed, it works essentially instantaneously. However, I don't really want the files in alphabetical order, as it doesn't seem very "random". Four questions: 1. Why does this query take so long? 2. Can anything be done on the server side to make it faster? 3. Why does it take so much longer now than it did in 2011? 4. Is there a better way to fetch a random cross-section of files in a particular category? TTO [1] https://en.wikipedia.org/wiki/User:This,_that_and_the_other/ For_the_Common_Good _______________________________________________ Mediawiki-api mailing list Mediawiki-api(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

-- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation _______________________________________________ Mediawiki-api mailing list Mediawiki-api(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Nikolas Everett

9:34 p.m.

New subject: Severe slowness with a particular English Wikipedia API query

You might try using the query to the full text search backend. I'm on mobile or I would whip up an example. I believe the syntax is incategory:"you_category ". You might need to use CirrusSearch (srbackend=CirrusSearch) if the category is added by a template. I don't know if it's be faster but you can try. On Sep 6, 2014 5:30 PM, "Nikolas Everett" <neverett(a)wikimedia.org> wrote:

...

I don't know the MySQL syntax super well but that query implied that it will always use the time index and then iterate in time order looking for a link to the category you want. Unless that category is an appreciable fraction of the total links I don't imagine that is a good plan. On Sep 6, 2014 3:58 PM, "Brad Jorsch (Anomie)" <bjorsch(a)wikimedia.org> wrote: > The database query for that is simple enough: > > SELECT /* ApiQueryCategoryMembers::run Anomie */ > cl_from,cl_sortkey,cl_type,page_namespace,page_title,cl_timestamp FROM > `page`,`categorylinks` FORCE INDEX (cl_timestamp) WHERE cl_to = > 'Copy_to_Wikimedia_Commons_(bot-assessed)' AND (cl_from=page_id) ORDER BY > cl_timestamp,cl_from LIMIT 501; > > And the PHP code doesn't do anything complicated either. Maybe Sean can > give us more insight if there's some subtle database thing going on here. > > Note, though, that you're not getting anything at all random here; you're > always getting the files that have been in the category longest first. > > > > On Fri, Sep 5, 2014 at 11:52 PM, This, that and the other < > at.light(a)live.com.au> wrote: > >> A tool I have written, For the Common Good [1], uses the following type >> of query to fetch a list of "random" files that users may like to transfer >> to Commons. The category name may differ but the structure is the same: >> >> https://en.wikipedia.org/w/api.php?format=xml&cmnamespace=6&cmtitle= >> Category%3ACopy%20to%20Wikimedia%20Commons%20(bot- >> assessed)&action=query&list=categorymembers&cmsort= >> timestamp&cmprop=title&cmlimit=500 >> >> In 2011 when I was first writing FtCG, this query ran at an acceptable >> speed. Recently, though, it has become extremely slow, to the point where >> timeouts are now a regular occurrence. It sometimes takes 4 or 5 tries (and >> several minutes) before results are returned. From then on, however, it >> works quickly. If you run this exact query now, there's a good chance it >> will work quickly because others have been running the query before you. >> >> The cause seems to be the "cmsort=timestamp" portion of the request. If >> this is removed, it works essentially instantaneously. However, I don't >> really want the files in alphabetical order, as it doesn't seem very >> "random". >> >> Four questions: >> 1. Why does this query take so long? >> 2. Can anything be done on the server side to make it faster? >> 3. Why does it take so much longer now than it did in 2011? >> 4. Is there a better way to fetch a random cross-section of files in a >> particular category? >> >> TTO >> >> [1] https://en.wikipedia.org/wiki/User:This,_that_and_the_other/ >> For_the_Common_Good >> >> >> _______________________________________________ >> Mediawiki-api mailing list >> Mediawiki-api(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api >> > > > > -- > Brad Jorsch (Anomie) > Software Engineer > Wikimedia Foundation > > _______________________________________________ > Mediawiki-api mailing list > Mediawiki-api(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api > >

Brad Jorsch (Anomie)

7 Sep 7 Sep

12:54 a.m.

New subject: Severe slowness with a particular English Wikipedia API query

On Sat, Sep 6, 2014 at 5:30 PM, Nikolas Everett <neverett(a)wikimedia.org> wrote:

...

The index is on (cl_to,cl_timestamp) (and then implicitly cl_from at the end). See maintenance/tables.sql. -- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

3557

days inactive

3558

days old

mediawiki-api@lists.wikimedia.org

Manage subscription

4 comments

3 participants

tags (0)

participants (3)

Brad Jorsch (Anomie)
Nikolas Everett
This, that and the other