Re: [Wikitech-l] [Commons-l] Category intersection: New extensionavailable

7 Mar 2008

...
  -----Original Message-----
 From: wikitech-l-bounces(a)lists.wikimedia.org 
 [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of 
 Simetrical
 Sent: 07 March 2008 18:41
 To: Wikimedia developers
 Subject: Re: [Wikitech-l] [Commons-l] Category intersection: 
 New extensionavailable

 On Fri, Mar 7, 2008 at 12:15 PM, Ilya Haykinson 
 &lt;haykinson(a)gmail.com&gt; wrote:
  For what it's worth, the extension
  http://www.mediawiki.org/wiki/DynamicPageList has been in use on  
 various Wikimedia sites for a while now with great success   to allow  
  for category intersections, and I think the
latest versions   support  
  image galleries etc.  
 We know.  DPL is not suitable for use on large wikis.

 On Fri, Mar 7, 2008 at 12:17 PM, Jared Williams 
 &lt;jared.williams1(a)ntlworld.com&gt; wrote:
   Yeah did notice that, think it could be replaced
with   something like.

  SELECT ci_page FROM {$table_categoryintersections} WHERE   ci_hash IN  
  (implode(',', $hashes))  GROUP BY ci_page
 HAVING COUNT(*) = 
 count($hashes)  LIMIT $this->max_hash_results  
 I'm not going to spend too much time parsing that, but it's 
 an automatic filesort of the entire set included by the WHERE 
 clause, i.e., the union of all the category intersections in 
 question, since MySQL doesn't support loose index scans for 
 WHERE x IN (...) GROUP BY y.  Repeated join seems likely to 
 be faster, although maybe not, I haven't benchmarked it or anything.

Ah yeah, I forget MySql specifics. 
I wouldn't be surprised if the optimium style of query changes given the
number of categories.

...
    Yeah, I think
chances of hash collisions are unlikely,   whats far more 
  likely  is someone recategorizing a page after a
search.   Which means 
  the double  check could be removed.  
 It's not just unlikely, it's so unlikely as to be impossible 
 to all intents and purposes, barring deliberately-constructed 
 collisions (which are possible with MD5, although maybe not 
 for such short strings, I forget).  Worry about a meteor 
 wiping out the data center before you worry about MD5 
 collisions by chance on sets with cardinality in the billions. 
My point was even it if was even remotely likely it doesn't really matter,
as soon as the data leaves the db, it could be stale in anycase. So
inaccuracies should be expected, therefore making the extra queries
redundant.

Jared

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [Commons-l] Category intersection: New extensionavailable