Re: [Wikitech-l] Category intersection: New extension available

7 Mar 2008

On Fri, Mar 7, 2008 at 2:48 PM, Simetrical
&lt;Simetrical+wikilist(a)gmail.com&gt; wrote:
...
  On Thu, Mar 6, 2008 at 4:16 PM, Magnus Manske
  &lt;magnusmanske(a)googlemail.com&gt; wrote:
   I tried it on my (mostly empty) MediaWiki test
setup, and it works
  peachy. However, *I NEED HELP* with
  * testing it on a large-scale installation
  * integrating it with MediaWiki more tightly (database wrappers, caching, etc.)
  * Brionizing the code, so it actually has a chance to be used on
  Wikipedia and/or Commons 
  I would help out, but I don't think there's any reason to settle for a
  sharply limited number of intersections, which I guess this approach
  requires. 
Well, unless you have a better idea...?
Yes, there's fulltext, but is that really the most efficient way? I
certainly isn't the most elegant;-)

...
    * More than
two intersections are implemented by nesting subqueries 
  Subqueries only work in MySQL 4.1.  You'll need to rewrite those as
  joins if you want this to run on Wikimedia, or probably to perform
  acceptably on any version of MySQL (MySQL is pretty terrible even in
  5.0 at optimizing subqueries).  And then we're back to the poor join
  performance that was an issue to start with, just with one join less,
  aren't we? 
Huh. Remind me again, why are we stuck at 4.0.26?

Even so, we can do fast A|B intersections.
Also, we can do 4 intersections with A|B and C|D, which is "half the joins".

For more intersections, I see three ways out:
* Move to MySQL 4.1 :-)
* Request complete list for every A|B combination, and merge in PHP.
Too much memory usage ()?
* Store "higher order" hashes as well - A|B|C, A|B|C|D... Waste of diskspace?

...
    * Hash values
are implemented as VARCHAR(32). Could easily switch to
  INTEGER if desirable (less storage, faster lookup, but more false
  positives) 
  BIGINT would give a trivial number of false positives.  INT would
  probably be a bit faster, especially on 32-bit machines, and while it
  would inevitably give some false positives, those should be rare
  enough to be easily filtered on the application side, if you don't
  have to run extra queries to do the filtering. 
I've switched to INT UNSIGNED, based on the first 8 hex chars of the MD5.
Also, I've made the second check optional, turned off by default.

Do you think we can rescue this?
Even if not, it has most of the infrastructure needed for other
approaches, e.g., fulltext.

Cheers,
Magnus

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Category intersection: New extension available