-----Original Message-----
From: wikitech-l-bounces(a)lists.wikimedia.org
[mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of
Simetrical
Sent: 07 March 2008 18:41
To: Wikimedia developers
Subject: Re: [Wikitech-l] [Commons-l] Category intersection:
New extensionavailable
On Fri, Mar 7, 2008 at 12:15 PM, Ilya Haykinson
<haykinson(a)gmail.com> wrote:
For what it's worth, the extension
http://www.mediawiki.org/wiki/DynamicPageList has been in use on
various Wikimedia sites for a while now with great success
to allow
for category intersections, and I think the
latest versions
support
image galleries etc.
We know. DPL is not suitable for use on large wikis.
On Fri, Mar 7, 2008 at 12:17 PM, Jared Williams
<jared.williams1(a)ntlworld.com> wrote:
Yeah did notice that, think it could be replaced
with
something like.
SELECT ci_page FROM {$table_categoryintersections} WHERE
ci_hash IN
(implode(',', $hashes)) GROUP BY ci_page
HAVING COUNT(*) =
count($hashes) LIMIT $this->max_hash_results
I'm not going to spend too much time parsing that, but it's
an automatic filesort of the entire set included by the WHERE
clause, i.e., the union of all the category intersections in
question, since MySQL doesn't support loose index scans for
WHERE x IN (...) GROUP BY y. Repeated join seems likely to
be faster, although maybe not, I haven't benchmarked it or anything.
Ah yeah, I forget MySql specifics.
I wouldn't be surprised if the optimium style of query changes given the
number of categories.
Yeah, I think
chances of hash collisions are unlikely,
whats far more
likely is someone recategorizing a page after a
search.
Which means
the double check could be removed.
It's not just unlikely, it's so unlikely as to be impossible
to all intents and purposes, barring deliberately-constructed
collisions (which are possible with MD5, although maybe not
for such short strings, I forget). Worry about a meteor
wiping out the data center before you worry about MD5
collisions by chance on sets with cardinality in the billions.
My point was even it if was even remotely likely it doesn't really matter,
as soon as the data leaves the db, it could be stale in anycase. So
inaccuracies should be expected, therefore making the extra queries
redundant.
Jared