From: Brion Vibber <brion(a)pobox.com>
On Thu, 2003-05-01 at 08:53, Nick Reinking wrote:
Off the top of my head, I can't think of any
simple mathematical way to
do want you want to do. (that being making articles selected randomly
less likely to be selected randomly over time)
Not sure who wants to do that... For me it's quite sufficient to make
the probability of selection random, and let the size of the selection
field make multiple selections _reasonably_ rare.
With over 100,000 articles, it's unlikely the user will ever see the same
article come up twice. A previous flawed algorithm had the property of
making a few articles much more likely to be selected.
The resetting of the random weight upon load is just
meant to keep the
pot stirring, and "in theory" shouldn't have any effect at all on
randomness if the random indexes were random to begin with.
Quite right. I would recommend only setting cur_random on article creation.
It would save that little tiny bit of database load. There's no reason to
think this "pot stirring" will make the selection more random, now that the
pot has been pre-stirred.
Seems to me
that the best way to do this is either:
1) Be truly random. In PostgreSQL, this would be something like:
SELECT cur_id from cur LIMIT 1 OFFSET random(SELECT count(cur_id)
FROM cur)
(this function should be really fast, not sure
if faster than:)
Mysql doesn't seem to let you use a non-constant expression for
'LIMIT offset,count' (and besides, still no subqueries), but we can pick
a random number easily enough in PHP. The problem is that we aren't sure
what range to use; hence a random index attached to each article whose
range is known (between 0 and 1).
The MySQL documentation suggests using "ORDER BY RAND() LIMIT 1", for
versions later than 3.23. But I think the current algorithm is quite clever.
Its only drawback is the overhead associated with maintaining the extra
index, and initialising the field on article creation.
We don't want to select non-article pages or
redirects, but we don't
currently have a count of exactly that. There's the "good articles"
count, but it has other criteria, so if we used it we'd always crop off
relatively recent articles. If we inflate it based on an estimate we'd
get overemphasis on the _most_ recent article if we estimate too high.
Sure, there's count(*), but that takes time to run... might be faster
with an index specifically on cur_namespace,cur_is_redirect. Or we could
maintain another column in the stats table which counts non-redirect
article pages with _no_ other criteria (such as size or comma/bracket
count). Rebuilding indexes on cur is _very_ slow, so if we need to
change the indexes, best to plan ahead.
Maybe we could have a "good article" flag, stored in cur on edit.
-- Tim Starling.
_________________________________________________________________
MSN Instant Messenger now available on Australian mobile phones. Go to
http://ninemsn.com.au/mobilecentral/hotmail_messenger.asp