Ian Gilfiallan wrote:
> - it's not the volume of queries, it's often the query that starts the
> backing up. So look at the first query in the log, and the first queries
> after a period of time running OK. Two that I saw this way are:
>
> SELECT cur_id,cur_namespace,cur_title,cur_text
> FROM cur,searchindex
> WHERE cur_id=si_page AND (MATCH (si_text) AGAINST ('math characters') AND
> (cur_is_redirect=0) ) AND (cur_namespace=0) LIMIT 0, 20;
This query looks extremely suspect to me because it seems to combine the worst
of two worlds: the locking overhead of InnoDB tables (for table 'cur') and
large locking granulatity of MyISAM tables (for table 'searchindex' which
needs to be updated every time an article is saved) . As I said before, I
really don't think we should be using InnoDB tables, but perhaps Ian has some
good arguments why we should.
-- Jan Hidders
I just made one more tweak to the random page selection; I had been
getting duplicates more frequently than I would expect, this should
spread out the results better:
$sqlget = "SELECT cur_id,cur_title
FROM cur USE INDEX (cur_random)
WHERE cur_namespace=0 AND cur_is_redirect=0
+ AND cur_random>RAND()
ORDER BY cur_random
LIMIT 1";
Please check it experimentally; currently on test and English.
-- brion vibber (brion @ pobox.com)
Allowing sysops to generally ban signed in users would reverse our
existing policy that this can only be done with Jimbo's blessing. But we
don't have to go that far. It would be relatively simple to do the
following:
1) Rename the "Block IP" special page to "Block user"
2) Add an additional textfield on that page to ban a signed in user
3) If you try to ban a signed in user, the CUR and OLD table are queried
to check whether that user has made any edits that are older than 14 days
4) If this is the case, the user is not banned and a warning message is
displayed. Otherwise, the user is banned.
By doing this, we raise the trust level required to be "immune" from sysop
bans, while making it possible to quickly block persistent vandals.
As an additional refinement, we could add a past_ban flag to the user
table. If a user has been banned in the past, he can be banned again by
any sysop.
I don't know if we can retrieve the IP address from our tables - I don't
see how since we do not store it in CUR. But if we can, we could also
provide sysops with the IP address of the blocked user, after they have
successfully blocked them.
Regards,
Erik
I have a trick for combining the history of badly moved pages,
which I've applied successfully a few times over the past month.
When I tried it just now, it failed miserably.
I will describe my actions:
First, I checked that the histories of [[Pennsylvania Dutch]]
and [[Pennsylvania German]] didn't overlap, except for redirecting.
Then I deleted [[Pennsylvania Dutch]] (the current redirect).
Then I moved [[Pennsylvania German]] to [[Pennsylvania Dutch]].
Then I restored the deletion of [[Pennsylvania Dutch]].
Et voila! the entire history is combined at [[Pennsylvania Dutch]]
(except for the unavoidable intrusion of redirects at the moving point),
ready to be moved back to [[Pennsylvania German]].
Well, that's how it had always worked before.
But in fact, the deleted history of [[Pennsylvania Dutch]] was *not* restored!
I restored it again, to see if that would help, but it didn't.
So I sadly moved it back to [[Pennsylvania German]],
and now there seems to be no way to get the history of [[Pennsylvania Dutch]].
Where did it go? History shouldn't be permanently deletable
with our nice soft delete. I suspect a database bug, but I don't know how.
Further hints of a database bug:
Thinking that I might have misspelled "Pennsylvania" at some point,
I did a search on the term "Dutch".
Although I got several redirects in the page title matches
and these matches easily fit onto a single page of results,
[[Pennsylvania Dutch]] never appeared on the list.
Yet the page undoubtedly exists, historyless though it may be.
So there is a bug somewhere that lost track of [[Pennsylvania Dutch]];
and there is some history that was truly deleted forever,
but which we might be able to dredge from the database (or yesterday's log),
and which it would be good to restore to its proper place
in the history of [[Pennsylvania German]].
-- Toby Bartels
The English-language Wikipedia will be in _read-only mode_ for a few
hours tonight (Feb 9->Feb 10) while the database structure is altered to
eliminate some of the remaining performance drags before the heavy
weekday traffic gets into full swing.
This will also fix the out-of-order page history bug, which was caused
by a temporary workaround to the slow history problem.
The downtime window will cover roughly 04:00-08:00 UTC (UK time). For us
USians, that's 8pm-midnight PST / 11pm-3am EST. Australians & others,
please check http://www.wikipedia.org/wiki/Time_zone for the appropriate
offset to your local time. (I'm not certain of the end time; it may run
shorter, and hopefully won't run longer. It's a big database, and mass
modifications take a *long* time.)
During that time the wiki should remain accessible for reading, but no
one will be able to log in or make edits.
I'll also be upgrading various other languages to the current software
revision; those should go relatively quickly, and will be staggered.
Generally, meta, wiktionary, and Wikipedia's other languages should
remain accessible for both reading and editing through the night except
for some brief read-only outages.
-- brion vibber (brion @ pobox.com)
ps Tech note: I figure we should hold off on the mysql 4 upgrade until
they can release a version that isn't followed within a day by a dire
warning about a new bug that directly affects a type of query that we
use a lot for Wikipedia. :)
In the meantime, I'm adding the inverse timestamp column which can be
sorted on the index by mysql 3 (thus speeding and fixing bugs in
history, contribs, & recentchangeslinked), and a random queue index
column to replace the separate table and its insanely slow refilling
operation.
I've overwritten the experimental stuff in test.wikipedia.org with the
current work code (including the new timestamp & random code which will
be going live in a few hours -- test fast! :)
Copies of the recent state should be in the 'backup2' subdirectory.
-- brion vibber (brion @ pobox.com)
Question: would anyone else like to receive automatic e-mail
notification of bug reports and feature requests filed at the tracker on
our SourceForge project?
If so, I can set up an alias address which will reflect to interested
parties and tell SF to send notices there instead of to me directly.
(Yes, people *do* file bug reports at the tracker.)
-- brion vibber (brion @ pobox.com)
I'd like to setup a read-only mirror of wikipedia on ross.bomis.com,
and point a URL at it. Possibly we can think of a good automatic
redirection scheme or something. Well, I say "I'd like to...", but
actually that means "I'm willing to pay Jason to..."
Jason posted the other day about a zero budget, but that's not
*exactly* true. :-)
We have some loose hardware sitting around that we might as well press
into service.
Ideally, we would have a way to _automatically_ redirect requests to
the read-only site when the full site is dead.
If throwing hardware at the problem is likely to help, I'll do it.
--Jimbo
Earlier today I noticed that it was taking quite a long time to refill
the random page queue; this currently works by:
INSERT
INTO random(ra_current,ra_title)
SELECT 0,cur_title
FROM cur
WHERE cur_namespace=0 AND cur_is_redirect=0
ORDER BY RAND()
LIMIT 1000
Is it actually trying to reorder all 100,000+ entries randomly and then
taking the first 1000, or does it just _seem_ that slow? And is there
actually a faster way of doing this?
Off the top of my head; what if the random queue listed *every* page;
each page would be queued when it was created (and removed if deleted or
changed to non-"article" state), and associated with a random index
number. When asked to view a random page, we sort on the random index
column (which would be indexed!) and take the lowest number; then assign
that same article a new random index.
This wouldn't require the occasional delay to refill the queue, and
since the random index would be indexed, retrieval should be quick
enough even with a large number of articles.
Thoughts?
-- brion vibber (brion @ pobox.com)