Could we please, please mark this option per default (and for anons)?
People keep reporting that the preview button doesn't work ...
This has been requested before (by me or others, I don't remember
anymore) and nobody disagreed.
Kurt
Lee Daniel Crocker <lee(a)piclab.com> said:
> I just did an ad-hoc benchmark on Piclab of the same installation
> with and without link-checking, and on the limited set of pages
> used by the test suite, the speedup was only about 3%. Of course
> all benchmarks on single-servers may be less applicable to the
> multiple-server installation we're going to have soon.
Was this a representative setup (e.g., with the full database)?
If so, that sounds like removing link-checks won't help
much - at least by itself and for single servers.
Thanks for the info!!
As I said, measuring is the only real answer, and I'm willing to
accept "nope, wrong guess" as the answer.
So turning to other ideas and performance measurements...
The swapping overhead does suggest that an excessive use of memory
by MySQL is resulting in the performance hit.
Perhaps reducing the amount of data
stored in MySQL (by moving the text of cur and old text into
the filesystem and OUT of MySQL entirely).
As another poster noted, the "old" text is
an especially large database, but presumably only a few rare
articles are actually read from "old".
Obviously reading an article from the filesystem will read the
data into memory too, but the underlying OS doesn't try to preload
the entire filesystem into memory, which MySQL appears to be
trying to do. Doing this will mean that archiving the encyclopedia
would have to archive both the MySQL metadata and the
article filesystem, but archiving a filesystem is a rather
well-understood problem :-).
> A quick start might be to temporarily disable all checking
> of links, and see if that helps much.
This seems to be a helpful suggestion. Without profiling, it's hard to
tell where the bottleneck is, but I think link checking is a good guess.
And this fairly simple (now that Lee has created a functioning test
suite) could probably tell us if this is a bottleneck. If so, then at
least we know where to focus our optimization efforts.
If this is the problem, we are in luck because there have been a lot of
good improvement suggestions. But they all add complexity to the code
(or database setup) and "premature optimization is the root of all kinds
of evil," so if link checking isn't a bottleneck it would be
counterproductive to spend a lot of time to try to optimize it.
--Mark
Quoting various people:
> >Changes in 'existence' (yes or no) come up infrequently. When
> someone
> >creates a brand new article, all other cached articles that are
> >affected by the change could be updated at that time. ...
> >
> More than that, you don't have to re-generate the cached pages, you
> only have to invalidate them.
> Thus, when a page is created, you have to "touch" all pages with a
> link to that not-yet-existing article,
> and when a page is deleted, you do the same to all pages that link to
> that article.
Yes. If stored in a filesystem, the caches of generated HTML for
the linking articles can be simply removed when the existence state
of an article is changed. This would be relatively rare; looking
at the "recent changes" log shows clearly that most edits are
of EXISTING articles. Editing a previous existing article shouldn't
modify any caches except of the article being edited.
...
> By not updating articles until they are accessed, you can defer a lot
> of
> work that would otherwise bog the system down at update time.
Yes. It also means that if someone creates a number of
related "previously non-existing" articles, cached HTML files are
only created when they're needed.
> Lazy
> evaluation is much nicer: when the page is demanded, the code should
> first look for a cached page, and generate it if necessary, before
> using
> that data to generate the article output.
Agreed.
> Potential downsides:
> * some articles are linked by 1000s+ of pages (like the pages linked
> by
> auto-generated articles). Hitting these could cause a significant
> pause
> as the cache is invalidated -- or will it?
No, it shouldn't be a problem in normal circumstances.
An HTML cache should only be changed if
the EXISTENCE of a cached page changes. If an article is
linked to by 1000s+ of pages, it almost certainly already exists,
so no cache (other than the one edited article) would be invalidated.
If an article is created (and didn't exist before), or completely
deleted from the database, I think the usual case is that it would
have have relatively few links (say, 0-10).
Thus, you'd only have a few cache invalidations.
The NON-normal case would actually be interesting.
If an article was widely linked - but didn't exist - then it's
likely that someone finally created "the article everybody wanted".
If that's so, a small hick-up to store a widely-desired article would
be reasonable - people would be glad for the article!
In practice, I doubt there'd be many - the people who monitor the
"most requested" article lists will create article before too many
people link to the non-existent article.
Altenernatively, perhaps all those links are a creation from a
vandalizing bot - in which case, you'd like to know about it.
If an article is deleted, and MANY pages refer to it, I'd
worry - that could signal serious vandalism.
Just to be clear: I'm specifically talking about
caches of stored generated HTML in this email.
This could be done by "front end" web servers inside their
filesystem, without touching MySQL, as long as the web servers
were told when to invalidate their caches.
Say, via a separate process that simply get told via broadcast to
"invalidate cache of article X" - it then removes the
corresponding file (there's a risk of getting "old" articles in
some circumstances - whether or not that's a problem is worth
discussion). You can even imagine, say, 4-5 front-end
webservers, with caches of article to serve read requests,
and only talking to the database when updates occur.
A separate issue is the possibility of storing _all_ of the article
text in the filesystem, instead of MySQL.
If that's done, a cache may or may not be useful; it depends on
how much it takes to render the HTML from the wikitext.
If the rendering to HTML is sufficiently fast, it's possible that
the memory pressure from having both cached & original text will
cause the system to be less efficient than if it regenerated the
text each time. I suspect caching would still be useful, even
if the original text is in the filesystem; sendfile() and friends
are blazingly fast, while PHP working to render the HTML
simply can't be.
It'd be possible to store both the original wikitext and the
rendered HTML in the filesystem. If you did that, perhaps they
should be in separate directories,
to simplify moving to multiple front-ends if that's desired later.
Tomasz Wegrzanowski queried:
>What about storing all markup and all rendered (or semi-rendered)
>html files on disk, under names being md5 hashes of them,
>and having database store only pointers and metadata ?
Storing all markup & rendered (cached) files on disk is one of
the options I've been raising as a possible performance enhancement.
However, I don't see there being a real performance advantage to
storing by hash value, instead of simply using an encoded article name
as the filename. And, if you use the article name as the filename,
the system's underpinnings become MUCH clearer (making it easier to
debug, etc.).
You probably don't want to just store
"all the files" in one big directory anyway.
Most filesystems do not handle large directories efficiently -
in fact, some are implemented as a linear search, which would make
finding the hashed file a real problem. This is easily handled
by using the first few characters of the article name as a hashing
function, e.g., Europe is stored in "E/u/r/Europe.wk".
The hash used here is "imperfect", but as a programmer I'd be
grateful for such a simple system when things go wrong.
And it makes lots of processing easier ("process articles in
ASCIIbetical order" is trivial).
Some filesystems _DO_ handle massive directories efficiently.
Reiser does well, for example. However, but a design that works well
on arbitrary filesystems would have the advantage of letting you
switch to more filesystems, depending on other factors.
If you're willing to accept implementations that limit the
code utility to specific filesystems (like Reiser),
I still don't see the advantage of hashed names - the
underlying filesystem will use its own hashing system anyway,
so you may as well use reasonable names.
However, maybe there's something I've missed.
If there is, please let me know!
Thanks...
erik_moeller(a)gmx.de (Erik Moeller) declared:
> Hi David,
>
> nice to see you here -- I enjoyed reading your Linux/OSS-related
> papers.
Thanks!!
> I have to say that disabling link checking on the live Wikipedia,
> even for a short time, is hardly acceptable.
Well, at least a few measurements to determine if it _IS_
a bottleneck would help. If the Wikipedia can be made to run
efficiently with link checking, then link checking should be left in.
I'm not against link checking, per se.
It's just that link checking is less important than
having a running Wikipedia. Faster hardware will
probably help, but software solutions are still worth pursuing.
A hardware speedup of 2x won't be enough if usage increases 20x.
Other alternatives include storing the "current" text - or even
ALL text - in the filesystem, using special filenames so that
MySQL doesn't need to be consulted (or only consulted a little)
for certain common queries. It could store
either the original wiki text, or the HTML'ized versions, or both.
MySQL is actually a pretty good SQL database. However,
it's optimized for serving structured data. The underlying OS
has received FAR more optimization work to find and retrieve
unstructured data, and it also has more information available to it
so it can swap/drop information more efficiently.
As was noted elsewhere, if old text is stored in the filesystem,
it would reduce the memory usage of MySQL substantially.
If all text were stored in the filesystem, MySQL would then
primarily be used for metadata storage and search index uses
(if search is on).
Anyway, I'm just typing ideas, hoping that some are useful.
My real goal is that I don't want the Wikipedia to be
a victim of its own success :-). A solution that works is,
by definition, a good solution :-).
> From: "Mark Christensen" <mchristensen(a)humantech.com>
> To: <wikitech-l(a)wikipedia.org>
> Reply-To: wikitech-l(a)wikipedia.org
>
> > A quick start might be to temporarily disable all checking
> > of links, and see if that helps much.
>
> This seems to be a helpful suggestion. Without profiling, it's hard
> to
> tell where the bottleneck is, but I think link checking is a good
> guess.
Thanks very much!! I think measuring without link-checking would
great, it would certainly answer many questions. I don't have a
machine I can test on, sadly; does someone else?
It's worth noting that link-checking not only causes
additional processing - if link-checking is disabled, and user
formatting is limited, many OTHER optimizations become easy.
In particular, caching becomes really easy if article text
doesn't depend on other state (i.e., doesn't require link-checking and
processing to support fancy user options). For example, without
link-checking, you don't have to follow lists to invalidate
"related" caches. The most effective optimization is to do
nothing at all :-). In-filesystem caches of HTML fragments would
make sense in such a situation, and Linux's sendfile()
could do a rather impressive job of improving performance
when sending cached article text. Allowing users to select
which stylesheet to blast back at them would give users a limited
amount of control, but seriously improve performance.
Link-checking is that it's not as useful
as you'd wish, anyway. After all, it only identifies
existence. An article with only a tiny amount of content appears
"complete" to the link-check, but clearly you'd want people to
work on that article too! If disabling link-checking (and the
optimizations it makes complicated) turns out to seriously improve
performance, then I think it's an obvious capability to disable
(at least as a configurable option).
My two cents, hope they help.
One fine day, Brion Vibber said:
> Do feel free to ask other free projects and universities if they'd be
> interested in supporting the project...
I thought it might be a good idea to ask, so I sent out an email to ibiblio:
> Greetings - my name is Nick, and I help out with a site called Wikipedia.
> (http://www.wikipedia.org). It is a free, multi-lingual project to
> create a complete, accurate, and more importantly open content
> encyclopedia. All of the content is licensed under the GNU FDL (GFDL),
> meaning that anybody has the freedom to copy and redistribute it, with
> or without modifications, either commercially or non-commercially -
> although they may not put in place technical measures to conceal the
> content.
>
> The English language Wikipedia has over 117 thousand articles already,
> and by our calculations, we have about half the content of the
> Encyclopedia Britannica. We have many other languages, which are also
> quickly growing in size and diversity.
>
> However, we are currently facing both a budget and capacity crunch.
> In short, we have neither. Since Wikipedia is a volunteer based
> program, it is hard for us to raise the funds to purchase additional
> hardware. Right now, we are running off of a dual Athlon 1800+ server
> with 2GB of RAM and 36GB of SCSI storage. Unfortunately, the system is
> being pushed to its absolute limits, with little relief in sight. We
> are installing a second system this weekend as a front end, but even
> with that, we are not sure how long we can hold out. The dual Athlon
> system runs at a typical load of 15-20 during normal US working hours.
>
> Since our mission seems to be very much inline with ibiblio's mission, I was
> wondering if there was any way that Wikipedia could be hosted by ibiblio?
> It would be a great help to our project and the community.
>
> Thanks!
Much to my surprise, they replied:
> hi nick,
>
> we would LOVE to host wikipedia.org. our only concern is with the
> additional load wikipedia might put on our mysql server. BUT... if you
> could possibly hold off moving the site for two weeks or so john and
> fred will have us up on our new hardware - we're moving to a web cluster
> and will have a much more powerful database machine.
>
> if this is acceptable to you, please check out
> http://www.ibiblio.org/faq/ for more information about our setup, and
> just drop me a list of what accounts, dbs, unix groups, web directories,
> etc and we'll go from there. just let me know?
>
> thanks,
> donald www.ibiblio.org
> formerly known as SunSITE
> 919.843.8215 and stoof.
Seems like a good deal to me. We should probably tell them that they should
keep our database on a seperate MySQL server, as it will absolutely demolish
just about anything they make available.
Anyways, if somebody who knows the server requirements, layout, and other
what-not wants to let Donald and/or the list know, that would be awesome
(assuming that Brion is still interested in having someone else host the
site). Having ibiblio make all the outlays seems like a good deal to me
(plus, they're a non-profit org, so you could likely make tax-deductable
donations to them).
Anyways, that's that. :)
--
Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN
There seems to be a runaway Apache process (8453) that has already
sucked up about 800 minutes of CPU time. Can somebody with root kill
it off?
--
Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN