Wikitech-l May 2003

wikitech-l@lists.wikimedia.org

59 participants
147 discussions

by Magnus Manske

I (sometimes) use Mandrake Linux, which is based on RedHat. The hdparm command did work - well, not wonders, but improved speed a bit. Try http://pierre.mit.edu/compfac/linux/Securing-Optimizing-Linux-RH-Edition-v1… for a quick-and-dirty manual. Magnus

21 years

Show preview before edit box and not after it

by Kurt Jansson

Could we please, please mark this option per default (and for anons)? People keep reporting that the preview button doesn't work ... This has been requested before (by me or others, I don't remember anymore) and nobody disagreed. Kurt

21 years

Chat about Wikipedia performance?

by David A. Wheeler

Lee Daniel Crocker <lee(a)piclab.com> said: > I just did an ad-hoc benchmark on Piclab of the same installation > with and without link-checking, and on the limited set of pages > used by the test suite, the speedup was only about 3%. Of course > all benchmarks on single-servers may be less applicable to the > multiple-server installation we're going to have soon. Was this a representative setup (e.g., with the full database)? If so, that sounds like removing link-checks won't help much - at least by itself and for single servers. Thanks for the info!! As I said, measuring is the only real answer, and I'm willing to accept "nope, wrong guess" as the answer. So turning to other ideas and performance measurements... The swapping overhead does suggest that an excessive use of memory by MySQL is resulting in the performance hit. Perhaps reducing the amount of data stored in MySQL (by moving the text of cur and old text into the filesystem and OUT of MySQL entirely). As another poster noted, the "old" text is an especially large database, but presumably only a few rare articles are actually read from "old". Obviously reading an article from the filesystem will read the data into memory too, but the underlying OS doesn't try to preload the entire filesystem into memory, which MySQL appears to be trying to do. Doing this will mean that archiving the encyclopedia would have to archive both the MySQL metadata and the article filesystem, but archiving a filesystem is a rather well-understood problem :-).

21 years

RE: [Wikitech-l] Re: Chat about Wikipedia performance?

by Mark Christensen

> A quick start might be to temporarily disable all checking > of links, and see if that helps much. This seems to be a helpful suggestion. Without profiling, it's hard to tell where the bottleneck is, but I think link checking is a good guess. And this fairly simple (now that Lee has created a functioning test suite) could probably tell us if this is a bottleneck. If so, then at least we know where to focus our optimization efforts. If this is the problem, we are in luck because there have been a lot of good improvement suggestions. But they all add complexity to the code (or database setup) and "premature optimization is the root of all kinds of evil," so if link checking isn't a bottleneck it would be counterproductive to spend a lot of time to try to optimize it. --Mark

21 years

Re: Chat about Wikipedia performance?

by David A. Wheeler

Quoting various people: > >Changes in 'existence' (yes or no) come up infrequently. When > someone > >creates a brand new article, all other cached articles that are > >affected by the change could be updated at that time. ... > > > More than that, you don't have to re-generate the cached pages, you > only have to invalidate them. > Thus, when a page is created, you have to "touch" all pages with a > link to that not-yet-existing article, > and when a page is deleted, you do the same to all pages that link to > that article. Yes. If stored in a filesystem, the caches of generated HTML for the linking articles can be simply removed when the existence state of an article is changed. This would be relatively rare; looking at the "recent changes" log shows clearly that most edits are of EXISTING articles. Editing a previous existing article shouldn't modify any caches except of the article being edited. ... > By not updating articles until they are accessed, you can defer a lot > of > work that would otherwise bog the system down at update time. Yes. It also means that if someone creates a number of related "previously non-existing" articles, cached HTML files are only created when they're needed. > Lazy > evaluation is much nicer: when the page is demanded, the code should > first look for a cached page, and generate it if necessary, before > using > that data to generate the article output. Agreed. > Potential downsides: > * some articles are linked by 1000s+ of pages (like the pages linked > by > auto-generated articles). Hitting these could cause a significant > pause > as the cache is invalidated -- or will it? No, it shouldn't be a problem in normal circumstances. An HTML cache should only be changed if the EXISTENCE of a cached page changes. If an article is linked to by 1000s+ of pages, it almost certainly already exists, so no cache (other than the one edited article) would be invalidated. If an article is created (and didn't exist before), or completely deleted from the database, I think the usual case is that it would have have relatively few links (say, 0-10). Thus, you'd only have a few cache invalidations. The NON-normal case would actually be interesting. If an article was widely linked - but didn't exist - then it's likely that someone finally created "the article everybody wanted". If that's so, a small hick-up to store a widely-desired article would be reasonable - people would be glad for the article! In practice, I doubt there'd be many - the people who monitor the "most requested" article lists will create article before too many people link to the non-existent article. Altenernatively, perhaps all those links are a creation from a vandalizing bot - in which case, you'd like to know about it. If an article is deleted, and MANY pages refer to it, I'd worry - that could signal serious vandalism. Just to be clear: I'm specifically talking about caches of stored generated HTML in this email. This could be done by "front end" web servers inside their filesystem, without touching MySQL, as long as the web servers were told when to invalidate their caches. Say, via a separate process that simply get told via broadcast to "invalidate cache of article X" - it then removes the corresponding file (there's a risk of getting "old" articles in some circumstances - whether or not that's a problem is worth discussion). You can even imagine, say, 4-5 front-end webservers, with caches of article to serve read requests, and only talking to the database when updates occur. A separate issue is the possibility of storing _all_ of the article text in the filesystem, instead of MySQL. If that's done, a cache may or may not be useful; it depends on how much it takes to render the HTML from the wikitext. If the rendering to HTML is sufficiently fast, it's possible that the memory pressure from having both cached & original text will cause the system to be less efficient than if it regenerated the text each time. I suspect caching would still be useful, even if the original text is in the filesystem; sendfile() and friends are blazingly fast, while PHP working to render the HTML simply can't be. It'd be possible to store both the original wikitext and the rendered HTML in the filesystem. If you did that, perhaps they should be in separate directories, to simplify moving to multiple front-ends if that's desired later.

21 years

Re: Chat about Wikipedia performance?

by David A. Wheeler

Tomasz Wegrzanowski queried: >What about storing all markup and all rendered (or semi-rendered) >html files on disk, under names being md5 hashes of them, >and having database store only pointers and metadata ? Storing all markup & rendered (cached) files on disk is one of the options I've been raising as a possible performance enhancement. However, I don't see there being a real performance advantage to storing by hash value, instead of simply using an encoded article name as the filename. And, if you use the article name as the filename, the system's underpinnings become MUCH clearer (making it easier to debug, etc.). You probably don't want to just store "all the files" in one big directory anyway. Most filesystems do not handle large directories efficiently - in fact, some are implemented as a linear search, which would make finding the hashed file a real problem. This is easily handled by using the first few characters of the article name as a hashing function, e.g., Europe is stored in "E/u/r/Europe.wk". The hash used here is "imperfect", but as a programmer I'd be grateful for such a simple system when things go wrong. And it makes lots of processing easier ("process articles in ASCIIbetical order" is trivial). Some filesystems _DO_ handle massive directories efficiently. Reiser does well, for example. However, but a design that works well on arbitrary filesystems would have the advantage of letting you switch to more filesystems, depending on other factors. If you're willing to accept implementations that limit the code utility to specific filesystems (like Reiser), I still don't see the advantage of hashed names - the underlying filesystem will use its own hashing system anyway, so you may as well use reasonable names. However, maybe there's something I've missed. If there is, please let me know! Thanks...

21 years

Re: Chat about Wikipedia performance?

by David A. Wheeler

erik_moeller(a)gmx.de (Erik Moeller) declared: > Hi David, > > nice to see you here -- I enjoyed reading your Linux/OSS-related > papers. Thanks!! > I have to say that disabling link checking on the live Wikipedia, > even for a short time, is hardly acceptable. Well, at least a few measurements to determine if it _IS_ a bottleneck would help. If the Wikipedia can be made to run efficiently with link checking, then link checking should be left in. I'm not against link checking, per se. It's just that link checking is less important than having a running Wikipedia. Faster hardware will probably help, but software solutions are still worth pursuing. A hardware speedup of 2x won't be enough if usage increases 20x. Other alternatives include storing the "current" text - or even ALL text - in the filesystem, using special filenames so that MySQL doesn't need to be consulted (or only consulted a little) for certain common queries. It could store either the original wiki text, or the HTML'ized versions, or both. MySQL is actually a pretty good SQL database. However, it's optimized for serving structured data. The underlying OS has received FAR more optimization work to find and retrieve unstructured data, and it also has more information available to it so it can swap/drop information more efficiently. As was noted elsewhere, if old text is stored in the filesystem, it would reduce the memory usage of MySQL substantially. If all text were stored in the filesystem, MySQL would then primarily be used for metadata storage and search index uses (if search is on). Anyway, I'm just typing ideas, hoping that some are useful. My real goal is that I don't want the Wikipedia to be a victim of its own success :-). A solution that works is, by definition, a good solution :-).

21 years

Re: Chat about Wikipedia performance?

by David A. Wheeler

> From: "Mark Christensen" <mchristensen(a)humantech.com> > To: <wikitech-l(a)wikipedia.org> > Reply-To: wikitech-l(a)wikipedia.org > > > A quick start might be to temporarily disable all checking > > of links, and see if that helps much. > > This seems to be a helpful suggestion. Without profiling, it's hard > to > tell where the bottleneck is, but I think link checking is a good > guess. Thanks very much!! I think measuring without link-checking would great, it would certainly answer many questions. I don't have a machine I can test on, sadly; does someone else? It's worth noting that link-checking not only causes additional processing - if link-checking is disabled, and user formatting is limited, many OTHER optimizations become easy. In particular, caching becomes really easy if article text doesn't depend on other state (i.e., doesn't require link-checking and processing to support fancy user options). For example, without link-checking, you don't have to follow lists to invalidate "related" caches. The most effective optimization is to do nothing at all :-). In-filesystem caches of HTML fragments would make sense in such a situation, and Linux's sendfile() could do a rather impressive job of improving performance when sending cached article text. Allowing users to select which stylesheet to blast back at them would give users a limited amount of control, but seriously improve performance. Link-checking is that it's not as useful as you'd wish, anyway. After all, it only identifies existence. An article with only a tiny amount of content appears "complete" to the link-check, but clearly you'd want people to work on that article too! If disabling link-checking (and the optimizations it makes complicated) turns out to seriously improve performance, then I think it's an obvious capability to disable (at least as a configurable option). My two cents, hope they help.

21 years

Really good news!

by Nick Reinking

One fine day, Brion Vibber said: > Do feel free to ask other free projects and universities if they'd be > interested in supporting the project... I thought it might be a good idea to ask, so I sent out an email to ibiblio: > Greetings - my name is Nick, and I help out with a site called Wikipedia. > (http://www.wikipedia.org). It is a free, multi-lingual project to > create a complete, accurate, and more importantly open content > encyclopedia. All of the content is licensed under the GNU FDL (GFDL), > meaning that anybody has the freedom to copy and redistribute it, with > or without modifications, either commercially or non-commercially - > although they may not put in place technical measures to conceal the > content. > > The English language Wikipedia has over 117 thousand articles already, > and by our calculations, we have about half the content of the > Encyclopedia Britannica. We have many other languages, which are also > quickly growing in size and diversity. > > However, we are currently facing both a budget and capacity crunch. > In short, we have neither. Since Wikipedia is a volunteer based > program, it is hard for us to raise the funds to purchase additional > hardware. Right now, we are running off of a dual Athlon 1800+ server > with 2GB of RAM and 36GB of SCSI storage. Unfortunately, the system is > being pushed to its absolute limits, with little relief in sight. We > are installing a second system this weekend as a front end, but even > with that, we are not sure how long we can hold out. The dual Athlon > system runs at a typical load of 15-20 during normal US working hours. > > Since our mission seems to be very much inline with ibiblio's mission, I was > wondering if there was any way that Wikipedia could be hosted by ibiblio? > It would be a great help to our project and the community. > > Thanks! Much to my surprise, they replied: > hi nick, > > we would LOVE to host wikipedia.org. our only concern is with the > additional load wikipedia might put on our mysql server. BUT... if you > could possibly hold off moving the site for two weeks or so john and > fred will have us up on our new hardware - we're moving to a web cluster > and will have a much more powerful database machine. > > if this is acceptable to you, please check out > http://www.ibiblio.org/faq/ for more information about our setup, and > just drop me a list of what accounts, dbs, unix groups, web directories, > etc and we'll go from there. just let me know? > > thanks, > donald www.ibiblio.org > formerly known as SunSITE > 919.843.8215 and stoof. Seems like a good deal to me. We should probably tell them that they should keep our database on a seperate MySQL server, as it will absolutely demolish just about anything they make available. Anyways, if somebody who knows the server requirements, layout, and other what-not wants to let Donald and/or the list know, that would be awesome (assuming that Brion is still interested in having someone else host the site). Having ibiblio make all the outlays seems like a good deal to me (plus, they're a non-profit org, so you could likely make tax-deductable donations to them). Anyways, that's that. :) -- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

21 years

Please kill processe

by Nick Reinking

There seems to be a runaway Apache process (8453) that has already sucked up about 800 minutes of CPU time. Can somebody with root kill it off? -- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

21 years

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l May 2003