Back in January we had some discussion about how difficult it was to
edit multiple cross-linked pages about subjects within a context now
that subpages are gone. There were several suggestions, but none of
them really clicked and none were ever implemented. The issue has
come up again, and there are now more pages with disambiguating
contexts now, so I think now is a good time to revisit.
I also have a proposal that I like better than all the earlier
ones (including mine). Rather than add a special tag like Base
or Context, and rather than using a special character, let's
just change our interpretation of links with a missing portion
on either side of the pipe, that is [[ link| ]] and [[ |link]].
Here's the proposal: On pages whose titles end with (context) in
parentheses, [[ |link]] is interpreted as [[link (context)|link]].
On all pages, [[link (context)| ]] is interpreted that way as well.
All other uses of [[|link]] or [[|link]] are simply interpreted
as [[link]].
That will make fixing all the links in the Middle Earth, Poker,
and other pages much easier, and I don't think it will add any
temptation to over-categorize or cause other problems.
It is an open question whether these links are interpreted at
save-time or render-time; the latter makes things easier I think,
but the former has advantages too.
0
I just commited a function (sysops only, I hope;) to move a page to a new
title, with complete history. A checkbox is used to create a redirect. Now:
1. Please check for errors.
2. Please make it use variables instead of English text constants (Biron, I
know you wouldn't be happy if you couldn't do that yourself ;]
3. The function doesn't move subpages, as there are no subpages, officially.
Neither does it move talk: pages. Might be worth adding.
Have fun with the coding,
Magnus
Dear Wikipedians!
Today I read an article in a mailinglist which lead me to a
slashdot discussion about a project called "World Wide Lexicon". There
seem to be some wrong expectations about it so I mailed the author of
WWL to ask him and to point him to Wikipedia. I think it's a very
interresting project, but you can take a look at it yourself:
the project:
www.worldwidelexicon.org
the /. discussion:
http://slashdot.org/articles/02/04/05/1911255.shtml?tid=95
the answers to my emails:
-----Ursprüngliche Nachricht-----
Von: Brian McConnell <brianmsf(a)yahoo.com>
An: Kurt Jansson <kurt(a)jansson.de>
Gesendet: Sonntag, 7. April 2002 19:55
Betreff: RE: Why "Lexicon"?
Kurt,
Thank you for your email.
I called it the worldwide lexicon because the system can be used to
retrieve
definitions for words as well as translations. For example, if you are
doing
a monolingual search, you can submit several different types of queries
to a
WWL server, including:
- syn : returns synonymous words and phrases
- ant : returns antonymous words and phrases
- def : returns verbose description for a word or phrase
- pcat : returns parent categories that the word, phrase or resource
locator
belongs to
- ccat : returns child categories that are associated with the entry
- vis : returns words that represent visually similar objects
I like Wikipedia, and would like to talk to someone about joining it to
the
WWL system. I think it could be very useful in processing monolingual
queries. All they will need to do is write a PHP script that recognizes
several SOAP simple methods.
I would also like to talk to the wikipedia software developer about the
possibility of modifying the system to be used as a translation
dictionary.
I don't like to reinvent the wheel, and it seems that the system they
have
built can be modified to host a user supported database of language pair
translations.
The benefit of joining Wikipedia is the system will appear as a data
source
along with other web dictionaries, lexicons and semantic network
servers.
The most useful feature of our system is that it will enable client
applications, a browser plug in for example, to locate WWL data sources
on
the fly, and then submit standardized queries to them. Thus, one fairly
simple piece of code can talk to lots of dictionaries throughout the web
(you might use it one day to lookup translations for words in a Spanish
document, and another to look for verbose definitions for words in your
home
language).
The main goal of WWL is to create a GNUtella like system for locating
and
communicating with dictionary and semantic network servers on the web
(there
are many). The problem today is that each system has its own proprietary
front end, so all of this information is fragmented. By creating a
simple
protocol for locating and talking to systems, it is possible to create
what
appears to be a single worldwide dictionary/semantic network that can be
accessed with a few lines of code.
Thanks for writing. Best regards,
Brian Mcconnell
-----Ursprüngliche Nachricht-----
Von: Brian McConnell <brianmsf(a)yahoo.com>
An: Kurt Jansson <kurt(a)jansson.de>
Gesendet: Sonntag, 7. April 2002 23:53
Betreff: RE: Why "Lexicon"?
Kurt,
Thanks for the quick reply.
Another point... WWL does not do full text translation. It is designed
to
assist word and phrase translation, as well as monolingual dictionary or
encyclopedia searches. As you know, translating full text without human
intervention is a very difficult problem. While I could see translation
systems using WWL to query dictionaries (to expand the scope of their
vocabularies), the WWL specification does not say anything about full
text
translation.
Our primary goal is to create a distributed dictionary/encyclopedia
protocol
that is very easy to implement in client and server software, and that
does
not require dictionary servers to make changes to their systems besides
writing a few scripts to generate SOAP responses instead of HTML. WWL's
purpose is to make it easy to automatically locate and communicate with
WWL-aware dictionary and semantic net servers. I like to describe this
as
GNUtella for dictionaries.
You are welcome to forward this my email to the wikipedia list or
developers. As I mentioned, I think you could do some interesting things
by
making your systems accessible via the WWL SOAP interface.
Thanks again for your email. Best regards.
Brian McConnell
Pierre, thanks for your comment... I'm forwarding it to wikitech-l, which is where
the developers hang out. Coincidentally, we are just now discussing performance
issues. Would you be interested in joining us?
----- Forwarded message from Pierre Abbat <phma(a)webjockey.net> -----
From: Pierre Abbat <phma(a)webjockey.net>
Date: Thu, 11 Apr 2002 09:14:25 -0400
To: webmaster(a)wikipedia.com
Subject: HTTP response from wikipedia takes too long
I am trying to read Wikipedia and Konqueror frequently times out, resulting
in an edit conflict if I'm trying to submit something. Trying to access
ross.bomis.com does not result in long times. Can you fix it?
phma
---
[phma@neofelis abi]$ time webserver ross.bomis.com
Server: Apache/1.3.23 (Unix) PHP/4.0.6 mod_fastcgi/2.2.12
0.11user 2.00system 0:10.61elapsed 19%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1095major+404minor)pagefaults 0swaps
[phma@neofelis abi]$ time webserver ross.bomis.com
Server: Apache/1.3.23 (Unix) PHP/4.0.6 mod_fastcgi/2.2.12
0.09user 0.10system 0:00.91elapsed 20%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1052major+404minor)pagefaults 0swaps
[phma@neofelis abi]$ time webserver ross.bomis.com
Server: Apache/1.3.23 (Unix) PHP/4.0.6 mod_fastcgi/2.2.12
0.10user 0.09system 0:00.44elapsed 42%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1052major+404minor)pagefaults 0swaps
[phma@neofelis abi]$ time webserver www.wikipedia.com
Server: Apache/1.3.23 (Unix) PHP/4.0.6 mod_fastcgi/2.2.12
0.11user 0.22system 0:32.39elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (957major+376minor)pagefaults 0swaps
[phma@neofelis abi]$ time webserver www.wikipedia.com
Server: Apache/1.3.23 (Unix) PHP/4.0.6 mod_fastcgi/2.2.12
0.10user 1.02system 1:17.12elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1125major+404minor)pagefaults 0swaps
[phma@neofelis abi]$ time webserver www.wikipedia.com
Looking up www.wikipedia.com.
Making HTTP connection to www.wikipedia.com.
Sending HTTP request.
HTTP request sent; waiting for response.
Alert!: Unexpected network read error; connection aborted.
Can't Access `http://www.wikipedia.com/'
Alert!: Unable to access document.
lynx: Can't access startfile
Command exited with non-zero status 1
0.10user 2.28system 5:16.30elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1176major+406minor)pagefaults 0swaps
----- End forwarded message -----
Jim accidentally sent this just to me, I'm sending it back to the list:
On mer, 2002-04-10 at 18:27, Jimmy Wales wrote:
> Brion L. VIBBER wrote:
> > > My best guess is that the parsing and lookups on regular pages are
> > > currently the main load, not editing or exotic database queries -- is
> > > this right?
> >
> > Not a clue. Initially, the database certainly was the main load, but I
> > haven't heard any newer figures. Jimbo?
>
> I'll reset the slow-query log and make a new version available after a few
> hours of data collection.
>
> > We used to cache rendered articles, but Jimbo disabled this feature some
> > time ago, claiming he was unable to find a performance advantage. (See
> > mailing list archives circa February 13.)
>
> But, I'm willing to try it again.
>
> > Personally, I've always find that idea suspicious; caching is definitely
> > faster on my test machine, and is going to be a particularly big help
> > with, say, long pages full of HTML tables! But then, my test machine has
> > a much much lower load to deal with than the real Wikipedia. :)
> > Nonetheless, if cacheing really isn't helping, that's because it's not
> > doing something right. It should be found, fixed, and reenabled.
>
> I would say that I agree with that.
>
> Here's a question for everyone.
>
> Let's say we have some portion of the page pre-calculated and cached.
> Is it faster to keep that cached text *in the database*, or *on the
> hard drive*?
>
> I'm very strongly biased towards thinking that keeping it on the hard
> drive is faster, and by a significant margin, but only because I've
> never tested it and because I know (from long experience at Bomis) that
> opening up a text file on disk and spitting it out can be *really* fast,
> if the machine has enough ram such that the filesystem can cache lots of
> popular files in memory.
>
> But, everything I read about MySQL talks about how screamingly fast it
> allegedly is, so...
>
> --Jimbo
>
Here's a patch for wikiSettings.php that fixes the problem of variables
being used before they are defined. See my previous post for some of the
rationale for this patch.
This fix makes the code actually work, so the pages end up being in the
colours assigned here. But we don't really want multicoloured pages, so
I've changed most of the colours to #FFFFFF so that there is no effective
change of page colouring.
This patch also fixes the problem of using $wikiCharset before it is
defined. I've just used "iso-8859-1" and "Latin-1" instead.
Zundark
*** wikiSettings.php.old Tue Feb 26 18:17:10 2002
--- wikiSettings.php Wed Apr 10 17:46:38 2002
***************
*** 21,33 ****
$wikiDBconnection = ""; # global variable to hold the current DB
# connection; should be empty initially.
! # Namespace backgrounds
$wikiNamespaceBackground = array () ;
! $wikiNamespaceBackground[$wikiTalk] = "#eeFFFF" ;
! $wikiNamespaceBackground["user_talk"] = $wikiNamespaceBackground["talk"] ;
! $wikiNamespaceBackground["wikipedia_talk"] = $wikiNamespaceBackground["talk"] ;
! $wikiNamespaceBackground[$wikiUser] = "#FFeeee" ;
! $wikiNamespaceBackground[$wikiWikipedia] = "#eeFFee" ;
$wikiNamespaceBackground["log"] = "#FFFFcc" ;
$wikiNamespaceBackground["special"] = "#eeeeee" ;
--- 21,30 ----
$wikiDBconnection = ""; # global variable to hold the current DB
# connection; should be empty initially.
! # Namespace backgrounds. (Those with variable indices are assigned later.)
$wikiNamespaceBackground = array () ;
! $wikiNamespaceBackground["user_talk"] = "#FFFFFF" ;
! $wikiNamespaceBackground["wikipedia_talk"] = "#FFFFFF" ;
$wikiNamespaceBackground["log"] = "#FFFFcc" ;
$wikiNamespaceBackground["special"] = "#eeeeee" ;
***************
*** 41,48 ****
include_once ( "wikiLocalSettings.php" ) ;
# Initialize list of available character encodings to the default if none was set up.
! if ( ! isset ( $wikiEncodingCharsets ) ) $wikiEncodingCharsets = array($wikiCharset);
! if ( ! isset ( $wikiEncodingNames ) ) $wikiEncodingNames = array($wikiCharset); # Localised names
#
# This file loads up the default English message strings
--- 38,47 ----
include_once ( "wikiLocalSettings.php" ) ;
# Initialize list of available character encodings to the default if none was set up.
! if ( ! isset ( $wikiEncodingCharsets ) )
! $wikiEncodingCharsets = array("iso-8859-1");
! if ( ! isset ( $wikiEncodingNames ) )
! $wikiEncodingNames = array("Latin-1"); # Localised names
#
# This file loads up the default English message strings
***************
*** 54,59 ****
--- 53,68 ----
include_once ( "wikiText" . ucfirst ( $wikiLanguage ) . ".php" ) ;
}
+ # More namespace backgrounds, now that the required variables have
+ # been defined. We must be careful not to overwrite any values that
+ # have been assigned elsewhere.
+ if ( ! isset ( $wikiNamespaceBackground[$wikiTalk] ) )
+ $wikiNamespaceBackground[$wikiTalk] = "#FFFFFF" ;
+ if ( ! isset ( $wikiNamespaceBackground[$wikiUser] ) )
+ $wikiNamespaceBackground[$wikiUser] = "#FFFFFF" ;
+ if ( ! isset ( $wikiNamespaceBackground[$wikiWikipedia] ) )
+ $wikiNamespaceBackground[$wikiWikipedia] = "#FFFFFF" ;
+
# Functions
# Is there any reason to localise this function? Ever?
I have been thinking about the performance of Wikipedia, and how it
might be improved.
Before I go off and investigate in detail, I'd just like to check my
basic concept of how the code works,
(based on reading this list -- I haven't pulled down the CVS to look at
it yet).
=== Total guesswork follows ===
Am I right in thinking that, for each ordinary page request,
* the raw text is pulled out of the database
* the taxt is parsed and reformatted
* links are looked up to see if they are linked and treated appropriately
* final page generation to HTML, with page decorations as per theme is added
My general impressions about activity rate is:
* about 100 pages per day are created or deleted
* roughly one edit every 30 seconds
* roughly one page hit every second
Packet loss seems negligible, so you don't seem to be running out of
bandwidth.
Although I guesstimate the hit rate at around one-per-second, pages seem
to be taking around 5 seconds to serve,
suggesting that the system is probably running at a loadav of say 5 or so.
My best guess is that the parsing and lookups on regular pages are
currently the main load, not editing or exotic database queries -- is
this right?
Jimbo has mentioned that the machine has a lot of RAM, so disk I/O is
unlikely to be the bottleneck: it's more likely to be CPU and
inter-process locking problems.
If so, I think careful page content caching could greatly improve
performance, by reducing the number of page parsings, renderings and
lookups across the board, at the cost of a slight increase in the cost
of page deletion and creation. However, by freeing up resources,
performance should improve across the board on all operations.
If I'm right, I think suitably intelligent caching could be applied not
only to ordinary pages, but also to some special pages, without any
major redesign or excessive complexity.
Before I start to look at things in more detail, could anyone confirm
whether I am even vaguely making sense?
-- Neil
Magnus Manske wrote:
> I think we all work with "standard settings", and there are no warnings
> showing up, just like the Bomis server uses standard and doesn't show
> anything like that, either.
But at least one major Wikipedia bug was caused by ignoring these
warnings, so this certainly isn't a good idea. (In development, that is.
The Bomis server is obviously a different matter.)
> > Here, $wikiTalk, $wikiNamespaceBackground["talk"], $wikiUser and
> > $wikiWikipedia are all undefined. I've no idea how to clean this
> > up, because I don't understand what it's supposed to look like.
> > Why are some of the indices variables and other constants?
> > In particular, what is the intended distinction between
> > $wikiNamespaceBackground["talk"] and $wikiNamespaceBackground[$wikiTalk]?
> > What should be done with this code?
>
> The reason (without looking at the code right now) is probably the missing
> "global" statement at the beginning of the function.
No, it's caused by using them before they are defined. The file
wikiTextEn.php which defines them is included later on. (This doesn't
apply to $wikiNamespaceBackground["talk"], which isn't defined anywhere.
I assume it's a mistake for $wikiNamespaceBackground[$wikiTalk].)
For the same reason, $wikiCharset is also used before being defined.
So the values in $wikiNamespaceBackground need to be assigned
after wikiTextEn.php (and any other language-specific setting file)
has been included. But these files can modify $wikiNamespaceBackground
(at least, the Esperanto one does - perhaps it shouldn't), so the only
solution appears to be to declare $wikiNamespaceBackground first,
then include the language-specific files, then assign values to
$wikiNamespaceBackground, but making sure not to overwrite any values
that have already been assigned. I'll post a patch for this later.
--
Zundark
I have been thinking about the performance of Wikipedia, and how it
might be improved.
Before I go off and investigate in detail, I'd just like to check my
basic concept of how the code works,
(based on reading this list -- I haven't pulled down the CVS to look at
it yet).
=== Total guesswork follows ===
Am I right in thinking that, for each ordinary page request,
* the raw text is pulled out of the database
* the taxt is parsed and reformatted
* links are looked up to see if they are linked and treated appropriately
* final page generation to HTML, with page decorations as per theme is added
My general impressions about activity rate is:
* about 100 pages per day are created or deleted
* roughly one edit every 30 seconds
* roughly one page hit every second
Packet loss seems negligible, so you don't seem to be running out of
bandwidth.
Although I guesstimate the hit rate at around one-per-second, pages seem
to be taking around 5 seconds to serve,
suggesting that the system is probably running at a loadav of say 5 or so.
My best guess is that the parsing and lookups on regular pages are
currently the main load, not editing or exotic database queries -- is
this right?
Jimbo has mentioned that the machine has a lot of RAM, so disk I/O is
unlikely to be the bottleneck: it's more likely to be CPU and
inter-process locking problems.
If so, I think careful page content caching could greatly improve
performance, by reducing the number of page parsings, renderings and
lookups across the board, at the cost of a slight increase in the cost
of page deletion and creation. However, by freeing up resources,
performance should improve across the board on all operations.
If I'm right, I think suitably intelligent caching could be applied not
only to ordinary pages, but also to some special pages, without any
major redesign or excessive complexity.
Before I start to look at things in more detail, could anyone confirm
whether I am even vaguely making sense?
-- Neil
Right now, the search box and last date of change appears to the
right, not centered. Is that intentional?
Also, I noticed on Mozilla that the article text is very close to the
left border of the browser, much closer than in the other skins. It
looks strange.
Axel