Disk space on suda is really, *really* tight. Every couple of days we
find ourselves clearing off files because it's *full*.
Would it be possible to toss in another disk, any disk, temporarily to
which we could migrate ~20gb of the database + future binlogs?
-- brion vibber (brion @ pobox.com)
So, the way that MediaWiki is currently set up, we have two fields for
identifying a contributor:
* user name
* nick
I think (but don't know) that the idea here is that your "user name"
is your "real" name, like "Evan Prodromou", and your "nick" is going
to be a nickname, handle, or pseudonym, like "Mister Bad". This may
come from the tradition on some wikis, like Ward's Wiki, Meatball,
others, where using your real name is the norm.
It seems that on Wikipedia, other Wikimedia projects, and Wikitravel
(which I'm most interested in), this is not the case. People treat a
user name like a Unix, IRC, or other "user account": an abbreviated
name or a pseudonym. The "nick" field is generally just used for
making fancy signatures; in other cases, it's just used to provide a
_second_ pseudonym or abbreviation.
Now, I'm the last person to put down pseudonyms. I think they're a
crucial part of Internet culture. But real names can be useful for,
say, getting credit as a contributor to an article. Somewhere along
the way here we lost the slot for adding a "real name" to a user
account.
You can't provide your real name even if you want to. Putting your
real name in the user name slot is lost in the noise; I don't know,
when you have a user account like "Bob Frapples", whether that's a
clever pseudonym or actually your real name. Contributors who want to
have their real name recognized now put them on their user pages. But
this is kind of difficult for software to determine what a user's real
name is.
I'd like to embrace the reality of the situation and have two identity
fields, plus a display field:
* User account name -- a pseudonym or abbrev or whatever
* Real name -- preferred form of legal name
* Signature -- fancy formatting for signatures
For these reasons, I'd like to propose the following:
* We add a nullable user field "user_real_name".
* The login/account creation page has an additional field for "real
name", with an explanation that it's optional, and only for
attribution, etc.
* The preferences page lets you change your real name.
* We change the documentation for the user name to note that it's a
nickname and doesn't need to be your real name.
* We change the documentation for the "nick" field to note its use as
a "signature" format.
Automatic attribution tools can use the real name field if it's
provided, or the preferred pseudonym ("Wikitravel user Hogwallop") if
not.
The user account name would continue to be shown everywhere it is now,
and the "nick" field would continue to be used primarily
(exclusively?) in the ~~~ signature areas. The main thing is that if
contributors want attribution under their real name, but identity in
the system under a nickname, they get it.
Lastly, I think an easy way to change your user name is necessary, to
make this shift in emphasis easier for those who want to. That's a
whole can of worms, there, but I don't think it's impossible to deal
with.
~ESP
--
Evan Prodromou <evan(a)wikitravel.org>
Wikitravel - http://www.wikitravel.org/
The free, complete, up-to-date and reliable world-wide travel guide
Ordinarily, the POST request for an edit must contain the correct wpEditTime
parameter, otherwise an edit conflict will be triggered. This has the
side-effect of forcing bots (malicious or otherwise) to request an edit page
before they POST their text. However, if wpSection=new, that is, the "post a
comment" feature is being used, the edit time is not checked.
At around 2003-04-08 14:30 UTC, an attack was performed on the Chinese
Wikipedia, using this fact. The attacker sent a flood of POST requests with
wpSection=new, apparently not waiting for the response from the server
before sending the next one. This allowed him/her to vandalise about 2 pages
per second. The content of the message was not corporate spam as previous
bot attacks have been, rather it was a puerile anti-China message to do with
eating dogs and cats.
My suggestions for dealing with this kind of attack are:
* Limiting the rate at which any given IP address can send POST requests,
i.e. throttling
* A facility for fast filter configuration, perhaps even at the sysop level.
Frantically editing EditPage.php at 1:30am is not my idea of fun.
* Securing edit submission such that a bot must request an edit page first
-- Tim Starling
Hello.
I am writing to introduce a project that complements Wikipedia. Jimmy suggested
I post here, so here goes...
The goal of the Worldwide Lexicon Project (www.worldwidelexicon.org) is to
create a standard procedure for discovering and querying a wide range of
language resources, including: machine translation servers, dictionaries,
encyclopedia (i.e. Wikipedia), and even human translators. Sounds ambitious,
but all WWL does is to create a machine-readable directory of language
services, and to define a standard set of CGI parm/value pairs and XML response
as a poor man's web services interface. Read about the REST method of building
web services, we borrowed heavily from this model.
We would like to ask the Wikipedia community to consider supporting WWL. This
is very easy to do. All that is required is to modify the existing search
script to recognize standard parameters used in WWL queries, and to generate an
XML response in place of HTML. How simple is this, here's an example.
A WWL client wants to search a Spanish Wikipedia for "Antoni Gaudi". This is a
two-step process. First the client, perhaps an embedded browser plug-in,
queries a WWL directory server as follows:
http://www.trekmail.com/wwl/sn.asp?action=findservices&sl=esp&servicetype=c…
Note: if the client does not already know the location of a directory server,
it simply loads http://root.worldwidelexicon.org to get a list of directory
servers.
The directory server replies with an XML dataset containing a list of Spanish
encyclopediae including, presumably, a Spanish wiki. Each record contains a
baseurl for each server, which informs the client which script to invoke. The
record for the spanish Wikipedia might point to:
http://esp.wikipedia.org/search.php
The client then queries this script as follows:
http://esp.wikipedia.org/search.php?
qt=wwl&action=searchtext&searchscope=title&sl=esp&stext=antonio+gaudi
The wiki server replies with an XML dataset containing matching records, full
text, and pointers to HTML URLs for each record. All it does is respond with
XML versus HTML when it sees the qt=wwl parm/value pair. The client application
parses the XML and displays it as desired. (Note that I am using querystring
notation, the queries could also be submitted via the POST method).
If client side developers prefer, they can also access WWL services via
JavaBean or ActiveX objects. These are basically a collection of wrapper
functions that further simplify this process and also handle common errors.
These will be available later this month when we begin testing a multilingual
instant messaging client. These tools are not necessary, so developers can
build applications in any development environment that allows you to open a URL
and parse XML.
To support a basic WWL implementation requires trivial changes to the existing
search script. In the future, you may also consider implementing an extended
feature set that describes parent and child categories related to an entry. I
believe this will also be easy to implement, but the basic implementation
described above is a quick and easy job.
Why do this? What WWL does is create a peer-to-peer equivalent of the Google
API. So application developers will be able to build tools that can talk to
many types of WWL enabled resources. The range of possible applications is
quite broad. One application (among many) that is well suited to Wikipedia is a
translation memory. Commercial translation memory tools are expensive and often
not well maintained. By combining WWL and Wikipedia, it will be possible to do
the same thing with a lightweight client app.
If anyone is interested in creating a WWL front-end to Wikipedia, you can
contact me at brian AT mcconnell.net I will be glad to explain the system in
detail and answer questions regarding implementation.
Thank you for your time.
Brian McConnell, Project Leader
Worldwide Lexicon
PS - we will also be releasing a Java based Jabber client that talks to machine
translation servers, and that also matches users with bilingual users who are
willing to translate for other IM users.
I agree on the quote from my writing. I've changed that from:
"RAID 1 mirroring offers about twice the number of read seeks as write seeks (each drive seeks independently). RAID 5 does not offer more read seeks than write seeks because each stripe is on all disks and all must seek together to get the data"
to:
"RAID 1 mirroring or RAID 10 offers about twice the number of read seeks as write seeks (each drive or stripe seeks independently). RAID 5 does not offer more read seeks than a single drive, RAID 1 or RAID 10 can deliver because each stripe is on all disks and all must seek together to get the data. In addition, in RAID 5 writes are slowed because at least one read is required to get the parity data unless it has been cached."
so it doesn't ignore the write rate reduction in RAID 5. That might be significant if it turns out that we become write rate limited and it might be one part of the reasons why LiveJournal, which is almost exclusively write-limited, is switching from RAID 5 to RAID 10. Doesn't affect what I was intending to write about, though, which was read rates of the various RAID systems.
Experience at the Wikipedia is that Suda with a 3 disk RAID 5 setup is far slower than Geoffrin with a 4 disk RAID 10 setup. I'm interested in reading your views on why that is the case. Either way, though, I'm inclined to go with what we've seen of performance in the Wikipedia environment until we can make Suda with RAID 5 faster than it has been. If you can come up with some proposals which might do that, it's worth considering trying them, since the greater space efficiency of RAID 5 will be useful eventually.
RAID 5 compared to RAID 10 is interesting when it comes to sequential read rates because the RAID 5 system can read the data from more drives, so it can get a higher sequential transfer rate. The catch is that this is a database system and database systems are generally considered to be limited primarily by their seek rate, not their sequential transfer rate. There are some potential gotchas in that though - cases with large chunk/cluster sizes in the database and some access patterns might change it. "Transaction rate" rather than seek rate or sequential transfer rate has lots of significant details not spelled out, which is one reason why I stuck to the comparatively unambiguous seek and sustained transfer rate measures (though those have a fair amount of varying potential as well).
Yes, I agree that it's possible to have RAID 5 systems set up not to have striping across all drives in the RAID 5 system box. However, that's not how people normally think of RAID 5 - they are normally thinking in terms of one set of drives. The RAID 5 option offers less independent seeks than RAID 1, unless you start to do things like splitting the array stripes as you described. Not really sure what I'd call that but RAID 5 probably isn't it. Maybe a pair of RAID 5s. In any case, I expect that to offer less seeks than RAID 1 because that two drive minimum per stripe has to seek together and RAID 1 drives can seek independently.
I do not agree that RAID 5 offers the highest read transaction rate, in general. Please support that claim compared to RAID 1 and RAID 10 over in the article talk page. It'll be interesting to see your data and any you can point to which compares the systems. Since we're considering Wikipedia use, data with Wikipedia access patterns, including transfer sizes, is what really interests me. I don't know the typical transfer size per seek for Wikipedia, though.
In a past life I was disk then overall manager for CompuServe's benchmarks and standards community, so I'm always happy to discuss disk system performance - it's a fun subject for me.:) But probably best not done on this list.:)
A summary of current performance issues and discussions in IRC, combined with some purchase options discussions is available at:
http://meta.wikipedia.org/wiki/Upgrade_discussion_April_2004
Also linked from there are the new, excellent, Ganglia statistics Tim Starling set up a few days ago.
I'll integrate the discussion from the mailing list which isn't yet covered there shortly.
Hi,
I don't know if this error was already reported, at least I found no bug
report at sourceforge. When viewing the deletion log
(http://de.wikipedia.org/wiki/Wikipedia:L%F6sch-Logbuch) of the German
Wikipedia there are several deleted entries which get parsed and break
the layout of the page. I just cheked en: and there it works normal?!
Any clues?
Regards
Thomas aka Urbanus
There's a batch compression of de's old revisions running on moreri.
This doesn't seem to enjoy coexisting with apache+wiki, sending the
machine into *huge* loads (~50) every few minutes, so I've shut down
apache on that machine to take it out of the rotation and avoid bogging
things down.
-- brion vibber (brion @ pobox.com)
Hi!
Because of IMDB's no-bot policy I inquired a couple of weeks ago what the
best way would be to submit links to Wikipedia articles about movies, so
they can add them to the "External reviews" section for each movie. Here's
the reply:
----------------------------------------
I apologize for the delay in getting back to you (I was out of the office
for the past 2 weeks). The easiest thing to automatically submit several
links is to send them to our mail server. You will need to send a
specially formatted email to adds(a)imdb.com with details of the title and
link.
Please note that our mail server doesn't normally accept incoming email
from unauthorized users so you will need to let us know the email address
of the sender so we can allow access to it.
This is the syntax you will need to use.
URLTITLE
title|type|URL|description|
END
or
URLNAME
name|type|URL|description|
END
where:
title = Title of film exactly as it appears on IMDb.com
type = a 3-letter code that identifies the type of link
URL = the link
description = description of the link
"type" can be one of the following:
COM comments/reviews
IMG image
SND sound
MOV movie
FAQ Frequently Asked Questions list.
OFF official sites
POS movie posters
TRA movie trailers
MSC miscellaneous i.e. anything that doesn't fit into a type above.
a few examples:
URLTITLE
Alien (1979)|COM|http://crazy4cinema.com/Review/FilmsA/f_alien.html|Crazy for Cinema
Alien (1979)|COM|http://efilmcritic.com/hbs.cgi?movie=583|eFilmCritic
Alien (1979)|COM|http://www.igs.net/~mtr/haiku-reviews.html#Alien|Haiku Reviews
Cheaper by the Dozen (2003)|COM|http://www.suntimes.com/ebert/ebert_reviews/2003/12/122402.html|R… Ebert, Chicago Sun-Times
Jaws (1975)|MSC|http://www.sharks.net/bigger_boats.html|Real shark attacks
END
or
URLNAME
Garbo, Greta|IMG|http://www.goldensilents.com/stars/gretagarbo.html|Golden Silents Portrait Photos|
Eastwood, Clint|MSC|http://www.sensesofcinema.com/contents/directors/03/eastwood.html… of Cinema - Great Directors Critical Database|
Hitchcock, Alfred (I)|MSC|http://hitchcock.tv/|Alfred Hitchcock - The Master Of
Suspense|
END
What I suggest is that you create a sample submission and send it to
me along with the email address that you would like to use as a
submitter, so I can check that everything looks ok and set things
up on our side.
GC
------------------------------------
So what we need is a mailbot which takes a list of Wikipedia articles and
the corresponding IMDB titles, apparently of the syntax "Title (year)" and
generates and sends the respective mails. Preferably it would keep track
of its submissions (could be done easily in Perl using a tie'd hash) so we
can update the list. For extra points, it could try to auto-guess the
title in the IMDB, say, by filtering out "(movie)" from the Wikipedia
title and looking for the first [1-2][0,8,9][0-9][0-9] match in the
article.
Mike, are you still interested?
Regards,
Erik