One of the fun new things in MediaWiki 1.4 is validation and
normalization of UTF-8 text input. The wiki will strip out malformed
and illegal UTF-8 sequences, and normalize combining character
sequences to help avoid almost-but-not-quite-equal oddities. (See
http://www.unicode.org/reports/tr15/ for background.)
I figured it would be wise to do some spot-checks of the existing
databases to see just how much trouble we're in already... I checked
the October 30 'cur' table dumps for the Russian, Portuguese, and
Korean Wikipedias.
The normalization routine UtfNormal::cleanUp() does a first quick pass
to strip malformed UTF-8 byte sequences (this is extra optimized for
predominantly Latin or pure-ASCII text), and if any characters are
found during this pass that might indicate a non-normalized string a
slower, full normalization pass is conducted.
Portuguese (40422 pages):
text requiring slow check: 193 (0.5%)
non-normal or invalid text: 13 (0.0%)
non-normal or invalid title: 7 (0.0%)
non-normal or invalid comment: 1 (0.0%)
(All of the broken titles and the comment are illegal 8-bit latin-1
names on image pages, from an upload bot.)
Russian (14733 pages):
text requiring slow check: 571 (3.9%)
non-normal or invalid text: 18 (0.1%)
non-normal or invalid comment: 3 (0.0%)
(A lot of these are Greek text fragments with non-normalized accent
characters.)
Korean (7998 pages):
text requiring slow check: 780 (9.6%)
non-normal or invalid text: 745 (9.3%)
(Most of these are Han characters which appear in a special
'compatibility' duplicate encoding area, and are normalized to the
standard unified Han encoding of the same character. Many more are the
Greek (!?) middle-dot character being replaced by the Latin one, which
is the preferred encoding. This seems to get used in the formatting of
lists.)
The full normalization check of Korean text is the worst case for my
code -- every syllable gets decomposed into constituent parts and
reassembled, and it can add about a second to the save/preview time for
a 30k article on my (otherwise unloaded) 2GHz Athlon. Not too awful all
things considered (most articles are much shorter than 30kb, and less
than 10% of Korean-language edits should not be a huge burden overall),
but it should be able to do much better by running the slow pass on
substrings around the 'maybe' points.
-- brion vibber (brion @ pobox.com)
Hello,
What's the sysadm's consensus about using memcached or Tugela Cache? Is
any of the two being used? - Are the experiences good?
I ran across this today:
http://sharedance.pureftpd.org/
Might be interesting for PHP session storage. Haven't tried it, though.
--
Greetings from Troels Arvin, Copenhagen, Denmark
Hello again.
I have put online a proof of concept version of a file varification
mechanism. the details:
I have played around with the adobe svg plugin a bit (Version 3.01 /
FireFox 1.0PR / Linux). It seems that JavaScript is supported, but does
not have access to the HTML-DOM, which is good. I thought about trying
this in MSIE, but then i thought some more...
Basically, this leaves us with the situation that JavaScript in SVG is
as secure as JavaScript in HTML - it isn't. So the solution would be IMHO to
a) reject all files that (somhow) look like HTML.
b) reject all files that (somhow) contain javascript.
c) scan all uploaded files for viruses.
I have put up a crude prototype of such a checker:
http://area23.brightbyte.de/checkfile-test.php
The source is available there, but would need some modifications to be
integrated into mediawiki (i guess - i have never looked into the
source, and i don't plan to). To discuss what i have done here, please
go to
http://de.wikipedia.org/w/wiki/Benutzer:Duesentrieb/checkfile
I hope you like it and it's not to hard to put it in. It would be
extremely helpful if we could again upload "obscure" things like MIDI
and SVG to the Wikipedia.
tnx,
daniel
Brion wrote:
> We have a heuristic check which attempts to match MSIE's heuristic test
> for HTML and rejects anything that matches. Hopefully it's good enough
> for that, though there may be other dangerous formats that it attempts
> to recognize, or other checks in the HTML heuristic which I might have
> missed.
ok... can I somhow help testing this?
> MSIE's MIME type "detection" (the process in which it throws away the
> server's specified content-type information and pulls a new one out of
> its butt in an unreliable, insecure manner) is partially documented here:
> http://msdn.microsoft.com/workshop/networking/moniker/overview/
> appendix_a.asp
urg. und that page does *not* state *how* the detection works... more
guesswork :(
> MIDI is probably safe. It doesn't seem to be in IE's internally
> recognized list of types, so it shouldn't try to autodetect.
so *please* just enable it, ok?
> SVG is a more dangerous format; IIRC it explicitly allows for the use
> of JavaScript. Would you mind testing the main SVG-supporting browsers
> (particularly the Adobe SVG Viewer plug-in running in MSIE and Mozilla)
> to ensure that JavaScript in a SVG file can't access cookies or hijack
> the containing browser window?
Hmpf, that would require me to boot into windows;) Well, ok, i'll have a
look. Last time i checked javascript in SVG was specified but not really
supported.
Also, we could just scan any SVG and other XML-Formats for "<script" and
"javascript:" and deny all files that contain such a string. That's a
little crude, but would work for 99% i guess.
>> * when a file is uploaded, run "file -bi" against that file and
>> remember the output, which is (a pretty good guess of) the mime-type
>> of the file.
>
> MediaWiki can't generally rely on 'file' since it's an external
> program. It may not give consistent results on all platforms, and is
> completely absent on some (such as Windows). It's also known to fail to
> catch the MSIE holes, which can detect HTML on actual valid image files.
Well, one could always make that check optional, so one could just
disable it on systems where it is not available. I belive cygwin
supplied a file command for windows, though. But the problem that file
may be "smarter" than MSIE remains, there you have a point.
>> * have a map of mime-types-to-file-extensions. Look up the mime-type
>> returned by file in that table. If it mismatches the file extension,
>> warn about it and refuse to upload. Skip the test if the mime-type is
>> not in the table.
>
> For known image types, we already check that the detected image type
> matches the extension.
good. Is it easy to extend the list of known mime/ext pairs?
>> If we are concerned about viruses in general, why not run a virus
>> scanner against every uploaded files? Uploads are not the frequent,
>> CPU should be able to cope with that.
>
> Mainly we're concerned about JavaScript session hijacking, but other
> problems are a concern as well. Feel free to whip up a wrapper around
> clamav or something, that might be useful...
OK, i'll have a look at it, it should be trivial enough. But i'll leave
the integration to you, because for me it would be a lot mor work to
find out where to put this than to write the funtion itself...
Thanks,
Daniel
I've running MediaWiki on Linux with no problems, with multiple
trouble-free installations, and am doing my first installation on Mac
OS X.
I have Mac OSX 10.3.2 and am running MySQL 4.1. I already ran into
the problem of having to set the root password the "old way" on
MySQL, which allowed me to connect to the database server (the
Mediawiki user guide was good on this point). However, it's now
failing on this (see below, note tables were created by previous
attempt). I was wondering if someone intimately familiar with the
installation script could give a suggestion about how to proceed.
Also the line states wikidb exists. In fact, when I check using the
command line client, the wikidb does not in fact exist.
Any suggestions from the experts? I'm at a loss.
Thanks,
Matthew Trump
================================================
PHP 4.3.2: ok
PHP server API is apache; ok, using pretty URLs (index.php/Page_Title)
Have XML / Latin1-UTF-8 conversion support.
PHP is configured with no memory_limit.
Have zlib support; enabling output compression.
Couldn't find GD library or ImageMagick; image thumbnailing disabled.
Installation directory:
/Users/trump/vhosts/multnomah.decumanus.com/htdocs/mediawiki-1.3.7
Script URI path: /mediawiki
Connected as root (automatic)
Connected to database... 4.1.7-standard; enabling MySQL 4 enhancements
Database wikidb exists
There are already MediaWiki tables in this database. Checking if
updates are needed...
...ipblocks is up to date.
...already have interwiki table
...indexes seem up to 20031107 standards
...have linkscc table.
...linkscc is up to date, or does not exist. Good.
...have hitcounter table.
Converting links table to ID-ID...
Sorry! The wiki is experiencing some technical difficulties, and
cannot contact the database server.
The first of the two conferences went quite well.
I only ran into a couple of problems and these
could perhaps be solved for the next conference.
It was held at Wentworth-by-the-Sea in Newcastle
NH, right next to Rye and Hampton Beach... the
place was gorgeous! And the facilities were awesome,
the AV folks there were very helpful in getting the
equipment all connected and very accomodating.
However the costs were pricey, an internet live
connect went for $150. A live AV support person
was $500 for the day... I didn't need an AV person
but even so, they were helpful anyway... no charge
for getting the connectors and making sure I was
ok to start the presentation, they were very helpful
indeed and very friendly!
All in all I presented on Friday and Saturday and
the headcount totals were over 100 educators,
school admins and staff from Northern New England
area.
The presentation on Wikipedia was well received...
the major objection, or should I say concern, is the
question that "how can Wikipedia content be
accurate without 'experts' writing the articles?"
My response is that the articles could be checked
against other online ecyclopedia content and thus
one could be reassured of the quality of content.
Also that there are safeguards like 'peer editing'
and 'npov' which help make the articles both
accurate and non-biased, or as much as a npov
objective can guarantee.
If there are other good points to counteract this
objection I'd be interested in putting this into the
next presentation... although for this next one
at McAuliffe Center Tech Conference I'll have a
fraction of the time I had at the NHAWLT presentation.
One of the problems I ran into was that I lost 'mousage',
I lost the mouse pointer, when I projected the laptop screen.
I used two different projectors and also tried an attached
USB mouse instead of the glidepoint.
But the instance of MediaWiki running
on the Linspire laptop was flawless of itself... a really
good example of how powerful and yet easy to use and
learn to edit the wiki engine is...
OpenOffice was very handy as a presentation tool.
I used the suggested presentation from French and
edited it and thinned it down for the time I had to
work with... I also used the screen shots of the
different language Wikipedias (that was done
around February)... And I incorporated the Alexa
charts to compare with Britannica, Groliers, Ecyclopedia.com
and refer.com so as to give some idea of Wikipedia's
growth and expansion.
I also briefly explained some of the other sister projects
and highlighted why the wiki software lends itself so
well to collaborative projects...
The next conference is in December.
With thanks to all who helped get me get MediaWiki
running on Linspire and also those that gave me
good suggestions on what to say during the presentation!
Best regards,
Jay Bowks
[[w:en:User:ILVI]]
On Wed, 27 Oct 2004 18:42:54 +0200, Ashar Voultoiz <hashar(a)altern.org> wrote:
> ilooy wrote:
>
>
> > Hi fellow Wikipedians,
> >
> > I'll be doing a presentation on Wikipedia for
> > the McAuliffe Conference and the NHAWLT
> > Teachers' Conference in the coming weeks
> > and was wondering if anyone else has done
> > a presentation to a large group... what features
> > would be good to highlight... and what seems
> > to work well with a group that has not heard
> > much about the project yet.
> >
> > If you have some ideas as to what might
> > be good points to bring out please let me
> > know. I appreciate any insights you may
> > offer on this subject.
> >
> > with sincere regards,
> > Jay B.
> > [[w:en:User:ILVI]]
>
Hello all.
The fact that SVG, MIDI and other formats are blocked is getting really
annoying. People complain about it over and over, and it's a bad
situation also regarding the fact that the GFDL calls for the
"transparent source" of a document.
As I understand it those formats are blocked because MSIE interprets
everything as HTML that *looks* like HTML. It was then stated that in
order to circumvent this, a varifyer would have to be written for all
formats. I do not understand why this is so, and I would like to suggest
a simple solution:
* when a file is uploaded, run "file -bi" against that file and remember
the output, which is (a pretty good guess of) the mime-type of the file.
* if the mime type is "text/html", refuse the upload.
* if the mime type is a forbidden format (exe, etc), refuse the upload.
That should be enough. If you want to be picky about the files type,
also do the following:
* have a map of mime-types-to-file-extensions. Look up the mime-type
returned by file in that table. If it mismatches the file extension,
warn about it and refuse to upload. Skip the test if the mime-type is
not in the table.
If we are concerned about viruses in general, why not run a virus
scanner against every uploaded files? Uploads are not the frequent, CPU
should be able to cope with that.
BTW: may I also suggest to convert the file-extensions to lowercase in
the same step the " "-to-"_" conversion happens? That would be great...
Please excuse me if this was all a pile of rubbish based on a
misunderstanding - just point it out. Furthermore, i'm willing to write
a routine that does the above, or anything else neccessary, provided i
do not have to dig deep into the Mediawiki-code. Just tell me the specs
of the function, and i'll post it here.
--
Hompeage: http://brightbyte.de
I'm installing MediaWiki for the first time. I ran into some problems with
the install in which I had to set the MySQL root account to use the old
password authentication method. I seem to have fixed that problem, but now
I'm running into this:
Checking environment...
PHP 5.0.2: the MonoBook skin will be disabled due to an
incompatibility between the PHPTAL template library and PHP 5. The
wiki should function normally, but with the older look and feel.
PHP server API is apache; ok, using pretty URLs (index.php/Page_Title
)
Have XML / Latin1-UTF-8 conversion support.
PHP is configured with no memory_limit.
No zlib support.
Couldn't find GD library or ImageMagick; image thumbnailing disabled.
Installation directory: /data/local/apache-dso/htdocs/wiki
Script URI path:
Connected as root (automatic)
Connected to database... 4.1.7-standard-log; enabling MySQL 4
enhancements
Created database wikidb
Creating tables... done.
Initializing data...
Granting user permissions...
Sorry! The wiki is experiencing some technical difficulties, and
cannot contact the database server.
The thing that I don't understand is that the installation was able
to contact the server, it actually created the database and created
the user. Am I missing something obvious here? I'm not that good with
MySQL. Any help that anyone can provide would be appreciated.
Thanks,
Christian Lair