On Fri, Oct 16, 2009 at 3:25 PM, Jona Christopher Sahnwaldt
<jcsahnwaldt(a)gmail.com> wrote:
How to fix this? I think MediaWiki should make sure
that a comment contains only valid UTF-8 sequences,
even when it is truncated. This may mean that it
has to be truncated to less than 255 bytes.
Alternatively, the dump process could drop invalid
UTF-8 sequences instead of replacing them.
Yet another fix: mwdumper should make sure
that a comment is at most 255 bytes long and
truncate it if necessary.
The silent truncation thing is ridiculous on a lot of levels:
* It creates invalid UTF-8, if MySQL isn't using a utf8 collation
(which has its own problems, and Wikimedia doesn't use it).
* How many characters can be stored depends on how many bytes the
relevant language's writing system happens to be in UTF-8, so
Chinese/Arabic/Hebrew/Greek/Russian/etc. users get <150 characters.
(Unless you use the utf8 collation, which has its own problems, and
Wikimedia doesn't use it.)
* It will cause a fatal error if MySQL is in strict mode.
* It makes it difficult to impossible for a specific wiki to decide to
allow longer edit summaries. (Personally, I find 255 characters is
often too short. I think Citizendium is hacked to allow more.)
* The limit in MySQL is counted in bytes (unless you use the utf8
collation, etc.), but HTML maxlength is counted in characters, so we
have no way to effectively limit things client-side without
JavaScript. Currently we fake it by setting a maxlength of 200
characters, and hoping that that winds up being less than 255 bytes.
That leaves enough breathing room so languages like French don't
overrun, but I assume speakers of
Chinese/Arabic/Hebrew/Greek/Russian/etc. languages are just resigned
to the fact that their edit summaries get unpredictably truncated.
Also, it's unnecessarily small for English, where 255 characters would
usually fit -- in fact enwiki has a Gadget that hacks this up, and
I've sometimes edited up the maxlength manually when I found I wanted
a little more space.
The correct fix is just to make the field TEXT/BLOB so the length
limit is enforced purely in the application. The same goes for
log_comment, where last I checked we weren't even doing the
maxlength=200 hack. Are there any objections to finally doing this?
What's the procedure these days for schema changes? I'd check in
something right now, in fact, except that I have to go in like ten
minutes.
(erm, end rant)