On lun, 2002-02-25 at 07:55, Jan Hidders wrote:
From: "Brion Vibber" <brion(a)pobox.com>
On sab, 2002-02-23 at 08:24, Jan Hidders wrote:
>
> Yes, and the accept-charset header is indeed the meta-tag I was thinking
of.
Does it vary appropriately with the default language of the browser? I
tried a few browsers on my system (US English or non-specific versions)
and got:
Mozilla 0.9.8: ISO-8859-1, utf-8;q=0.66, *;q=0.66
Konqueror: unicode, utf-8, *
Netscape 4.78: iso-8859-1,*,utf-8
Opera 6: windows-1252;q=1.0, utf-8;q=1.0, utf-16;q=1.0,iso-8859-1;q=0.6,
*;q=0.1
Internet Explorer 5.0: (nothing)
lynx: (nothing)
FWIW, I tried it with nl-be and as you would expect it is the same for
Netscape and Explorer as above. I'll try what happens if I turn on
multilanguage-support in Windows.
In the case of multiple supported encodings, do
we use the first one, or
the "best" one?
I would say the best one with. The only reason not do so is to show users
how to write certain characters as entities, but if somebody decides to edit
a page on the Russian Wikipedia we might expect that this person knows how
to enter Cyrillic characters on his or her keyboard, or either knows how to
enter them as entities. I'm not sure about when q!=1.0 for the best one.
Besides, if they would benefit from seeing the entities, they would
benefit just as well from cut-n-paste, wouldn't they? Who's really going
to memorise the numbers?
With regards
to L. Crocker's earlier comment about UTF-8 being
potentially problematic with older browsers; after checking the
Accept-Charset field we *know* that a browser supports UTF-8, and can
provide an alternate, more limited, encoding if not.
That, I think, would be my personal preference.
Mine too.
While I'm at it -- if instead of HTML
entities we were to use UTF-8
internally, we could get around the search problem by simply changing
high characters in the fulltext index to hex codes.
Just to be clear on this. If we want indexing to work we need to encode it
in a format that uses only the Latin 1 numbers, letters, "_" and
"'". The
alternative is to run different MySQL servers with appopriate character sets
for different Wikipedias.
Can you give an example? What is at the moment the
behaviour of the
Esperanto Wikipedia?
The current behavior that you'll find on visiting
eo.wikipedia.com
(running the old software):
- Article contents, titles, comments, user interface messages, etc. are
shown in UTF-8.
- The article edit box is in limited UTF-8, with the standard diacritics
transliterated (X-system)
- Any input (edit box, comment field, search box, username) is
internally normalised to X-system, so users can type any way they wish.
- On selecting a convenient link at the top of the screen, article
contents, user interface messages etc. are optionally shown in X-system.
Ok. I see your point. Does this also work for Windows users?
Yes, it works for everybody.
Okay, but
what's "special behavior" and what's "default behavior"? Say
I
visit the Japanese wikipedia, the French wikipedia, and the Esperanto
wikipedia; is using a native encoding for native editing in each one
"special behavior"? Or is using a common encoding for all "special
behavior"?
I would say that the default behaviour is that you see in the edit box the
best character encoding your browser accepts. If the browser doesn't specify
this then the site falls back to Latin1 or whatever is specified as
$defaultEncoding on that site.
Sounds reasonable to me.
[...] Whatever
happens, though, we still
need the custom transliteration functions available for the X-system
conversion. Those users without special keyboard drivers (or worse yet,
with Netscape 4's Unicode support) need to be able to type and sometimes
read that way.
By the way, is this specified in the charset with the POST?
There's no real way to get that information, to my knowledge. This
particular conversion is clean enough that it can be run over any input
the user gives, though, so my preference would be to just keep filter
functions available over all editable text.
HTML entities
are unacceptably difficult for typing, and
not helpful for reading -- a browser that doesn't show the UTF-8 won't
do any better with the entities.
My main interest is that we maintain 1 software package that can be adapted
by setting some variables and includes. If we really cannot get around
having extra special encoding-links that can be turned off or on depending
upon some variables, then so be it.
Agreed, which is why I added the little recoding functions in the first
place. They wouldn't be needed most of the time, but when they were, a
function or two could be dropped in with the localised messages.
-- brion vibber (brion @
pobox.com)