On Tue, 5 Mar 2002, Jimmy Wales wrote:
So, where do we stand on the issue of international
upgrades?
I'd like to get back to these quickly, if possible. Starting with esperanto, and
then
polish. And then probably spanish, although of course we'll now need to co-ordinate
with
the forpas forked group, so that we minimize the extent of the forkage in the hopes of
bringing things back together soon.
There was a discussion some days ago on how to best implement a more or
less character set independent underlying system, but it sort of died out
without any clear consensus. In particular, no response from lcrocker, who
made the initial claim that the present system of using language-tied
encodings was Wrong.
As it left off, Jan and I had more or less agreed on something like:
* Check the browser's Accept-charset header; if available, use UTF-8. If
not, use the most likely encoding used for that wiki's language.
* Where necessary, convert characters into/out of HTML entities so that
non-Unicode browsers can safely handle all characters.
* Internally, non-ascii characters will need to be escaped somehow in the
search index field to allow correct indexing.
Somewhat less solid was how to store the actual text internally: Lee
suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.
* UTF-8 is more space- and bandwidth-efficient and doesn't require
outgoing transliteration for the many users with relatively current,
UTF8-savvy browsers, but needs to be translated into native code / HTML
entities for non-UTF-8-savvy browsers (old old old ones, and Netscape 4
which has very buggy Unicode support).
* ASCII+HTML entities won't require outgoing translation for
non-UTF8-savvy browsers that nonetheless understand unicode-numbered
character entities, but may not be much of an improvement for older
browsers that don't know the character entities are always numbers in
Unicode, not the current character set. Thus outgoing translation to the
browser's character set is recommended to be safe.
Incoming translation is always required, as edited text will come to us in
the character encoding used by the browser and may or may not have HTML
entities typed by the user mixed in.
The character set translation can probably be done mostly via PHP's iconv
support -- however this is an optional component and must be enabled
during compile time (same as the annoying 4-letter minimum for search
index terms). Also, some slight customisation of the process is necessary
for for instance the Esperanto transliteration schema (basically in place
in $RecodeInput/$RecodeOutput).
If there's some consensus on this, we can get crackin' and get this
implemented so the upgrades can proceed.
-- brion vibber (brion @
pobox.com)