Re: [Intlwiki-l] Re: Encoding for non-Latin1 wikipedias (was Re: [Wikitech-l] New caseconversion functions)

25 Feb 2002

On lun, 2002-02-25 at 07:55, Jan Hidders wrote:
...

 From: "Brion Vibber" &lt;brion(a)pobox.com&gt;
  On sab, 2002-02-23 at 08:24, Jan Hidders wrote:
 >
 > Yes, and the accept-charset header is indeed the meta-tag I was thinking  of.

 Does it vary appropriately with the default language of the browser? I
 tried a few browsers on my system (US English or non-specific versions)
 and got:

 Mozilla 0.9.8: ISO-8859-1, utf-8;q=0.66, *;q=0.66
 Konqueror: unicode, utf-8, *
 Netscape 4.78: iso-8859-1,*,utf-8
 Opera 6: windows-1252;q=1.0, utf-8;q=1.0, utf-16;q=1.0,iso-8859-1;q=0.6,  *;q=0.1
  Internet Explorer 5.0: (nothing)
 lynx: (nothing)  
 FWIW, I tried it with nl-be and as you would expect it is the same for
 Netscape and Explorer as above. I'll try what happens if I turn on
 multilanguage-support in Windows.

  In the case of multiple supported encodings, do
we use the first one, or
 the "best" one?  
 I would say the best one with. The only reason not do so is to show users
 how to write certain characters as entities, but if somebody decides to edit
 a page on the Russian Wikipedia we might expect that this person knows how
 to enter Cyrillic characters on his or her keyboard, or either knows how to
 enter them as entities. I'm not sure about when q!=1.0 for the best one. 
Besides, if they would benefit from seeing the entities, they would
benefit just as well from cut-n-paste, wouldn't they? Who's really going
to memorise the numbers?

...
   With regards
to L. Crocker's earlier comment about UTF-8 being
 potentially problematic with older browsers; after checking the
 Accept-Charset field we *know* that a browser supports UTF-8, and can
 provide an alternate, more limited, encoding if not.

 That, I think, would be my personal preference.  
 Mine too.

  While I'm at it -- if instead of HTML
entities we were to use UTF-8
 internally, we could get around the search problem by simply changing
 high characters in the fulltext index to hex codes.  
 Just to be clear on this. If we want indexing to work we need to encode it
 in a format that uses only the Latin 1 numbers, letters, "_" and
"'". The
 alternative is to run different MySQL servers with appopriate character sets
 for different Wikipedias. 

...
    Can you give an example? What is at the moment the
behaviour of the
 Esperanto Wikipedia? 
 The current behavior that you'll find on visiting eo.wikipedia.com
 (running the old software):
 - Article contents, titles, comments, user interface messages, etc. are
 shown in UTF-8.
 - The article edit box is in limited UTF-8, with the standard diacritics
 transliterated (X-system)
 - Any input (edit box, comment field, search box, username) is
 internally normalised to X-system, so users can type any way they wish.
 - On selecting a convenient link at the top of the screen, article
 contents, user interface messages etc. are optionally shown in X-system.  
 Ok. I see your point. Does this also work for Windows users? 
Yes, it works for everybody.

...
   Okay, but
what's "special behavior" and what's "default behavior"? Say
I
 visit the Japanese wikipedia, the French wikipedia, and the Esperanto
 wikipedia; is using a native encoding for native editing in each one
 "special behavior"? Or is using a common encoding for all "special
 behavior"?  
 I would say that the default behaviour is that you see in the edit box the
 best character encoding your browser accepts. If the browser doesn't specify
 this then the site falls back to Latin1 or whatever is specified as
 $defaultEncoding on that site. 
Sounds reasonable to me.

...
   [...] Whatever
happens, though, we still
 need the custom transliteration functions available for the X-system
 conversion. Those users without special keyboard drivers (or worse yet,
 with Netscape 4's Unicode support) need to be able to type and sometimes
 read that way.  
 By the way, is this specified in the charset with the POST? 
There's no real way to get that information, to my knowledge. This
particular conversion is clean enough that it can be run over any input
the user gives, though, so my preference would be to just keep filter
functions available over all editable text.

...
   HTML entities
are unacceptably difficult for typing, and
 not helpful for reading -- a browser that doesn't show the UTF-8 won't
 do any better with the entities.  
 My main interest is that we maintain 1 software package that can be adapted
 by setting some variables and includes. If we really cannot get around
 having extra special encoding-links that can be turned off or on depending
 upon some variables, then so be it. 
Agreed, which is why I added the little recoding functions in the first
place. They wouldn't be needed most of the time, but when they were, a
function or two could be dropped in with the localised messages.

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Intlwiki-l] Re: Encoding for non-Latin1 wikipedias (was Re: [Wikitech-l] New caseconversion functions)