Re: [Wikitech-l] International upgrades

8 Mar 2002

On ĵaŭ, 2002-03-07 at 10:50, Jan Hidders wrote:
...
  From: "Brion Vibber"
&lt;brion(a)pobox.com&gt;

 Somewhat less solid was how to store the actual text internally: Lee
 suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.  
 We cannot index UTF-8. 
Nor HTML entities, naturally.

...
  But if we introduce a redundant indexable field where
 all characters (even the ASII ones) are represented with their unicode
 number, then we would have a way around the 4-letter indexing boundary and
 the problem that you cannot index anything but letters. So in that case I
 would vote for UTF-8 since that would probably be the most efficient anyway. 
Hmm, that's an idea.

[Incidentally; if we are to switch to UTF-8, we'll obviously want to do
something about the fact that the current English wikipedia uses
ISO-8859-1 high characters extensively. These pages can be converted
fairly easily, either as a one time search & replace or as a
normalise-an-old-page-when-we-first-load-it thing.]

...
   If there's
some consensus on this, we can get crackin' and get this
 implemented so the upgrades can proceed.  
 Er, I would suggest that before coding we set up a document that describes
 what the consenus is. It should say what codings are used for what, when and
 where. It should also say which functions take care of this coding. This
 would also include the coding used in URLs. 
Well, I was hoping there would be some evidence of some kind of
consensus before anyone goes writing documents or code! :)

We can probably add this stuff to
http://meta.wikipedia.com/wiki.phtml?title=Proposed_Wikipedia_policy_on_for…

The only sane format for URLs would be url-encoded UTF-8. This is the
recommended norm (http://www.w3.org/International/O-URL-and-ident.html),
it is the most future-proof (can you imagine if we kept all our URLs in
EBCDIC instead of ASCII because everybody still had links & bookmarks
from their old IBM mainframe days?), and it allows links across
languages to be consistently represented.

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] International upgrades