On ĵaŭ, 2002-03-07 at 10:50, Jan Hidders wrote:
From: "Brion Vibber"
<brion(a)pobox.com>
Somewhat less solid was how to store the actual text internally: Lee
suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.
We cannot index UTF-8.
Nor HTML entities, naturally.
But if we introduce a redundant indexable field where
all characters (even the ASII ones) are represented with their unicode
number, then we would have a way around the 4-letter indexing boundary and
the problem that you cannot index anything but letters. So in that case I
would vote for UTF-8 since that would probably be the most efficient anyway.
Hmm, that's an idea.
[Incidentally; if we are to switch to UTF-8, we'll obviously want to do
something about the fact that the current English wikipedia uses
ISO-8859-1 high characters extensively. These pages can be converted
fairly easily, either as a one time search & replace or as a
normalise-an-old-page-when-we-first-load-it thing.]
If there's
some consensus on this, we can get crackin' and get this
implemented so the upgrades can proceed.
Er, I would suggest that before coding we set up a document that describes
what the consenus is. It should say what codings are used for what, when and
where. It should also say which functions take care of this coding. This
would also include the coding used in URLs.
Well, I was hoping there would be some evidence of some kind of
consensus before anyone goes writing documents or code! :)
We can probably add this stuff to
http://meta.wikipedia.com/wiki.phtml?title=Proposed_Wikipedia_policy_on_for…
The only sane format for URLs would be url-encoded UTF-8. This is the
recommended norm (
http://www.w3.org/International/O-URL-and-ident.html),
it is the most future-proof (can you imagine if we kept all our URLs in
EBCDIC instead of ASCII because everybody still had links & bookmarks
from their old IBM mainframe days?), and it allows links across
languages to be consistently represented.
-- brion vibber (brion @
pobox.com)