From: "Brion L. VIBBER" <brion(a)pobox.com>
On ĵaŭ, 2002-03-07 at 10:50, Jan Hidders wrote:
From: "Brion Vibber"
<brion(a)pobox.com>
Somewhat less solid was how to store the actual text internally: Lee
suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.
We cannot index UTF-8.
Nor HTML entities, naturally.
That's not so obvious. If you replace the &, # and ; with something that is
indexed like ' and _ then indexing would work. But like I said, I favour
UTF-8 anyway and we can solve the indexing problem with an extra column
where everything is more or less represented as entities anyway. (We could
even replace uppercase with lowercase there and have case insensitive
indexing.)
[Incidentally; if we are to switch to UTF-8, we'll
obviously want to do
something about the fact that the current English wikipedia uses
ISO-8859-1 high characters extensively. These pages can be converted
fairly easily, either as a one time search & replace or as a
normalise-an-old-page-when-we-first-load-it thing.]
I like the one-time-search-and-replace approach. No need to complicate
and/or slow down the run-time code with checks and translation code.
Well, I was hoping there would be some evidence of
some kind of
consensus before anyone goes writing documents or code! :)
Of course. :-) I am still wondering what our great leader thinks of all
this.
We can probably add this stuff to
http://meta.wikipedia.com/wiki.phtml?title=Proposed_Wikipedia_policy_on_fore
ign_characters
This page is more about the local policy on the English Wikipedia. I would
like to see a page with a title like "A common architecture for Wikipedias
in all languages". I'd start writing it, but work is really busy at the
moment.
The only sane format for URLs would be url-encoded
UTF-8. This is the
recommended norm (
http://www.w3.org/International/O-URL-and-ident.html),
it is the most future-proof (can you imagine if we kept all our URLs in
EBCDIC instead of ASCII because everybody still had links & bookmarks
from their old IBM mainframe days?), and it allows links across
languages to be consistently represented.
Completely agreed.
-- Jan Hidders