[Wikipedia-l] Switching everything to UTF-8

Tomasz Wegrzanowski taw at users.sf.net
Mon Nov 17 23:02:20 UTC 2003


Staying so long with ISO 8859 was a mistake.

So I propose converting all Wikipedias that aren't using UTF-8 yet to UTF-8.
Procedure should be like that:
1. new LanguageXX.php prepared and put under some name
2. make backups
3. create tables curutf8 and oldutf8
4. disable write access
5. convert all data - numeric HTML codes are going to be replaced by UTF-8 characters too.
6. rename tables cur and old to cur88591 and cur88591
7. rename tables curutf8 and oldutf8 to cur and old
8. replace old LanguageXX.php with utf8-enabled version
9. reenable write access

The conversion script should be tested on test.* Wikipedia first.

During step 5 Wikipedia is going to be read only. It may take some time,
especially with English Wikipedia, so it's better to do conversion of each Wikipedia
separately. During steps 6-8 Wikipedia may not work at all, but it's going to
take less than a minute.

Does anybody have any really good reason why shouldn't I proceed ?
These reasons aren't good enough:
* broken URLs     - all old URLs are going to work after upgrade
* size increase   - size is going to stay about the same
* broken browsers - they should be upgraded, if someone has browser so old
			that it doesn't grok UTF-8, it's not going to grok CSS,
			PNGs, and other things we're using either.
			Unless we want to remove all CSS and PNGs, there's
			no point in not using UTF-8.
* ISO 8859-N is good enough - no, it's not. Not if someone wants to write about
			people and places from countries where non-8859-1 Latin
			characters are used, or about linguistics, or math, etc.



More information about the Wikipedia-l mailing list