I'm totally agree with Timwi – proper Unicode support is a requirement
not a feature. However can someone tell me why PHP comes with no
appropriate out-of-box support for such vital feature in 21 century?
The root cause of my diff engine ignoring Unicode at the moment is
because many PHP functions simply don't work with UTF-8 encoded
strings. PHP team promises proper Unicode support only in version 6.
Yeah I guess we are still in nineties …
However I think it's much better to honestly say upfront that Unicode
isn't properly supported then to claim that it is. For example look no
further than Wikipedia's current diff engine. Self-appointed Unicode
expert Tim Starling brags that it is extremely easy to build UTF-8
support from scratch. Well let's check that.
For example if you use ordinary single quote (the one from damned
latin-1, you can easily find it on your keyboard) to separate two
words in wikipedia then no problems. Diff engine will see these two
separate words. However if you use Left
single quotation mark (Unicode code 0x2018, the one MS Word likes to
use) to separate two words oops now these two words are treated as
one.
Test Case for everyone to check:
Using ordinary single quote:
First edit:
One'two
Second Edit:
One'three
Diff engine output:
Correctly highlights words two and three
Using left single quotation mark (Unicode code 0x2018, you might need
to type it rather than copy&paste it, of course all due to excellent
Unicode support by each and every e-mail program):
First edit:
One'two
Second Edit:
One'three
Diff engine output:
Incorrectly highlights both strings
So my question to all Unicode Nazis here is why quote from latin-1
charset is treated *differently* from slightly different Unicode
quote?
On 08/06/06, Rob Church <robchur(a)gmail.com> wrote:
On 08/06/06, Timwi <timwi(a)gmx.net> wrote:
It is already confrontational of a programmer to
pretend the whole world
could make do with Latin-1. It is one of the most devastating and
accordingly infuriating assumptions that still prevails despite the fact
that Unicode is decades old. We're in the 21st century; it is no longer
appropriate to even start programming anything where any user-visible
text is restricted to Latin-1 or any other 8-bit charset.
Of course, of course, I clean forgot. Because a quick proof of concept
has to be PERFECT, doesn't it. Do excuse that little oversight.
It's not perfect yet. Get over it and give some feedback on the idea.
Rob Church
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l