Re: [Wikitech-l] New diff feature for MediaWiki

8 Jun 2006

I'm totally agree with Timwi – proper Unicode support is a requirement
not a feature. However can someone tell me why PHP comes with no
appropriate out-of-box support for such vital feature in 21 century?
The root cause of my diff engine ignoring Unicode at the moment is
because many PHP functions simply don't work with UTF-8 encoded
strings. PHP team promises proper Unicode support only in version 6.
Yeah I guess we are still in nineties …

However I think it's much better to honestly say upfront that Unicode
isn't properly supported then to claim that it is. For example look no
further than Wikipedia's current diff engine. Self-appointed Unicode
expert Tim Starling brags that it is extremely easy to build UTF-8
support from scratch. Well let's check that.

For example if you use ordinary single quote (the one from damned
latin-1, you can easily find it on your keyboard) to separate two
words in wikipedia then no problems. Diff engine will see these two
separate words. However if you use Left
single quotation mark (Unicode code 0x2018, the one MS Word likes to
use) to separate two words oops now these two words are treated as
one.

Test Case for everyone to check:

Using ordinary single quote:
First edit:
One'two
Second Edit:
One'three
Diff engine output:
Correctly highlights words two and three

Using left single quotation mark (Unicode code 0x2018, you might need
to type it rather than copy&paste it, of course all due to excellent
Unicode support by each and every e-mail program):
First edit:
One'two
Second Edit:
One'three
Diff engine output:
Incorrectly highlights both strings

So my question to all Unicode Nazis here is why quote from latin-1
charset is treated *differently* from slightly different Unicode
quote?

On 08/06/06, Rob Church &lt;robchur(a)gmail.com&gt; wrote:
...
  On 08/06/06, Timwi &lt;timwi(a)gmx.net&gt; wrote:
  It is already confrontational of a programmer to
pretend the whole world
 could make do with Latin-1. It is one of the most devastating and
 accordingly infuriating assumptions that still prevails despite the fact
 that Unicode is decades old. We're in the 21st century; it is no longer
 appropriate to even start programming anything where any user-visible
 text is restricted to Latin-1 or any other 8-bit charset. 
 Of course, of course, I clean forgot. Because a quick proof of concept
 has to be PERFECT, doesn't it. Do excuse that little oversight.

 It's not perfect yet. Get over it and give some feedback on the idea.

 Rob Church
 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)wikimedia.org
 http://mail.wikipedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] New diff feature for MediaWiki