Re: [Wikitech-l] New diff feature for MediaWiki

8 Jun 2006

Regarding UTF-8 support. Perhaps it would be better if I try to
explain some of the problems I'm facing. For example I'm not tracking
most frequently used English words (a, the, and, or …). In my opinion
every language should be tweaked separately and that's why I'm
suggesting to first test it on English Wikipedia.
Also I don't have a problem with finding spaces in UTF-8 encoded
strings and splitting it there. The problem is that some Unicode
characters like ẅ (letter w with two dots on top, Unicode code 0x1E85)
are used to write words and some Unicode characters such as ' (Left
single quotation mark, Unicode code 0x2018) are used to separate
words. Also I believe these characters could be encoded as HTML
entities in Wikitext.
As I'm tracking words I need to distinguish between these "character
classes" as they are known in regular expressions (i.e. \w word
character and \W non word character). If Tim Starling has a silver
bullet that can solve these problems feel free to e-mail it to me.
However in my opinion implementing that kind of UTF-8 support from
scratch can be somewhat tricky business.
The bottom line is that problems above *can* be solved but what I
suggest is to try on English Wikipedia first to see how it's going to
work in general and whether it's a useful feature. Support for other
languages could and should be added later on one language at a time.

On 08/06/06, Rob Church &lt;robchur(a)gmail.com&gt; wrote:
...
  On 08/06/06, Tim Starling
&lt;t.starling(a)physics.unimelb.edu.au&gt; wrote:
  Gerard Meijssen wrote:
 > Hoi,
 > This small Unicode issue is a show stopper. When software is suggested
 > that only works on Latin script, you do not appreciate the amount of
 > work that is done in other scripts using the MediaWiki software. 
 "You do not appreciate" - rather a confrontational tone, there. Who
 are we to assume that someone else doesn't appreciate the amount of
 effort put in elsewhere? It might be correct, but then again, there
 might be no specific bias against it.

  > Apart from that why would it be boring..
this is a technical list.
 > Personally I am interested in two things as well, what other projects
 > are you referring to and how you want to see this attribution done. 
 Apart from why what would be boring? The post was to get feedback,
 don't withhold it. I would imagine standard attribution for the code
 under GNU GPL blah blah blah. We won't be adding flashing banners,
 "Wikipedia now uses a feature from XYZ". Or are we to start crediting
 developers with individual features? "Thanks for clearing your
 watchlist, c/o Rob Church."

  I discussed unicode support with the original
poster on IRC. I couldn't get
 through to him that adding UTF-8 support to a PHP application is trivial, 
 My impression of the poster was that he didn't completely understand
 the whole UTF-8/Unicode/blah thing nor its implications, and looked
 somewhat confused.

  and requires no special UTF-8 support within PHP
itself. MediaWiki's UTF-8
 support is mostly implemented from scratch using PHP's binary-safe string
 handling. My wikidiff2 module in C++ also contains a simple UTF-8 decoder
 within the word splitting routine. It's not difficult. 
 If the *idea* is found to be viable, adding the UTF-8 goodies will be
 trivial, and we'll put the damn effort in.

 Rob Church
 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)wikimedia.org
 http://mail.wikipedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] New diff feature for MediaWiki