Regarding UTF-8 support. Perhaps it would be better if I try to
explain some of the problems I'm facing. For example I'm not tracking
most frequently used English words (a, the, and, or …). In my opinion
every language should be tweaked separately and that's why I'm
suggesting to first test it on English Wikipedia.
Also I don't have a problem with finding spaces in UTF-8 encoded
strings and splitting it there. The problem is that some Unicode
characters like ẅ (letter w with two dots on top, Unicode code 0x1E85)
are used to write words and some Unicode characters such as ' (Left
single quotation mark, Unicode code 0x2018) are used to separate
words. Also I believe these characters could be encoded as HTML
entities in Wikitext.
As I'm tracking words I need to distinguish between these "character
classes" as they are known in regular expressions (i.e. \w word
character and \W non word character). If Tim Starling has a silver
bullet that can solve these problems feel free to e-mail it to me.
However in my opinion implementing that kind of UTF-8 support from
scratch can be somewhat tricky business.
The bottom line is that problems above *can* be solved but what I
suggest is to try on English Wikipedia first to see how it's going to
work in general and whether it's a useful feature. Support for other
languages could and should be added later on one language at a time.
On 08/06/06, Rob Church <robchur(a)gmail.com> wrote:
On 08/06/06, Tim Starling
<t.starling(a)physics.unimelb.edu.au> wrote:
Gerard Meijssen wrote:
> Hoi,
> This small Unicode issue is a show stopper. When software is suggested
> that only works on Latin script, you do not appreciate the amount of
> work that is done in other scripts using the MediaWiki software.
"You do not appreciate" - rather a confrontational tone, there. Who
are we to assume that someone else doesn't appreciate the amount of
effort put in elsewhere? It might be correct, but then again, there
might be no specific bias against it.
> Apart from that why would it be boring..
this is a technical list.
> Personally I am interested in two things as well, what other projects
> are you referring to and how you want to see this attribution done.
Apart from why what would be boring? The post was to get feedback,
don't withhold it. I would imagine standard attribution for the code
under GNU GPL blah blah blah. We won't be adding flashing banners,
"Wikipedia now uses a feature from XYZ". Or are we to start crediting
developers with individual features? "Thanks for clearing your
watchlist, c/o Rob Church."
I discussed unicode support with the original
poster on IRC. I couldn't get
through to him that adding UTF-8 support to a PHP application is trivial,
My impression of the poster was that he didn't completely understand
the whole UTF-8/Unicode/blah thing nor its implications, and looked
somewhat confused.
and requires no special UTF-8 support within PHP
itself. MediaWiki's UTF-8
support is mostly implemented from scratch using PHP's binary-safe string
handling. My wikidiff2 module in C++ also contains a simple UTF-8 decoder
within the word splitting routine. It's not difficult.
If the *idea* is found to be viable, adding the UTF-8 goodies will be
trivial, and we'll put the damn effort in.
Rob Church
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l