Re: [Wikitech-l] New diff feature for MediaWiki

9 Jun 2006

On 6/8/06, Tim Starling &lt;t.starling(a)physics.unimelb.edu.au&gt; wrote:
...
  Roman Nosov wrote:
  Regarding UTF-8 support. Perhaps it would be
better if I try to
 explain some of the problems I'm facing. For example I'm not tracking
 most frequently used English words (a, the, and, or …). In my opinion
 every language should be tweaked separately and that's why I'm
 suggesting to first test it on English Wikipedia.
 Also I don't have a problem with finding spaces in UTF-8 encoded
 strings and splitting it there. The problem is that some Unicode
 characters like ẅ (letter w with two dots on top, Unicode code 0x1E85)
 are used to write words and some Unicode characters such as ' (Left
 single quotation mark, Unicode code 0x2018) are used to separate
 words. Also I believe these characters could be encoded as HTML
 entities in Wikitext.
 As I'm tracking words I need to distinguish between these "character
 classes" as they are known in regular expressions (i.e. \w word
 character and \W non word character). If Tim Starling has a silver
 bullet that can solve these problems feel free to e-mail it to me.
 However in my opinion implementing that kind of UTF-8 support from
 scratch can be somewhat tricky business.
 The bottom line is that problems above *can* be solved but what I
 suggest is to try on English Wikipedia first to see how it's going to
 work in general and whether it's a useful feature. Support for other
 languages could and should be added later on one language at a time.

 High-numbered punctuation characters are rare, the approach I took in
 wikidiff2 was to consider them part of the word. I considered all
 non-alphanumeric characters less than 0xc0 as word-splitting punctuation
 characters. 
The Unicode character databases actually include information on which chars
are letters, which are punctuation, etc. Some programming languages incorporate
this into appropriate functions such as isletter(), ispunt() or the
like. I believe Perl
has them. I don't know whether PHP has them or not but if it doesn't that might
be considered a bug.

...
  There are three languages that I'm aware of that
don't use
 spaces to separate words, and thus require special handling: Chinese,
 Japanese and Thai. They are the only ones that I was able to find while
 searching the web for word segmentation information, and nobody from any
 other language wiki has complained. 
The other language I can think of that doesn't use spaces is Khmer but
it doesn't
have many fonts yet and so very few web sites if any and surely no wikis. Some
other Southeast Asian scripts may fall into the same category.

...
  Chinese and Japanese are adequately
 handled by doing character-level diffs -- I received lots of praise from the
 Japanese Wikipedia for this scheme. Chinese and Japanese word segmentation
 for search or machine translation is a much more difficult problem, but
 luckily solving it is unnecessary for diff formatting. Character-level diffs
 may well be superior anyway.

 For Thai I am using character-level diffs, and although I haven't received
 any complaints from the Wikipedians, I believe this is less than ideal. Thai
 has lots of composing characters, so you often end up highlighting little
 dots on top of letters and the like. Really what is required here is
 dictionary-based word segmentation. 
I believe there are free dictionary based word segmentation algorithms
available for Thai. It's also known not to be perfect but I'm not aware of any
free Thai word segmenters that do better than them.

Andrew Dunbar (hippietrail)

...
  Our search engine is also next to
 useless on the Thai Wikipedia due to the lack of word segmentation. But
 that's not a problem Rowan has to solve.

 Putting all that together, here's how I detect word characters in wikidiff2:

 inline bool my_istext(int ch)
 {
        // Standard alphanumeric
        if ((ch >= '0' && ch <= '9') ||
           (ch == '_') ||
           (ch >= 'A' && ch <= 'Z') ||
           (ch >= 'a' && ch <= 'z'))
        {
                return true;
        }
        // Punctuation and control characters
        if (ch < 0xc0) return false;
        // Thai, return false so it gets split up
        if (ch >= 0xe00 && ch <= 0xee7) return false;
        // Chinese/Japanese, same
        if (ch >= 0x3000 && ch <= 0x9fff) return false;
        if (ch >= 0x20000 && ch <= 0x2a000) return false;
        // Otherwise assume it's from a language that uses spaces
        return true;
 }

 Now this might not be sounding "trivial" anymore. UTF-8 support is trivial,
 I'll stand by that, but supporting all the languages of the world is not so
 trivial. But as you can see, language support isn't as hard as you might
 think, because lots of research has already been done.

 -- Tim Starling

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)wikimedia.org
 http://mail.wikipedia.org/mailman/listinfo/wikitech-l 

-- 
http://linguaphile.sf.net

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] New diff feature for MediaWiki