On 6/8/06, Tim Starling <t.starling(a)physics.unimelb.edu.au> wrote:
Roman Nosov wrote:
Regarding UTF-8 support. Perhaps it would be
better if I try to
explain some of the problems I'm facing. For example I'm not tracking
most frequently used English words (a, the, and, or …). In my opinion
every language should be tweaked separately and that's why I'm
suggesting to first test it on English Wikipedia.
Also I don't have a problem with finding spaces in UTF-8 encoded
strings and splitting it there. The problem is that some Unicode
characters like ẅ (letter w with two dots on top, Unicode code 0x1E85)
are used to write words and some Unicode characters such as ' (Left
single quotation mark, Unicode code 0x2018) are used to separate
words. Also I believe these characters could be encoded as HTML
entities in Wikitext.
As I'm tracking words I need to distinguish between these "character
classes" as they are known in regular expressions (i.e. \w word
character and \W non word character). If Tim Starling has a silver
bullet that can solve these problems feel free to e-mail it to me.
However in my opinion implementing that kind of UTF-8 support from
scratch can be somewhat tricky business.
The bottom line is that problems above *can* be solved but what I
suggest is to try on English Wikipedia first to see how it's going to
work in general and whether it's a useful feature. Support for other
languages could and should be added later on one language at a time.
High-numbered punctuation characters are rare, the approach I took in
wikidiff2 was to consider them part of the word. I considered all
non-alphanumeric characters less than 0xc0 as word-splitting punctuation
characters.
The Unicode character databases actually include information on which chars
are letters, which are punctuation, etc. Some programming languages incorporate
this into appropriate functions such as isletter(), ispunt() or the
like. I believe Perl
has them. I don't know whether PHP has them or not but if it doesn't that might
be considered a bug.
There are three languages that I'm aware of that
don't use
spaces to separate words, and thus require special handling: Chinese,
Japanese and Thai. They are the only ones that I was able to find while
searching the web for word segmentation information, and nobody from any
other language wiki has complained.
The other language I can think of that doesn't use spaces is Khmer but
it doesn't
have many fonts yet and so very few web sites if any and surely no wikis. Some
other Southeast Asian scripts may fall into the same category.
Chinese and Japanese are adequately
handled by doing character-level diffs -- I received lots of praise from the
Japanese Wikipedia for this scheme. Chinese and Japanese word segmentation
for search or machine translation is a much more difficult problem, but
luckily solving it is unnecessary for diff formatting. Character-level diffs
may well be superior anyway.
For Thai I am using character-level diffs, and although I haven't received
any complaints from the Wikipedians, I believe this is less than ideal. Thai
has lots of composing characters, so you often end up highlighting little
dots on top of letters and the like. Really what is required here is
dictionary-based word segmentation.
I believe there are free dictionary based word segmentation algorithms
available for Thai. It's also known not to be perfect but I'm not aware of any
free Thai word segmenters that do better than them.
Andrew Dunbar (hippietrail)
Our search engine is also next to
useless on the Thai Wikipedia due to the lack of word segmentation. But
that's not a problem Rowan has to solve.
Putting all that together, here's how I detect word characters in wikidiff2:
inline bool my_istext(int ch)
{
// Standard alphanumeric
if ((ch >= '0' && ch <= '9') ||
(ch == '_') ||
(ch >= 'A' && ch <= 'Z') ||
(ch >= 'a' && ch <= 'z'))
{
return true;
}
// Punctuation and control characters
if (ch < 0xc0) return false;
// Thai, return false so it gets split up
if (ch >= 0xe00 && ch <= 0xee7) return false;
// Chinese/Japanese, same
if (ch >= 0x3000 && ch <= 0x9fff) return false;
if (ch >= 0x20000 && ch <= 0x2a000) return false;
// Otherwise assume it's from a language that uses spaces
return true;
}
Now this might not be sounding "trivial" anymore. UTF-8 support is trivial,
I'll stand by that, but supporting all the languages of the world is not so
trivial. But as you can see, language support isn't as hard as you might
think, because lots of research has already been done.
-- Tim Starling
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
--
http://linguaphile.sf.net