Update: A Tamil Wikipedian, Mahir, went to the core of the issue that I cited in
my previous email and identified that the issue in that instance was due to the
superfluous use of the zero width non-joiner HTML entity. We're going to file a
bug asking Mediwiki to chomp those entities when they occur in inappropriate
places.
- Sundar
"That language is an instrument of human reason, and not merely a medium for
the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture
----- Original Message ----
From: BalaSundaraRaman <sundarbecse(a)yahoo.com>
To: Discussion list on Indian language projects of Wikimedia.
<wikimediaindia-l(a)lists.wikimedia.org>
Cc: Wikita <wikita-l(a)lists.wikimedia.org>
Sent: Wed, December 29, 2010 11:34:06 AM
Subject: Re: [Wikimediaindia-l] Indic languages & unicode issues.
Ragib,
(copied Tamil Wiki list)
We've faced an issue similar to Bug #5948. Due to non-canonicalisation, there
are two articles on the same title in Tamil
Wikipedia!
http://ta.wikipedia.org/wiki/%E0%AE%AA%E0%AF%87%E0%AE%9A%E0%AF%8D%E0%AE%9A%…
8
(Tamil discussion)
- Sundar
"That language is an instrument of human reason, and not merely a medium for
the expression of thought, is a truth generally
admitted."
- George Boole, quoted in Iverson's Turing Award Lecture
----- Original Message ----
> From: Ragib Hasan <ragibhasan(a)gmail.com>
> To: Discussion list on Indian language projects of Wikimedia.
><wikimediaindia-l(a)lists.wikimedia.org>
> Sent: Wed, December 29, 2010 10:23:06 AM
> Subject: Re: [Wikimediaindia-l] Indic languages & unicode issues.
>
> I'm curious about the issue you are discussing ... is this similar to
> a long-standing bug that affects Bengali, Assamese, and Bishnupriya
> Manipuri wikipedias?
>
https://bugzilla.wikimedia.org/show_bug.cgi?id=5948
>
>
> Ragib
>
>
> User:Ragib on en and bn
>
>
> --
> Ragib Hasan, Ph.D
> NSF Computing Innovation Fellow and
> Assistant Research Scientist
>
> Dept of Computer Science
> Johns Hopkins University
> 3400 N Charles Street
> Baltimore, MD 21218
>
> Website:
>
http://www.ragibhasan.com
>
>
>
> On Mon, Dec 27, 2010 at 1:29 AM, BalaSundaraRaman <sundarbecse(a)yahoo.com>
wrote:
>> Unicode's decision to bring the second encoding in
>
>> standard was widely debated and opposed mainly by FOSS developer
>> community from Malayalam. Unicode announced the dual encoding scheme
>> without canonical equivalence definition in 2005 and reverted it when
>> scholars and developers opposed it.
>
> Sadly, you're not alone in this, Santhosh.
> We have had canonical non-equivalence issues and many more (similar to
the
> > atomic chillu issue) in Tamil too. :(
> > Part of it was inherited from the umbrellaish ISCII model (done with
good
>
intentions, I believe).
> They put the abugidas of the Indo-Aryan languages and other systems like
>Tamil
> > (haven't studied other writing systems enough to comment upon) into one
>bucket
> > and we're still suffering for that. They cite stability when legitimate
>changes
> > are sought, but allow such breaking changes.
> >
> > I'm sure you'll be working with the search engines to map the
equivalent
glyph
> sequences. Also, please explore mediawiki tech solutions to add redirects
or
> hidden texts (though not ideal).
>
> - Sundar
>
> "That language is an instrument of human reason, and not merely a medium
for
>the
> > expression of thought, is a truth generally admitted."
> > - George Boole, quoted in Iverson's Turing Award Lecture
> >
> >
> >
> > ----- Original Message ----
> >> From: Santhosh Thottingal <santhosh.thottingal(a)gmail.com>
> >> To: Discussion list on Indian language projects of Wikimedia.
> >><wikimediaindia-l(a)lists.wikimedia.org>
> >> Sent: Sun, December 26, 2010 10:28:17 PM
> >> Subject: Re: [Wikimediaindia-l] Indic languages & unicode issues.
> >>
> >> On Sun, Dec 26, 2010 at 7:43 PM, CherianTinu Abraham
> >> <tinucherian(a)gmail.com> wrote:
> >> > Hi all,
> >> > Happened to see Gerard's blog post on issues with Malayalam
Wikipedia
>> >
& Unicode upgrade to
>> > 5.1
http://ultimategerardm.blogspot.com/2010/12/malayalam-enigma.html
> >>
> >>
> >> The issue is very complex. There were heated debates around this
topic
>> in
Unicode Indic Mailing list for years. In short the issue is about
>> dual encoding- representing a letter using two types of unicode
>> character codes. Unicode's decision to bring the second encoding in
>> standard was widely debated and opposed mainly by FOSS developer
>> community from Malayalam. Unicode announced the dual encoding scheme
>> without canonical equivalence definition in 2005 and reverted it when
>> scholars and developers opposed it.
>> The same proposal again introduced. Foss community, language scholars
>> protested the proposal. The SMC community submitted a document with 17
>> reasons why dual encoding should not be introduced.- see
>>
http://wiki.smc.org.in/images/2/23/SMC_Unicode_5.1.pdf
>> Similarly a seminar conducted to discuss the issue by University of
>> Kerala opposed the proposal. see
>>http://images2.wikia.nocookie.net/__cb20080131071131/fci/images/1/19/Report_of_Workshop.pdf
f
f
>>f
>> But Unicode technical consortium did not bother to answer both of
>> these reports and went ahead with the decision in Unicode 5.1. The
>> dual encoding scheme is with out any canonical equivalence
definition.
>> Since it is not there in standard I
doubt whether Operating systems
>> will implement it, not to mention about search engines.
>>
>> Since the new encoding scheme is defined without backward
>> compatibility, or against unicode's stability policy, Malayalam FOSS
>> community decided not to implement it until issues are resolved and
>> continuing with unicode 5.0 encoding. Malayalam news portals also
>> follow unicode 5.0. Most of the tools from Google also continue with
>> unicode 5.0 based encoding. Malayalam wikipedia decided to go ahead
>> with latest version of unicode. I had resisted this move in the
>> discussion pages of Malayalam wikipedia. The decision was taken based
>> on voting by a small community of editors and not based on proper
>> technical analysis.
>>
>>
>> Believe it or not, this is how Malayalam wiki is rendered inWindows XP
>> IE 8 box with OS default font:
>>
http://thottingal.in/tmp/ml-wiki-winxp-IE8.png
>> I hope it gives some clue about the issue that Gerard mentioned.
>>
>> Most of the discussions happened around the encoding issue was in
>> Malayalam(in Malayalam wiki or in blogs), but this English blog post
>> might summarize it
>>
http://www.j4v4m4n.in/2009/11/07/unicode-or-malayalam/
>>
>>
>> Discussions happened in Malayalam wikipedia(content in Malayalam
>> language)
>>http://ml.wikipedia.org/wiki/വിക്കിപീഡിയ:പഞ്ചായത്ത്_(സാങ്കേതികം)/യൂണികോഡ്_5.1.0/ചർച്ച_(പഴയവ)
)
)
Thanks
Santhosh Thottingal
http://thottingal.in
_______________________________________________
Wikimediaindia-l l mailing list
Wikimediaindia-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l
_______________________________________________
Wikimediaindia-l mailing list
Wikimediaindia-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l
_______________________________________________
Wikimediaindia-l mailing list
Wikimediaindia-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l
_______________________________________________
Wikimediaindia-l mailing list
Wikimediaindia-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l