On Sep 27, 2013, at 9:27 PM, शंतनू <shantanoo(a)gmail.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> +++ Bakul Shah [26-Sep-2013 21:41 +0530]
> | Google for javascript spell check.
>
> 'Search the fine web (STFW) for...' is correct than 'Google for...'
Languages evolve.
http://searchenginewatch.com/article/2058373/Google-Now-A-Verb-In-The-Oxfor…
> | typo.js for example can use the same
> | dictionaries and affix rules as hunspell (what is used in OpenOffice and
> | Firefox) so any improvements we make with guj spellchecking will work
> | with it as well. It can do spell checking in multiple languages.
>
> I have not come across good FOSS spell check algorithm implementation
> for Indic languages. 'typo.js' seems to be nice. Has it been tried with
> Indic languages? How is the accuracy?
No idea if typo.js is good enough but it is a start if you want something
in javascript and it is opensource.
> | Hunspell can be used locally to spell check .txt files and I think it
> | knows a few other formats. It also allows use of private dictionaries
> | to store words the general dictionary doesn't know about but you use
> | such as proper names.
> |
> | The difficulty with Gujarati and other Indian scripts is that most
> | command line programs don't know how to display them properly so
> | you pretty much need to spellcheck using a word processor program or
> | a browser.
>
> Am little confused regarding the initial aim. From the discussion,
> i understand there are following things which need to be done.
>
> - - Correct the words which have been there already on the wiki
This may work for some misspelled words but typical algorithms
don't have needed semantic information to make the right choice
when there are two or more correctly spelled words that are
pretty close. I believe that even for text already on the wiki
you'd want an interactive spellchecker.
> - - Make sure that when user types incorrect words, correct spelling is
> suggested and let user decide what to do.
>
> On mrwiki we had the first one. Correcting the spelling (not
> grammatical) errors.
>
> IMO, start using what is available and simultaneously work on other
> ideas would be better than not using any tool(s).
Right. But what is available in open source doesn't seem to work. At least
in Gujarati (and the little I tried, in Hindi too). My guess, which
may be completely wrong, is that there are at least two issues.
1. The affix rules need to be expanded greatly
2. May be, algorithms written originally for European languages don't work
well with Indic languages. There may be implicit assumptions in the
code.
Hunspell's affix rule format is documented but there is a bunch of work
to be done here. Hopefully Kartik Mistry and others maintaining other
Indic dictionaries have a better handle on this.
The only way I know how to understand 2. is to write some code. But
hunspell has over 38,000 lines of C++ code -- so I don't want to touch
it. But it should not be hard to figure this out using a separate little
program and then if it makes sense we can merge it back in hunspell.
But note: I am telling you what I will do; not what you or anyone else
should do!
> Another observation, many people are talking about dictionaries. IMO, it
> should be *word list*. For spellcheck one does not require dictionary.
> Word list is necessary and sufficient. Efforts required for dictionary
> are huge compared to generating the word list.
In the software world words get used all the time in a way not originally
intended). For example we "execute" a program! And when it "hangs" we
"kill" it! In particular, the word "dictionary" is used in multiple ways.
Languages evolve and get messier. So it goes.
>
> There are few other issues which may result in failure of spell check.
> We are assuming that data entered is using correrct unicode character
> set. But in case of devnagari we have seen some issues. e.g.
>
> स्वतः is written as स्वत:
> दऱ्यावर is written as द-यावर
>
> This will probably break the logic.
> Another example, in case of roman script there is only one way of
> writing the word. In case of devnagari same thing can be written
> in different ways. e.g.
>
> अॅ ॲ
> दऱ्यावर दऱ्यावर दर्यावर (same output is rendered)
Even in English multiple spellings are allowed (-ize vs -ise suffix).
Even in the same dialect (American or British) a word may be spelled
differently: hyphenated words lose the hyphen over time. Well-known
vs wellknown. And spelling evolves with time. For example, the old
Sarth Jodnikosh uses ગૂજરાતી but now ગુજરાતી is the accepted spelling. In
any case, we can't capture all these nuances.
>
> (why and when such things happen could be discussed in another thread)
I suggest we set up a google group 'indic-spell' to discuss spelling
issues. Unless a group already exists for this purpose. Reply to just
me (and *not* the wikipedia-gu mailing list) to discuss this!
>
> - --
> शंतनू
>
> |
> | On Sep 25, 2013, at 9:58 PM, शंतनू महाजन <shantanoo(a)gmail.com> wrote:
> |
> | > -----BEGIN PGP SIGNED MESSAGE-----
> | > Hash: SHA1
> | >
> | > Hi Dhaval,
> | > Reply inline.
> | >
> | > +++ Dhaval S. Vyas [26-Sep-2013 00:31 +0530]
> | > | Something like that I do on gu.wiki with my bot. Correct most of the
> | > | obvious spelling or grammatical mistakes in Gujarati.
> | >
> | > Nice to know about it. Can you share more details regarding following:
> | > - - In which language (python/ruby/perl/...) the bot is written?
> | > - - How do you maintain the updated list of words/mistakes which need to
> | > be corrected? Can you share the current list?
> | >
> | > How are you fixing the grammatical mistakes/errors? Is that code
> | > specific to Gujarati or can it be used easily for other languages?
> | >
> | > | Here we are looking
> | > | for something broader, which could be used not only online or onwiki, but
> | > | offline and/or at least offwiki.
> | >
> | > Can you elaborate more on this? By offline do you mean that user
> | > provides some file (lets consider unicode .txt file), runs the program
> | > locally on its machine, and spell check is done on that data?
> | >
> | > | Also, bot corrects errors already commited, while it would be better to
> | > | have good spell checker handy that can correct on-the-fly, with user input
> | > | obviously. This is more useful as it would help users learn correct Jodani,
> | > | in longer term. Users could build their own pool, etc will be other
> | > | benefits.
> | >
> | > Had thought about it in past, but since did not find any javascript
> | > expertise who could help with writing the auto-correction code, was not
> | > able to do much regarding it. :(.
> | >
> | > - --
> | > शंतनू
> | >
> | > |
> | > | Regards,
> | > | Dhaval
> | > | On 25 Sep 2013 19:46, "Arnav Sonara" <sonara.arnav(a)gmail.com> wrote:
> | > |
> | > | > નમસ્કાર મિત્રો :-)
> | > | >
> | > | > બાર ગાંઉએ બોલી બદલાય કે બાર ગામે બોલી બદલાય, હૂતો કે હુતી કે સુતી કે સુતો,
> | > | > GPL, BSD કે પછી CC by SA ;-) right now our main concern is to have/develop
> | > | > such a tool which can replace the commonly misspelled words.
> | > | >
> | > | > I remembered that Marathi Wikipedia Community is using such a tool and got
> | > | > in touch with one of the guys running it.
> | > | >
> | > | > So they have a common list of words with correct spelling (hrsv and
> | > | > dirgha) against what people generally tend to type incorrectly, and their
> | > | > bot replaces the incorrect one.
> | > | >
> | > | > The word list can be found here<http://toolserver.org/~shantanoo/replace_words.txt.html> and
> | > | > the code for reading/modifying wiki pages can be found here<https://github.com/pune-lug/supersimplemediawiki>.
> | > | > So if anyone is up for taking this task, please go ahead and start
> | > | > developing the tool :-D maybe I can help if and when it is possible.
> | > | >
> | > | > I've also CC'd a friend of mine from Marathi community who helped me
> | > | > understanding this, feel free to connect to him. I'm sure he will love to
> | > | > help you out. Thanks.
> | > | >
> | > | >
> | > | >
> | > | > Thanks,
> | > | > Arnav <http://arnavs.com/>.
> | > | > (User:Rangilo_Gujarati)<http://en.wikipedia.org/wiki/User:Rangilo_Gujarati>
> | > | > .
> | > | >
> | > | >
> | > | > On Sun, Sep 22, 2013 at 5:15 AM, Dhaval S. Vyas <dsvyas(a)gmail.com> wrote:
> | > | >
> | > | >> Dear Bakulbhai,
> | > | >>
> | > | >>
> | > | >>> [No need to "embed" in wikipedia. You edit articles in your web browser.
> | > | >>> Or in a word processor and then copy and paste into the browser.]
> | > | >>
> | > | >>
> | > | >> Most of us prefer to work straight on the wiki as wikiformatting could be
> | > | >> done simultaneously. But, eitherway, it is OK so far as we have the
> | > | >> spellchecker somewhere.
> | > | >>
> | > | >>
> | > | >>> In terms of software, any new software I write will be under a BSD like
> | > | >>> license (basically: you can do anything you do with it, including use it in
> | > | >>> commercial work, but a) don't hold me responsible for anything and b) don't
> | > | >>> take credit for it). Any existing software I enhance is under its existing
> | > | >>> license. Note that ispell was under BSD license. aspell is under GPL (but
> | > | >>> the affix related work in it is under BSD since it came from ispell).
> | > | >>
> | > | >>
> | > | >> That's simply great, we are lucky to have come across a person like you
> | > | >> and Thank Rajeshbhai for looping in :-)
> | > | >>
> | > | >>
> | > | >>> I will leave any organizational issues to Rajesh and other more capable.
> | > | >>> Rajesh, rather than use individual email addresses, pick an existing
> | > | >>> mailing list to discuss this further.
> | > | >>
> | > | >> Suggested a different thread because, all interested parties have
> | > | >> already responded. Also, this being a mailing list, many people have opted
> | > | >> for a digest instead of individual emails. So, if we continued discussion
> | > | >> here, their feedback might not be in time (e.g. Vihangbhai's reply today),
> | > | >> while if it came as personal email, it could be replied by interested
> | > | >> person. But, both options have their own pros and cons, so will leave it
> | > | >> entirely on you all.
> | > | >>
> | > | >> Regards,
> | > | >> Dhaval
> | > | >>
> | > | >>
> | > | >>
> | > | >>
> | > | >>
> | > | >> On Sat, Sep 21, 2013 at 10:46 PM, Bakul Shah <bakul(a)bitblocks.com> wrote:
> | > | >>
> | > | >>> This is perhaps not the right mailing list to continue this discussion
> | > | >>> but I am already on 3 or 4 gujarati related lists.... So for now I will
> | > | >>> continue here.
> | > | >>>
> | > | >>> On Sep 20, 2013, at 9:36 AM, "Dhaval S. Vyas" <dsvyas(a)gmail.com> wrote:
> | > | >>>
> | > | >>> Once we have such functionality, and it is available under CC
> | > | >>> licence/public domain, it could be embedded in wiki (if not easily, without
> | > | >>> much trouble). There are several ways we could take it on board.
> | > | >>>
> | > | >>> [No need to "embed" in wikipedia. You edit articles in your web browser.
> | > | >>> Or in a word processor and then copy and paste into the browser.]
> | > | >>>
> | > | >>> *Licensing:*
> | > | >>>
> | > | >>> The spellchecking dictionary needs the kind of enhancements I talked
> | > | >>> about earlier. One starting point is the word list in *dict-gu_IN.oxt*OpenOffice extension.
> | > | >>>
> | > | >>> Before doing any real work on it, we need to get the licensing issues
> | > | >>> clarified. *dict-gu_IN.oxt *has a README file that says the original
> | > | >>> word list was prepared by Utkarsh Project volunteers and the list has GPL.
> | > | >>> GPL is actually for software so it is strange to see a dictionary using
> | > | >>> GPL. See also:
> | > | >>>
> | > | >>>
> | > | >>> http://stackoverflow.com/questions/4329467/is-it-okay-to-include-gpled-file…
> | > | >>>
> | > | >>> I don't want to get into a discussion of pros and cons of GPL but I am
> | > | >>> unwilling to work on anything GPL due to its more restrictive licensing.
> | > | >>> The best option in my view is to ask the Utkarsh volunteers to make it
> | > | >>> public domain or licence it under a dual license (GPL as well as BSD). I
> | > | >>> checked out http://www.utkarsh.org but it is not clear how to do this.
> | > | >>> Rajesh, can you help sort this out?
> | > | >>>
> | > | >>> The other alternative is to do what Rajesh suggested, which is to feed
> | > | >>> lots of text to a program and derive a list. If we do this, we will make
> | > | >>> this completely public domain.
> | > | >>>
> | > | >>> In terms of software, any new software I write will be under a BSD like
> | > | >>> license (basically: you can do anything you do with it, including use it in
> | > | >>> commercial work, but a) don't hold me responsible for anything and b) don't
> | > | >>> take credit for it). Any existing software I enhance is under its existing
> | > | >>> license. Note that ispell was under BSD license. aspell is under GPL (but
> | > | >>> the affix related work in it is under BSD since it came from ispell).
> | > | >>>
> | > | >>> Can we all, who has interest in developing such functionality and
> | > | >>> passion for the language as well as expertise, form a taskforce and take it
> | > | >>> off list? I will be delighted to work on it in whatever capacity I can.
> | > | >>>
> | > | >>> I will leave any organizational issues to Rajesh and other more capable.
> | > | >>> Rajesh, rather than use individual email addresses, pick an existing
> | > | >>> mailing list to discuss this further.
> | > | >>>
> | > | >>> On 20 Sep 2013 15:48, "Bakul Shah" <bakul(a)bitblocks.com> wrote:
> | > | >>>
> | > | >>>> Rajesh,
> | > | >>>>
> | > | >>>> Proof readers will have to use a word processor or browser as other
> | > | >>>> tools are not very good at displaying Indic languages.
> | > | >>>>
> | > | >>>> Googledoc is no good at spell checking.
> | > | >>>>
> | > | >>>> OpenOffice (or LibreOffice) has a number of dictionaries including for
> | > | >>>> Gujarati. I suspect it doesn't work well & we have work to do. I have no
> | > | >>>> desire or time to work on openOffice -- it is massive -- but there may be a
> | > | >>>> way....
> | > | >>>>
> | > | >>>> [The rest is a bit too technical. Feel free to skip]
> | > | >>>>
> | > | >>>> There are a number of open source standalong spell checking programs
> | > | >>>> such as ispell, aspell, hunspell etc. Most were derived from or influenced
> | > | >>>> by the original unix spell program written by S.C.Johnson. For the curious,
> | > | >>>> here's a paper by Doug McIlroy about it:
> | > | >>>> http://unix-spell.googlecode.com/svn/trunk/McIlroy_spell_1982.pdf
> | > | >>>>
> | > | >>>> ispell was pre-unicode and only worked with western languages but it
> | > | >>>> made some major advances that seemed to be carried over to aspell. I dug
> | > | >>>> into apell some and it seems to support Gujarati.
> | > | >>>>
> | > | >>>> Anyway, aspell can be used from other programs (has an API), can handle
> | > | >>>> multiple languages etc. Its documentation is not sufficient (IMHO) to
> | > | >>>> understand affix rules. ispell documentation has more details. I used to
> | > | >>>> know ispell fairly well but that was 20+ years ago!
> | > | >>>>
> | > | >>>> The *dict-gu.oxt *extension (used in OpenOffice) contains a file
> | > | >>>> called *gu_IN.dic* that contains a world list and* gu_IN.aff* that
> | > | >>>> should have *affix* rules for Gujarati but it is very small (compared
> | > | >>>> to English) and seems to needs a bunch more work. I see that this extension
> | > | >>>> is maintained by Kartik Mistry (did I see an email from him in this
> | > | >>>> thread?) so may be he and I can figure out how to add more affix rules?
> | > | >>>>
> | > | >>>> The basic idea with some example: given a rule like
> | > | >>>>
> | > | >>>> BOTH/R
> | > | >>>>
> | > | >>>> This can expand into BOTH and BOTHER, as -ER is a common english
> | > | >>>> extension (cart, carter and so on). Another example: may have
> | > | >>>>
> | > | >>>> ACIDIFY/NR
> | > | >>>>
> | > | >>>> This can expand into ACIDIFY ACIDIFICATION ACIDIFIER (Y-ER maps to
> | > | >>>> IER). These rules make the spellchecking dictionary quite compact as well
> | > | >>>> as indicate how a word should be taken apart for efficient matching.
> | > | >>>>
> | > | >>>> aspell is capable of deriving such rules for English but I suspect it
> | > | >>>> will need help in Indian languages. This is where the *.aff* file
> | > | >>>> comes in. So for example in Gujarati we would like to render the following
> | > | >>>> words in single rule
> | > | >>>>
> | > | >>>> * ગધેડો ગધેડી ગધેડું ગધેડા ગધેડાનું ગધેડાની ગધેડાનો ગધેડાના*
> | > | >>>> *
> | > | >>>> *
> | > | >>>> etc. For this we can write something like
> | > | >>>>
> | > | >>>> * ગધેડ/XYZABC*
> | > | >>>>
> | > | >>>> where each letter denotes a particular suffix. And yet, there is no
> | > | >>>> such word as *ગધેડ *-- And I am not sure these programs can handle
> | > | >>>> this. And we have compund words such as *ઘોડાગાડી* -- which will
> | > | >>>> require more complex rules. In fact Indic languages should have a much
> | > | >>>> larger set of affix rules than English! We should also check out what is
> | > | >>>> being done for Hindi.
> | > | >>>>
> | > | >>>> Next, we need rules for `similar' letters (or letters near each other
> | > | >>>> on a keyboard) so that if there is not an exact match, we first try such
> | > | >>>> similar or neighbor letters.
> | > | >>>>
> | > | >>>> Anyway, once we fix up the dictionary, very likely the same dictionary
> | > | >>>> can be used with word processors such as openOffice etc. An easier idea may
> | > | >>>> be to do a web based frontend.
> | > | >>>>
> | > | >>>> These programs do a lot of work: create dictionaries, read various file
> | > | >>>> formats, update screen, etc. etc. that make them complicated and hard to
> | > | >>>> modify. ideally I would want a single function for checking:
> | > | >>>> check(Speller, String)
> | > | >>>>
> | > | >>>> That returns a quad: (correctly spelled prefix, misspelled word, list
> | > | >>>> of suggestions, remaining string). A separate program can generate the
> | > | >>>> dictionaries. The Speller object will read whatever dictionaries it needs.
> | > | >>>> But I don't have time to implement this.
> | > | >>>>
> | > | >>>> Bakul
> | > | >>>>
> | > | >>>> On Sep 19, 2013, at 7:43 PM, Rajesh Mashruwala <mashru(a)gmail.com>
> | > | >>>> wrote:
> | > | >>>>
> | > | >>>> Has anyone tried Microsoft office Gujarati spell checker? It is
> | > | >>>> available with office 2010.
> | > | >>>>
> | > | >>>> Sent from the old new iPad!
> | > | >>>>
> | > | >>>> On Sep 18, 2013, at 11:43 AM, Bakul Shah <bakul(a)bitblocks.com> wrote:
> | > | >>>>
> | > | >>>> Googling "hindi spell checker algorithm" found a number of papers. The
> | > | >>>> basic idea is to compare how "similar" a word being checked is to a word
> | > | >>>> known to be correct, where similarity is computed using some algorithm. You
> | > | >>>> don't store all the ways people can misspell a word. plus logic is used to
> | > | >>>> derive related words from a root word, which depend on plurality, gender,
> | > | >>>> tense, etc. These rules are more complex in Indic languages than western.
> | > | >>>> And i think we may need to look at "clusters" instead of individual
> | > | >>>> unicode points. But all this must have been worked years ago. May be not
> | > | >>>> for Gujarati but for Hindi, Marathi, Bengali. You should check with the
> | > | >>>> usual suspects (google, Microsoft, SIL, language researchers etc.).
> | > | >>>>
> | > | >>>> For OCR you may need something slightly different than spellcheckers
> | > | >>>> that deal with human errors. Here a more common problem will be mistaking
> | > | >>>> similar looking letters and joining or splitting of words due to too little
> | > | >>>> of too much white space.
> | > | >>>>
> | > | >>>> Ultimately there should be support for language variations too (surati,
> | > | >>>> kathiawadi, amdavadi etc)!
> | > | >>>>
> | > | >>>> On Sep 18, 2013, at 4:12 AM, Rajesh Mashruwala <mashru(a)gmail.com>
> | > | >>>> wrote:
> | > | >>>>
> | > | >>>> Dhavalbhai,
> | > | >>>>
> | > | >>>> As we get text that is generated using OCR, I see need for a good
> | > | >>>> Gujarati dictionary. I tried to use GL dictionary. It was not effective
> | > | >>>> because it has corpus of words. It can not recognize any variation on the
> | > | >>>> word. In that model, we need possibly over ten times the corpus GL
> | > | >>>> dictionary has to be useful. Otherwise, it finds error with too many
> | > | >>>> correct words.
> | > | >>>>
> | > | >>>> The same dictionary could be used for Gujarati proof readers.
> | > | >>>>
> | > | >>>> One way is to generate larger corpus by scrapping words from Gujarati
> | > | >>>> Internet pages (those in Unicode), a better way is to think about building
> | > | >>>> better dictionary logic. I may be able to interest exceptionally good
> | > | >>>> volunteer developers if we can think of smarter way of creating a
> | > | >>>> dictionary. For example, we could codify grammar rules to form derivative
> | > | >>>> words.
> | > | >>>>
> | > | >>>> Should we pursue this course?
> | > | >>>>
> | > | >>>>
> | > | >>>>
> | > | >>>> Sent from the old new iPad!
> | > | >>>>
> | > | >>>> On Sep 18, 2013, at 2:48 AM, "Dhaval S. Vyas" <dsvyas(a)gmail.com> wrote:
> | > | >>>>
> | > | >>>> Dear Roopalben,
> | > | >>>>
> | > | >>>> I second your concern regarding the correct language. I often say that
> | > | >>>> Newspapers are the only LITERATURE most of us end up reading and have
> | > | >>>> access to. The language and (more becoming common Hindi) words used in them
> | > | >>>> shapes the language of society in present day and hence it is great that
> | > | >>>> you are introducing this course.
> | > | >>>>
> | > | >>>> Unfortunately, on wiki we don't have spelling correction tool or
> | > | >>>> dictionary lookup facility. But, Vishal Monpara has been developing one.
> | > | >>>> Gujarati Lexicon has recently developed pop-up dictionary as well, which
> | > | >>>> could be adapted for this purpose.
> | > | >>>>
> | > | >>>> On gu.wikipedia, there is a lot of content translated from either
> | > | >>>> English or Hindi, and most of these lack the original Gujarati language.
> | > | >>>> When read, these translations look so artificial. For the course, it could
> | > | >>>> be good idea to show such examples and get the course attendees correct it,
> | > | >>>> may be offline if they are not computer savvy or hesitant to use wikipedia.
> | > | >>>>
> | > | >>>> Please let me and community here know if you have any suggestions on
> | > | >>>> how we can help with the task you are carrying out.
> | > | >>>>
> | > | >>>> Kind Regards,
> | > | >>>> Dhaval
> | > | >>>> On 18 Sep 2013 06:39, "Roopal Mehta" <roopal.mehta(a)gmail.com> wrote:
> | > | >>>>
> | > | >>>>> Basically there are not many good proofreaders available in the
> | > | >>>>> publishing industry - and the demand is high. That was the main reason for
> | > | >>>>> starting this course.
> | > | >>>>>
> | > | >>>>> Wikipedia is an important source for information. However, the concern
> | > | >>>>> here is about correct use of language too. Today we see a lot many errors
> | > | >>>>> in Gujarati newspapers, publishing, media and almost everywhere. That is a
> | > | >>>>> high concern for us.
> | > | >>>>>
> | > | >>>>> If Wiki is going to be an important tool for the next generation, we
> | > | >>>>> Have to make sure that it conveys correct language to the society.
> | > | >>>>>
> | > | >>>>> I would like to know, whether any auto-correction of spelling etc. are
> | > | >>>>> available while editing an article in Wiki ?
> | > | >>>>>
> | > | >>>>> Thank you.
> | > | >>>>>
> | > | >>>>>
> | > | >>>>> Roopal
> | > | >>>>>
> | > | >>>>>
> | > | >>>>> On Tue, Sep 17, 2013 at 4:38 PM, Kartik Mistry <
> | > | >>>>> kartik.mistry(a)gmail.com> wrote:
> | > | >>>>>
> | > | >>>>>> On Tue, Sep 17, 2013 at 3:42 PM, Roopal Mehta <roopal.mehta(a)gmail.com>
> | > | >>>>>> wrote:
> | > | >>>>>> > At Gujarati Sahitya Parishad, we are running proof reading course
> | > | >>>>>> and we are including a session of modern methods of proof reading, which
> | > | >>>>>> includes editing on (Guj) Wiki articles.
> | > | >>>>>> >
> | > | >>>>>> > Please send suggestions if you have. This is the first batch of
> | > | >>>>>> students from various fields.
> | > | >>>>>>
> | > | >>>>>> Few suggestions (some may be offtopic, sorry for that!)
> | > | >>>>>> 1. Please follow Wikipedia's guideline for article.
> | > | >>>>>> 2. Make sure person is logged in before making changes.
> | > | >>>>>> 3. Please do not change anything other than spelling/grammar etc.
> | > | >>>>>> 4. If you're that already, donating pictures of 'સાહિત્યકાર' in
> | > | >>>>>> various articles from GSP, is good idea. Isn't it? :)
> | > | >>>>>>
> | > | >>>>>> Thanks for good work!
> | > | >>>>>>
> | > | >>>>>> --
> | > | >>>>>> Kartik Mistry | IRC: kart_
> | > | >>>>>> {0x1f1f, kartikm}.wordpress.com
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.14 (Darwin)
> Comment: GPGTools - http://gpgtools.org
>
> iQIcBAEBAgAGBQJSRlrIAAoJEPC2gDV1D+BTzeAP/2YQf6jDK0aQS6p/ArFsAZ0E
> 0RrGBpbAjXPYSLiDVo5R163Fkae3zF2mOv76LeNyCqYZdfwUx80xudeG8Ytiph4M
> lq0phdsRCF+Hhdewx3aLGIeekQJKRXXkXFjEkIZJPNlnhmG5wwJ9O86Nx3KQI2vu
> QYBlEr+WHx73IdUiujmtM4uAfQD1NdB6M8cs136+A2I2rCAdFjgErLqvLHaDeZAR
> D1ih/6lNQnI3PKRFpwCIKq9GKcccTG6Thf4QKgTIKfB18uYVnzBIpP/t+J4y/+zT
> PYNVK3v4pEWaHdtQa/QVet4Q0oXaVeXbpvW2BagbvdoqTLyQZiMiOGHQRUhmOWiA
> +3dBI5McFuRRHCiV+HvqgUwQPgb/i3Q9YqAmSHGRqd1+2EnALYEVJgtmT0f/AL5v
> /lwBXk70NUD18x6BhNiAJg0O4QeGyVs1kalG5xX2P95WG8LCLwiYhfMX9nBLi5Tb
> 5LxPSFT/d+sKaqjju69vrukjSoN3hosDUt3Kn5vDDnB2Ep+ba0Y/ROwSZGkTv/VW
> +rbL8Cn+ZCZF4uRopyMc3jfG3VnrQK6HzpIz+4LDUqPnlgbxLmabU4kPIZ1+LZEc
> RJfa3EUi3AHzB68Jd1lLi9oQ0mbT+pi7QjS70AoB5d8Cn+rIWZyPln5uASS3pyDf
> Em7XlKr5Tp2u+rWXvB+L
> =wcOi
> -----END PGP SIGNATURE-----
Google for javascript spell check. typo.js for example can use the same
dictionaries and affix rules as hunspell (what is used in OpenOffice and
Firefox) so any improvements we make with guj spellchecking will work
with it as well. It can do spell checking in multiple languages.
Hunspell can be used locally to spell check .txt files and I think it
knows a few other formats. It also allows use of private dictionaries
to store words the general dictionary doesn't know about but you use
such as proper names.
The difficulty with Gujarati and other Indian scripts is that most
command line programs don't know how to display them properly so
you pretty much need to spellcheck using a word processor program or
a browser.
On Sep 25, 2013, at 9:58 PM, शंतनू महाजन <shantanoo(a)gmail.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Dhaval,
> Reply inline.
>
> +++ Dhaval S. Vyas [26-Sep-2013 00:31 +0530]
> | Something like that I do on gu.wiki with my bot. Correct most of the
> | obvious spelling or grammatical mistakes in Gujarati.
>
> Nice to know about it. Can you share more details regarding following:
> - - In which language (python/ruby/perl/...) the bot is written?
> - - How do you maintain the updated list of words/mistakes which need to
> be corrected? Can you share the current list?
>
> How are you fixing the grammatical mistakes/errors? Is that code
> specific to Gujarati or can it be used easily for other languages?
>
> | Here we are looking
> | for something broader, which could be used not only online or onwiki, but
> | offline and/or at least offwiki.
>
> Can you elaborate more on this? By offline do you mean that user
> provides some file (lets consider unicode .txt file), runs the program
> locally on its machine, and spell check is done on that data?
>
> | Also, bot corrects errors already commited, while it would be better to
> | have good spell checker handy that can correct on-the-fly, with user input
> | obviously. This is more useful as it would help users learn correct Jodani,
> | in longer term. Users could build their own pool, etc will be other
> | benefits.
>
> Had thought about it in past, but since did not find any javascript
> expertise who could help with writing the auto-correction code, was not
> able to do much regarding it. :(.
>
> - --
> शंतनू
>
> |
> | Regards,
> | Dhaval
> | On 25 Sep 2013 19:46, "Arnav Sonara" <sonara.arnav(a)gmail.com> wrote:
> |
> | > નમસ્કાર મિત્રો :-)
> | >
> | > બાર ગાંઉએ બોલી બદલાય કે બાર ગામે બોલી બદલાય, હૂતો કે હુતી કે સુતી કે સુતો,
> | > GPL, BSD કે પછી CC by SA ;-) right now our main concern is to have/develop
> | > such a tool which can replace the commonly misspelled words.
> | >
> | > I remembered that Marathi Wikipedia Community is using such a tool and got
> | > in touch with one of the guys running it.
> | >
> | > So they have a common list of words with correct spelling (hrsv and
> | > dirgha) against what people generally tend to type incorrectly, and their
> | > bot replaces the incorrect one.
> | >
> | > The word list can be found here<http://toolserver.org/~shantanoo/replace_words.txt.html> and
> | > the code for reading/modifying wiki pages can be found here<https://github.com/pune-lug/supersimplemediawiki>.
> | > So if anyone is up for taking this task, please go ahead and start
> | > developing the tool :-D maybe I can help if and when it is possible.
> | >
> | > I've also CC'd a friend of mine from Marathi community who helped me
> | > understanding this, feel free to connect to him. I'm sure he will love to
> | > help you out. Thanks.
> | >
> | >
> | >
> | > Thanks,
> | > Arnav <http://arnavs.com/>.
> | > (User:Rangilo_Gujarati)<http://en.wikipedia.org/wiki/User:Rangilo_Gujarati>
> | > .
> | >
> | >
> | > On Sun, Sep 22, 2013 at 5:15 AM, Dhaval S. Vyas <dsvyas(a)gmail.com> wrote:
> | >
> | >> Dear Bakulbhai,
> | >>
> | >>
> | >>> [No need to "embed" in wikipedia. You edit articles in your web browser.
> | >>> Or in a word processor and then copy and paste into the browser.]
> | >>
> | >>
> | >> Most of us prefer to work straight on the wiki as wikiformatting could be
> | >> done simultaneously. But, eitherway, it is OK so far as we have the
> | >> spellchecker somewhere.
> | >>
> | >>
> | >>> In terms of software, any new software I write will be under a BSD like
> | >>> license (basically: you can do anything you do with it, including use it in
> | >>> commercial work, but a) don't hold me responsible for anything and b) don't
> | >>> take credit for it). Any existing software I enhance is under its existing
> | >>> license. Note that ispell was under BSD license. aspell is under GPL (but
> | >>> the affix related work in it is under BSD since it came from ispell).
> | >>
> | >>
> | >> That's simply great, we are lucky to have come across a person like you
> | >> and Thank Rajeshbhai for looping in :-)
> | >>
> | >>
> | >>> I will leave any organizational issues to Rajesh and other more capable.
> | >>> Rajesh, rather than use individual email addresses, pick an existing
> | >>> mailing list to discuss this further.
> | >>
> | >> Suggested a different thread because, all interested parties have
> | >> already responded. Also, this being a mailing list, many people have opted
> | >> for a digest instead of individual emails. So, if we continued discussion
> | >> here, their feedback might not be in time (e.g. Vihangbhai's reply today),
> | >> while if it came as personal email, it could be replied by interested
> | >> person. But, both options have their own pros and cons, so will leave it
> | >> entirely on you all.
> | >>
> | >> Regards,
> | >> Dhaval
> | >>
> | >>
> | >>
> | >>
> | >>
> | >> On Sat, Sep 21, 2013 at 10:46 PM, Bakul Shah <bakul(a)bitblocks.com> wrote:
> | >>
> | >>> This is perhaps not the right mailing list to continue this discussion
> | >>> but I am already on 3 or 4 gujarati related lists.... So for now I will
> | >>> continue here.
> | >>>
> | >>> On Sep 20, 2013, at 9:36 AM, "Dhaval S. Vyas" <dsvyas(a)gmail.com> wrote:
> | >>>
> | >>> Once we have such functionality, and it is available under CC
> | >>> licence/public domain, it could be embedded in wiki (if not easily, without
> | >>> much trouble). There are several ways we could take it on board.
> | >>>
> | >>> [No need to "embed" in wikipedia. You edit articles in your web browser.
> | >>> Or in a word processor and then copy and paste into the browser.]
> | >>>
> | >>> *Licensing:*
> | >>>
> | >>> The spellchecking dictionary needs the kind of enhancements I talked
> | >>> about earlier. One starting point is the word list in *dict-gu_IN.oxt*OpenOffice extension.
> | >>>
> | >>> Before doing any real work on it, we need to get the licensing issues
> | >>> clarified. *dict-gu_IN.oxt *has a README file that says the original
> | >>> word list was prepared by Utkarsh Project volunteers and the list has GPL.
> | >>> GPL is actually for software so it is strange to see a dictionary using
> | >>> GPL. See also:
> | >>>
> | >>>
> | >>> http://stackoverflow.com/questions/4329467/is-it-okay-to-include-gpled-file…
> | >>>
> | >>> I don't want to get into a discussion of pros and cons of GPL but I am
> | >>> unwilling to work on anything GPL due to its more restrictive licensing.
> | >>> The best option in my view is to ask the Utkarsh volunteers to make it
> | >>> public domain or licence it under a dual license (GPL as well as BSD). I
> | >>> checked out http://www.utkarsh.org but it is not clear how to do this.
> | >>> Rajesh, can you help sort this out?
> | >>>
> | >>> The other alternative is to do what Rajesh suggested, which is to feed
> | >>> lots of text to a program and derive a list. If we do this, we will make
> | >>> this completely public domain.
> | >>>
> | >>> In terms of software, any new software I write will be under a BSD like
> | >>> license (basically: you can do anything you do with it, including use it in
> | >>> commercial work, but a) don't hold me responsible for anything and b) don't
> | >>> take credit for it). Any existing software I enhance is under its existing
> | >>> license. Note that ispell was under BSD license. aspell is under GPL (but
> | >>> the affix related work in it is under BSD since it came from ispell).
> | >>>
> | >>> Can we all, who has interest in developing such functionality and
> | >>> passion for the language as well as expertise, form a taskforce and take it
> | >>> off list? I will be delighted to work on it in whatever capacity I can.
> | >>>
> | >>> I will leave any organizational issues to Rajesh and other more capable.
> | >>> Rajesh, rather than use individual email addresses, pick an existing
> | >>> mailing list to discuss this further.
> | >>>
> | >>> On 20 Sep 2013 15:48, "Bakul Shah" <bakul(a)bitblocks.com> wrote:
> | >>>
> | >>>> Rajesh,
> | >>>>
> | >>>> Proof readers will have to use a word processor or browser as other
> | >>>> tools are not very good at displaying Indic languages.
> | >>>>
> | >>>> Googledoc is no good at spell checking.
> | >>>>
> | >>>> OpenOffice (or LibreOffice) has a number of dictionaries including for
> | >>>> Gujarati. I suspect it doesn't work well & we have work to do. I have no
> | >>>> desire or time to work on openOffice -- it is massive -- but there may be a
> | >>>> way....
> | >>>>
> | >>>> [The rest is a bit too technical. Feel free to skip]
> | >>>>
> | >>>> There are a number of open source standalong spell checking programs
> | >>>> such as ispell, aspell, hunspell etc. Most were derived from or influenced
> | >>>> by the original unix spell program written by S.C.Johnson. For the curious,
> | >>>> here's a paper by Doug McIlroy about it:
> | >>>> http://unix-spell.googlecode.com/svn/trunk/McIlroy_spell_1982.pdf
> | >>>>
> | >>>> ispell was pre-unicode and only worked with western languages but it
> | >>>> made some major advances that seemed to be carried over to aspell. I dug
> | >>>> into apell some and it seems to support Gujarati.
> | >>>>
> | >>>> Anyway, aspell can be used from other programs (has an API), can handle
> | >>>> multiple languages etc. Its documentation is not sufficient (IMHO) to
> | >>>> understand affix rules. ispell documentation has more details. I used to
> | >>>> know ispell fairly well but that was 20+ years ago!
> | >>>>
> | >>>> The *dict-gu.oxt *extension (used in OpenOffice) contains a file
> | >>>> called *gu_IN.dic* that contains a world list and* gu_IN.aff* that
> | >>>> should have *affix* rules for Gujarati but it is very small (compared
> | >>>> to English) and seems to needs a bunch more work. I see that this extension
> | >>>> is maintained by Kartik Mistry (did I see an email from him in this
> | >>>> thread?) so may be he and I can figure out how to add more affix rules?
> | >>>>
> | >>>> The basic idea with some example: given a rule like
> | >>>>
> | >>>> BOTH/R
> | >>>>
> | >>>> This can expand into BOTH and BOTHER, as -ER is a common english
> | >>>> extension (cart, carter and so on). Another example: may have
> | >>>>
> | >>>> ACIDIFY/NR
> | >>>>
> | >>>> This can expand into ACIDIFY ACIDIFICATION ACIDIFIER (Y-ER maps to
> | >>>> IER). These rules make the spellchecking dictionary quite compact as well
> | >>>> as indicate how a word should be taken apart for efficient matching.
> | >>>>
> | >>>> aspell is capable of deriving such rules for English but I suspect it
> | >>>> will need help in Indian languages. This is where the *.aff* file
> | >>>> comes in. So for example in Gujarati we would like to render the following
> | >>>> words in single rule
> | >>>>
> | >>>> * ગધેડો ગધેડી ગધેડું ગધેડા ગધેડાનું ગધેડાની ગધેડાનો ગધેડાના*
> | >>>> *
> | >>>> *
> | >>>> etc. For this we can write something like
> | >>>>
> | >>>> * ગધેડ/XYZABC*
> | >>>>
> | >>>> where each letter denotes a particular suffix. And yet, there is no
> | >>>> such word as *ગધેડ *-- And I am not sure these programs can handle
> | >>>> this. And we have compund words such as *ઘોડાગાડી* -- which will
> | >>>> require more complex rules. In fact Indic languages should have a much
> | >>>> larger set of affix rules than English! We should also check out what is
> | >>>> being done for Hindi.
> | >>>>
> | >>>> Next, we need rules for `similar' letters (or letters near each other
> | >>>> on a keyboard) so that if there is not an exact match, we first try such
> | >>>> similar or neighbor letters.
> | >>>>
> | >>>> Anyway, once we fix up the dictionary, very likely the same dictionary
> | >>>> can be used with word processors such as openOffice etc. An easier idea may
> | >>>> be to do a web based frontend.
> | >>>>
> | >>>> These programs do a lot of work: create dictionaries, read various file
> | >>>> formats, update screen, etc. etc. that make them complicated and hard to
> | >>>> modify. ideally I would want a single function for checking:
> | >>>> check(Speller, String)
> | >>>>
> | >>>> That returns a quad: (correctly spelled prefix, misspelled word, list
> | >>>> of suggestions, remaining string). A separate program can generate the
> | >>>> dictionaries. The Speller object will read whatever dictionaries it needs.
> | >>>> But I don't have time to implement this.
> | >>>>
> | >>>> Bakul
> | >>>>
> | >>>> On Sep 19, 2013, at 7:43 PM, Rajesh Mashruwala <mashru(a)gmail.com>
> | >>>> wrote:
> | >>>>
> | >>>> Has anyone tried Microsoft office Gujarati spell checker? It is
> | >>>> available with office 2010.
> | >>>>
> | >>>> Sent from the old new iPad!
> | >>>>
> | >>>> On Sep 18, 2013, at 11:43 AM, Bakul Shah <bakul(a)bitblocks.com> wrote:
> | >>>>
> | >>>> Googling "hindi spell checker algorithm" found a number of papers. The
> | >>>> basic idea is to compare how "similar" a word being checked is to a word
> | >>>> known to be correct, where similarity is computed using some algorithm. You
> | >>>> don't store all the ways people can misspell a word. plus logic is used to
> | >>>> derive related words from a root word, which depend on plurality, gender,
> | >>>> tense, etc. These rules are more complex in Indic languages than western.
> | >>>> And i think we may need to look at "clusters" instead of individual
> | >>>> unicode points. But all this must have been worked years ago. May be not
> | >>>> for Gujarati but for Hindi, Marathi, Bengali. You should check with the
> | >>>> usual suspects (google, Microsoft, SIL, language researchers etc.).
> | >>>>
> | >>>> For OCR you may need something slightly different than spellcheckers
> | >>>> that deal with human errors. Here a more common problem will be mistaking
> | >>>> similar looking letters and joining or splitting of words due to too little
> | >>>> of too much white space.
> | >>>>
> | >>>> Ultimately there should be support for language variations too (surati,
> | >>>> kathiawadi, amdavadi etc)!
> | >>>>
> | >>>> On Sep 18, 2013, at 4:12 AM, Rajesh Mashruwala <mashru(a)gmail.com>
> | >>>> wrote:
> | >>>>
> | >>>> Dhavalbhai,
> | >>>>
> | >>>> As we get text that is generated using OCR, I see need for a good
> | >>>> Gujarati dictionary. I tried to use GL dictionary. It was not effective
> | >>>> because it has corpus of words. It can not recognize any variation on the
> | >>>> word. In that model, we need possibly over ten times the corpus GL
> | >>>> dictionary has to be useful. Otherwise, it finds error with too many
> | >>>> correct words.
> | >>>>
> | >>>> The same dictionary could be used for Gujarati proof readers.
> | >>>>
> | >>>> One way is to generate larger corpus by scrapping words from Gujarati
> | >>>> Internet pages (those in Unicode), a better way is to think about building
> | >>>> better dictionary logic. I may be able to interest exceptionally good
> | >>>> volunteer developers if we can think of smarter way of creating a
> | >>>> dictionary. For example, we could codify grammar rules to form derivative
> | >>>> words.
> | >>>>
> | >>>> Should we pursue this course?
> | >>>>
> | >>>>
> | >>>>
> | >>>> Sent from the old new iPad!
> | >>>>
> | >>>> On Sep 18, 2013, at 2:48 AM, "Dhaval S. Vyas" <dsvyas(a)gmail.com> wrote:
> | >>>>
> | >>>> Dear Roopalben,
> | >>>>
> | >>>> I second your concern regarding the correct language. I often say that
> | >>>> Newspapers are the only LITERATURE most of us end up reading and have
> | >>>> access to. The language and (more becoming common Hindi) words used in them
> | >>>> shapes the language of society in present day and hence it is great that
> | >>>> you are introducing this course.
> | >>>>
> | >>>> Unfortunately, on wiki we don't have spelling correction tool or
> | >>>> dictionary lookup facility. But, Vishal Monpara has been developing one.
> | >>>> Gujarati Lexicon has recently developed pop-up dictionary as well, which
> | >>>> could be adapted for this purpose.
> | >>>>
> | >>>> On gu.wikipedia, there is a lot of content translated from either
> | >>>> English or Hindi, and most of these lack the original Gujarati language.
> | >>>> When read, these translations look so artificial. For the course, it could
> | >>>> be good idea to show such examples and get the course attendees correct it,
> | >>>> may be offline if they are not computer savvy or hesitant to use wikipedia.
> | >>>>
> | >>>> Please let me and community here know if you have any suggestions on
> | >>>> how we can help with the task you are carrying out.
> | >>>>
> | >>>> Kind Regards,
> | >>>> Dhaval
> | >>>> On 18 Sep 2013 06:39, "Roopal Mehta" <roopal.mehta(a)gmail.com> wrote:
> | >>>>
> | >>>>> Basically there are not many good proofreaders available in the
> | >>>>> publishing industry - and the demand is high. That was the main reason for
> | >>>>> starting this course.
> | >>>>>
> | >>>>> Wikipedia is an important source for information. However, the concern
> | >>>>> here is about correct use of language too. Today we see a lot many errors
> | >>>>> in Gujarati newspapers, publishing, media and almost everywhere. That is a
> | >>>>> high concern for us.
> | >>>>>
> | >>>>> If Wiki is going to be an important tool for the next generation, we
> | >>>>> Have to make sure that it conveys correct language to the society.
> | >>>>>
> | >>>>> I would like to know, whether any auto-correction of spelling etc. are
> | >>>>> available while editing an article in Wiki ?
> | >>>>>
> | >>>>> Thank you.
> | >>>>>
> | >>>>>
> | >>>>> Roopal
> | >>>>>
> | >>>>>
> | >>>>> On Tue, Sep 17, 2013 at 4:38 PM, Kartik Mistry <
> | >>>>> kartik.mistry(a)gmail.com> wrote:
> | >>>>>
> | >>>>>> On Tue, Sep 17, 2013 at 3:42 PM, Roopal Mehta <roopal.mehta(a)gmail.com>
> | >>>>>> wrote:
> | >>>>>> > At Gujarati Sahitya Parishad, we are running proof reading course
> | >>>>>> and we are including a session of modern methods of proof reading, which
> | >>>>>> includes editing on (Guj) Wiki articles.
> | >>>>>> >
> | >>>>>> > Please send suggestions if you have. This is the first batch of
> | >>>>>> students from various fields.
> | >>>>>>
> | >>>>>> Few suggestions (some may be offtopic, sorry for that!)
> | >>>>>> 1. Please follow Wikipedia's guideline for article.
> | >>>>>> 2. Make sure person is logged in before making changes.
> | >>>>>> 3. Please do not change anything other than spelling/grammar etc.
> | >>>>>> 4. If you're that already, donating pictures of 'સાહિત્યકાર' in
> | >>>>>> various articles from GSP, is good idea. Isn't it? :)
> | >>>>>>
> | >>>>>> Thanks for good work!
> | >>>>>>
> | >>>>>> --
> | >>>>>> Kartik Mistry | IRC: kart_
> | >>>>>> {0x1f1f, kartikm}.wordpress.com
> | >>>>>>
> | >>>>>> _______________________________________________
> | >>>>>> Wikipedia-gu mailing list
> | >>>>>> Wikipedia-gu(a)lists.wikimedia.org
> | >>>>>> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
> | >>>>>>
> | >>>>>
> | >>>>>
> | >>>>> _______________________________________________
> | >>>>> Wikipedia-gu mailing list
> | >>>>> Wikipedia-gu(a)lists.wikimedia.org
> | >>>>> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
> | >>>>>
> | >>>>> _______________________________________________
> | >>>> Wikipedia-gu mailing list
> | >>>> Wikipedia-gu(a)lists.wikimedia.org
> | >>>> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
> | >>>>
> | >>>>
> | >>>>
> | >>>
> | >>
> | >> _______________________________________________
> | >> Wikipedia-gu mailing list
> | >> Wikipedia-gu(a)lists.wikimedia.org
> | >> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
> | >>
> | >>
> | >
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.14 (Darwin)
> Comment: GPGTools - http://gpgtools.org
>
> iQIcBAEBAgAGBQJSQ78EAAoJEPC2gDV1D+BTb2kP/1XZiJK74jx3V+13Q5QXxiV4
> CdyRYb4YDlZsdBd1EdXXons6YB5EC9ro9xdljSgDv7MoXdjcfjPY2RHSHYsAqf6I
> 0SyIBAM1sBgBbjNRB9JCHl2d9yG5sRxgF5IFW0oOjWJBT4UF18iBNHk7O0cr5TJR
> V6KX6hGMs7Koft3+cJLSce32laSC2VDi3b3Z4I6EAgongmd8HD+WdIJrJw1q8NP2
> X6lBcCnsobbyA54oET8+i8tIX9tFi32PIcYZI5+mAi3T3jauQ3VJ9kxpqEim6CNj
> 6eI8X0R7tbISlwLOZERNxRcjpbaw2AAXQbKuONJsxaoQgIxC+cm+67VBkkRvbe7k
> qZEyPpBgtlfi7FgiGbG0ljuzbWpo04lErMS3ogtzi8dtyXBy5uSP2uV5B4kip52V
> BtW8gfcX6vuVUoKLEx9e5NNY+Mp99ela8QV5b5FjavBiGyz2SNEBlmXJ4BhGDDj0
> NWrXLUw+VXh8FyJf6m/fUvKYIKS5maKREIBSsKxBleCB3WrflH88nLMrW96BYXFY
> ULP53BkwQCqh72V6XPbXVffes4raS5egn6dFmMvff+WYccZRrCuBLnF/L+OSzgVI
> eaL6Q5scHou9a3UlzLYaL9KoOHoakmOMY+XLg+MFPySXRbQsCot1NsfoDrDGSRLb
> rliQquEJWulHZWC2dpcx
> =w7Ri
> -----END PGP SIGNATURE-----
Has anyone tried Microsoft office Gujarati spell checker? It is available
with office 2010.
Sent from the old new iPad!
On Sep 18, 2013, at 11:43 AM, Bakul Shah <bakul(a)bitblocks.com> wrote:
Googling "hindi spell checker algorithm" found a number of papers. The
basic idea is to compare how "similar" a word being checked is to a word
known to be correct, where similarity is computed using some algorithm. You
don't store all the ways people can misspell a word. plus logic is used to
derive related words from a root word, which depend on plurality, gender,
tense, etc. These rules are more complex in Indic languages than western.
And i think we may need to look at "clusters" instead of individual
unicode points. But all this must have been worked years ago. May be not
for Gujarati but for Hindi, Marathi, Bengali. You should check with the
usual suspects (google, Microsoft, SIL, language researchers etc.).
For OCR you may need something slightly different than spellcheckers that
deal with human errors. Here a more common problem will be mistaking
similar looking letters and joining or splitting of words due to too little
of too much white space.
Ultimately there should be support for language variations too (surati,
kathiawadi, amdavadi etc)!
On Sep 18, 2013, at 4:12 AM, Rajesh Mashruwala <mashru(a)gmail.com> wrote:
Dhavalbhai,
As we get text that is generated using OCR, I see need for a good Gujarati
dictionary. I tried to use GL dictionary. It was not effective because it
has corpus of words. It can not recognize any variation on the word. In
that model, we need possibly over ten times the corpus GL dictionary has to
be useful. Otherwise, it finds error with too many correct words.
The same dictionary could be used for Gujarati proof readers.
One way is to generate larger corpus by scrapping words from Gujarati
Internet pages (those in Unicode), a better way is to think about building
better dictionary logic. I may be able to interest exceptionally good
volunteer developers if we can think of smarter way of creating a
dictionary. For example, we could codify grammar rules to form derivative
words.
Should we pursue this course?
Sent from the old new iPad!
On Sep 18, 2013, at 2:48 AM, "Dhaval S. Vyas" <dsvyas(a)gmail.com> wrote:
Dear Roopalben,
I second your concern regarding the correct language. I often say that
Newspapers are the only LITERATURE most of us end up reading and have
access to. The language and (more becoming common Hindi) words used in them
shapes the language of society in present day and hence it is great that
you are introducing this course.
Unfortunately, on wiki we don't have spelling correction tool or dictionary
lookup facility. But, Vishal Monpara has been developing one. Gujarati
Lexicon has recently developed pop-up dictionary as well, which could be
adapted for this purpose.
On gu.wikipedia, there is a lot of content translated from either English
or Hindi, and most of these lack the original Gujarati language. When read,
these translations look so artificial. For the course, it could be good
idea to show such examples and get the course attendees correct it, may be
offline if they are not computer savvy or hesitant to use wikipedia.
Please let me and community here know if you have any suggestions on how we
can help with the task you are carrying out.
Kind Regards,
Dhaval
On 18 Sep 2013 06:39, "Roopal Mehta" <roopal.mehta(a)gmail.com> wrote:
> Basically there are not many good proofreaders available in the publishing
> industry - and the demand is high. That was the main reason for starting
> this course.
>
> Wikipedia is an important source for information. However, the concern
> here is about correct use of language too. Today we see a lot many errors
> in Gujarati newspapers, publishing, media and almost everywhere. That is a
> high concern for us.
>
> If Wiki is going to be an important tool for the next generation, we Have
> to make sure that it conveys correct language to the society.
>
> I would like to know, whether any auto-correction of spelling etc. are
> available while editing an article in Wiki ?
>
> Thank you.
>
>
> Roopal
>
>
> On Tue, Sep 17, 2013 at 4:38 PM, Kartik Mistry <kartik.mistry(a)gmail.com>wrote:
>
>> On Tue, Sep 17, 2013 at 3:42 PM, Roopal Mehta <roopal.mehta(a)gmail.com>
>> wrote:
>> > At Gujarati Sahitya Parishad, we are running proof reading course and
>> we are including a session of modern methods of proof reading, which
>> includes editing on (Guj) Wiki articles.
>> >
>> > Please send suggestions if you have. This is the first batch of
>> students from various fields.
>>
>> Few suggestions (some may be offtopic, sorry for that!)
>> 1. Please follow Wikipedia's guideline for article.
>> 2. Make sure person is logged in before making changes.
>> 3. Please do not change anything other than spelling/grammar etc.
>> 4. If you're that already, donating pictures of 'સાહિત્યકાર' in
>> various articles from GSP, is good idea. Isn't it? :)
>>
>> Thanks for good work!
>>
>> --
>> Kartik Mistry | IRC: kart_
>> {0x1f1f, kartikm}.wordpress.com
>>
>> _______________________________________________
>> Wikipedia-gu mailing list
>> Wikipedia-gu(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
>>
>
>
> _______________________________________________
> Wikipedia-gu mailing list
> Wikipedia-gu(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
>
> _______________________________________________
Wikipedia-gu mailing list
Wikipedia-gu(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
Greetings from CIS-A2K!
We request the pleasure of your company at the event 'Re-releasing Konkani Vishwakosh & Building Konkani Wikipedia' . The event will take place on the 26th September, Thursday,10am - 11am at the Conference Hall, Goa University, Taleigao.
Upon CIS-A2K' explicit request, Goa University has approved the re-release of Konkani Vishwakosh under Creative Commons License (CC‐BY‐SA 3.0) to make it freely available to public and thus preserve Konkani language and culture in the digital era. This encyclopedia will also serve as one of the main sources of building and writing articles on Konkani Wikipedia (which is currently under incubation).
This is a major win for the Wikimedia movement in India. This is, probably, the first time ever a State run higher education institution has re-released copyrighted content under Creative Commons. Hope more such initiatives will be underway to "Let the Knowledge Go around"!
We'd like you to be a part of this event and help showcase Konkani community and language on a global digital platform such as Konkani Wikipedia.
Please see the link for the invite.[1] We look forward to seeing you at the event.
Best,
Nitika Tandon
Program Manager
Access to Knowledge
The Centre for Internet & Society
[1] https://commons.wikimedia.org/wiki/File:Re-release_of_Konkani_Vishwakosh_un…
Dear All,
On behalf of CIS-A2K, I am glad to share with you that GOA University has
signed an MoU with CIS-A2K.
As part of this MoU Goa University and CIS-A2K will work together to
digitize “Konkani Vishwakosh” under Creative Common license and build a
Digital Knowledge Partnership in order to enhance digital literacy in the
Konkani language and facilitate collaborative knowledge production and
disseminate the same free of cost through Kokani Wikipedia (currently under
incubation). Gos University and CIS-A2K will co-design and jointly
implement relevant training programmes to achieve this objective.
CIS-A2K is grateful for the support and encouragement received from the Goa
University Vice-Chancellor Dr. Satish Shetye; Prof. Alito Siqueira; Prof.
Priyadarshini Tadkodkar; Dr. Madhavi Sardesai; Dr. Gopakumar; and other
faculty of Goa University.
We are also equally grateful to Wikipedians Harriet Vidyasagar and
Frederick Noronha who have engaged with CIS-A2K team and been a constant
source of support.
Wish us best of luck!
Vishnu
Friends:
At Gujarati Sahitya Parishad, we are running proof reading course and we are including a session of modern methods of proof reading, which includes editing on (Guj) Wiki articles.
Please send suggestions if you have. This is the first batch of students from various fields.
Thank you
Roopal
Warm Regards
Roopal
Proof reading mate gujarati wikima khas to halma dhavalbhai aa kamgiri
kare chhe. Ashokbhai pan kare chhe pan teo niymit samay api shakta
nathi. Hun pan proof reading karto hato pan haju thoda smay sudhi
wikima samay aapi shakish nahi. Juna ane articlema proof reading baki
chhe. Aa mate aakhi team taiyar thay tevi jaruriyat to chhe j. Lexicon
ni madad thi pan any sabhyo aa karyama sahyog aapi shake chhe. Hun
thoda samay pachhi aa kaam mate samay falvi shakish.
-yogesh kavishwar
On 9/18/13, wikipedia-gu-request(a)lists.wikimedia.org
<wikipedia-gu-request(a)lists.wikimedia.org> wrote:
> Send Wikipedia-gu mailing list submissions to
> wikipedia-gu(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
> or, via email, send a message with subject or body 'help' to
> wikipedia-gu-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> wikipedia-gu-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wikipedia-gu digest..."
>
મિત્રો,
વિકિસ્રોત પર હાલમાં પરિયોજના ૨૯ "મૂરખરાજ અને તેના બે ભાઈઓ" પૂર્ણ થઈ છે. પરિયોજના ૨૮ - "વનવૃક્ષો" ૯૬% જેટલી પૂર્ણ થઈ છે. આ સાથે નવી પરિયોજના ક્રમાંક ૩૦ હેઠળ ઝવેરચંદ મેઘાણી રચિત લોક કથા સંગ્રહ "સૌરાષ્ટ્રની રસધાર ભાગ-૧" ચઢાવવામાં આવી રહ્યું છે. યુ. પી. એસ. સી. પરીક્ષાના ગુજરાતી વિષયના પાઠ્યક્રમાં આ પુસ્તક સમાવિષ્ટ છે માટે ગુજરાતી વિદ્યાર્થીઓ ને તે ઉપયોગિ થઈ રહેશે. જે મિત્રોને સહકાર્યમાં આ કાર્યમાં સહભાગી થવું હોય તેઓ નીચે દર્શાવેલી કડી પર સંપર્ક કરશો.
https://gu.wikisource.org/wiki/%E0%AA%9A%E0%AA%B0%E0%AB%8D%E0%AA%9A%E0%AA%B…
આભાર
સુશાંત
મિત્રો,
સહભાગી મિત્રોના સહકાર્યને કારણે વિકિસ્રોત પર ગાંધીજી રચિત બોધ કથા "મૂરખરાજ અને તેના બે ભાઈઓ" ચઢાવાનું કાર્ય પૂર્ણ થયું છે. ગાંધીજીના પ્રિય અને વિશ્વપ્રસિદ્ધ લેખક ટોલ્સટોયના વિચાર પર આધારિત આ બોધ કથા સાદી રહેણી કરણી અને સદાચાર પર આધારિત છે. આ પરિયોજના ૦૭-૦૯-૨૦૧૩ ના દિવસે ચાલુ થઈ અને ૧૩-૦૯-૨૦૧૩ ના દિવસે તે પૂર્ણ થઈ છે.
આ પરિયોજનામાં અશોકભાઈ વૈષ્ણવ (અમદાવાદ), કાર્તિકભાઈ મિસ્ત્રી, કોકિલાબેન મિસ્ત્રી અને સુશાંત (મુંબઈ) એ ભાગ લીધો હતો. સૌ મિત્રોનો ખૂબ ખૂબ આભાર.
ગાંધીજીની આ પ્રાચીન કૃતિને વાંચવા ને માણવા સૌને આમંત્રણ છે
સુશાંત સાવલા