Wikipedia-gu September 2013

wikipedia-gu@lists.wikimedia.org

14 participants
19 discussions

by Bakul Shah

On Sep 27, 2013, at 9:27 PM, शंतनू <shantanoo(a)gmail.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > +++ Bakul Shah [26-Sep-2013 21:41 +0530] > | Google for javascript spell check. > > 'Search the fine web (STFW) for...' is correct than 'Google for...' Languages evolve. http://searchenginewatch.com/article/2058373/Google-Now-A-Verb-In-The-Oxfor… > | typo.js for example can use the same > | dictionaries and affix rules as hunspell (what is used in OpenOffice and > | Firefox) so any improvements we make with guj spellchecking will work > | with it as well. It can do spell checking in multiple languages. > > I have not come across good FOSS spell check algorithm implementation > for Indic languages. 'typo.js' seems to be nice. Has it been tried with > Indic languages? How is the accuracy? No idea if typo.js is good enough but it is a start if you want something in javascript and it is opensource. > | Hunspell can be used locally to spell check .txt files and I think it > | knows a few other formats. It also allows use of private dictionaries > | to store words the general dictionary doesn't know about but you use > | such as proper names. > | > | The difficulty with Gujarati and other Indian scripts is that most > | command line programs don't know how to display them properly so > | you pretty much need to spellcheck using a word processor program or > | a browser. > > Am little confused regarding the initial aim. From the discussion, > i understand there are following things which need to be done. > > - - Correct the words which have been there already on the wiki This may work for some misspelled words but typical algorithms don't have needed semantic information to make the right choice when there are two or more correctly spelled words that are pretty close. I believe that even for text already on the wiki you'd want an interactive spellchecker. > - - Make sure that when user types incorrect words, correct spelling is > suggested and let user decide what to do. > > On mrwiki we had the first one. Correcting the spelling (not > grammatical) errors. > > IMO, start using what is available and simultaneously work on other > ideas would be better than not using any tool(s). Right. But what is available in open source doesn't seem to work. At least in Gujarati (and the little I tried, in Hindi too). My guess, which may be completely wrong, is that there are at least two issues. 1. The affix rules need to be expanded greatly 2. May be, algorithms written originally for European languages don't work well with Indic languages. There may be implicit assumptions in the code. Hunspell's affix rule format is documented but there is a bunch of work to be done here. Hopefully Kartik Mistry and others maintaining other Indic dictionaries have a better handle on this. The only way I know how to understand 2. is to write some code. But hunspell has over 38,000 lines of C++ code -- so I don't want to touch it. But it should not be hard to figure this out using a separate little program and then if it makes sense we can merge it back in hunspell. But note: I am telling you what I will do; not what you or anyone else should do! > Another observation, many people are talking about dictionaries. IMO, it > should be *word list*. For spellcheck one does not require dictionary. > Word list is necessary and sufficient. Efforts required for dictionary > are huge compared to generating the word list. In the software world words get used all the time in a way not originally intended). For example we "execute" a program! And when it "hangs" we "kill" it! In particular, the word "dictionary" is used in multiple ways. Languages evolve and get messier. So it goes. > > There are few other issues which may result in failure of spell check. > We are assuming that data entered is using correrct unicode character > set. But in case of devnagari we have seen some issues. e.g. > > स्वतः is written as स्वत: > दऱ्यावर is written as द-यावर > > This will probably break the logic. > Another example, in case of roman script there is only one way of > writing the word. In case of devnagari same thing can be written > in different ways. e.g. > > अॅ ॲ > दऱ्यावर दऱ्यावर दर्‍यावर (same output is rendered) Even in English multiple spellings are allowed (-ize vs -ise suffix). Even in the same dialect (American or British) a word may be spelled differently: hyphenated words lose the hyphen over time. Well-known vs wellknown. And spelling evolves with time. For example, the old Sarth Jodnikosh uses ગૂજરાતી but now ગુજરાતી is the accepted spelling. In any case, we can't capture all these nuances. > > (why and when such things happen could be discussed in another thread) I suggest we set up a google group 'indic-spell' to discuss spelling issues. Unless a group already exists for this purpose. Reply to just me (and *not* the wikipedia-gu mailing list) to discuss this! > > - -- > शंतनू > > | > | On Sep 25, 2013, at 9:58 PM, शंतनू महाजन <shantanoo(a)gmail.com> wrote: > | > | > -----BEGIN PGP SIGNED MESSAGE----- > | > Hash: SHA1 > | > > | > Hi Dhaval, > | > Reply inline. > | > > | > +++ Dhaval S. Vyas [26-Sep-2013 00:31 +0530] > | > | Something like that I do on gu.wiki with my bot. Correct most of the > | > | obvious spelling or grammatical mistakes in Gujarati. > | > > | > Nice to know about it. Can you share more details regarding following: > | > - - In which language (python/ruby/perl/...) the bot is written? > | > - - How do you maintain the updated list of words/mistakes which need to > | > be corrected? Can you share the current list? > | > > | > How are you fixing the grammatical mistakes/errors? Is that code > | > specific to Gujarati or can it be used easily for other languages? > | > > | > | Here we are looking > | > | for something broader, which could be used not only online or onwiki, but > | > | offline and/or at least offwiki. > | > > | > Can you elaborate more on this? By offline do you mean that user > | > provides some file (lets consider unicode .txt file), runs the program > | > locally on its machine, and spell check is done on that data? > | > > | > | Also, bot corrects errors already commited, while it would be better to > | > | have good spell checker handy that can correct on-the-fly, with user input > | > | obviously. This is more useful as it would help users learn correct Jodani, > | > | in longer term. Users could build their own pool, etc will be other > | > | benefits. > | > > | > Had thought about it in past, but since did not find any javascript > | > expertise who could help with writing the auto-correction code, was not > | > able to do much regarding it. :(. > | > > | > - -- > | > शंतनू > | > > | > | > | > | Regards, > | > | Dhaval > | > | On 25 Sep 2013 19:46, "Arnav Sonara" <sonara.arnav(a)gmail.com> wrote: > | > | > | > | > નમસ્કાર મિત્રો :-) > | > | > > | > | > બાર ગાંઉએ બોલી બદલાય કે બાર ગામે બોલી બદલાય, હૂતો કે હુતી કે સુતી કે સુતો, > | > | > GPL, BSD કે પછી CC by SA ;-) right now our main concern is to have/develop > | > | > such a tool which can replace the commonly misspelled words. > | > | > > | > | > I remembered that Marathi Wikipedia Community is using such a tool and got > | > | > in touch with one of the guys running it. > | > | > > | > | > So they have a common list of words with correct spelling (hrsv and > | > | > dirgha) against what people generally tend to type incorrectly, and their > | > | > bot replaces the incorrect one. > | > | > > | > | > The word list can be found here<http://toolserver.org/~shantanoo/replace_words.txt.html> and > | > | > the code for reading/modifying wiki pages can be found here<https://github.com/pune-lug/supersimplemediawiki>. > | > | > So if anyone is up for taking this task, please go ahead and start > | > | > developing the tool :-D maybe I can help if and when it is possible. > | > | > > | > | > I've also CC'd a friend of mine from Marathi community who helped me > | > | > understanding this, feel free to connect to him. I'm sure he will love to > | > | > help you out. Thanks. > | > | > > | > | > > | > | > > | > | > Thanks, > | > | > Arnav <http://arnavs.com/>. > | > | > (User:Rangilo_Gujarati)<http://en.wikipedia.org/wiki/User:Rangilo_Gujarati> > | > | > . > | > | > > | > | > > | > | > On Sun, Sep 22, 2013 at 5:15 AM, Dhaval S. Vyas <dsvyas(a)gmail.com> wrote: > | > | > > | > | >> Dear Bakulbhai, > | > | >> > | > | >> > | > | >>> [No need to "embed" in wikipedia. You edit articles in your web browser. > | > | >>> Or in a word processor and then copy and paste into the browser.] > | > | >> > | > | >> > | > | >> Most of us prefer to work straight on the wiki as wikiformatting could be > | > | >> done simultaneously. But, eitherway, it is OK so far as we have the > | > | >> spellchecker somewhere. > | > | >> > | > | >> > | > | >>> In terms of software, any new software I write will be under a BSD like > | > | >>> license (basically: you can do anything you do with it, including use it in > | > | >>> commercial work, but a) don't hold me responsible for anything and b) don't > | > | >>> take credit for it). Any existing software I enhance is under its existing > | > | >>> license. Note that ispell was under BSD license. aspell is under GPL (but > | > | >>> the affix related work in it is under BSD since it came from ispell). > | > | >> > | > | >> > | > | >> That's simply great, we are lucky to have come across a person like you > | > | >> and Thank Rajeshbhai for looping in :-) > | > | >> > | > | >> > | > | >>> I will leave any organizational issues to Rajesh and other more capable. > | > | >>> Rajesh, rather than use individual email addresses, pick an existing > | > | >>> mailing list to discuss this further. > | > | >> > | > | >> Suggested a different thread because, all interested parties have > | > | >> already responded. Also, this being a mailing list, many people have opted > | > | >> for a digest instead of individual emails. So, if we continued discussion > | > | >> here, their feedback might not be in time (e.g. Vihangbhai's reply today), > | > | >> while if it came as personal email, it could be replied by interested > | > | >> person. But, both options have their own pros and cons, so will leave it > | > | >> entirely on you all. > | > | >> > | > | >> Regards, > | > | >> Dhaval > | > | >> > | > | >> > | > | >> > | > | >> > | > | >> > | > | >> On Sat, Sep 21, 2013 at 10:46 PM, Bakul Shah <bakul(a)bitblocks.com> wrote: > | > | >> > | > | >>> This is perhaps not the right mailing list to continue this discussion > | > | >>> but I am already on 3 or 4 gujarati related lists.... So for now I will > | > | >>> continue here. > | > | >>> > | > | >>> On Sep 20, 2013, at 9:36 AM, "Dhaval S. Vyas" <dsvyas(a)gmail.com> wrote: > | > | >>> > | > | >>> Once we have such functionality, and it is available under CC > | > | >>> licence/public domain, it could be embedded in wiki (if not easily, without > | > | >>> much trouble). There are several ways we could take it on board. > | > | >>> > | > | >>> [No need to "embed" in wikipedia. You edit articles in your web browser. > | > | >>> Or in a word processor and then copy and paste into the browser.] > | > | >>> > | > | >>> *Licensing:* > | > | >>> > | > | >>> The spellchecking dictionary needs the kind of enhancements I talked > | > | >>> about earlier. One starting point is the word list in *dict-gu_IN.oxt*OpenOffice extension. > | > | >>> > | > | >>> Before doing any real work on it, we need to get the licensing issues > | > | >>> clarified. *dict-gu_IN.oxt *has a README file that says the original > | > | >>> word list was prepared by Utkarsh Project volunteers and the list has GPL. > | > | >>> GPL is actually for software so it is strange to see a dictionary using > | > | >>> GPL. See also: > | > | >>> > | > | >>> > | > | >>> http://stackoverflow.com/questions/4329467/is-it-okay-to-include-gpled-file… > | > | >>> > | > | >>> I don't want to get into a discussion of pros and cons of GPL but I am > | > | >>> unwilling to work on anything GPL due to its more restrictive licensing. > | > | >>> The best option in my view is to ask the Utkarsh volunteers to make it > | > | >>> public domain or licence it under a dual license (GPL as well as BSD). I > | > | >>> checked out http://www.utkarsh.org but it is not clear how to do this. > | > | >>> Rajesh, can you help sort this out? > | > | >>> > | > | >>> The other alternative is to do what Rajesh suggested, which is to feed > | > | >>> lots of text to a program and derive a list. If we do this, we will make > | > | >>> this completely public domain. > | > | >>> > | > | >>> In terms of software, any new software I write will be under a BSD like > | > | >>> license (basically: you can do anything you do with it, including use it in > | > | >>> commercial work, but a) don't hold me responsible for anything and b) don't > | > | >>> take credit for it). Any existing software I enhance is under its existing > | > | >>> license. Note that ispell was under BSD license. aspell is under GPL (but > | > | >>> the affix related work in it is under BSD since it came from ispell). > | > | >>> > | > | >>> Can we all, who has interest in developing such functionality and > | > | >>> passion for the language as well as expertise, form a taskforce and take it > | > | >>> off list? I will be delighted to work on it in whatever capacity I can. > | > | >>> > | > | >>> I will leave any organizational issues to Rajesh and other more capable. > | > | >>> Rajesh, rather than use individual email addresses, pick an existing > | > | >>> mailing list to discuss this further. > | > | >>> > | > | >>> On 20 Sep 2013 15:48, "Bakul Shah" <bakul(a)bitblocks.com> wrote: > | > | >>> > | > | >>>> Rajesh, > | > | >>>> > | > | >>>> Proof readers will have to use a word processor or browser as other > | > | >>>> tools are not very good at displaying Indic languages. > | > | >>>> > | > | >>>> Googledoc is no good at spell checking. > | > | >>>> > | > | >>>> OpenOffice (or LibreOffice) has a number of dictionaries including for > | > | >>>> Gujarati. I suspect it doesn't work well & we have work to do. I have no > | > | >>>> desire or time to work on openOffice -- it is massive -- but there may be a > | > | >>>> way.... > | > | >>>> > | > | >>>> [The rest is a bit too technical. Feel free to skip] > | > | >>>> > | > | >>>> There are a number of open source standalong spell checking programs > | > | >>>> such as ispell, aspell, hunspell etc. Most were derived from or influenced > | > | >>>> by the original unix spell program written by S.C.Johnson. For the curious, > | > | >>>> here's a paper by Doug McIlroy about it: > | > | >>>> http://unix-spell.googlecode.com/svn/trunk/McIlroy_spell_1982.pdf > | > | >>>> > | > | >>>> ispell was pre-unicode and only worked with western languages but it > | > | >>>> made some major advances that seemed to be carried over to aspell. I dug > | > | >>>> into apell some and it seems to support Gujarati. > | > | >>>> > | > | >>>> Anyway, aspell can be used from other programs (has an API), can handle > | > | >>>> multiple languages etc. Its documentation is not sufficient (IMHO) to > | > | >>>> understand affix rules. ispell documentation has more details. I used to > | > | >>>> know ispell fairly well but that was 20+ years ago! > | > | >>>> > | > | >>>> The *dict-gu.oxt *extension (used in OpenOffice) contains a file > | > | >>>> called *gu_IN.dic* that contains a world list and* gu_IN.aff* that > | > | >>>> should have *affix* rules for Gujarati but it is very small (compared > | > | >>>> to English) and seems to needs a bunch more work. I see that this extension > | > | >>>> is maintained by Kartik Mistry (did I see an email from him in this > | > | >>>> thread?) so may be he and I can figure out how to add more affix rules? > | > | >>>> > | > | >>>> The basic idea with some example: given a rule like > | > | >>>> > | > | >>>> BOTH/R > | > | >>>> > | > | >>>> This can expand into BOTH and BOTHER, as -ER is a common english > | > | >>>> extension (cart, carter and so on). Another example: may have > | > | >>>> > | > | >>>> ACIDIFY/NR > | > | >>>> > | > | >>>> This can expand into ACIDIFY ACIDIFICATION ACIDIFIER (Y-ER maps to > | > | >>>> IER). These rules make the spellchecking dictionary quite compact as well > | > | >>>> as indicate how a word should be taken apart for efficient matching. > | > | >>>> > | > | >>>> aspell is capable of deriving such rules for English but I suspect it > | > | >>>> will need help in Indian languages. This is where the *.aff* file > | > | >>>> comes in. So for example in Gujarati we would like to render the following > | > | >>>> words in single rule > | > | >>>> > | > | >>>> * ગધેડો ગધેડી ગધેડું ગધેડા ગધેડાનું ગધેડાની ગધેડાનો ગધેડાના* > | > | >>>> * > | > | >>>> * > | > | >>>> etc. For this we can write something like > | > | >>>> > | > | >>>> * ગધેડ/XYZABC* > | > | >>>> > | > | >>>> where each letter denotes a particular suffix. And yet, there is no > | > | >>>> such word as *ગધેડ *-- And I am not sure these programs can handle > | > | >>>> this. And we have compund words such as *ઘોડાગાડી* -- which will > | > | >>>> require more complex rules. In fact Indic languages should have a much > | > | >>>> larger set of affix rules than English! We should also check out what is > | > | >>>> being done for Hindi. > | > | >>>> > | > | >>>> Next, we need rules for `similar' letters (or letters near each other > | > | >>>> on a keyboard) so that if there is not an exact match, we first try such > | > | >>>> similar or neighbor letters. > | > | >>>> > | > | >>>> Anyway, once we fix up the dictionary, very likely the same dictionary > | > | >>>> can be used with word processors such as openOffice etc. An easier idea may > | > | >>>> be to do a web based frontend. > | > | >>>> > | > | >>>> These programs do a lot of work: create dictionaries, read various file > | > | >>>> formats, update screen, etc. etc. that make them complicated and hard to > | > | >>>> modify. ideally I would want a single function for checking: > | > | >>>> check(Speller, String) > | > | >>>> > | > | >>>> That returns a quad: (correctly spelled prefix, misspelled word, list > | > | >>>> of suggestions, remaining string). A separate program can generate the > | > | >>>> dictionaries. The Speller object will read whatever dictionaries it needs. > | > | >>>> But I don't have time to implement this. > | > | >>>> > | > | >>>> Bakul > | > | >>>> > | > | >>>> On Sep 19, 2013, at 7:43 PM, Rajesh Mashruwala <mashru(a)gmail.com> > | > | >>>> wrote: > | > | >>>> > | > | >>>> Has anyone tried Microsoft office Gujarati spell checker? It is > | > | >>>> available with office 2010. > | > | >>>> > | > | >>>> Sent from the old new iPad! > | > | >>>> > | > | >>>> On Sep 18, 2013, at 11:43 AM, Bakul Shah <bakul(a)bitblocks.com> wrote: > | > | >>>> > | > | >>>> Googling "hindi spell checker algorithm" found a number of papers. The > | > | >>>> basic idea is to compare how "similar" a word being checked is to a word > | > | >>>> known to be correct, where similarity is computed using some algorithm. You > | > | >>>> don't store all the ways people can misspell a word. plus logic is used to > | > | >>>> derive related words from a root word, which depend on plurality, gender, > | > | >>>> tense, etc. These rules are more complex in Indic languages than western. > | > | >>>> And i think we may need to look at "clusters" instead of individual > | > | >>>> unicode points. But all this must have been worked years ago. May be not > | > | >>>> for Gujarati but for Hindi, Marathi, Bengali. You should check with the > | > | >>>> usual suspects (google, Microsoft, SIL, language researchers etc.). > | > | >>>> > | > | >>>> For OCR you may need something slightly different than spellcheckers > | > | >>>> that deal with human errors. Here a more common problem will be mistaking > | > | >>>> similar looking letters and joining or splitting of words due to too little > | > | >>>> of too much white space. > | > | >>>> > | > | >>>> Ultimately there should be support for language variations too (surati, > | > | >>>> kathiawadi, amdavadi etc)! > | > | >>>> > | > | >>>> On Sep 18, 2013, at 4:12 AM, Rajesh Mashruwala <mashru(a)gmail.com> > | > | >>>> wrote: > | > | >>>> > | > | >>>> Dhavalbhai, > | > | >>>> > | > | >>>> As we get text that is generated using OCR, I see need for a good > | > | >>>> Gujarati dictionary. I tried to use GL dictionary. It was not effective > | > | >>>> because it has corpus of words. It can not recognize any variation on the > | > | >>>> word. In that model, we need possibly over ten times the corpus GL > | > | >>>> dictionary has to be useful. Otherwise, it finds error with too many > | > | >>>> correct words. > | > | >>>> > | > | >>>> The same dictionary could be used for Gujarati proof readers. > | > | >>>> > | > | >>>> One way is to generate larger corpus by scrapping words from Gujarati > | > | >>>> Internet pages (those in Unicode), a better way is to think about building > | > | >>>> better dictionary logic. I may be able to interest exceptionally good > | > | >>>> volunteer developers if we can think of smarter way of creating a > | > | >>>> dictionary. For example, we could codify grammar rules to form derivative > | > | >>>> words. > | > | >>>> > | > | >>>> Should we pursue this course? > | > | >>>> > | > | >>>> > | > | >>>> > | > | >>>> Sent from the old new iPad! > | > | >>>> > | > | >>>> On Sep 18, 2013, at 2:48 AM, "Dhaval S. Vyas" <dsvyas(a)gmail.com> wrote: > | > | >>>> > | > | >>>> Dear Roopalben, > | > | >>>> > | > | >>>> I second your concern regarding the correct language. I often say that > | > | >>>> Newspapers are the only LITERATURE most of us end up reading and have > | > | >>>> access to. The language and (more becoming common Hindi) words used in them > | > | >>>> shapes the language of society in present day and hence it is great that > | > | >>>> you are introducing this course. > | > | >>>> > | > | >>>> Unfortunately, on wiki we don't have spelling correction tool or > | > | >>>> dictionary lookup facility. But, Vishal Monpara has been developing one. > | > | >>>> Gujarati Lexicon has recently developed pop-up dictionary as well, which > | > | >>>> could be adapted for this purpose. > | > | >>>> > | > | >>>> On gu.wikipedia, there is a lot of content translated from either > | > | >>>> English or Hindi, and most of these lack the original Gujarati language. > | > | >>>> When read, these translations look so artificial. For the course, it could > | > | >>>> be good idea to show such examples and get the course attendees correct it, > | > | >>>> may be offline if they are not computer savvy or hesitant to use wikipedia. > | > | >>>> > | > | >>>> Please let me and community here know if you have any suggestions on > | > | >>>> how we can help with the task you are carrying out. > | > | >>>> > | > | >>>> Kind Regards, > | > | >>>> Dhaval > | > | >>>> On 18 Sep 2013 06:39, "Roopal Mehta" <roopal.mehta(a)gmail.com> wrote: > | > | >>>> > | > | >>>>> Basically there are not many good proofreaders available in the > | > | >>>>> publishing industry - and the demand is high. That was the main reason for > | > | >>>>> starting this course. > | > | >>>>> > | > | >>>>> Wikipedia is an important source for information. However, the concern > | > | >>>>> here is about correct use of language too. Today we see a lot many errors > | > | >>>>> in Gujarati newspapers, publishing, media and almost everywhere. That is a > | > | >>>>> high concern for us. > | > | >>>>> > | > | >>>>> If Wiki is going to be an important tool for the next generation, we > | > | >>>>> Have to make sure that it conveys correct language to the society. > | > | >>>>> > | > | >>>>> I would like to know, whether any auto-correction of spelling etc. are > | > | >>>>> available while editing an article in Wiki ? > | > | >>>>> > | > | >>>>> Thank you. > | > | >>>>> > | > | >>>>> > | > | >>>>> Roopal > | > | >>>>> > | > | >>>>> > | > | >>>>> On Tue, Sep 17, 2013 at 4:38 PM, Kartik Mistry < > | > | >>>>> kartik.mistry(a)gmail.com> wrote: > | > | >>>>> > | > | >>>>>> On Tue, Sep 17, 2013 at 3:42 PM, Roopal Mehta <roopal.mehta(a)gmail.com> > | > | >>>>>> wrote: > | > | >>>>>> > At Gujarati Sahitya Parishad, we are running proof reading course > | > | >>>>>> and we are including a session of modern methods of proof reading, which > | > | >>>>>> includes editing on (Guj) Wiki articles. > | > | >>>>>> > > | > | >>>>>> > Please send suggestions if you have. This is the first batch of > | > | >>>>>> students from various fields. > | > | >>>>>> > | > | >>>>>> Few suggestions (some may be offtopic, sorry for that!) > | > | >>>>>> 1. Please follow Wikipedia's guideline for article. > | > | >>>>>> 2. Make sure person is logged in before making changes. > | > | >>>>>> 3. Please do not change anything other than spelling/grammar etc. > | > | >>>>>> 4. If you're that already, donating pictures of 'સાહિત્યકાર' in > | > | >>>>>> various articles from GSP, is good idea. Isn't it? :) > | > | >>>>>> > | > | >>>>>> Thanks for good work! > | > | >>>>>> > | > | >>>>>> -- > | > | >>>>>> Kartik Mistry | IRC: kart_ > | > | >>>>>> {0x1f1f, kartikm}.wordpress.com > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.14 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQIcBAEBAgAGBQJSRlrIAAoJEPC2gDV1D+BTzeAP/2YQf6jDK0aQS6p/ArFsAZ0E > 0RrGBpbAjXPYSLiDVo5R163Fkae3zF2mOv76LeNyCqYZdfwUx80xudeG8Ytiph4M > lq0phdsRCF+Hhdewx3aLGIeekQJKRXXkXFjEkIZJPNlnhmG5wwJ9O86Nx3KQI2vu > QYBlEr+WHx73IdUiujmtM4uAfQD1NdB6M8cs136+A2I2rCAdFjgErLqvLHaDeZAR > D1ih/6lNQnI3PKRFpwCIKq9GKcccTG6Thf4QKgTIKfB18uYVnzBIpP/t+J4y/+zT > PYNVK3v4pEWaHdtQa/QVet4Q0oXaVeXbpvW2BagbvdoqTLyQZiMiOGHQRUhmOWiA > +3dBI5McFuRRHCiV+HvqgUwQPgb/i3Q9YqAmSHGRqd1+2EnALYEVJgtmT0f/AL5v > /lwBXk70NUD18x6BhNiAJg0O4QeGyVs1kalG5xX2P95WG8LCLwiYhfMX9nBLi5Tb > 5LxPSFT/d+sKaqjju69vrukjSoN3hosDUt3Kn5vDDnB2Ep+ba0Y/ROwSZGkTv/VW > +rbL8Cn+ZCZF4uRopyMc3jfG3VnrQK6HzpIz+4LDUqPnlgbxLmabU4kPIZ1+LZEc > RJfa3EUi3AHzB68Jd1lLi9oQ0mbT+pi7QjS70AoB5d8Cn+rIWZyPln5uASS3pyDf > Em7XlKr5Tp2u+rWXvB+L > =wcOi > -----END PGP SIGNATURE-----

10 years, 7 months

Re: [Wikipedia-gu] Proof reading

by Bakul Shah

Google for javascript spell check. typo.js for example can use the same dictionaries and affix rules as hunspell (what is used in OpenOffice and Firefox) so any improvements we make with guj spellchecking will work with it as well. It can do spell checking in multiple languages. Hunspell can be used locally to spell check .txt files and I think it knows a few other formats. It also allows use of private dictionaries to store words the general dictionary doesn't know about but you use such as proper names. The difficulty with Gujarati and other Indian scripts is that most command line programs don't know how to display them properly so you pretty much need to spellcheck using a word processor program or a browser. On Sep 25, 2013, at 9:58 PM, शंतनू महाजन <shantanoo(a)gmail.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi Dhaval, > Reply inline. > > +++ Dhaval S. Vyas [26-Sep-2013 00:31 +0530] > | Something like that I do on gu.wiki with my bot. Correct most of the > | obvious spelling or grammatical mistakes in Gujarati. > > Nice to know about it. Can you share more details regarding following: > - - In which language (python/ruby/perl/...) the bot is written? > - - How do you maintain the updated list of words/mistakes which need to > be corrected? Can you share the current list? > > How are you fixing the grammatical mistakes/errors? Is that code > specific to Gujarati or can it be used easily for other languages? > > | Here we are looking > | for something broader, which could be used not only online or onwiki, but > | offline and/or at least offwiki. > > Can you elaborate more on this? By offline do you mean that user > provides some file (lets consider unicode .txt file), runs the program > locally on its machine, and spell check is done on that data? > > | Also, bot corrects errors already commited, while it would be better to > | have good spell checker handy that can correct on-the-fly, with user input > | obviously. This is more useful as it would help users learn correct Jodani, > | in longer term. Users could build their own pool, etc will be other > | benefits. > > Had thought about it in past, but since did not find any javascript > expertise who could help with writing the auto-correction code, was not > able to do much regarding it. :(. > > - -- > शंतनू > > | > | Regards, > | Dhaval > | On 25 Sep 2013 19:46, "Arnav Sonara" <sonara.arnav(a)gmail.com> wrote: > | > | > નમસ્કાર મિત્રો :-) > | > > | > બાર ગાંઉએ બોલી બદલાય કે બાર ગામે બોલી બદલાય, હૂતો કે હુતી કે સુતી કે સુતો, > | > GPL, BSD કે પછી CC by SA ;-) right now our main concern is to have/develop > | > such a tool which can replace the commonly misspelled words. > | > > | > I remembered that Marathi Wikipedia Community is using such a tool and got > | > in touch with one of the guys running it. > | > > | > So they have a common list of words with correct spelling (hrsv and > | > dirgha) against what people generally tend to type incorrectly, and their > | > bot replaces the incorrect one. > | > > | > The word list can be found here<http://toolserver.org/~shantanoo/replace_words.txt.html> and > | > the code for reading/modifying wiki pages can be found here<https://github.com/pune-lug/supersimplemediawiki>. > | > So if anyone is up for taking this task, please go ahead and start > | > developing the tool :-D maybe I can help if and when it is possible. > | > > | > I've also CC'd a friend of mine from Marathi community who helped me > | > understanding this, feel free to connect to him. I'm sure he will love to > | > help you out. Thanks. > | > > | > > | > > | > Thanks, > | > Arnav <http://arnavs.com/>. > | > (User:Rangilo_Gujarati)<http://en.wikipedia.org/wiki/User:Rangilo_Gujarati> > | > . > | > > | > > | > On Sun, Sep 22, 2013 at 5:15 AM, Dhaval S. Vyas <dsvyas(a)gmail.com> wrote: > | > > | >> Dear Bakulbhai, > | >> > | >> > | >>> [No need to "embed" in wikipedia. You edit articles in your web browser. > | >>> Or in a word processor and then copy and paste into the browser.] > | >> > | >> > | >> Most of us prefer to work straight on the wiki as wikiformatting could be > | >> done simultaneously. But, eitherway, it is OK so far as we have the > | >> spellchecker somewhere. > | >> > | >> > | >>> In terms of software, any new software I write will be under a BSD like > | >>> license (basically: you can do anything you do with it, including use it in > | >>> commercial work, but a) don't hold me responsible for anything and b) don't > | >>> take credit for it). Any existing software I enhance is under its existing > | >>> license. Note that ispell was under BSD license. aspell is under GPL (but > | >>> the affix related work in it is under BSD since it came from ispell). > | >> > | >> > | >> That's simply great, we are lucky to have come across a person like you > | >> and Thank Rajeshbhai for looping in :-) > | >> > | >> > | >>> I will leave any organizational issues to Rajesh and other more capable. > | >>> Rajesh, rather than use individual email addresses, pick an existing > | >>> mailing list to discuss this further. > | >> > | >> Suggested a different thread because, all interested parties have > | >> already responded. Also, this being a mailing list, many people have opted > | >> for a digest instead of individual emails. So, if we continued discussion > | >> here, their feedback might not be in time (e.g. Vihangbhai's reply today), > | >> while if it came as personal email, it could be replied by interested > | >> person. But, both options have their own pros and cons, so will leave it > | >> entirely on you all. > | >> > | >> Regards, > | >> Dhaval > | >> > | >> > | >> > | >> > | >> > | >> On Sat, Sep 21, 2013 at 10:46 PM, Bakul Shah <bakul(a)bitblocks.com> wrote: > | >> > | >>> This is perhaps not the right mailing list to continue this discussion > | >>> but I am already on 3 or 4 gujarati related lists.... So for now I will > | >>> continue here. > | >>> > | >>> On Sep 20, 2013, at 9:36 AM, "Dhaval S. Vyas" <dsvyas(a)gmail.com> wrote: > | >>> > | >>> Once we have such functionality, and it is available under CC > | >>> licence/public domain, it could be embedded in wiki (if not easily, without > | >>> much trouble). There are several ways we could take it on board. > | >>> > | >>> [No need to "embed" in wikipedia. You edit articles in your web browser. > | >>> Or in a word processor and then copy and paste into the browser.] > | >>> > | >>> *Licensing:* > | >>> > | >>> The spellchecking dictionary needs the kind of enhancements I talked > | >>> about earlier. One starting point is the word list in *dict-gu_IN.oxt*OpenOffice extension. > | >>> > | >>> Before doing any real work on it, we need to get the licensing issues > | >>> clarified. *dict-gu_IN.oxt *has a README file that says the original > | >>> word list was prepared by Utkarsh Project volunteers and the list has GPL. > | >>> GPL is actually for software so it is strange to see a dictionary using > | >>> GPL. See also: > | >>> > | >>> > | >>> http://stackoverflow.com/questions/4329467/is-it-okay-to-include-gpled-file… > | >>> > | >>> I don't want to get into a discussion of pros and cons of GPL but I am > | >>> unwilling to work on anything GPL due to its more restrictive licensing. > | >>> The best option in my view is to ask the Utkarsh volunteers to make it > | >>> public domain or licence it under a dual license (GPL as well as BSD). I > | >>> checked out http://www.utkarsh.org but it is not clear how to do this. > | >>> Rajesh, can you help sort this out? > | >>> > | >>> The other alternative is to do what Rajesh suggested, which is to feed > | >>> lots of text to a program and derive a list. If we do this, we will make > | >>> this completely public domain. > | >>> > | >>> In terms of software, any new software I write will be under a BSD like > | >>> license (basically: you can do anything you do with it, including use it in > | >>> commercial work, but a) don't hold me responsible for anything and b) don't > | >>> take credit for it). Any existing software I enhance is under its existing > | >>> license. Note that ispell was under BSD license. aspell is under GPL (but > | >>> the affix related work in it is under BSD since it came from ispell). > | >>> > | >>> Can we all, who has interest in developing such functionality and > | >>> passion for the language as well as expertise, form a taskforce and take it > | >>> off list? I will be delighted to work on it in whatever capacity I can. > | >>> > | >>> I will leave any organizational issues to Rajesh and other more capable. > | >>> Rajesh, rather than use individual email addresses, pick an existing > | >>> mailing list to discuss this further. > | >>> > | >>> On 20 Sep 2013 15:48, "Bakul Shah" <bakul(a)bitblocks.com> wrote: > | >>> > | >>>> Rajesh, > | >>>> > | >>>> Proof readers will have to use a word processor or browser as other > | >>>> tools are not very good at displaying Indic languages. > | >>>> > | >>>> Googledoc is no good at spell checking. > | >>>> > | >>>> OpenOffice (or LibreOffice) has a number of dictionaries including for > | >>>> Gujarati. I suspect it doesn't work well & we have work to do. I have no > | >>>> desire or time to work on openOffice -- it is massive -- but there may be a > | >>>> way.... > | >>>> > | >>>> [The rest is a bit too technical. Feel free to skip] > | >>>> > | >>>> There are a number of open source standalong spell checking programs > | >>>> such as ispell, aspell, hunspell etc. Most were derived from or influenced > | >>>> by the original unix spell program written by S.C.Johnson. For the curious, > | >>>> here's a paper by Doug McIlroy about it: > | >>>> http://unix-spell.googlecode.com/svn/trunk/McIlroy_spell_1982.pdf > | >>>> > | >>>> ispell was pre-unicode and only worked with western languages but it > | >>>> made some major advances that seemed to be carried over to aspell. I dug > | >>>> into apell some and it seems to support Gujarati. > | >>>> > | >>>> Anyway, aspell can be used from other programs (has an API), can handle > | >>>> multiple languages etc. Its documentation is not sufficient (IMHO) to > | >>>> understand affix rules. ispell documentation has more details. I used to > | >>>> know ispell fairly well but that was 20+ years ago! > | >>>> > | >>>> The *dict-gu.oxt *extension (used in OpenOffice) contains a file > | >>>> called *gu_IN.dic* that contains a world list and* gu_IN.aff* that > | >>>> should have *affix* rules for Gujarati but it is very small (compared > | >>>> to English) and seems to needs a bunch more work. I see that this extension > | >>>> is maintained by Kartik Mistry (did I see an email from him in this > | >>>> thread?) so may be he and I can figure out how to add more affix rules? > | >>>> > | >>>> The basic idea with some example: given a rule like > | >>>> > | >>>> BOTH/R > | >>>> > | >>>> This can expand into BOTH and BOTHER, as -ER is a common english > | >>>> extension (cart, carter and so on). Another example: may have > | >>>> > | >>>> ACIDIFY/NR > | >>>> > | >>>> This can expand into ACIDIFY ACIDIFICATION ACIDIFIER (Y-ER maps to > | >>>> IER). These rules make the spellchecking dictionary quite compact as well > | >>>> as indicate how a word should be taken apart for efficient matching. > | >>>> > | >>>> aspell is capable of deriving such rules for English but I suspect it > | >>>> will need help in Indian languages. This is where the *.aff* file > | >>>> comes in. So for example in Gujarati we would like to render the following > | >>>> words in single rule > | >>>> > | >>>> * ગધેડો ગધેડી ગધેડું ગધેડા ગધેડાનું ગધેડાની ગધેડાનો ગધેડાના* > | >>>> * > | >>>> * > | >>>> etc. For this we can write something like > | >>>> > | >>>> * ગધેડ/XYZABC* > | >>>> > | >>>> where each letter denotes a particular suffix. And yet, there is no > | >>>> such word as *ગધેડ *-- And I am not sure these programs can handle > | >>>> this. And we have compund words such as *ઘોડાગાડી* -- which will > | >>>> require more complex rules. In fact Indic languages should have a much > | >>>> larger set of affix rules than English! We should also check out what is > | >>>> being done for Hindi. > | >>>> > | >>>> Next, we need rules for `similar' letters (or letters near each other > | >>>> on a keyboard) so that if there is not an exact match, we first try such > | >>>> similar or neighbor letters. > | >>>> > | >>>> Anyway, once we fix up the dictionary, very likely the same dictionary > | >>>> can be used with word processors such as openOffice etc. An easier idea may > | >>>> be to do a web based frontend. > | >>>> > | >>>> These programs do a lot of work: create dictionaries, read various file > | >>>> formats, update screen, etc. etc. that make them complicated and hard to > | >>>> modify. ideally I would want a single function for checking: > | >>>> check(Speller, String) > | >>>> > | >>>> That returns a quad: (correctly spelled prefix, misspelled word, list > | >>>> of suggestions, remaining string). A separate program can generate the > | >>>> dictionaries. The Speller object will read whatever dictionaries it needs. > | >>>> But I don't have time to implement this. > | >>>> > | >>>> Bakul > | >>>> > | >>>> On Sep 19, 2013, at 7:43 PM, Rajesh Mashruwala <mashru(a)gmail.com> > | >>>> wrote: > | >>>> > | >>>> Has anyone tried Microsoft office Gujarati spell checker? It is > | >>>> available with office 2010. > | >>>> > | >>>> Sent from the old new iPad! > | >>>> > | >>>> On Sep 18, 2013, at 11:43 AM, Bakul Shah <bakul(a)bitblocks.com> wrote: > | >>>> > | >>>> Googling "hindi spell checker algorithm" found a number of papers. The > | >>>> basic idea is to compare how "similar" a word being checked is to a word > | >>>> known to be correct, where similarity is computed using some algorithm. You > | >>>> don't store all the ways people can misspell a word. plus logic is used to > | >>>> derive related words from a root word, which depend on plurality, gender, > | >>>> tense, etc. These rules are more complex in Indic languages than western. > | >>>> And i think we may need to look at "clusters" instead of individual > | >>>> unicode points. But all this must have been worked years ago. May be not > | >>>> for Gujarati but for Hindi, Marathi, Bengali. You should check with the > | >>>> usual suspects (google, Microsoft, SIL, language researchers etc.). > | >>>> > | >>>> For OCR you may need something slightly different than spellcheckers > | >>>> that deal with human errors. Here a more common problem will be mistaking > | >>>> similar looking letters and joining or splitting of words due to too little > | >>>> of too much white space. > | >>>> > | >>>> Ultimately there should be support for language variations too (surati, > | >>>> kathiawadi, amdavadi etc)! > | >>>> > | >>>> On Sep 18, 2013, at 4:12 AM, Rajesh Mashruwala <mashru(a)gmail.com> > | >>>> wrote: > | >>>> > | >>>> Dhavalbhai, > | >>>> > | >>>> As we get text that is generated using OCR, I see need for a good > | >>>> Gujarati dictionary. I tried to use GL dictionary. It was not effective > | >>>> because it has corpus of words. It can not recognize any variation on the > | >>>> word. In that model, we need possibly over ten times the corpus GL > | >>>> dictionary has to be useful. Otherwise, it finds error with too many > | >>>> correct words. > | >>>> > | >>>> The same dictionary could be used for Gujarati proof readers. > | >>>> > | >>>> One way is to generate larger corpus by scrapping words from Gujarati > | >>>> Internet pages (those in Unicode), a better way is to think about building > | >>>> better dictionary logic. I may be able to interest exceptionally good > | >>>> volunteer developers if we can think of smarter way of creating a > | >>>> dictionary. For example, we could codify grammar rules to form derivative > | >>>> words. > | >>>> > | >>>> Should we pursue this course? > | >>>> > | >>>> > | >>>> > | >>>> Sent from the old new iPad! > | >>>> > | >>>> On Sep 18, 2013, at 2:48 AM, "Dhaval S. Vyas" <dsvyas(a)gmail.com> wrote: > | >>>> > | >>>> Dear Roopalben, > | >>>> > | >>>> I second your concern regarding the correct language. I often say that > | >>>> Newspapers are the only LITERATURE most of us end up reading and have > | >>>> access to. The language and (more becoming common Hindi) words used in them > | >>>> shapes the language of society in present day and hence it is great that > | >>>> you are introducing this course. > | >>>> > | >>>> Unfortunately, on wiki we don't have spelling correction tool or > | >>>> dictionary lookup facility. But, Vishal Monpara has been developing one. > | >>>> Gujarati Lexicon has recently developed pop-up dictionary as well, which > | >>>> could be adapted for this purpose. > | >>>> > | >>>> On gu.wikipedia, there is a lot of content translated from either > | >>>> English or Hindi, and most of these lack the original Gujarati language. > | >>>> When read, these translations look so artificial. For the course, it could > | >>>> be good idea to show such examples and get the course attendees correct it, > | >>>> may be offline if they are not computer savvy or hesitant to use wikipedia. > | >>>> > | >>>> Please let me and community here know if you have any suggestions on > | >>>> how we can help with the task you are carrying out. > | >>>> > | >>>> Kind Regards, > | >>>> Dhaval > | >>>> On 18 Sep 2013 06:39, "Roopal Mehta" <roopal.mehta(a)gmail.com> wrote: > | >>>> > | >>>>> Basically there are not many good proofreaders available in the > | >>>>> publishing industry - and the demand is high. That was the main reason for > | >>>>> starting this course. > | >>>>> > | >>>>> Wikipedia is an important source for information. However, the concern > | >>>>> here is about correct use of language too. Today we see a lot many errors > | >>>>> in Gujarati newspapers, publishing, media and almost everywhere. That is a > | >>>>> high concern for us. > | >>>>> > | >>>>> If Wiki is going to be an important tool for the next generation, we > | >>>>> Have to make sure that it conveys correct language to the society. > | >>>>> > | >>>>> I would like to know, whether any auto-correction of spelling etc. are > | >>>>> available while editing an article in Wiki ? > | >>>>> > | >>>>> Thank you. > | >>>>> > | >>>>> > | >>>>> Roopal > | >>>>> > | >>>>> > | >>>>> On Tue, Sep 17, 2013 at 4:38 PM, Kartik Mistry < > | >>>>> kartik.mistry(a)gmail.com> wrote: > | >>>>> > | >>>>>> On Tue, Sep 17, 2013 at 3:42 PM, Roopal Mehta <roopal.mehta(a)gmail.com> > | >>>>>> wrote: > | >>>>>> > At Gujarati Sahitya Parishad, we are running proof reading course > | >>>>>> and we are including a session of modern methods of proof reading, which > | >>>>>> includes editing on (Guj) Wiki articles. > | >>>>>> > > | >>>>>> > Please send suggestions if you have. This is the first batch of > | >>>>>> students from various fields. > | >>>>>> > | >>>>>> Few suggestions (some may be offtopic, sorry for that!) > | >>>>>> 1. Please follow Wikipedia's guideline for article. > | >>>>>> 2. Make sure person is logged in before making changes. > | >>>>>> 3. Please do not change anything other than spelling/grammar etc. > | >>>>>> 4. If you're that already, donating pictures of 'સાહિત્યકાર' in > | >>>>>> various articles from GSP, is good idea. Isn't it? :) > | >>>>>> > | >>>>>> Thanks for good work! > | >>>>>> > | >>>>>> -- > | >>>>>> Kartik Mistry | IRC: kart_ > | >>>>>> {0x1f1f, kartikm}.wordpress.com > | >>>>>> > | >>>>>> _______________________________________________ > | >>>>>> Wikipedia-gu mailing list > | >>>>>> Wikipedia-gu(a)lists.wikimedia.org > | >>>>>> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu > | >>>>>> > | >>>>> > | >>>>> > | >>>>> _______________________________________________ > | >>>>> Wikipedia-gu mailing list > | >>>>> Wikipedia-gu(a)lists.wikimedia.org > | >>>>> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu > | >>>>> > | >>>>> _______________________________________________ > | >>>> Wikipedia-gu mailing list > | >>>> Wikipedia-gu(a)lists.wikimedia.org > | >>>> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu > | >>>> > | >>>> > | >>>> > | >>> > | >> > | >> _______________________________________________ > | >> Wikipedia-gu mailing list > | >> Wikipedia-gu(a)lists.wikimedia.org > | >> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu > | >> > | >> > | > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.14 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQIcBAEBAgAGBQJSQ78EAAoJEPC2gDV1D+BTb2kP/1XZiJK74jx3V+13Q5QXxiV4 > CdyRYb4YDlZsdBd1EdXXons6YB5EC9ro9xdljSgDv7MoXdjcfjPY2RHSHYsAqf6I > 0SyIBAM1sBgBbjNRB9JCHl2d9yG5sRxgF5IFW0oOjWJBT4UF18iBNHk7O0cr5TJR > V6KX6hGMs7Koft3+cJLSce32laSC2VDi3b3Z4I6EAgongmd8HD+WdIJrJw1q8NP2 > X6lBcCnsobbyA54oET8+i8tIX9tFi32PIcYZI5+mAi3T3jauQ3VJ9kxpqEim6CNj > 6eI8X0R7tbISlwLOZERNxRcjpbaw2AAXQbKuONJsxaoQgIxC+cm+67VBkkRvbe7k > qZEyPpBgtlfi7FgiGbG0ljuzbWpo04lErMS3ogtzi8dtyXBy5uSP2uV5B4kip52V > BtW8gfcX6vuVUoKLEx9e5NNY+Mp99ela8QV5b5FjavBiGyz2SNEBlmXJ4BhGDDj0 > NWrXLUw+VXh8FyJf6m/fUvKYIKS5maKREIBSsKxBleCB3WrflH88nLMrW96BYXFY > ULP53BkwQCqh72V6XPbXVffes4raS5egn6dFmMvff+WYccZRrCuBLnF/L+OSzgVI > eaL6Q5scHou9a3UlzLYaL9KoOHoakmOMY+XLg+MFPySXRbQsCot1NsfoDrDGSRLb > rliQquEJWulHZWC2dpcx > =w7Ri > -----END PGP SIGNATURE-----

10 years, 7 months

Re: [Wikipedia-gu] Proof reading

by Rajesh Mashruwala

Has anyone tried Microsoft office Gujarati spell checker? It is available with office 2010. Sent from the old new iPad! On Sep 18, 2013, at 11:43 AM, Bakul Shah <bakul(a)bitblocks.com> wrote: Googling "hindi spell checker algorithm" found a number of papers. The basic idea is to compare how "similar" a word being checked is to a word known to be correct, where similarity is computed using some algorithm. You don't store all the ways people can misspell a word. plus logic is used to derive related words from a root word, which depend on plurality, gender, tense, etc. These rules are more complex in Indic languages than western. And i think we may need to look at "clusters" instead of individual unicode points. But all this must have been worked years ago. May be not for Gujarati but for Hindi, Marathi, Bengali. You should check with the usual suspects (google, Microsoft, SIL, language researchers etc.). For OCR you may need something slightly different than spellcheckers that deal with human errors. Here a more common problem will be mistaking similar looking letters and joining or splitting of words due to too little of too much white space. Ultimately there should be support for language variations too (surati, kathiawadi, amdavadi etc)! On Sep 18, 2013, at 4:12 AM, Rajesh Mashruwala <mashru(a)gmail.com> wrote: Dhavalbhai, As we get text that is generated using OCR, I see need for a good Gujarati dictionary. I tried to use GL dictionary. It was not effective because it has corpus of words. It can not recognize any variation on the word. In that model, we need possibly over ten times the corpus GL dictionary has to be useful. Otherwise, it finds error with too many correct words. The same dictionary could be used for Gujarati proof readers. One way is to generate larger corpus by scrapping words from Gujarati Internet pages (those in Unicode), a better way is to think about building better dictionary logic. I may be able to interest exceptionally good volunteer developers if we can think of smarter way of creating a dictionary. For example, we could codify grammar rules to form derivative words. Should we pursue this course? Sent from the old new iPad! On Sep 18, 2013, at 2:48 AM, "Dhaval S. Vyas" <dsvyas(a)gmail.com> wrote: Dear Roopalben, I second your concern regarding the correct language. I often say that Newspapers are the only LITERATURE most of us end up reading and have access to. The language and (more becoming common Hindi) words used in them shapes the language of society in present day and hence it is great that you are introducing this course. Unfortunately, on wiki we don't have spelling correction tool or dictionary lookup facility. But, Vishal Monpara has been developing one. Gujarati Lexicon has recently developed pop-up dictionary as well, which could be adapted for this purpose. On gu.wikipedia, there is a lot of content translated from either English or Hindi, and most of these lack the original Gujarati language. When read, these translations look so artificial. For the course, it could be good idea to show such examples and get the course attendees correct it, may be offline if they are not computer savvy or hesitant to use wikipedia. Please let me and community here know if you have any suggestions on how we can help with the task you are carrying out. Kind Regards, Dhaval On 18 Sep 2013 06:39, "Roopal Mehta" <roopal.mehta(a)gmail.com> wrote: > Basically there are not many good proofreaders available in the publishing > industry - and the demand is high. That was the main reason for starting > this course. > > Wikipedia is an important source for information. However, the concern > here is about correct use of language too. Today we see a lot many errors > in Gujarati newspapers, publishing, media and almost everywhere. That is a > high concern for us. > > If Wiki is going to be an important tool for the next generation, we Have > to make sure that it conveys correct language to the society. > > I would like to know, whether any auto-correction of spelling etc. are > available while editing an article in Wiki ? > > Thank you. > > > Roopal > > > On Tue, Sep 17, 2013 at 4:38 PM, Kartik Mistry <kartik.mistry(a)gmail.com>wrote: > >> On Tue, Sep 17, 2013 at 3:42 PM, Roopal Mehta <roopal.mehta(a)gmail.com> >> wrote: >> > At Gujarati Sahitya Parishad, we are running proof reading course and >> we are including a session of modern methods of proof reading, which >> includes editing on (Guj) Wiki articles. >> > >> > Please send suggestions if you have. This is the first batch of >> students from various fields. >> >> Few suggestions (some may be offtopic, sorry for that!) >> 1. Please follow Wikipedia's guideline for article. >> 2. Make sure person is logged in before making changes. >> 3. Please do not change anything other than spelling/grammar etc. >> 4. If you're that already, donating pictures of 'સાહિત્યકાર' in >> various articles from GSP, is good idea. Isn't it? :) >> >> Thanks for good work! >> >> -- >> Kartik Mistry | IRC: kart_ >> {0x1f1f, kartikm}.wordpress.com >> >> _______________________________________________ >> Wikipedia-gu mailing list >> Wikipedia-gu(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu >> > > > _______________________________________________ > Wikipedia-gu mailing list > Wikipedia-gu(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu > > _______________________________________________ Wikipedia-gu mailing list Wikipedia-gu(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu

10 years, 7 months

Re-releasing Konkani Viswakosh under CC-BY-SA 3.0 (Event Invitation)

by Nitika Tandon

Greetings from CIS-A2K! We request the pleasure of your company at the event 'Re-releasing Konkani Vishwakosh & Building Konkani Wikipedia' . The event will take place on the 26th September, Thursday,10am - 11am at the Conference Hall, Goa University, Taleigao. Upon CIS-A2K' explicit request, Goa University has approved the re-release of Konkani Vishwakosh under Creative Commons License (CC‐BY‐SA 3.0) to make it freely available to public and thus preserve Konkani language and culture in the digital era. This encyclopedia will also serve as one of the main sources of building and writing articles on Konkani Wikipedia (which is currently under incubation). This is a major win for the Wikimedia movement in India. This is, probably, the first time ever a State run higher education institution has re-released copyrighted content under Creative Commons. Hope more such initiatives will be underway to "Let the Knowledge Go around"! We'd like you to be a part of this event and help showcase Konkani community and language on a global digital platform such as Konkani Wikipedia. Please see the link for the invite.[1] We look forward to seeing you at the event. Best, Nitika Tandon Program Manager Access to Knowledge The Centre for Internet & Society [1] https://commons.wikimedia.org/wiki/File:Re-release_of_Konkani_Vishwakosh_un…

10 years, 7 months

CIS-A2K MoU with Goa University (Announcement)

by Vishnu T

Dear All, On behalf of CIS-A2K, I am glad to share with you that GOA University has signed an MoU with CIS-A2K. As part of this MoU Goa University and CIS-A2K will work together to digitize “Konkani Vishwakosh” under Creative Common license and build a Digital Knowledge Partnership in order to enhance digital literacy in the Konkani language and facilitate collaborative knowledge production and disseminate the same free of cost through Kokani Wikipedia (currently under incubation). Gos University and CIS-A2K will co-design and jointly implement relevant training programmes to achieve this objective. CIS-A2K is grateful for the support and encouragement received from the Goa University Vice-Chancellor Dr. Satish Shetye; Prof. Alito Siqueira; Prof. Priyadarshini Tadkodkar; Dr. Madhavi Sardesai; Dr. Gopakumar; and other faculty of Goa University. We are also equally grateful to Wikipedians Harriet Vidyasagar and Frederick Noronha who have engaged with CIS-A2K team and been a constant source of support. Wish us best of luck! Vishnu

10 years, 7 months

Proof reading

by Roopal Mehta

Friends: At Gujarati Sahitya Parishad, we are running proof reading course and we are including a session of modern methods of proof reading, which includes editing on (Guj) Wiki articles. Please send suggestions if you have. This is the first batch of students from various fields. Thank you Roopal Warm Regards Roopal

10 years, 7 months

Re: [Wikipedia-gu] Wikipedia-gu Digest, Vol 22, Issue 11

by yogesh kavishvar

Proof reading mate gujarati wikima khas to halma dhavalbhai aa kamgiri kare chhe. Ashokbhai pan kare chhe pan teo niymit samay api shakta nathi. Hun pan proof reading karto hato pan haju thoda smay sudhi wikima samay aapi shakish nahi. Juna ane articlema proof reading baki chhe. Aa mate aakhi team taiyar thay tevi jaruriyat to chhe j. Lexicon ni madad thi pan any sabhyo aa karyama sahyog aapi shake chhe. Hun thoda samay pachhi aa kaam mate samay falvi shakish. -yogesh kavishwar On 9/18/13, wikipedia-gu-request(a)lists.wikimedia.org <wikipedia-gu-request(a)lists.wikimedia.org> wrote: > Send Wikipedia-gu mailing list submissions to > wikipedia-gu(a)lists.wikimedia.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu > or, via email, send a message with subject or body 'help' to > wikipedia-gu-request(a)lists.wikimedia.org > > You can reach the person managing the list at > wikipedia-gu-owner(a)lists.wikimedia.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Wikipedia-gu digest..." >

10 years, 7 months

Access to Knowledge [CIS-A2K] Newsletter August 2013

by Subhashish Panigrahi

Dear all, We thank you all for your support and collaboration. Please find below the details of our work in the month of August 2013. Updates from Konkani Wikipedia The CIS-A2K programme organised a four-day workshop in collaboration with the Goa University for students of M.A. (Konkani) at the Central State Library, Konkani Department, University of Goa from August 21 to 24, 2013. Nitika Tandon and Subhashish Panigrahi conducted this workshop. The event saw 38 students creating 43 new articles on Konkani Wikipedia, which is incubation. The main objective is to bring this seven-year-old project out of incubation into a live Wikipedia project. Event reports, blog entries, videos, guest column and media coverage arising from the workshop can be accessed by clicking on the links below: ► Event Reports - Konkani Wikipedia — Climbing up the Indian Language Ladder? (by Subhashish Panigrahi, August 31, 2013): http://cis-india.org/a2k/blog/konkani-wikipedia-climbing-up-the-indian-lang… - Konkani Wikipedia Advances in 4 Days — From 90 Articles to 130 Articles! (by Nitika Tandon, August 31, 2013): http://cis-india.org/openness/blog/konkani-wikipedia-advances-in-four-days <http://cis-india.org/openness/blog/konkani-wikipedia-advances-in-four-days> ► Guest Columns - Konkani Wikipedia climbing up the Indian language ladder (by Subhashish Panigrahi, DNA, September 6, 2013). - http://www.dnaindia.com/blogs/1885294/post-konkani-wikipedia-climbing-up-th… - Recap on Konkani Wikipedia Workshop (by Subhashish Panigrahi, Startup Goa Blog, September 9, 2013). - http://cis-india.org/openness/blog/start-up-goa-blog-september-10-2013-subh… - ଅବସର ପରର ଦ୍ବିତୀୟ ଜୀବନ, ଅବସର ପରେ ସକ୍ରିୟ ଭାବେ ଓଡ଼ିଆ ଉଇକିପିଡ଼ିଆରେ ଲେଖାଲେଖି ଜାରୀ ରଖିଥିବା ଜଣେ ଡାକ୍ତରଙ୍କ ସ‌ହ ଭାବାଲୋଚନା http://cis-india.org/openness/blog/odiapua-subhashish-sep-10-odia-wikipedia… <http://cis-india.org/openness/blog/odiapua-subhashish-sep-10-odia-wikipedia…> ► Media Coverage - Wikipedia writes a new script (by Joyce Dias, August 24, 2013, The Goan). CIS-A2K workshop held in Goa is mentioned extensively. Nitika Tandon and Subhashish Panigrahi are quoted. http://cis-india.org/news/thegoan-joyce-dias-august-24-2013-wikipedia-write… - Konkani Wikipedia makes headway (by Diana Fernandes, OHeraldO, August 24, 2013). Nitika Tandon is quoted. http://cis-india.org/news/epaperoheraldo-august-24-2013-diana-fernandes-kon… - Konkani Wikipedians Speak (Goan Voice Daily Newsletter, September 4, 2013). Konkani Wikipedia workshop organized in Goa from August 21 to 24, 2013 is mentioned in this newsletter. http://www.goanvoice.org.uk/?ref=Guzels.TV <http://www.goanvoice.org.uk/?ref=Guzels.TV> ► Blog Entries - Voices from Goa: Frania Pereira tells Why She Writes Articles on Konkani Wikipedia (by Subhashish Panigrahi, August 27, 2013). http://cis-india.org/openness/blog/voices-from-goa - Voices from Goa: Wikipedia Editor Rusita Paryekar (by Subhashish Panigrahi, August 27, 2013). http://cis-india.org/openness/blog/voices-from-goa-wikipedia-editor-rusita-… <http://cis-india.org/openness/blog/voices-from-goa-wikipedia-editor-rusita-…> Other Wikipedia Updates ► Event Reports - A Kannada Wikipedia Workshop at Krishnarajapet (by Dr. U.B. Pavanaja, August 14, 2013). The workshop was co-organized by the CIS-A2K team along with Kannada Sahitya Parishat of KR Pet. http://cis-india.org/openness/blog/a-kannada-wikipedia-workshop-at-krishnar… - Wikimania 2013: Wikipedians represent Indian Languages in Hong Kong (by Subhashish Panigrahi, August 19, 2013). The event was organised by Wikimedia Foundation. Subhashish participated in the event. http://cis-india.org/openness/blog/wikimania2013 - An Odia Wikipedia Workshop at Sambalpur (by Gorvachove Pothal, August 27, 2013). This workshop was held at Veer Surendra Sai University of Technology, Burla, Sambalpur on July 26 and 27, 2013. Odia Wikipedian Gorvachove Pothal organized this workshop with financial support from the CIS-A2K programme. http://cis-india.org/openness/blog/odia-wikipedia-workshop-sambalpur ► Events Co-organised - Digitization of Books for Indic Language WikiSource (co-organised by Wikimedia India and CIS-A2K, CIS, Bangalore, August 18, 2013). http://cis-india.org/openness/events/digitization-of-books-for-indic-langua… - A Workshop on Editing Wikipedia in Mumbai (organised by the Centre for Indian Languages in Higher Education and CIS-A2K, Tata Institute of Social Sciences, Mumbai, August 24, 2013). The workshop was aimed at assisting students to take part in the Indian Languages Mela at the Tata Institute of Social Sciences (September 20-21, 2013) which is hosting a competition for best Indian language entries on Wikipedia. http://cis-india.org/openness/events/workshop-on-editing-wikipedia-in-mumbai ► Event Organised - Mobile Training Workshop @ CIS (CIS, Bangalore, August 29, 2013). Rachita and Keerthana Chandrashekar gave a talk on mobile campaigns. http://cis-india.org/openness/events/mobile-training-workshop - Wikipedia Training in Telugu for Dr. B.R. Ambedkar Open University, Hyderabad (Dr. B.R. Ambedkar Open University, Hyderabad, September 5-6, 2013). T. Vishnu Vardhan taught a module on "Knowledge and Openness in the Digital Era". http://cis-india.org/openness/events/wikipedia-training-in-telugu-for-b-r-a… ► Events Participated - Wikimania 2013: The International Wikimedia Conference (organised by Wikimedia Foundation, Hong Kong Polytechnic University, August 7 – 11, 2013). T. Vishnu Vardhan and Subhashish Panigrahi participated in the event. http://wikimania2013.wikimedia.org/wiki/Main_page - Wikimedia Asia Meeting (organised by Wikimedia community, Hong Kong, August 10, 2013). T. Vishnu Vardhan and Subhashish Panigrahi participated in the meeting. http://cis-india.org/news/wikimedia-asia-meeting.<http://cis-india.org/news/wikimedia-asia-meeting>Unedited transcript of the entire conversation is posted online http://cis-india.org/news/wikimedia-asia-meeting - వికీపీడియా:సమావేశం/హైదరాబాద్/ఆగష్టు (Hyderabad, August 25, 2013). T.Vishnu Vardhan participated in the meeting through Skype. http://cis-india.org/news/telugu-wiki-meet-up-hyderabad-august-2013 <http://cis-india.org/news/telugu-wiki-meet-up-hyderabad-august-2013> ► Media Coverage - Wikipedia Gains Massive Traffic Thanks To Vernacular Languages (Before It’s News, August 1, 2013). T. Vishnu Vardhan is quoted. http://cis-india.org/news/beforeitnews-august-1-2013-wikipedia-gains-massiv… - Wikipedia boom in Marathi, Malayalam and other desi languages (by Sandhya Soman, The Times of India, August 1, 2013). T. Vishnu Vardhan is quoted. http://cis-india.org/news/times-of-india-august-1-2013-sandhya-soman-wikipe… - Wikipedia boom in vernacular languages (by Divya Saboo, DNA, August 1, 2013). The Centre for Internet and Society is mentioned. http://cis-india.org/news/dna-august-1-2013-divya-saboo-wikipedia-boom-in-v… - Stress on posting articles on Kannada Wikipedia (by R. Krishna Kumar, Hindu, August 2, 2013). Dr. U.B.Pavanaja is quoted. http://cis-india.org/news/hindu-r-krishna-kumar-august-2-2013-stress-on-pos… - India’s Indigenous Languages Drive Wikipedia’s Growth (by Mahesh Sharma, TechCrunch, August 6, 2013). T. Vishnu Vardhan is quoted. http://cis-india.org/news/techcrunch-august-6-2013-mahesh-sharma-indias-ind… - Krishnarajapet Wikipedia Workshop Coverage (Prajavani, August 12, 2013). http://cis-india.org/news/prajavani-august-12-2013-krishnarajapet-workshop - Krishnarajapet Wikipedia Workshop Coverage (Vijaya Vani, August 12, 2013). http://cis-india.org/news/vijaya-vani-august-12-2013-krishnarajapet-wikiped… - Krishnarajapet Wikipedia Workshop Coverage (Suvarna Times of Karnataka, August 12, 2013). http://cis-india.org/news/suvarna-times-of-karnataka-august-12-2013-krishna… ► Other guest columns: - ଅବସର ପରର ଦ୍ବିତୀୟ ଜୀବନ, ଅବସର ପରେ ସକ୍ରିୟ ଭାବେ ଓଡ଼ିଆ ଉଇକିପିଡ଼ିଆରେ ଲେଖାଲେଖି ଜାରୀ ରଖିଥିବା ଜଣେ ଡାକ୍ତରଙ୍କ ସ‌ହ ଭାବାଲୋଚନା (by Subhashish Panigrahi, Odiapua, September 10, 2013): http://cis-india.org/openness/blog/odiapua-september-10-2013-subhashish-pan… <http://cis-india.org/openness/blog/odiapua-september-10-2013-subhashish-pan…> ► Ongoing / Upcoming Events - Digital Resources in Telugu: A Workshop for Research Scholars (co-organised by the Department of Cultural Studies, English and Foreign Languages University, Hyderabad and CIS-A2K, September 13, 2013). http://cis-india.org/openness/events/digital-resources-in-telugu - Train the Trainer — Four-day long Residential Training Workshop in Bangalore (organised by CIS-A2K, Bangalore, October 1 – 5, 2013). The programme will be held in the first week of October. http://cis-india.org/openness/events/train-the-trainer <http://cis-india.org/openness/events/train-the-trainer> Wikimedia Foundation has funded A2K to anchor the growth of Wikimedia movement in India. The A2K team consists of three members based in Bangalore: T. Vishnu Vardhan, Dr. U.B. Pavanaja and Subhashish Panigrahi and one member Nitika Tandon in Delhi. Archives of our newsletters can be accessed here (http://cis-india.org/about/newsletters). Wikipedians from various communities can request for outreach programs, technical bugs, logistics-merchandize and media, public relations and communications at http://bit.ly/TOcXId. About CIS The Centre for Internet and Society is a non-profit research organization that works on policy issues relating to freedom of expression, privacy, accessibility for persons with disabilities, access to knowledge and IPR reform, and openness (including open government, FOSS, open standards, etc.), and engages in academic research on digital natives and digital humanities. Follow us elsewhere ● Twitter: https://twitter.com/CISA2K ● Facebook group: https://www.facebook.com/cisa2k ● Visit us at: <https://cis-india.org/> https://meta.wikimedia.org/wiki/India_Access_To_Knowledge ● E-mail: a2k(a)cis-india.org Support Us Please help us defend consumer / citizen rights on the Internet! Write a cheque in favour of ‘The Centre for Internet and Society’ and mail it to us at No. 194, 2nd ‘C’ Cross, Domlur, 2nd Stage, Bengaluru – 5600 71. Request for Collaboration: We invite researchers, practitioners, and theoreticians, both organisationally and as individuals, to collaboratively engage with Internet and society and improve our understanding of this new field. To discuss the research collaborations, write to Sunil Abraham, Executive Director, at sunil(a)cis-india.org or Nishant Shah, Director – Research, at nishant(a)cis-india.org. To discuss collaborations on Indic language wikipedia, write to T. Vishnu Vardhan, Programme Director, A2K, at vishnu(a)cis-india.org CIS is grateful to its donors, Wikimedia Foundation, Ford Foundation, Privacy International, UK, Hans Foundation and the Kusuma Trust which was founded by Anurag Dikshit and Soma Pujari, philanthropists of Indian origin, for its core funding and support for most of its projects. -- Best! Subhashish Panigrahi Programme Officer, Access To Knowledge Centre for Internet and Society @psubhashish

10 years, 7 months

વિકિસ્રોત પર નવું સહકાર્ય : પરિયોજના ક્રમાંક ૩૦ - "સૌરાષ્ટ્રની રસધાર ભાગ-૧"

by sushant savla

મિત્રો, વિકિસ્રોત પર હાલમાં પરિયોજના ૨૯ "મૂરખરાજ અને તેના બે ભાઈઓ" પૂર્ણ થઈ છે. પરિયોજના ૨૮ - "વનવૃક્ષો" ૯૬% જેટલી પૂર્ણ થઈ છે. આ સાથે નવી પરિયોજના ક્રમાંક ૩૦ હેઠળ ઝવેરચંદ મેઘાણી રચિત લોક કથા સંગ્રહ "સૌરાષ્ટ્રની રસધાર ભાગ-૧" ચઢાવવામાં આવી રહ્યું છે. યુ. પી. એસ. સી. પરીક્ષાના ગુજરાતી વિષયના પાઠ્યક્રમાં આ પુસ્તક સમાવિષ્ટ છે માટે ગુજરાતી વિદ્યાર્થીઓ ને તે ઉપયોગિ થઈ રહેશે. જે મિત્રોને સહકાર્યમાં આ કાર્યમાં સહભાગી થવું હોય તેઓ નીચે દર્શાવેલી કડી પર સંપર્ક કરશો. https://gu.wikisource.org/wiki/%E0%AA%9A%E0%AA%B0%E0%AB%8D%E0%AA%9A%E0%AA%B… આભાર સુશાંત

10 years, 7 months

વિકિસ્રોત પરિયોજના ૨૯ - મૂરખરાજ અને તેના બે ભાઈઓ ! (પૂર્ણ)

by sushant savla

મિત્રો, સહભાગી મિત્રોના સહકાર્યને કારણે વિકિસ્રોત પર ગાંધીજી રચિત બોધ કથા "મૂરખરાજ અને તેના બે ભાઈઓ" ચઢાવાનું કાર્ય પૂર્ણ થયું છે. ગાંધીજીના પ્રિય અને વિશ્વપ્રસિદ્ધ લેખક ટોલ્સટોયના વિચાર પર આધારિત આ બોધ કથા સાદી રહેણી કરણી અને સદાચાર પર આધારિત છે. આ પરિયોજના ૦૭-૦૯-૨૦૧૩ ના દિવસે ચાલુ થઈ અને ૧૩-૦૯-૨૦૧૩ ના દિવસે તે પૂર્ણ થઈ છે. આ પરિયોજનામાં અશોકભાઈ વૈષ્ણવ (અમદાવાદ), કાર્તિકભાઈ મિસ્ત્રી, કોકિલાબેન મિસ્ત્રી અને સુશાંત (મુંબઈ) એ ભાગ લીધો હતો. સૌ મિત્રોનો ખૂબ ખૂબ આભાર. ગાંધીજીની આ પ્રાચીન કૃતિને વાંચવા ને માણવા સૌને આમંત્રણ છે સુશાંત સાવલા

10 years, 7 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Wikipedia-gu September 2013