Re: [Wikipedia-gu] Proof reading - Wikipedia-gu

26 Sep 2013

Google for javascript spell check. typo.js for example can use the same
dictionaries and affix rules as hunspell (what is used in OpenOffice and
Firefox) so any improvements we make with guj spellchecking will work
with it as well. It can do spell checking in multiple languages.

Hunspell can be used locally to spell check .txt files and I think it
knows a few other formats. It also allows use of private dictionaries
to store words the general dictionary doesn't know about but you use
such as proper names.

The difficulty with Gujarati and other Indian scripts is that most
command line programs don't know how to display them properly so
you pretty much need to spellcheck using a word processor program or
a browser.

On Sep 25, 2013, at 9:58 PM, शंतनू महाजन &lt;shantanoo(a)gmail.com&gt; wrote:

...
  -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1

 Hi Dhaval,
 Reply inline.

 +++ Dhaval S. Vyas [26-Sep-2013 00:31 +0530]
 | Something like that I do on gu.wiki with my bot. Correct most of the
 | obvious spelling or grammatical mistakes in Gujarati.

 Nice to know about it. Can you share more details regarding following:
 - - In which language (python/ruby/perl/...) the bot is written?
 - - How do you maintain the updated list of words/mistakes which need to
  be corrected? Can you share the current list?

 How are you fixing the grammatical mistakes/errors? Is that code
 specific to Gujarati or can it be used easily for other languages?

 | Here we are looking
 | for something broader, which could be used not only online or onwiki, but
 | offline and/or at least offwiki.

 Can you elaborate more on this? By offline do you mean that user
 provides some file (lets consider unicode .txt file), runs the program
 locally on its machine, and spell check is done on that data?

 | Also, bot corrects errors already commited, while it would be better to
 | have good spell checker handy that can correct on-the-fly, with user input
 | obviously. This is more useful as it would help users learn correct Jodani,
 | in longer term. Users could build their own pool, etc will be other
 | benefits.

 Had thought about it in past, but since did not find any javascript
 expertise who could help with writing the auto-correction code, was not
 able to do much regarding it. :(.

 - -- 
 शंतनू

 |
 | Regards,
 | Dhaval
 | On 25 Sep 2013 19:46, "Arnav Sonara" &lt;sonara.arnav(a)gmail.com&gt; wrote:
 |
 | > નમસ્કાર મિત્રો :-)
 | >
 | > બાર ગાંઉએ બોલી બદલાય કે બાર ગામે બોલી બદલાય, હૂતો કે હુતી કે સુતી કે સુતો,
 | > GPL, BSD કે પછી CC by SA ;-) right now our main concern is to have/develop
 | > such a tool which can replace the commonly misspelled words.
 | >
 | > I remembered that Marathi Wikipedia Community is using such a tool and got
 | > in touch with one of the guys running it.
 | >
 | > So they have a common list of words with correct spelling (hrsv and
 | > dirgha) against what people generally tend to type incorrectly, and their
 | > bot replaces the incorrect one.
 | >
 | > The word list can be found
here<http://toolserver.org/~shantanoo/replace_words.txt.html> and
 | > the code for reading/modifying wiki pages can be found
here<https://github.com/pune-lug/supersimplemediawiki>ki>.
 | > So if anyone is up for taking this task, please go ahead and start
 | > developing the tool :-D maybe I can help if and when it is possible.
 | >
 | > I've also CC'd a friend of mine from Marathi community who helped me
 | > understanding this, feel free to connect to him. I'm sure he will love to
 | > help you out. Thanks.
 | >
 | >
 | >
 | > Thanks,
 | > Arnav <http://arnavs.com/>.
 | > (User:Rangilo_Gujarati)<http://en.wikipedia.org/wiki/User:Rangilo_Gujarat…
 | > .
 | >
 | >
 | > On Sun, Sep 22, 2013 at 5:15 AM, Dhaval S. Vyas &lt;dsvyas(a)gmail.com&gt; wrote:
 | >
 | >> Dear Bakulbhai,
 | >>
 | >>
 | >>> [No need to "embed" in wikipedia. You edit articles in your web
browser.
 | >>> Or in a word processor and then copy and paste into the browser.]
 | >>
 | >>
 | >> Most of us prefer to work straight on the wiki as wikiformatting could be
 | >> done simultaneously. But, eitherway, it is OK so far as we have the
 | >> spellchecker somewhere.
 | >>
 | >>
 | >>> In terms of software, any new software I write will be under a BSD like
 | >>> license (basically: you can do anything you do with it, including use it
in
 | >>> commercial work, but a) don't hold me responsible for anything and b)
don't
 | >>> take credit for it). Any existing software I enhance is under its
existing
 | >>> license. Note that ispell was under BSD license. aspell is under GPL (but
 | >>> the affix related work in it is under BSD since it came from ispell).
 | >>
 | >>
 | >> That's simply great, we are lucky to have come across a person like you
 | >> and Thank Rajeshbhai for looping in :-)
 | >>
 | >>
 | >>> I will leave any organizational issues to Rajesh and other more capable.
 | >>> Rajesh, rather than use individual email addresses, pick an existing
 | >>> mailing list to discuss this further.
 | >>
 | >>  Suggested a different thread because, all interested parties have
 | >> already responded. Also, this being a mailing list, many people have opted
 | >> for a digest instead of individual emails. So, if we continued discussion
 | >> here, their feedback might not be in time (e.g. Vihangbhai's reply
today),
 | >> while if it came as personal email, it could be replied by interested
 | >> person. But, both options have their own pros and cons, so will leave it
 | >> entirely on you all.
 | >>
 | >> Regards,
 | >> Dhaval
 | >>
 | >>
 | >>
 | >>
 | >>
 | >> On Sat, Sep 21, 2013 at 10:46 PM, Bakul Shah &lt;bakul(a)bitblocks.com&gt;
wrote:
 | >>
 | >>> This is perhaps not the right mailing list to continue this discussion
 | >>> but I am already on 3 or 4 gujarati related lists.... So for now I will
 | >>> continue here.
 | >>>
 | >>> On Sep 20, 2013, at 9:36 AM, "Dhaval S. Vyas"
&lt;dsvyas(a)gmail.com&gt; wrote:
 | >>>
 | >>> Once we have such functionality, and it is available under CC
 | >>> licence/public domain, it could be embedded in wiki (if not easily,
without
 | >>> much trouble). There are several ways we could take it on board.
 | >>>
 | >>> [No need to "embed" in wikipedia. You edit articles in your web
browser.
 | >>> Or in a word processor and then copy and paste into the browser.]
 | >>>
 | >>> *Licensing:*
 | >>>
 | >>> The spellchecking dictionary needs the kind of enhancements I talked
 | >>> about earlier. One starting point is the word list in
*dict-gu_IN.oxt*OpenOffice extension.
 | >>>
 | >>> Before doing any real work on it, we need to get the licensing issues
 | >>> clarified. *dict-gu_IN.oxt *has a README file that says the original
 | >>> word list was prepared by Utkarsh Project volunteers and the list has
GPL.
 | >>> GPL is actually for software so it is strange to see a dictionary using
 | >>> GPL. See also:
 | >>>
 | >>>
 | >>>
http://stackoverflow.com/questions/4329467/is-it-okay-to-include-gpled-file…
 | >>>
 | >>> I don't want to get into a discussion of pros and cons of GPL but I
am
 | >>> unwilling to work on anything GPL due to its more restrictive licensing.
 | >>> The best option in my view is to ask the Utkarsh volunteers to make it
 | >>> public domain or licence it under a dual license (GPL as well as BSD). I
 | >>> checked out http://www.utkarsh.org but it is not clear how to do this.
 | >>> Rajesh, can you help sort this out?
 | >>>
 | >>> The other alternative is to do what Rajesh suggested, which is to feed
 | >>> lots of text to a program and derive a list. If we do this, we will make
 | >>> this completely public domain.
 | >>>
 | >>> In terms of software, any new software I write will be under a BSD like
 | >>> license (basically: you can do anything you do with it, including use it
in
 | >>> commercial work, but a) don't hold me responsible for anything and b)
don't
 | >>> take credit for it). Any existing software I enhance is under its
existing
 | >>> license. Note that ispell was under BSD license. aspell is under GPL (but
 | >>> the affix related work in it is under BSD since it came from ispell).
 | >>>
 | >>> Can we all, who has interest in developing such functionality and
 | >>> passion for the language as well as expertise, form a taskforce and take
it
 | >>> off list? I will be delighted to work on it in whatever capacity I can.
 | >>>
 | >>> I will leave any organizational issues to Rajesh and other more capable.
 | >>>  Rajesh, rather than use individual email addresses, pick an existing
 | >>> mailing list to discuss this further.
 | >>>
 | >>> On 20 Sep 2013 15:48, "Bakul Shah" &lt;bakul(a)bitblocks.com&gt;
wrote:
 | >>>
 | >>>> Rajesh,
 | >>>>
 | >>>> Proof readers will have to use a word processor or browser as other
 | >>>> tools are not very good at displaying Indic languages.
 | >>>>
 | >>>> Googledoc is no good at spell checking.
 | >>>>
 | >>>> OpenOffice (or LibreOffice) has a number of dictionaries including
for
 | >>>> Gujarati. I suspect it doesn't work well & we have work to do.
I have no
 | >>>> desire or time to work on openOffice -- it is massive -- but there may
be a
 | >>>> way....
 | >>>>
 | >>>> [The rest is a bit too technical. Feel free to skip]
 | >>>>
 | >>>> There are a number of open source standalong spell checking programs
 | >>>> such as ispell, aspell, hunspell etc. Most were derived from or
influenced
 | >>>> by the original unix spell program written by S.C.Johnson. For the
curious,
 | >>>> here's a paper by Doug McIlroy about it:
 | >>>>  http://unix-spell.googlecode.com/svn/trunk/McIlroy_spell_1982.pdf
 | >>>>
 | >>>> ispell was pre-unicode and only worked with western languages but it
 | >>>> made some major advances that seemed to be carried over to aspell. I
dug
 | >>>> into apell some and it seems to support Gujarati.
 | >>>>
 | >>>> Anyway, aspell can be used from other programs (has an API), can
handle
 | >>>> multiple languages etc. Its documentation is not sufficient (IMHO) to
 | >>>> understand affix rules. ispell documentation has more details. I used
to
 | >>>> know ispell fairly well but that was 20+ years ago!
 | >>>>
 | >>>> The *dict-gu.oxt *extension (used in OpenOffice) contains a file
 | >>>> called *gu_IN.dic* that contains a world list and* gu_IN.aff* that
 | >>>> should have *affix* rules for Gujarati but it is very small (compared
 | >>>> to English) and seems to needs a bunch more work. I see that this
extension
 | >>>> is maintained by Kartik Mistry (did I see an email from him in this
 | >>>> thread?) so may be he and I can figure out how to add more affix
rules?
 | >>>>
 | >>>> The basic idea with some example: given a rule like
 | >>>>
 | >>>> BOTH/R
 | >>>>
 | >>>> This can expand into BOTH and BOTHER, as -ER is a common english
 | >>>> extension (cart, carter and so on).  Another example: may have
 | >>>>
 | >>>> ACIDIFY/NR
 | >>>>
 | >>>> This can expand into ACIDIFY ACIDIFICATION ACIDIFIER (Y-ER maps to
 | >>>> IER). These rules make the spellchecking dictionary quite compact as
well
 | >>>> as indicate how a word should be taken apart for efficient matching.
 | >>>>
 | >>>> aspell is capable of deriving such rules for English but I suspect it
 | >>>> will need help in Indian languages. This is where the *.aff* file
 | >>>> comes in. So for example in Gujarati we would like to render the
following
 | >>>> words in single rule
 | >>>>
 | >>>> * ગધેડો ગધેડી ગધેડું ગધેડા ગધેડાનું ગધેડાની ગધેડાનો  ગધેડાના*
 | >>>> *
 | >>>> *
 | >>>> etc. For this we can write something like
 | >>>>
 | >>>> * ગધેડ/XYZABC*
 | >>>>
 | >>>> where each letter denotes a particular suffix. And yet, there is no
 | >>>> such word as *ગધેડ *-- And I am not sure these programs can handle
 | >>>> this. And we have compund words such as *ઘોડાગાડી*  -- which will
 | >>>> require more complex rules. In fact Indic languages should have a
much
 | >>>> larger set of affix rules than English! We should also check out what
is
 | >>>> being done for Hindi.
 | >>>>
 | >>>> Next, we need rules for `similar' letters (or letters near each
other
 | >>>> on a keyboard) so that if there is not an exact match, we first try
such
 | >>>> similar or neighbor letters.
 | >>>>
 | >>>> Anyway, once we fix up the dictionary, very likely the same
dictionary
 | >>>> can be used with word processors such as openOffice etc. An easier
idea may
 | >>>> be to do a web based frontend.
 | >>>>
 | >>>> These programs do a lot of work: create dictionaries, read various
file
 | >>>> formats, update screen, etc. etc. that make them complicated and hard
to
 | >>>> modify. ideally I would want a single function for checking:
 | >>>>  check(Speller, String)
 | >>>>
 | >>>> That returns a quad: (correctly spelled prefix, misspelled word, list
 | >>>> of suggestions, remaining string). A separate program can generate
the
 | >>>> dictionaries. The Speller object will read whatever dictionaries it
needs.
 | >>>> But I don't have time to implement this.
 | >>>>
 | >>>> Bakul
 | >>>>
 | >>>> On Sep 19, 2013, at 7:43 PM, Rajesh Mashruwala
&lt;mashru(a)gmail.com&gt;
 | >>>> wrote:
 | >>>>
 | >>>> Has anyone tried Microsoft office Gujarati spell checker? It is
 | >>>> available with office 2010.
 | >>>>
 | >>>> Sent from the old new iPad!
 | >>>>
 | >>>> On Sep 18, 2013, at 11:43 AM, Bakul Shah &lt;bakul(a)bitblocks.com&gt;
wrote:
 | >>>>
 | >>>> Googling "hindi spell checker algorithm" found a number of
papers. The
 | >>>> basic idea is to compare how "similar" a word being checked
is to a word
 | >>>> known to be correct, where similarity is computed using some
algorithm. You
 | >>>> don't store all the ways people can misspell a word. plus logic is
used to
 | >>>> derive related words from a root word, which depend on plurality,
gender,
 | >>>> tense, etc. These rules are more complex in Indic languages than
western.
 | >>>> And i think we may need to look at "clusters"  instead of
individual
 | >>>> unicode points. But all this must have been worked years ago. May be
not
 | >>>> for Gujarati but for Hindi, Marathi, Bengali. You should check with
the
 | >>>> usual suspects (google, Microsoft, SIL, language researchers etc.).
 | >>>>
 | >>>> For OCR you may need something slightly different than spellcheckers
 | >>>> that deal with human errors. Here a more common problem will be
mistaking
 | >>>> similar looking letters and joining or splitting of words due to too
little
 | >>>> of too much white space.
 | >>>>
 | >>>> Ultimately there should be support for language variations too
(surati,
 | >>>> kathiawadi, amdavadi etc)!
 | >>>>
 | >>>> On Sep 18, 2013, at 4:12 AM, Rajesh Mashruwala
&lt;mashru(a)gmail.com&gt;
 | >>>> wrote:
 | >>>>
 | >>>> Dhavalbhai,
 | >>>>
 | >>>> As we get text that is generated using OCR, I see need for a good
 | >>>> Gujarati dictionary. I tried to use GL dictionary. It was not
effective
 | >>>> because it has corpus of words. It can not recognize any variation on
the
 | >>>> word. In that model, we need possibly over ten times the corpus GL
 | >>>> dictionary has to be useful. Otherwise, it finds error with too many
 | >>>> correct words.
 | >>>>
 | >>>> The same dictionary could be used for Gujarati proof readers.
 | >>>>
 | >>>> One way is to generate larger corpus by scrapping words from Gujarati
 | >>>> Internet pages (those in Unicode), a better way is to think about
building
 | >>>> better dictionary logic. I may be able to interest exceptionally good
 | >>>> volunteer developers if we can think of smarter way of creating a
 | >>>> dictionary. For example, we could codify grammar rules to form
derivative
 | >>>> words.
 | >>>>
 | >>>> Should we pursue this course?
 | >>>>
 | >>>>
 | >>>>
 | >>>> Sent from the old new iPad!
 | >>>>
 | >>>> On Sep 18, 2013, at 2:48 AM, "Dhaval S. Vyas"
&lt;dsvyas(a)gmail.com&gt; wrote:
 | >>>>
 | >>>> Dear Roopalben,
 | >>>>
 | >>>> I second your concern regarding the correct language. I often say
that
 | >>>> Newspapers are the only LITERATURE most of us end up reading and have
 | >>>> access to. The language and (more becoming common Hindi) words used in
them
 | >>>> shapes the language of society in present day and hence it is great
that
 | >>>> you are introducing this course.
 | >>>>
 | >>>> Unfortunately, on wiki we don't have spelling correction tool or
 | >>>> dictionary lookup facility. But, Vishal Monpara has been developing
one.
 | >>>> Gujarati Lexicon has recently developed pop-up dictionary as well,
which
 | >>>> could be adapted for this purpose.
 | >>>>
 | >>>> On gu.wikipedia, there is a lot of content translated from either
 | >>>> English or Hindi, and most of these lack the original Gujarati
language.
 | >>>> When read, these translations look so artificial. For the course, it
could
 | >>>> be good idea to show such examples and get the course attendees
correct it,
 | >>>> may be offline if they are not computer savvy or hesitant to use
wikipedia.
 | >>>>
 | >>>> Please let me and community here know if you have any suggestions on
 | >>>> how we can help with the task you are carrying out.
 | >>>>
 | >>>> Kind Regards,
 | >>>> Dhaval
 | >>>> On 18 Sep 2013 06:39, "Roopal Mehta"
&lt;roopal.mehta(a)gmail.com&gt; wrote:
 | >>>>
 | >>>>> Basically there are not many good proofreaders available in the
 | >>>>> publishing industry - and the demand is high. That was the main
reason for
 | >>>>> starting this course.
 | >>>>>
 | >>>>> Wikipedia is an important source for information. However, the
concern
 | >>>>> here is about correct use of language too. Today we see a lot many
errors
 | >>>>> in Gujarati newspapers, publishing, media and almost everywhere.
That is a
 | >>>>> high concern for us.
 | >>>>>
 | >>>>> If Wiki is going to be an important tool for the next generation,
we
 | >>>>> Have to make sure that it conveys correct language to the
society.
 | >>>>>
 | >>>>> I would like to know, whether any auto-correction of spelling etc.
are
 | >>>>> available while editing an article in Wiki ?
 | >>>>>
 | >>>>> Thank you.
 | >>>>>
 | >>>>>
 | >>>>> Roopal
 | >>>>>
 | >>>>>
 | >>>>> On Tue, Sep 17, 2013 at 4:38 PM, Kartik Mistry <
 | >>>>> kartik.mistry(a)gmail.com&gt; wrote:
 | >>>>>
 | >>>>>> On Tue, Sep 17, 2013 at 3:42 PM, Roopal Mehta
&lt;roopal.mehta(a)gmail.com&gt;
 | >>>>>> wrote:
 | >>>>>> > At Gujarati Sahitya Parishad, we are running proof
reading course
 | >>>>>> and we are including a session of modern methods of proof
reading, which
 | >>>>>> includes editing on (Guj) Wiki articles.
 | >>>>>> >
 | >>>>>> > Please send suggestions if you have. This is the first
batch of
 | >>>>>> students from various fields.
 | >>>>>>
 | >>>>>> Few suggestions (some may be offtopic, sorry for that!)
 | >>>>>> 1. Please follow Wikipedia's guideline for article.
 | >>>>>> 2. Make sure person is logged in before making changes.
 | >>>>>> 3. Please do not change anything other than spelling/grammar
etc.
 | >>>>>> 4. If you're that already, donating pictures of
'સાહિત્યકાર' in
 | >>>>>> various articles from GSP, is good idea. Isn't it? :)
 | >>>>>>
 | >>>>>> Thanks for good work!
 | >>>>>>
 | >>>>>> --
 | >>>>>> Kartik Mistry | IRC: kart_
 | >>>>>> {0x1f1f, kartikm}.wordpress.com
 | >>>>>>
 | >>>>>> _______________________________________________
 | >>>>>> Wikipedia-gu mailing list
 | >>>>>> Wikipedia-gu(a)lists.wikimedia.org
 | >>>>>> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
 | >>>>>>
 | >>>>>
 | >>>>>
 | >>>>> _______________________________________________
 | >>>>> Wikipedia-gu mailing list
 | >>>>> Wikipedia-gu(a)lists.wikimedia.org
 | >>>>> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
 | >>>>>
 | >>>>> _______________________________________________
 | >>>> Wikipedia-gu mailing list
 | >>>> Wikipedia-gu(a)lists.wikimedia.org
 | >>>> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
 | >>>>
 | >>>>
 | >>>>
 | >>>
 | >>
 | >> _______________________________________________
 | >> Wikipedia-gu mailing list
 | >> Wikipedia-gu(a)lists.wikimedia.org
 | >> https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
 | >>
 | >>
 | >
...PGP SIGNATURE...
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (Darwin)
 Comment: GPGTools - http://gpgtools.org

 iQIcBAEBAgAGBQJSQ78EAAoJEPC2gDV1D+BTb2kP/1XZiJK74jx3V+13Q5QXxiV4
 CdyRYb4YDlZsdBd1EdXXons6YB5EC9ro9xdljSgDv7MoXdjcfjPY2RHSHYsAqf6I
 0SyIBAM1sBgBbjNRB9JCHl2d9yG5sRxgF5IFW0oOjWJBT4UF18iBNHk7O0cr5TJR
 V6KX6hGMs7Koft3+cJLSce32laSC2VDi3b3Z4I6EAgongmd8HD+WdIJrJw1q8NP2
 X6lBcCnsobbyA54oET8+i8tIX9tFi32PIcYZI5+mAi3T3jauQ3VJ9kxpqEim6CNj
 6eI8X0R7tbISlwLOZERNxRcjpbaw2AAXQbKuONJsxaoQgIxC+cm+67VBkkRvbe7k
 qZEyPpBgtlfi7FgiGbG0ljuzbWpo04lErMS3ogtzi8dtyXBy5uSP2uV5B4kip52V
 BtW8gfcX6vuVUoKLEx9e5NNY+Mp99ela8QV5b5FjavBiGyz2SNEBlmXJ4BhGDDj0
 NWrXLUw+VXh8FyJf6m/fUvKYIKS5maKREIBSsKxBleCB3WrflH88nLMrW96BYXFY
 ULP53BkwQCqh72V6XPbXVffes4raS5egn6dFmMvff+WYccZRrCuBLnF/L+OSzgVI
 eaL6Q5scHou9a3UlzLYaL9KoOHoakmOMY+XLg+MFPySXRbQsCot1NsfoDrDGSRLb
 rliQquEJWulHZWC2dpcx
 =w7Ri
 -----END PGP SIGNATURE-----