[Wikipedia-l] [Wikisource-l] [Commons-l] Dream a little...

Andrew Gray shimgray at gmail.com
Mon Oct 16 19:21:50 UTC 2006


On 16/10/06, Yann Forget <yann at forget-me.net> wrote:

> > 2. While OCR capacities exist for some languages, they do not exist for
> > other languages, where the material is much more likely to get lost.
> > Manuscripts in Tibetan monasteries, for example, can be scanend but not
> > OCRed easily. To make this information available, developers should be
> > paid to create adequate OCR tools for these languages. Rough cost: $5
> > million.
>
> Much of the limits of Wikisource now is on the capability to scan and
> ocr documents. There is no good free OCR software, apart the new
> software recently released to GPL by Google, but it works only for
> English and has still limitations. So developing a good free and
> multilingual OCR software would be my priority. AFAIK there is no good
> OCR software (free or not) for any Indian languages, including Sanskrit.
> I have never seen any for Tibetan either.
>
> But having a software is not enough. A few OCR servers managed by the
> Foundation where anyone can sent an automated OCR request would be very
> useful. There are already proprietary OCR software who can do that.

This is a very, very, very good idea. Having a dedicated system to
input TIFF images (or the like) and spit out high-grade OCR, rather
than just relying on whatever the scanning volunteer can come up with,
would help the wikisource-like projects leap ahead.

...has anyone proposed this to Project Gutenberg? If they can get the
money together, it might free up an *awful* lot of their volunteer
time.

-- 
- Andrew Gray
  andrew.gray at dunelm.org.uk



More information about the Wikipedia-l mailing list