Hi
Here are the answers
What does "converted to Unicode" mean? Converted from what exactly? Do
you maybe mean "converted via OCR (Optical
character recognition) from
images in file formats (JPG, PNG, images in a PDF) which don't allow
marking text to a file format which allows marking text in those files?
There is no good OCR for languages like Malayalam. So each scanned image is
manually typed and proofread For example, See the 7th page of this book
<http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>. You
can see the scan image on the right and the transcribed text for that page
on the left in the *Transcript *tab. This is done for 136 books, and total
pages on these books are close to 25,700 pages.
What would you want the script to do exactly? Pull the files from the
Tuebingen Digital Library and then mass-upload these
files to Commons?
Yes, this is what is required. Unicode migration we will handle separately.
Shiju Alex
On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <aklapper(a)wikimedia.org> wrote:
> Hi,
> Great! Some questions below for better
understanding what's wanted:
> On Sun, 2018-12-02 at 15:22 +0530, Shiju
Alex wrote:
> > Recently Tuebingen University
> > <https://uni-tuebingen.de/en/university.html> (with
> > the support from German Research Foundation) ran a project titled
> *Gundert
> > Legacy project* to digitize close to 137,000 pages from *850 public
> domain
> > books*.
>
> > All these public domain books are in
the South Indian languages
> *Malayalam,
> > Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam,
> 187
> > in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
>
> > Also there was a separate sub-project
which was run as part of this
> > project to convert 136 titles in Malayalam to Malayalam Unicode. The
> number
> > of pages that were converted to Unicode is close to *25,700* pages .The
> > Unicode conversion project was ran only for Malayalam. For the other
> > languages it is just the scanning of books
> What does "converted to Unicode"
mean? Converted from what exactly? Do
you maybe mean "converted via OCR (Optical
character recognition) from
images in file formats (JPG, PNG, images in a PDF) which don't allow
marking text to a file format which allows marking text in those files?
> > The project is complete now and the
results of the project is available
> in
> > the Hermman Gundert Portal
https://www.gundert-portal.de/?language=en
> which
> > was released on Nov 20. A news report is available here.
> > <
>
https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mi…
>
>
> > To view the books in each language you can navigate through the various
> > links in the portal. For example, malayalam books are available here:
> >
https://www.gundert-portal.de/?page=malayalam
>
> > Now we need to upload these scans to
Wikimedia Commons and Unicode text
> to
> > Malayalam Wikisource (25,700 Unicode converted pages)
>
> > The first priority is for the scans
that are converted to Unicode. Is it
> > possible to write a script to migrate the scans from Tuebingen Digital
> > library to Wikimedia Commons? (I can share the exact details of books
> > converted to Unicode if needed)
> What would you want the script to do
exactly? Pull the files from the
Tuebingen Digital Library and then mass-upload these
files to Commons?
> OCR (identify letters in pure images and converting those
letters to
> text which could be marked and copied)? Something else?
> To convert image files available on
Wikimedia Commons to recognized
> text, see
https://tools.wmflabs.org/ws-google-ocr/ for example. There
> is also
https://phabricator.wikimedia.org/T120788 for more info/tools.
> > All the digitized files are heavy and
the size ranges from 100 MB to 1.5
> GB
> > depending on the number of pages in the books. So manually managing this
> is
> > going to be a big challenge.
>
> > Can some one help with this?
> Cheers,
> andre
> --
> Andre Klapper | Bugwrangler / Developer Advocate
>
https://blogs.gnome.org/aklapper/
>
_______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikitech-l