[Wikisource-l] Wikisource & DJVu at Commons

Thu Nov 16 02:46:33 UTC 2006

Alexander Klauer wrote:
> on how to upload scanned texts:
> 
> it would be great if the MediaWiki DjVu inline renderer and the 
> ProofreadPage extension could be made to work together. Then one 
> could upload texts as DjVu with all its benefits (plain 
> text/image mixing, efficient storage, only one single file 
> upload), but one would still be able to extract single pages 
> into Wikisource's Page: namespace.

Ultimately, upload and download should be possible in DjVu, PDF, 
TIFF, and ZIP archive.  All of those formats are capable of 
storing many pages in one file.  As far as I know, DjVu and PDF 
are capable of mixing image and (OCR) text in one file, including 
the mapping of individual words to positions in the image.  In a 
ZIP archive, you could store the scanned image in 0001.jpg (or 
.png or .tif) together with OCR text in 0001.txt, etc.

A download (e.g. in PDF format, for facsimile printing) should be 
possible for all pages in a volume or for all pages belonging to a 
chapter.

Currently, pages in fr.wikisource have names such as
[[Page:Fermat - Livre 1-000008.jpg]]
so "Fermat - Livre 1" could be the ZIP filename, and 000008.jpg
would be the image contained within the ZIP archive.  Instead of 
the dash, one might consider "/" for subpages here.

Next challenge: If the OCR text holds the position of each word in 
the image, can you mix this with Javascript (AJAX?) to highlight 
(in yellow) in the image the word you are currently wiki-editing?
And how do you update that position when you move text around?

How does commercial PDF/DjVu proofreading software handle this?

There is still a lot of programming to be done for this.

-- 
  Lars Aronsson (lars at aronsson.se)
  Aronsson Datateknik - http://aronsson.se