Full support for djvu file - Wikitech-l

3 May 2012

...
  Message: 5

 Date: Thu, 3 May 2012 08:33:45 +0200

 From: Alex Brollo &lt;alex.brollo(a)gmail.com&gt;

 To: Wikimedia developers &lt;wikitech-l(a)lists.wikimedia.org&gt;

 Subject: [Wikitech-l] Full support for djvu files

 Message-ID:

       &lt;CAH_M_mPXxD9LeMjHCm65CRAvoqN5W45O5dGO+TeH1C0f_hc4rg(a)mail.gmail.com&gt;

 Content-Type: text/plain; charset=ISO-8859-1

 Djvu files are the wikisource standard supporting proofreading. They have

 very interesting features, being fully "open" in structure and layering,

 and allowing a fast and effective sharing into the web, when they are

 stored in their "indirect" mode. Most interesting, their text layer - which

 can be easily extracted - contains both the mapped text from OCR and

 metadata. A free library - divuLibre - allows full command line access to

 any file content.

 Presently, djvu files structure and features are minimally used. Indirect

 mode is IMHO not supported at all, there's no mean to access to mapped text

 layer nor to metadata, and only the "full text" can be accessed once, when

 creating a new page into Page namespace.

 It would be great IMHO:

 * to support indirect mode as the standard;

 * to allow free, easy access to the full text layer content from wikisource

 user interface.

 Alex

Text layer is stored in img_metadata, which means it can be retrieved
by the API (using ?action=query&prop=imageinfo&iiprop=metadata).
However when I tried to test this, it didn't seem to work. Maybe
trying to return the entire text layer hit some max api result size
limit or something. (It'd be really nice if we had some nicer place to
store information about files, especially for huge things like the
text layer which we don't generally want to load the entire thing all
the time. There's a bug about that somewhere in bugzilla land).

Indirect mode (From what I can find out from google) is when you have
an index djvu file that has links to all the pages making up the djvu
file, so you can start viewing immediately and pages are only
downloaded as needed. I'm not sure how such a format would work in
terms of uploading it. Unless we convert it on the server side, how
would we upload all the constitutiant files (I suppose we could tell
people to upload tarballs. Then we have to make sure to validate the
contents, and communicate to people that the tarball is only for
uploaded djvu files). [Of course until 5 minutes ago I'd never heard
of an indirect djvu file, so I could be misunderstanding]

-bawolff