Hi Erik,
I've previously given something along these lines some thought, and rapidly came to
the conclusion that it's a huge amount of work,
but yet potentially exceedingly useful (assuming it scales up to something that can handle
book-sized documents).
Essentially, something that exports a list of articles you give it into PDF format, and
then you can print the PDF file, or have
someone else professionally print the PDF for you into a nicely-bound book format, or you
can just read the PDF electronically.
So why would anyone do this? To see why, imagine a world in which:
* All teachers have complete control over their textbooks, instead of having them handed
down from on high.
* Where the content in those books is build by clicking together prefabricated blocks of
text (namely articles, or sections in
articles), like bits of Lego.
* Where the content that makes up those articles is (hopefully) free and public (such as
the Wikipedia), such that any corrections
or updates or expansions are available to all (including other teachers).
* Where improvised high-school or uni students no longer have to pay $50 for a textbook
that they're only going to read bits of (and
only probably then right before their exam) - instead they can get the PDF and read it
electronically, or they can just print out
the bits they want, or if they really want the book they can presumably get a better price
than $50 for getting it printed because
there can be competition for printing/binding in a way that there isn't currently for
textbooks.
In practical terms, a use-case something along these lines was what I had in mind:
* User logs into MediaWiki, and goes to a new extension page, possibly called
[[Special:MakeMyBook]], which allows tagging multiple
articles for PDF export.
* The user can create a new document, which allows ordering / reordering the articles as
desired.
* Ideally, it would also allow including excerpts from articles (e.g. Labelled Section
Transclusion).
* Ideally, it would also allow using particular versions of articles, or maybe even
better, the latest stable version of the
article.
* Ideally, it would also allow including content from other remote wikis too, not just the
local wiki, and it would respect the
copyright terms of those wikis. For example, suppose that Wikitravel or Wikia or
Wikisource had some great material that
complemented some GFDL material in the Wikipedia. It would be really powerful to be able
to have those bits of content side-by-side
(assuming the licenses were compatible), and have it automatically include all the
necessary license-term legalese in an appendix in
5-point font. Yuri's API already includes a mechanism for getting the content license
from a wiki (assuming the remote wiki is
running a recent SVN revision).
* Once happy with the structure, the user would then click a "make my book" or
"Engage!" button.
* Some server-side job would then go off and do the processing, and email the user when
it's done and/or leave them a message on
their talk page.
* The user could then download the resulting PDF file from an FTP or web site, and
presumably after a while the file would be
deleted to free up disk space.
* The document structure / index should be saved and public, so that it can be viewed by
others (who are interested in the same
topic), or worked on by others (when collaboratively making a document), or modified by
the original author (if they want to update
or expand the document at a later date).
The "PDF link" in the sidebar for exporting just the current article is a great
idea that I hadn't even considered, but building
books I think is probably the most strategically important application. Along the same
vein though, you could maybe have an "Add to
book" link, which tacks the currently active article onto the end of the currently
active PDF document / book (so that people could
browse around pages related to a topic, and tag the stuff they found relevant).
The three main issues that come to mind though are:
* CPU power + disk space requirements: If it gets used in the way I would hope it would be
used, it's going to get used a lot, and
creating/optimizing large PDF files is not a computationally cheap operation. Multiply
that by many users, making multiple revisions
of each book, and you have the potential for a huge backlog of tasks, each producing some
very large files. So even if it was
super-efficient, it'd I expect need some serious CPU power, plus have large disk space
requirements.
* How to do the actual conversion to PDF (discussed below).
* Coming up with a decent user-interface for allowing the user to add / move / delete /
rearrange content (or maybe just start out
with an unordered list in text format, to start simple), and working out where to store
that information (e.g. Should it get its own
namespace, rather than cluttering up the current namespaces?)
... And then of course, there's the small issue of actually building the damn thing,
once it's determined what's being built, and
that it can theoretically be built ;-)
- Within this budget, do you believe an alternative
approach which
utilizes an intermediate format is viable (e.g.
wiki-to-Docbook-to-PDF), given the complexity of the MediaWiki syntax,
its various extensions, and the need to keep up with parser changes?
Well, I was wondering if it was possible to "cheat", and avoiding the whole
compatibility problem, by doing something like this:
+--------------------+
| Private web server |
+--------------------+
|
|
v
+--------------------+
| MediaWiki |
+--------------------+
|
|
v
+--------------------+
| Embedded Firefox |
+--------------------+
|
| Automatically print
v
+--------------------+
| PDF export |
+--------------------+
I.e. a quasi-embedded version of what already happens, just all on one box, instead of
over-the-network, and behaving like an
integrated pipeline, instead of as separate independent bits. The benefit of this is that
you can use already-existing known-working
software, and avoid reinventing the wheel. And you don't have to worry about keeping
compatibility with the parser - just leave it
as MediaWiki's problem to convert wikitext to XHTML. The downsides of this are:
* It may not even be possible to do this in a sensible way (i.e. to take these currently
completely separate bits of software, and
embed them into one large process, which takes wiki-text as input, and spits out PDF at
the end), although my current guess is that
it probably is (with a lot of hacking), but that's just a guess.
* You probably wouldn't get the "Support filters on the rendered HTML"
functionality (or at the very least it might make it harder),
so thumbnails on the current print output would look like thumbnails on the PDF output.
* Wouldn't get internal PDF hyperlinks (e.g. clicking on something in the index to
take you to somewhere in the body probably
wouldn't work).
* Would probably produce lots of separate intermediate PDF files (one per article), so
there would need to be a step for combining
many separate PDFs together into one large one, whilst not having blank gaps between
articles; Alternatively would have to create
one huge document which includes all of the required articles, and print that in one go,
which would avoid this problem, but might
be slow (just as rendering a 200 page document through MediaWiki at the moment would be
slow) and use lots of RAM.
BTW, the best "see also" web links for this would probably be
http://meta.wikimedia.org/wiki/Paper_Wikipedia and
http://pediapress.com/ ; The pediapress example in particular is interesting as they have
a working commercial implementation of
some of this stuff, but it has a number of drawbacks, such as a) only seems to be for
English Wikipedia b) was using a dump from
July 2006 last time I looked, so very out of date, plus errors frozen in time and cannot
be removed or corrected c) the preview PDF
they give you has "SAMPLE" stamped across every page in red letters d)
they're professional printers so they really want to sell
printed material, not create PDF files e) can't reorder articles, everything is in
alphabetical order only f) can't include partial
content, such as sections g) can't pull content from non-local wikis h) can't see
a way to collaborate on books or share with others
the stuff you have worked on but not yet finished. That said, their site is still kinda
neat, and certainly worth looking at to see
what works well and what doesn't work so well.
All the best,
Nick.
> -----Original Message-----
> From: wikitech-l-bounces(a)wikimedia.org
> [mailto:wikitech-l-bounces@wikimedia.org]On Behalf Of Erik Moeller
> Sent: Friday, 5 January 2007 3:02 PM
> To: Wikimedia developers; MediaWiki announcements and site admin list
> Subject: [Wikitech-l] RfC: PDF and document management for MediaWiki
>
>
> I have identified an organization which is willing to spend up to
> about EUR 10,000 on adding support for exporting MediaWiki pages as
> PDF files, and improving document management for documents consisting
> of multiple pages.
>
> My current thinking is that the functionality implemented, as a
> minimum, would be as follows:
> a) Using an extension, integrate of a "PDF link" on any wiki page
> which would call an external library like HTMLDOC on a single wiki
> page
> b) Support filters on the rendered HTML (replacing image thumbnails
> with high resolution images, filter content by regular expression,
> etc.), and revision filters (export last revision edited by user on
> whitelist Y, or approximating currentdate-Z)
> c) Create a "PDF basket" UI which makes it possible to compile a PDF
> from multiple pages easily (and rearrange the pages in a hierarchy).
> The resulting structures could potentially also be stored as wikitext,
> using a new <structure> extension tag, so that they can be used both
> by individuals compiling PDFs for personal use, and by groups
> collaborating on complex documents.
>
> Possibly some budget could also be allocated for improving the
> external PDF library used, especially if we can allocate additional
> funds for this project.
>
> I'd like to request comments on this approach, specifically:
> - Besides HTMLDOC, do you know a good (X)HTML-to-PDF library which
> could be used for this purpose?
- Within this budget, do you believe an alternative
approach which
utilizes an intermediate format is viable (e.g.
wiki-to-Docbook-to-PDF), given the complexity of the MediaWiki syntax,
its various extensions, and the need to keep up with parser changes?
> - If you
are a developer, would you be interested in working on this
> project, and available to do so? (If so, please contact me privately.)
>
> Any other comments would also be appreciated.
> --
> Peace & Love,
> Erik
>
> DISCLAIMER: This message does not represent an official position of
> the Wikimedia Foundation or its Board of Trustees.
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)wikimedia.org
>
http://mail.wikipedia.org/mailman/listinfo/wikitech-l