In a separate thread (sorry--digest mode bit me), Dominic wrote:
Many cultural institutions are developing their own crowdsourced transcription
projects. I think Wikisource can be a much more robust platform than these
one-off projects, with a more well-developed community that aggregates the
transcription efforts of texts from many institutions in a single place
with a proven process.
I'm a big fan of Wikisource, and have recommended it, but I don't think that data extraction is the biggest barrier to adoption the GLAM sector faces.  Branding is a much, much bigger deal.  I talked about this the ALA this summer ( http://manuscripttranscription.blogspot.com/2014/07/collaborative-digitization-at-ala-2014.html -- see the slide with a screenshot of Wiksource next to one of Letters 1916, which uses DIY History/Scripto as its platform):

"The first one is is the French-language version of Wikisource. Wikisource is a sister project to Wikipedia that was spun off around 2003 that allows people to transcribe documents and do OCR correction both. This is being used by the Departmental Archives of Alpes-Maritimes to transcribe a set of journals of episcopal visits. The bishop in the sixteenth century would go around and report on all the villages [in his diocese], so there's all this local history, but it's also got some difficult paleography.

"So they're using Wikisource, which is a great tool! It has all kinds of version control. It has ways to track proofreading. It does an elegant job of putting together indiviual pages into larger documents. But, do you see "Departmental Archives of Alpes-Maritimes" on this page? No! You have no idea [who the institution is]. Now, if they're using this internally, that may be fine -- it's a powerful tool.

"By contrast, look at the Letters of 1916. [Three sentences inaudible.] This is public engagement in a public-facing site. "

There were a lot of nods in the room, and even more when I revisited the slide in a crowdsourcing workshop a month later.

If an institution were able to attach a custom stylesheet to pages displaying its 'project', if it were able to send users to an attractive homepage for its 'project', showing the project's materials, and recent activity on them, with ways for admins to monitor their volunteers' questions or discussions on talk pages, or announce news -- that would drop that barrier to entry.  At the moment, a GLAM that points its users to Wikisource effectively 'loses' them -- they're sending them off to a different community and a different site that just happens to contain copies of the institution's material, with no easy way for the users to get back to the institution.

That said, think bulk export of transcripts would help, especially if there were an easy way for the institution to match each transcript to the identifier in its own system.  Plaintext may be good enough for e.g. a library that's using a CMS and just wants their docs to be searchable.  I've seen TEI recommended in the past, and while I'm a big fan, I suspect it's of secondary importance.

Ben