The Swedish Wikisource is copying scanned books from
various sources. You typically find a PDF or DJVU file,
containing both scanned images and raw OCR text,
that you upload to Commons, create an Index: page
with the <pagelist/> tag.
Some of these books have pretty miserable OCR text,
perhaps because the Norwegian National library scanned
a Swedish book with their OCR software set to Norwegian.
Somebody with an OCR program needs to run a new OCR
on these images. Fortunately, it is quite easy to
feed the PDF or DJVU file into an OCR program such
as Finereader, and use a bot to update the pages.
We now have one user on sv.wikisource doing this.
For these Index: pages, I created a category:OCR-kö
(meaning: queue of OCR requests). When trying to interwiki
link, I found a similar category on de.wikisource, but
similar categories on fr, en, and pt had been removed.
What's the story behind that? Don't you need OCR
requests in these languages? The comment on the English
page mentions an OCR robot on the toolserver. Really?
Exist:
http://de.wikisource.org/wiki/Kategorie:OCR-Anfragenhttp://sv.wikisource.org/wiki/Kategori:OCR-k%C3%B6
Have been removed in June 2009:
http://en.wikisource.org/wiki/Category:OCR_Requestshttp://fr.wikisource.org/wiki/Cat%C3%A9gorie:Demandes_d%27OCRhttp://pt.wikisource.org/wiki/Categoria:!Pedidos_de_OCR
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
At the Wikimedia chapter meeting in Berlin last week,
Wikisource was mentioned as an interesting project in
several different settings.
I know a lot of interesting projects and attempts are
being tried in various languages of Wikisource, but
perhaps there isn't enough coordination and exchange
of ideas and experience between all volunteers.
How could we improve this? I personally think this
mailing list is the first place to start. We could
all write short notices of any new idea or project
that we are undertaking. Then we should probably
get together at a session during the Wikimania
conference in Gdansk this summer.
The Wikimedia "chapters" are national or regional
membership associations that provide means to go
beyond the ordinary project volunteer communities,
for example when expenses need to be covered for
travel or equipment, or when contracts need to be
signed. One example is that Wikimedia France recently
signed a deal with the Bibliothèque nationale de France
to provide access to scanned images of books,
that can be proofread in fr.wikisource.org.
Deals of this kind fit in with a larger pattern,
where chapters seek collaboration with galleries,
libraries, archives and museums (= GLAM), hoping
that they will contribute free images to Wikimedia
Commons and Wikipedia. Or where museums will allow
wikipedians to take photos of their collections.
Other chapters are buying scanners for volunteers
to use. But perhaps digital cameras are more
useful than scanners these days. How many know
how to use them correctly? Maybe we need workshops.
Many of the chapters are now growing fast and are
quite successful at fundraising. This creates an
interesting challenge to fund projects that really
make a difference. The chapters need good projects
to fund, that they can show off to donors at the
coming fundraiser in the fall 2010/winter 2011.
I think Wikisource has a lot of potential for
supplying chapters with good projects to fund.
--
Lars Aronsson (lars(a)aronsson.se)
Wikimedia Sverige - stöd fri kunskap - http://wikimedia.se/
Hi everyone,
The next strategic planning office hours are:
Tuesday, 6 April, from 20:00-21:00 UTC, which is:
-Tuesday (1-2pm PDT)
-Tuesday (4-5pm EDT)
Office hours will be a great opportunity to discuss the work that's
happened as well as the work to come.
As always, you can access the chat by going to
https://webchat.freenode.net and filling in a username and the channel
name (#wikimedia-strategy). You may be prompted to click through a
security warning. It's fine. More details at:
http://strategy.wikimedia.org/wiki/IRC_office_hours
Thanks! Hope to see many of you there.
____________________
Philippe Beaudette
Facilitator, Strategy Project
Wikimedia Foundation
philippe(a)wikimedia.org
Imagine a world in which every human being can freely share in
the sum of all knowledge. Help us make it a reality!
http://wikimediafoundation.org/wiki/Donate
It is increasingly common to add books to Wikisource
by finding a PDF or Djvu file, uploading it to Commons,
and then to create an Index: page on Wikisource
for proofreading.
But this would be much easier if:
1) The fields (author, title, etc.) of the Index
page were filled in from the data already given
on Commons. (Yes, those could be wrong or need
additional care, but this could always be
edited afterwards, if initial values are fetched
from Commons.)
2) The <pagelist/> tag was already in the
"pages" box.
3) All pages were created in automatically
with the OCR text from Commons, instead
of leaving a long list of red links. (This
would require the text for each page to be
extracted, something that pdftotext can do
in seconds, but Commons takes weeks to do.)
Could this be automated? Is there already
some tool or bot that does this?
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
Ok for this one, then. But many others on other wikisources are, I guess, unfair. Does any admin or developper have the rights to modify this stuff on each domain ?
Regards,
Syagrius
De : Michael Jörgens
Envoyés : 15.03.10 18:18
À : discussion list for Wikisource, the free library
Objet : Re: [Wikisource-l] Are the Page Views statistics fair enough ?
{{externesBild}} is quite understandable. In german language wikisource it is mandatory that the scan of the page is public available and linked, I think it is something like Special:Filepath/XX. A lot of old books a scanned by universities und we link to the pages there, but our main source is commons for the scans-
greetings
2010/3/15 <syagrius(a)gmx.fr>
Since the wikisource sub-domains are now classified by page views count, I guess it should at least be fair. If we look to the statistics for December 2009, we see that an important part of the English wikisource traffic (http://stats.grok.se/en.s/http://stats.grok.se/en.s/ ) comes from "Special:AutoLogin", of the Russian one (http://stats.grok.se/ru.s/http://stats.grok.se/ru.s/ ) from "Special:Filepath/XXpng", of the German one (http://stats.grok.se/de.s/http://stats.grok.se/de.s/ ) from "{{{EXTERNESBILD}}}" (I don't know what it is) and some "png", and of the Spanish one (http://stats.grok.se/es.s/http://stats.grok.se/es.s/ ) from "Special:Filepath/XXpng", etc.
There was the same problem on French wikisource some months ago, and it was fortunately corrected. Could anyone correct all this on every sub-domain ?
Regards,
Syagrius
_______________________________________________
Wikisource-l mailing list
Wikisource-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-lhttps://lists.wikimedia.org/mailman/listinfo/wikisource-l
Hi everyone,
I hope I'm not off topic on this mailing-list.
I'm looking for a scanner to scan books for Wikisource. I'm rather
confused by the quantity of products on the market, plus many reviews
focus on film scanners. Also, I need the scanner to be Mac compatible.
Would you have some advance ? Thanks in advance.
--
Marie-Lan / Jastrow