On 07/30/2013 07:17 PM, Amgine wrote:
Of course this is referencing spoken language which,
in most cases,
differs significantly from written language, but a running word corpus
of 100,000,000 seems a useful target, with samples weighted between
transcripts, periodicals, and texts from a delimited time and region.
Lemmatized corpus of 6,000-10,000.
If you want to compare one year or decade to the next,
you need a similar sample from both years. One way
to get this is to narrow down to a corpus of just one
journal or newspaper. Wikisource can do this with
Popular Science Monthly,
https://en.wikisource.org/wiki/PSM
You'll get popular science and only that for every year.
You won't have romantic poetry for one year, and
theological texts for the next year. You can spot trends
in the use of words like engine/motor or steam/electricity,
just because that is what this journal is about, and
you get the same number of issues and pages each year.
Some assembly required: Most volumes of PSM are
not complete yet. Lots of proofreading remains.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se