Re: [Wiktionary-l] [Wikitech-l] Listing missing words of wiktionnaries

31 Jul 2013

On 07/30/2013 07:17 PM, Amgine wrote:
...
  Of course this is referencing spoken language which,
in most cases,
 differs significantly from written language, but a running word corpus
 of 100,000,000 seems a useful target, with samples weighted between
 transcripts, periodicals, and texts from a delimited time and region.
 Lemmatized corpus of 6,000-10,000. 
If you want to compare one year or decade to the next,
you need a similar sample from both years. One way
to get this is to narrow down to a corpus of just one
journal or newspaper. Wikisource can do this with
Popular Science Monthly,
https://en.wikisource.org/wiki/PSM

You'll get popular science and only that for every year.
You won't have romantic poetry for one year, and
theological texts for the next year. You can spot trends
in the use of words like engine/motor or steam/electricity,
just because that is what this journal is about, and
you get the same number of issues and pages each year.

Some assembly required: Most volumes of PSM are
not complete yet. Lots of proofreading remains.

-- 
   Lars Aronsson (lars(a)aronsson.se)
   Aronsson Datateknik - http://aronsson.se

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wiktionary-l] [Wikitech-l] Listing missing words of wiktionnaries