Responding inline:
Brian Wolff <bawolff <at> gmail.com> writes:
The most immediate thing that comes to mind is why
create a new
interface where users can "add" words, instead of just scrapping
wiktionary? (I take it from your proposal you plan to create a new
project where users can submit words for consideration for inclusion
into the dictionary).
Additionally as for experts rejecting or accepting words:
*Is that actually needed?
*Do experts actually exist who would be willing to do that sort of
thing? (This varries depending on your definition of "expert". For
example, if you mean people with PhD's in said language who will
verify the word is proper, the answer would be no. If you mean people
who are XX-3 or XX-N in the language then maybe, but I'm not really
sure how much of a benefit the review would provide relative to the
costs)
This should not be a problem. Since we expect public to collaborate, the
ones who know a certain language can help "verify" a word without any actual
need of some official degree in hand.
I recognize scrapping is difficult for a whole host of
reasons (Mostly
the fact its semi-unstructured turns it into an NLP project, and that
standards aren't consistent cross languages - However, in this case it
seems like the information needed would not be that hard [famous last
words] to extract simply by looking at categories). It seems like
making users add data to a new project is duplicating effort going on
in wiktionary.
Since the project is a "spelling-dictionary" our main
concern is with the
spellings and this distinction in providing details of the word (wiktionary)
and providing spellings (our project) is what sets the two
Even if this project can't use wiktionary for some
reason, it seems
slightly overlapping with either wikidata or omegawiki, and could
perhaps re-use some work for those projects in terms of storing data.
Yes, scraping wiktionary in different languages would be a cumbersome task.
We are delving into other possibilities trying to ensure that no work is
duplicated and we maintain the uniformity of the resources in the wikimedia
community.
Last of all, In your proposal you give some potential
db schemas. I
imagine the schema should have a language column for what language the
word is for (Not to mention things get more complicated with related
languages e.g. EN vs EN-US vs EN-CA vs EN-GB)).
For this, I was considering
separate tables for every different language. I
am not sure if it would be a good idea to include a language column in a
single given table.
Also words can have
multiple meanings, perhaps you might want to split up meaning from the
word. Its not really needed if the meaning is "immutable", but if
meanings can be modified, you may want some way to be able to identify
which individual meaning was edited (And then there's issues with
history, etc, which again leads back to see if you can have an
existing project that has already solved those issues for where the
data comes from, instead of making a new one)
While browsing various (existing and
proposed in research papers) dictionary
structures, I came across dicollecte (this has also been previously
mentioned once). They have quite elaborate structure which ensures to cover
the possibilities of changing spellings as well.
We are really grateful for the wonderful feedback from your side! I am
discussing the various possibilities you mentioned with my mentor now.
We shall keep you all updated about the progress of the project. :)
Thanks a lot!
Ankita Shukla