Phonetic indexing - Discovery - lists.wikimedia.org

Kevin Smith

14 Jan 14 Jan

10 p.m.

Cool idea. I would also be inclined to limit it to searches containing 4 or fewer words/tokens. My only experience is with soundex, which was invented in 1918, so I'm probably not the one to ask. :P Kevin Smith Agile Coach, Wikimedia Foundation On Thu, Jan 14, 2016 at 1:53 PM, Trey Jones <tjones(a)wikimedia.org> wrote:

...

For some reason today I wanted to look up Mikhail Baryshnikov. It's been a while so I forgot how to spell his last name. I didn't try very hard, and I got no enwiki result. Google, of course, found the correct spelling, which I then used on enwiki. Since I used to do name searching and matching, this gave me an idea, which generalizes beyond just names. For every article title (and maybe each redirect—we could look into that) we could generate a phonetic index[1] and store those in a special EalasticSearch index. (We could look at storing multiple phonetic indexes for better recall, possibly generated by multiple algorithms; some, like Double Metaphone, generate multiple index by themselves.) Then, under certain circumstances (say, zero results and no suggestion from any other source, or no result with a score above a certain cutoff, or too few results, etc.), we could make a suggestion and/or show results based on matching phonetic index plus some score (say, a mix of page views and page rank, or whatever scoring we've got going on). So, when some doofus (hey, that's me!) comes along and searches for "borishnakoff" (worse than what I actually searched for), we could correct to *baryshnikov* (there's page with that title) or give *Mikhail Baryshnikov* as a result (likely the top scoring item with the same phonetic index in the title), or something similar. Other algorithms exist (and can be devised) for languages other than English, so the maximally fleshed out version of this would offer a choice of phonetic indexing algorithms, but I get ahead of myself. *Has anyone looked into this kind of phonetic indexing for enwiki, Wikipedia in general, or other wikimedia projects before?* I have some additional thoughts on how to test the effectiveness of phonetic indexing on zero results for enwiki without having to fully implement everything if the index sounds like something we could afford to build. Thoughts? —Trey [1] https://en.wikipedia.org/wiki/Phonetic_algorithm — Briefly, as an example, you drop non-initial vowels and duplicate letters, and collapse letters that tend to sound alike, while taking into account orthographic conventions like sh, ch, th, initial kn- or pt-, etc. So both *baryshnikov* and *borishnakoff* are likely to come out something like BRXNGV. Trey Jones Software Engineer, Discovery Wikimedia Foundation _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Reply

Oliver Keyes

10:25 p.m.

I really love this idea! On 14 January 2016 at 14:11, Deborah Tankersley <dtankersley(a)wikimedia.org> wrote:

...

I was thinking about something like that earlier this week - when I was hearing about searching for a term in a different language (other than English) on the en.wikipedia site and not getting any results. Could the phonetic 'search' be used for that too? Do we have any idea of how many pages (in English and otherwise) that have the phonetic spelling for the main topic? Just some additional thoughts.... Deb On Thu, Jan 14, 2016 at 2:00 PM, Kevin Smith <ksmith(a)wikimedia.org> wrote:

Cool idea. I would also be inclined to limit it to searches containing 4 or fewer words/tokens. My only experience is with soundex, which was invented in 1918, so I'm probably not the one to ask. :P Kevin Smith Agile Coach, Wikimedia Foundation On Thu, Jan 14, 2016 at 1:53 PM, Trey Jones <tjones(a)wikimedia.org> wrote:

For some reason today I wanted to look up Mikhail Baryshnikov. It's been a while so I forgot how to spell his last name. I didn't try very hard, and I got no enwiki result. Google, of course, found the correct spelling, which I then used on enwiki. Since I used to do name searching and matching, this gave me an idea, which generalizes beyond just names. For every article title (and maybe each redirect—we could look into that) we could generate a phonetic index[1] and store those in a special EalasticSearch index. (We could look at storing multiple phonetic indexes for better recall, possibly generated by multiple algorithms; some, like Double Metaphone, generate multiple index by themselves.) Then, under certain circumstances (say, zero results and no suggestion from any other source, or no result with a score above a certain cutoff, or too few results, etc.), we could make a suggestion and/or show results based on matching phonetic index plus some score (say, a mix of page views and page rank, or whatever scoring we've got going on). So, when some doofus (hey, that's me!) comes along and searches for "borishnakoff" (worse than what I actually searched for), we could correct to baryshnikov (there's page with that title) or give Mikhail Baryshnikov as a result (likely the top scoring item with the same phonetic index in the title), or something similar. Other algorithms exist (and can be devised) for languages other than English, so the maximally fleshed out version of this would offer a choice of phonetic indexing algorithms, but I get ahead of myself. Has anyone looked into this kind of phonetic indexing for enwiki, Wikipedia in general, or other wikimedia projects before? I have some additional thoughts on how to test the effectiveness of phonetic indexing on zero results for enwiki without having to fully implement everything if the index sounds like something we could afford to build. Thoughts? —Trey [1] https://en.wikipedia.org/wiki/Phonetic_algorithm — Briefly, as an example, you drop non-initial vowels and duplicate letters, and collapse letters that tend to sound alike, while taking into account orthographic conventions like sh, ch, th, initial kn- or pt-, etc. So both baryshnikov and borishnakoff are likely to come out something like BRXNGV. Trey Jones Software Engineer, Discovery Wikimedia Foundation _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

_______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- -- Deb Tankersley Product Manager, Discovery Wikimedia Foundation _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- Oliver Keyes Count Logula Wikimedia Foundation

Reply

Deborah Tankersley

11:07 p.m.

Fuzzy matching <https://en.wikipedia.org/wiki/Fuzzy_matching_(computer-assisted_translation)> FTW? ;) On Thu, Jan 14, 2016 at 2:51 PM, Trey Jones <tjones(a)wikimedia.org> wrote:

...

There are lots of possible implementations of phonetic searching. Limiting based on query term count would save lots of overhead, and limiting it to terms that aren't in the index (or have very very low counts) could work, too. These are things we could test beforehand, to see what the expense and benefit of computing various things work out to be. Soundex *is* pretty old, but it works okay. It's easily modified to be a bit smarter, too. The baseline implementation only considers the first few consonants to maximize recall for genealogists who are willing to sort through lots of hay to find that needle. Double Metaphone seems to be out there and available (may require a consultation with a lawyer), while Metaphone 3 is clearly for sale (the license is pretty nice as long as you don't want to share it). As for using it with other languages, hmmm, I have to think. The phonetic "index" is generally would not be directly searchable in normal text; it isn't a phonetic representation of the word, it's just a code that similar sounding words tend to have. Phonetic spelling comes in a few varieties on enwiki. There are IPA spellings[1] and dictionary style phonetic spellings. The dictionary spellings can have different conventions (I don't know how well standardized they are on enwiki—linguists have been pushing for IPA since it is standardized). But even IPA can have differences of detail that make it unsearchable. Gorbachev has three IPA pronunciations: /ˈɡɔrbəˌtʃɔːf, -ˌtʃɒf/ in English, and ɡərbɐˈtɕɵf in Russian. The first one includes primary and secondary stress information, the second one is only the last syllable of the name, and the third one has primary stress info. Leaving any of the stress info out, or try to search for the second pronunciation, and you don't get a match. So, I don't think we can leverage the phonetic spellings that are in articles. However, it would definitely work for reasonable spellings of many words of non-English origin. Possibly *aparrachick* for *apparatchik, *probably *shadenfroid* for *schadenfreude,* but probably not *paree* for *Paris* (there's already a redirect for that, though!). It depends a lot on the spelling system of the source language (French has too many silent letters, for example) or the transliteration system used, and the history of the borrowing (when spelling and sound don't match up, English tends to keep one and adapt the other, which is good, but sometimes it turns weird). [1] https://en.wikipedia.org/wiki/International_Phonetic_Alphabet — favored by linguists, woo hoo! —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation On Thu, Jan 14, 2016 at 2:11 PM, Deborah Tankersley < dtankersley(a)wikimedia.org> wrote:

I was thinking about something like that earlier this week - when I was hearing about searching for a term in a different language (other than English) on the en.wikipedia site and not getting any results. Could the phonetic 'search' be used for that too? Do we have any idea of how many pages (in English and otherwise) that have the phonetic spelling for the main topic? Just some additional thoughts.... Deb On Thu, Jan 14, 2016 at 2:00 PM, Kevin Smith <ksmith(a)wikimedia.org> wrote:

Cool idea. I would also be inclined to limit it to searches containing 4 or fewer words/tokens. My only experience is with soundex, which was invented in 1918, so I'm probably not the one to ask. :P Kevin Smith Agile Coach, Wikimedia Foundation On Thu, Jan 14, 2016 at 1:53 PM, Trey Jones <tjones(a)wikimedia.org> wrote:

For some reason today I wanted to look up Mikhail Baryshnikov. It's been a while so I forgot how to spell his last name. I didn't try very hard, and I got no enwiki result. Google, of course, found the correct spelling, which I then used on enwiki. Since I used to do name searching and matching, this gave me an idea, which generalizes beyond just names. For every article title (and maybe each redirect—we could look into that) we could generate a phonetic index[1] and store those in a special EalasticSearch index. (We could look at storing multiple phonetic indexes for better recall, possibly generated by multiple algorithms; some, like Double Metaphone, generate multiple index by themselves.) Then, under certain circumstances (say, zero results and no suggestion from any other source, or no result with a score above a certain cutoff, or too few results, etc.), we could make a suggestion and/or show results based on matching phonetic index plus some score (say, a mix of page views and page rank, or whatever scoring we've got going on). So, when some doofus (hey, that's me!) comes along and searches for "borishnakoff" (worse than what I actually searched for), we could correct to *baryshnikov* (there's page with that title) or give *Mikhail Baryshnikov* as a result (likely the top scoring item with the same phonetic index in the title), or something similar. Other algorithms exist (and can be devised) for languages other than English, so the maximally fleshed out version of this would offer a choice of phonetic indexing algorithms, but I get ahead of myself. *Has anyone looked into this kind of phonetic indexing for enwiki, Wikipedia in general, or other wikimedia projects before?* I have some additional thoughts on how to test the effectiveness of phonetic indexing on zero results for enwiki without having to fully implement everything if the index sounds like something we could afford to build. Thoughts? —Trey [1] https://en.wikipedia.org/wiki/Phonetic_algorithm — Briefly, as an example, you drop non-initial vowels and duplicate letters, and collapse letters that tend to sound alike, while taking into account orthographic conventions like sh, ch, th, initial kn- or pt-, etc. So both *baryshnikov* and *borishnakoff* are likely to come out something like BRXNGV. Trey Jones Software Engineer, Discovery Wikimedia Foundation _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

_______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- -- Deb Tankersley Product Manager, Discovery Wikimedia Foundation _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

_______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- -- Deb Tankersley Product Manager, Discovery Wikimedia Foundation

Reply

Oliver Keyes

15 Jan 15 Jan

1:05 a.m.

See I think of Scalzi's "Fuzzy Nation", which if you haven't read, you should. It's really fun (and my copy is signed because being friends with famous authors' daughters can be tremendously valuable but only in very limited ways) On 14 January 2016 at 15:15, Trey Jones <tjones(a)wikimedia.org> wrote:

...

On Thu, Jan 14, 2016 at 3:07 PM, Deborah Tankersley <dtankersley(a)wikimedia.org> wrote:

Fuzzy matching FTW? ;)

Ha! Unfortunately, I always read "fuzzy" as "expensive" in this context. _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- Oliver Keyes Count Logula Wikimedia Foundation

Reply