Explored, Designed, Delivered.sm |
Creativyst Docs Understanding the SoundEx Algorithm |
|
How To:
Understanding Classic SoundEx Algorithms
Search Names & Phrases Based on Phonetic Similarity (w/source code)
Overview
Terms that are often misspelled can be a problem for database designers.
Names, for example, are variable length, can have strange spellings, and they
are not unique. American names have a diversity of ethnic origins, which give
us names pronounced the same way but spelled differently and vice versa.
Words too, can be misspelled or have multiple spellings, especially across different cultures or national sources. To help solve this problem, we need phonetic algorithms which can find similar sounding terms and names. Just such a family of algorithms exist and have come to be called SoundEx algorithms. A Soundex search algorithm takes a written word, such as a person's name, as input, and produces a character string that identifies a set of words that are (roughly) phonetically alike. It is very handy for searching large databases when the user has incomplete data. The method used by Soundex is based on the six phonetic classifications of human speech sounds (bilabial, labiodental, dental, alveolar, velar, and glottal), which are themselves based on where you put your lips and tongue to make the sounds. The algorithm itself is fairly straight forward to code and requires no backtracking or multiple passes over the input word. In fact, it is so straight forward, I will start (after a history section) by presenting it as an outline. Further on, I will give C, JavaScript, Perl, and VB code that implements the two standard algorithms used in the American Census as well as an enhanced version, which is described in this article.
History
The original Soundex algorithm was not named Soundex.
Index systems similar to what we now call "Soundex" were originally
developed and used for indexing American census records in the late
nineteenth, and early 20th century.
An algorithm that is now considered the predessesor to Soundex began in a patent application filed by Robert C. Russell in 1917. It was simply titled "INDEX". The name "Soundex" came along later, and was first registered as a trademark in 1926. The trademarked term was originally issued to Kardex Systems, Inc. Later, in the late 1960's through early 1980's, Soundex was used by the American Telegraph & Telephone company (AT&T) in their 411 telephone information service. In the years before speech recognition systems came into use, information inquiries were handled by human operators. Soundex proved to be the best way to access phone number information in response to person and place names that were spoken over telephones. The original patent titled "INDEX" was issued in 1918 as patent number 1,261,167 (filed Oct. 25, 1917). Other patents by Russell involved index systems and indexing gadgets. Some were variations and improvements of his 1917 index system:
There are considerable and reliable sources on the Web (as well as the usual unreliable source) naming one Margaret O'Dell as a co-inventor, but I have not been able to find any actual evidence of this. If you have information showing Margaret O'Dell to be a co-inventor of any of these patents, please contact me. so that she can be given credit here for her contributions. Also the patent filed in 1917 (1,261,167) is not a perfect match for present-day Soundex algorithms but is considered the first Soundex because it is very similar to the algorithms we use today. Russell did produce at least one earlier index system that assigned numbers to "key-letters" which he had filed a patent application for a couple of years earlier. None of these patents, as far as I can tell, list any inventor named O'Dell. **Finally, Russell's patent number 1,261,167 was reissued in 1923 as # Re.15,582.
The Algorithm as an Outline
Discussion
The algorithm presented above is slightly different than the
originally patented algorithm.
The original Soundex algorithm of 1918 starts to fail when the number of words in the database gets to be large. For example, the diversity of names in a large database with many foreign spellings starts to put more and more phonetically unlike names into the same code. The slightly improved algorithm presented here differs from the original in a variety of way. For example, it differs in step three, where the vowel sounds are replaced with zeros ('0'), and in step (6.0) where those zeros are removed. The original algorithm eliminated the vowels completely in step 3.0, before duplicate adjacent digits are removed. This improved version will produce SoundEx codes with duplicate digits in them; for example, "HERMAN" will code to "H06505" which will then reduce to "H655" in the last step. In the original version, "HERMAN" would code to "H655" which will then reduce to "H65" and finally to "H650" in the last step. Other SoundEx algorithms exist as well, such as LEAA codes used in crime prevention databases, and Cutter Tables used by libraries to encode author names. Each has its advantages and disadvantages. For example, Cutter Tables have as an advantage, that the codes produced maintain the alphabetical order of the original material.
Enhancements
It's not hard to think of improvements that will make this already powerful
algorithm even more robust. An example (at least to American pronunciation
sensibilities) might include replacing many multi-letter sequences that produce
unrelated sounds before performing the steps of the basic algorithm. For
example, before starting the above procedure, replace:
Challenge: Can you think of more ways to improve this basic SoundEx algorithm?
SoundEx and the Census
The U.S. census has been making use of SoundEx codes to index surnames since
the late 1800's. Those doing census lookups must use the same method to encode
surnames as the census takers did when they generated the database. That
means, for starters, our clever set of enhancements can't be used.
With one exception, the 'normal' method of encoding the census soundex codes is
identical to that described in the Algorithm as
Outline section above. The exception is a special rule for the
letters H and W.
In the method described above, if two letters from the same group are on either
side of a vowel, an 'H', or a 'W', they are considered two separate letters
and they are not removed by the 'remove adjacent digits' step (5).
But in 'normal' census SoundEx, the 'H' and 'W' were completely removed from
the conversion when surrounding letters were in the same group, so letters from
the same group were combined into a single digit when only 'H' or 'W' were
between them. For example:
normal census method:
Also of importance for those searching through census records: Often census takers would leave off the prefix in names that had prefixes when producing SoundEx codes (nobody's really sure why). In this case the SoundEx for the name VanDrake (for example) might have the soundex calculated only for 'Drake'. One more thing - Out on the web, some are saying the incorrect "special" census codes documented here are actually the correctly done census codes, while those that followed the special rule for 'H' and 'W' were the ones done in error. They are wrong. Fortunately though, you don't have to decide which side to believe (but, still, have I ever lied to you? :-) ). Simple logic should provide enough evidence to allow you to draw your own conclusion. Consider...
If logic makes your eyes glaze over, don't worry. A very convincing appeal to authority exists: No less than the U.S. National Archives (those who maintain census data) agree with the position documented here. :-) See the Related Reading section below for a link.
SoundEx Limitations
SoundEx acts as a bridge between the fuzzy and inexact process of human vocal
interaction, and the concise true/false processes at the foundation of computer
communication. As such, SoundEx is an inherently unreliable interface.
For this reason, SoundEx is only usable in applications that can tolerate high false positives (when words that don't match the sound of the inquiry are returned) and high false negatives (when words that match the sound of the inquiry are NOT returned). This limitation is true even of the best SoundEx improvement techniques available. As long as you accept and honor this limitation, SoundEx and its derivatives can be a very useful tool in helping to improve the quality and usefulness of databases. In many instances, unreliable interfaces are used as a foundation, upon which a reliable layer may be built. Interfaces that build a reliable layer, based on context, over a SoundEx foundation may also be possible.
Permissions
This article is © Copyright, Creativyst, Inc. 2002 - 2013 ALL RIGHTS
RESERVED.
Permissions printed over code, DTD, and schema files are supported as our permission statement for these constructs. Specifically, the C, JavaScript, Perl, and VB code, -and ONLY the code-, printed in this paper may be copied and modified freely under Berkeley style open licensing as described in the comments of each code example. Links to this paper are always welcome. However, you may not copy, modify, or distribute this article without first obtaining express written permission from Creativyst, Inc. Those wishing to obtain permission to distribute this paper or derivatives in any form should contact Creativyst.
|
|
Source Code
Below you will find a set of SoundEx functions that produce identical results
based on identical inputs. They are currently provided in the following
languages:
Related Reading
The following links provide resources for those who'd like to learn more about
the SoundEx algorithm, it's limitations, enhancements, and uses.
SoundEx Converter Form
The following form uses the JavaScript SoundEx function presented above to
produce a SoundEx code for any word you enter into the field labeled
"Input Word:". It returns the SoundEx codes for three different
algorithms in the fields below. You may also enter a size for the output code
within limits (4 to 10 characters).
To test differences in the enhanced SoundEx option, use words and names like knightridder, psychology, and Pflanders. The word Knight will demonstrate two different enhancements as you type it. To see the differences between the two census options, try Ashcroft.
Contribute
If you've written a SoundEx function in another
language and would like to share it as part of this article, please contact me. If used,
your code will be attributed to you with a link to your site.
|
Note: Use of this information as the basis for
your own article without citing it is dishonest and
constitutes plagiarism. If you base any part of
your own article on information from this article
PLEASE CITE IT!
Thanks for your help & contributions:
John Nye (Tool guy) for finding bugs in Perl code
Dmitry S. Denisov for bug-finding (thanks!)
Mark Westenberger (mjw -at: igs.net) for finding bugs in C code
tests: v1.0e: Gh bug (non-census extended code): If Gh begins word it should not be replaced with 'H' only. I first noticed this ghastly little ghoul while visiting a ghetto in Ghana. I was sitting down to a meal, munching on some Gherkins, when I was approached by a Ghazi. He pulled out his sword and I was terrified (for you see, Jesus is my Lord and Savior). Imagine my relief when he handed the sword to his Ghillie. They had just caught some fish and wanted to share them and my gherkins for lunch. Once his ghillie had cleaned the fish and returned from the ghat we all sat down to a meal. The Ghazi told me he has been questioning his faith. I replied that we all go through times of questioning our faith and that such times only help to clarify that faith, like a ghee. In so doing our faith becomes stronger, I continued. Such times of questioning serve —not only— to help prove our own faith through clarity, but they also improve the institutions who claim to be the keepers of our faith. For you see, while they exercise authority over us, they also need us in order to have relevance. This dynamic is comparable to a subtle, and ongoing (if tacit) ghareo. I'm not sure if it was because I helped him find his faith, or because he simply disagreed, but at that moment he nodded to his ghillie, who swatted at me with the sword. Perhaps they just thought I was eating more than my share of the fish... AFTER-WORD: I got away with just a scratch (a director's "cut" is NOT available). Now you know why I'm a programmer and not a ghostwriter. dammit jim. v1.0d: Include test-strings with non-alpha characters. v1.0c: Schmit S530 (not S253) Schneider S536 (not S253) Lloyd L300 (not L430) Pfister P236 v1.0b: Holden (first letter is 'H') Write (first letter is 'W' followed by #6 letter) White (oooh, H/W together in 1st letter) Htacky H320 Atacky A320 ashcroft (to test normal H squeezes)