This is a discussion on What is Soundex? within the PHP Programming forums, part of the Web Development category; Some improved algorithm 1 Capitalize all letters in the word and drop all punctuation marks. Pad the word with rightmost ...
| |||||||
| Register | FAQ | Members List | Calendar | Mark Forums Read |
| |||
| Some improved algorithm 1 Capitalize all letters in the word and drop all punctuation marks. Pad the word with rightmost blanks as needed during each procedure step. 2 Retain the first letter of the word. 3 Change all occurrence of the following letters to '0' (zero): 'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'. 4 Change letters from the following sets into the digit given: * 1 = 'B', 'F', 'P', 'V' * 2 = 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z' * 3 = 'D','T' * 4 = 'L' * 5 = 'M','N' * 6 = 'R' 5 Remove all pairs of digits which occur beside each other from the string that resulted after step (4). 6 Remove all zeros from the string that results from step 5.0 (placed there in step 3) 7 Pad the string that resulted from step (6) with trailing zeros and return only the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>.
__________________ With, J. Jeyaseelan Everything Possible |
| Sponsored Links |
| |||
| It's not hard to think of improvements that will make this already powerful algorithm even more robust. An example (at least to American pronunciation sensibilities) might include replacing many multi-letter sequences that produce unrelated sounds before performing the steps of the basic algorithm. For example, before starting the above procedure, replace: * DG with GThe conversion enhancement for PF would not normally be needed because both letters are in the same group (group 1). However, since this conversion improvement is only for the start of the word, it must be included, since the first letter is preserved in this and classic SoundEx.
__________________ With, J. Jeyaseelan Everything Possible |
| |||
| SoundEx Limitations SoundEx acts as a bridge between the fuzzy and inexact process of human vocal interaction, and the concise true/false processes at the foundation of computer communication. As such, SoundEx is an inherently unreliable interface. For this reason, SoundEx is only usable in applications that can tolerate high false positives (when words that don't match the sound of the inquiry are returned) and high false negatives (when words that match the sound of the inquiry are NOT returned). This limitation is true even of the best SoundEx improvement techniques available. As long as you accept and honor this limitation, SoundEx and its derivatives can be a very useful tool in helping to improve the quality and usefulness of databases. In many instances, unreliable interfaces are used as a foundation, upon which a reliable layer may be built. Interfaces that build a reliable layer, based on context, over a SoundEx foundation may also be possible.
__________________ With, J. Jeyaseelan Everything Possible |
| |||
| Although the standard soundex string is 4 characters long, and this is what's returned by the php function, some database programs return an arbitrary number of strings. MySQL, for instance. The MySQL documentation covers this, recommending that you may wish to use substring to output the standard 4 characters. Let's take 'Dostoyevski' as an example. select soundex("Dostoyevski") returns D2312 select substring(soundex("Dostoyevski"), 1, 4); returns D231 PHP will return the value as 'D231' So, to use the soundex function to generate a WHERE parameter in a MySQL SELECT statement, you might try this: $s = soundex('Dostoyevski'); SELECT * FROM authors WHERE substring(soundex(lastname), 1 , 4) = "' . $s . '"'; Or, if you want to bypass the php function $result = mysql_query("select soundex('Dostoyevski')"); $s = mysql_result($result, 0, 0);
__________________ With, J. Jeyaseelan Everything Possible |
| |||
| To search for words like Clansy and Klansy, just reverse the strings: PHP Code:
__________________ With, J. Jeyaseelan Everything Possible |
| |||
| levenshtein() uses to calculate Levenshtein distance between two strings This function returns the Levenshtein-Distance between the two argument strings or -1, if one of the argument strings is longer than the limit of 255 characters.
__________________ With, J. Jeyaseelan Everything Possible |
| |||
| The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2. The complexity of the algorithm is O(m*n), where n and m are the length of str1 and str2 (rather good when compared to similar_text(), which is O(max(n,m)**3), but still expensive). In its simplest form the function will take only the two strings as parameter and will calculate just the number of insert, replace and delete operations needed to transform str1 into str2. A second variant will take three additional parameters that define the cost of insert, replace and delete operations. This is more general and adaptive than variant one, but not as efficient.
__________________ With, J. Jeyaseelan Everything Possible |
| |||
| levenshtein() example PHP Code:
__________________ With, J. Jeyaseelan Everything Possible |
| |||
| soundex() unfortunately is very sensitive about the first character. It is not possible to use it and have Clansy and Klansy return the same value. If you want to do a phonetic search on such names you will still need to write a routine to evaluate C452 as being similar to K452.
__________________ With, J. Jeyaseelan Everything Possible |
![]() |
| Thread Tools | |
| Display Modes | |
| |