IT Community - Software Programming, Web Development and Technical Support

What is Soundex?

This is a discussion on What is Soundex? within the PHP Programming forums, part of the Web Development category; Some improved algorithm 1 Capitalize all letters in the word and drop all punctuation marks. Pad the word with rightmost ...


Go Back   IT Community - Software Programming, Web Development and Technical Support > Web Development > PHP Programming

Register FAQ Members List Calendar Mark Forums Read
  #21 (permalink)  
Old 04-22-2008, 02:14 AM
Jeyaseelansarc Jeyaseelansarc is offline
D-Web Genius
 
Join Date: Mar 2007
Location: Chennai
Posts: 1,162
Jeyaseelansarc is on a distinguished road
Send a message via AIM to Jeyaseelansarc
Default Re: What is Soundex?

Some improved algorithm

1 Capitalize all letters in the word and drop all punctuation marks. Pad the word with rightmost blanks as needed during each procedure step.
2 Retain the first letter of the word.
3 Change all occurrence of the following letters to '0' (zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
4 Change letters from the following sets into the digit given:

* 1 = 'B', 'F', 'P', 'V'
* 2 = 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z'
* 3 = 'D','T'
* 4 = 'L'
* 5 = 'M','N'
* 6 = 'R'

5 Remove all pairs of digits which occur beside each other from the string that resulted after step (4).
6 Remove all zeros from the string that results from step 5.0 (placed there in step 3)
7 Pad the string that resulted from step (6) with trailing zeros and return only the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>.
__________________
With,
J. Jeyaseelan

Everything Possible
Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Sponsored Links
  #22 (permalink)  
Old 04-22-2008, 03:59 AM
Jeyaseelansarc Jeyaseelansarc is offline
D-Web Genius
 
Join Date: Mar 2007
Location: Chennai
Posts: 1,162
Jeyaseelansarc is on a distinguished road
Send a message via AIM to Jeyaseelansarc
Default Re: What is Soundex?

It's not hard to think of improvements that will make this already powerful algorithm even more robust. An example (at least to American pronunciation sensibilities) might include replacing many multi-letter sequences that produce unrelated sounds before performing the steps of the basic algorithm. For example, before starting the above procedure, replace:
* DG with G
* GH with H
* GN with N (not 'ng')
* KN with N
* PH with F
* MP with M ...WHEN... it is followed by S, Z, or T
* PS with S ...WHEN... it starts a word
* PF with F ...WHEN... it starts a word
* MB with M
* TCH with CH
* A or I with E WHEN - starts word+followed by[AEIO]
(Not in source code...)
The conversion enhancement for PF would not normally be needed because both letters are in the same group (group 1). However, since this conversion improvement is only for the start of the word, it must be included, since the first letter is preserved in this and classic SoundEx.
__________________
With,
J. Jeyaseelan

Everything Possible
Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #23 (permalink)  
Old 04-22-2008, 04:01 AM
Jeyaseelansarc Jeyaseelansarc is offline
D-Web Genius
 
Join Date: Mar 2007
Location: Chennai
Posts: 1,162
Jeyaseelansarc is on a distinguished road
Send a message via AIM to Jeyaseelansarc
Default Re: What is Soundex?

SoundEx Limitations

SoundEx acts as a bridge between the fuzzy and inexact process of human vocal interaction, and the concise true/false processes at the foundation of computer communication. As such, SoundEx is an inherently unreliable interface.

For this reason, SoundEx is only usable in applications that can tolerate high false positives (when words that don't match the sound of the inquiry are returned) and high false negatives (when words that match the sound of the inquiry are NOT returned).

This limitation is true even of the best SoundEx improvement techniques available. As long as you accept and honor this limitation, SoundEx and its derivatives can be a very useful tool in helping to improve the quality and usefulness of databases.

In many instances, unreliable interfaces are used as a foundation, upon which a reliable layer may be built. Interfaces that build a reliable layer, based on context, over a SoundEx foundation may also be possible.
__________________
With,
J. Jeyaseelan

Everything Possible
Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #24 (permalink)  
Old 04-22-2008, 04:04 AM
Jeyaseelansarc Jeyaseelansarc is offline
D-Web Genius
 
Join Date: Mar 2007
Location: Chennai
Posts: 1,162
Jeyaseelansarc is on a distinguished road
Send a message via AIM to Jeyaseelansarc
Default Re: What is Soundex?

Although the standard soundex string is 4 characters long, and this is what's returned by the php function, some database programs return an arbitrary number of strings. MySQL, for instance.

The MySQL documentation covers this, recommending that you may wish to use substring to output the standard 4 characters. Let's take 'Dostoyevski' as an example.

select soundex("Dostoyevski")
returns D2312
select substring(soundex("Dostoyevski"), 1, 4);
returns D231

PHP will return the value as 'D231'

So, to use the soundex function to generate a WHERE parameter in a MySQL SELECT statement, you might try this:
$s = soundex('Dostoyevski');
SELECT * FROM authors WHERE substring(soundex(lastname), 1 , 4) = "' . $s . '"';

Or, if you want to bypass the php function
$result = mysql_query("select soundex('Dostoyevski')");
$s = mysql_result($result, 0, 0);
__________________
With,
J. Jeyaseelan

Everything Possible
Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #25 (permalink)  
Old 04-22-2008, 04:07 AM
Jeyaseelansarc Jeyaseelansarc is offline
D-Web Genius
 
Join Date: Mar 2007
Location: Chennai
Posts: 1,162
Jeyaseelansarc is on a distinguished road
Send a message via AIM to Jeyaseelansarc
Default Re: What is Soundex?

To search for words like Clansy and Klansy, just reverse the strings:

PHP Code:
$s1 "Clansy";
  
$s2 "Klansy";
  if((
soundex($s1) == soundex($s2)) ||
     (
soundex(strrev($s1)) == soundex(strrev($s2))))
    echo(
"Match"); 
__________________
With,
J. Jeyaseelan

Everything Possible
Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #26 (permalink)  
Old 04-23-2008, 01:31 AM
Jeyaseelansarc Jeyaseelansarc is offline
D-Web Genius
 
Join Date: Mar 2007
Location: Chennai
Posts: 1,162
Jeyaseelansarc is on a distinguished road
Send a message via AIM to Jeyaseelansarc
Default Re: What is Soundex?

levenshtein() uses to calculate Levenshtein distance between two strings

This function returns the Levenshtein-Distance between the two argument strings or -1, if one of the argument strings is longer than the limit of 255 characters.
__________________
With,
J. Jeyaseelan

Everything Possible
Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #27 (permalink)  
Old 04-23-2008, 01:32 AM
Jeyaseelansarc Jeyaseelansarc is offline
D-Web Genius
 
Join Date: Mar 2007
Location: Chennai
Posts: 1,162
Jeyaseelansarc is on a distinguished road
Send a message via AIM to Jeyaseelansarc
Default Re: What is Soundex?

The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2. The complexity of the algorithm is O(m*n), where n and m are the length of str1 and str2 (rather good when compared to similar_text(), which is O(max(n,m)**3), but still expensive).

In its simplest form the function will take only the two strings as parameter and will calculate just the number of insert, replace and delete operations needed to transform str1 into str2.

A second variant will take three additional parameters that define the cost of insert, replace and delete operations. This is more general and adaptive than variant one, but not as efficient.
__________________
With,
J. Jeyaseelan

Everything Possible
Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #28 (permalink)  
Old 04-23-2008, 01:33 AM
Jeyaseelansarc Jeyaseelansarc is offline
D-Web Genius
 
Join Date: Mar 2007
Location: Chennai
Posts: 1,162
Jeyaseelansarc is on a distinguished road
Send a message via AIM to Jeyaseelansarc
Default Re: What is Soundex?

levenshtein() example

PHP Code:
<?php
// input misspelled word
$input 'carrrot';

// array of words to check against
$words  = array('apple','pineapple','banana','orange',
                
'radish','carrot','pea','bean','potato');

// no shortest distance found, yet
$shortest = -1;

// loop through words to find the closest
foreach ($words as $word) {

    
// calculate the distance between the input word,
    // and the current word
    
$lev levenshtein($input$word);

    
// check for an exact match
    
if ($lev == 0) {

        
// closest word is this one (exact match)
        
$closest $word;
        
$shortest 0;

        
// break out of the loop; we've found an exact match
        
break;
    }

    
// if this distance is less than the next found shortest
    // distance, OR if a next shortest word has not yet been found
    
if ($lev <= $shortest || $shortest 0) {
        
// set the closest match, and shortest distance
        
$closest  $word;
        
$shortest $lev;
    }
}

echo 
"Input word: $input\n";
if (
$shortest == 0) {
    echo 
"Exact match found: $closest\n";
} else {
    echo 
"Did you mean: $closest?\n";
}
__________________
With,
J. Jeyaseelan

Everything Possible
Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #29 (permalink)  
Old 04-23-2008, 01:33 AM
Jeyaseelansarc Jeyaseelansarc is offline
D-Web Genius
 
Join Date: Mar 2007
Location: Chennai
Posts: 1,162
Jeyaseelansarc is on a distinguished road
Send a message via AIM to Jeyaseelansarc
Default Re: What is Soundex?

The above example will output:
Code:
Input word: carrrot
Did you mean: carrot?
__________________
With,
J. Jeyaseelan

Everything Possible
Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #30 (permalink)  
Old 04-23-2008, 01:40 AM
Jeyaseelansarc Jeyaseelansarc is offline
D-Web Genius
 
Join Date: Mar 2007
Location: Chennai
Posts: 1,162
Jeyaseelansarc is on a distinguished road
Send a message via AIM to Jeyaseelansarc
Default Re: What is Soundex?

soundex() unfortunately is very sensitive about the first character. It is not possible to use it and have Clansy and Klansy return the same value. If you want to do a phonetic search on such names you will still need to write a routine to evaluate C452 as being similar to K452.
__________________
With,
J. Jeyaseelan

Everything Possible
Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT -7. The time now is 02:17 PM.


Copyright ©2004 - 2007, DiscussWeb. All Rights Reserved.

SEO by vBSEO 3.0.0