Tuesday, January 31, 2012

Spelling Corrector Algorithm - Java Code

I was amazed at how Google does spelling correction so well. Type in a search like 'speling' and Google comes back in 0.1 seconds or so with Did you mean: spelling. Impressive!

In one of my past projects, I was working on a RealEstate infrastructure where we had to support spelling corrections for City/State searches. At that point, we used Solr's spelling corrector. But I wanted to dig into the algorithm and write something simple and works without Solr. Just did some research online, went through some papers publishings and wrote an algorithm that serves what we are getting from Solr.

The code maintains a static dictionary of correct spellings in Hashmap, which we load on-boot. Below is the algorithm written in Java.

It works in O(n), n being the length of the word.

Spelling Corrector - Java code


package main.java.algo;

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

// References
//  - http://norvig.com/spell-correct.html
//  - http://raelcunha.com/spell-correct.php
//  - http://surguy.net/articles/scala-spelling.xml


// Components

//- dictionary (pre populated), map with word to weight
//- edits : deletes, transpose, replace, inserts
//- candidates : loop edits and populate. map to weight to word

public class SpellingCorrector {
 
 public static void main(String[] args) {

  dictionary.put("spelling", 1);
  dictionary.put("ashish", 1);
  dictionary.put("param", 1);
  
  String input = "spiling";
  String correct = correct(input);
  System.out.println(correct);
 }
 
 
 // word to count map - how may times a word is present - or a weight attached to a word
 public static Map<String, Integer> dictionary = new HashMap<String, Integer>();

 public static String correct(String word) {
  
  if(dictionary.containsKey(word))
   return word;
  
  // getting all possible edits of input word
  List<String> edits = edits(word);
  
  // Here we can either iterate through list of edits and find the 1st word that is in dictionary and return.
  // Or we can go to below logic to return the word with most weight (that we would have already stored in dictionary map)
  
  // map to hold the possible matches
  Map<Integer, String> candidates = new HashMap<Integer, String>();

  // keep checking the dictionary and populate the possible matches
  // it stores the count (or weight) of word and the actual word
  for(String s : edits) {
   if(dictionary.containsKey(s)) {
    candidates.put(dictionary.get(s), s);
   }
  }
  
  // one we have found possible matches - we want to return most popular (most weight) word
  if(candidates.size() > 0)
   return candidates.get(Collections.max(candidates.keySet()));
  
  
  // If none matched.
  // We will pick every word from edits and again do edits (using same previous logic) on that to see if anything matches
  // We don't do recursion here because we don't the loop to continue infinitely if none matches
  // If even after doing edits of edits, we don't find the correct word - then return.
  
  for(String s : edits) {
   
   List<String> newEdits = edits(s);
   for(String ns : newEdits) {
    if(dictionary.containsKey(ns)) {
     candidates.put(dictionary.get(ns), ns);
    }
   }
  }
  if(candidates.size() > 0)
   return candidates.get(Collections.max(candidates.keySet()));
  else 
   return word;
 }
 
 // Word EDITS
 // Getting all possible corrections c of a given word w. 
 // It is the edit distance between two words: the number of edits it would take to turn one into the other
 
 public static List<String> edits(String word) {
  
  if(word == null || word.isEmpty())
   return null;
  
  List<String> list = new ArrayList<String>();
  
  String w = null;
  
  // deletes (remove one letter)
  for (int i = 0; i < word.length(); i++) {
   w = word.substring(0, i) + word.substring(i + 1); // ith word skipped
   list.add(w);
  }
  
  // transpose (swap adjacent letters)
  // note here i is less than word.length() - 1
  for (int i = 0; i < word.length() - 1; i++) { // < w.len -1 :: -1 because we swapped last 2 elements in go.
   w = word.substring(0, i) + word.charAt(i + 1) + word.charAt(i) + word.substring(i + 2); // swapping ith and i+1'th char
   list.add(w);
  }
  
  // replace (change one letter to another)
  for (int i = 0; i < word.length(); i++) {
   for (char c = 'a'; c <= 'z'; c++) {
    w = word.substring(0, i) + c + word.substring(i + 1); // replacing ith char with all possible alphabets
    list.add(w);
   }
  }
  
  // insert (add a letter)
  // note here i is less than and EQUAL to
  for (int i = 0; i <= word.length(); i++) { // <= because we want to insert the new char at the end as well
   for (char c = 'a'; c <= 'z'; c++) {
    w = word.substring(0, i) + c + word.substring(i); // inserting new char at ith position
    list.add(w);
   }
  }
  
  return list;
 }
 
}

14 comments:

  1. Thank you very much very helpful indeed

    ReplyDelete
  2. // one we have found possible matches - we want to return most popular (most weight) word
    if(candidates.size() > 0)
    return candidates.get(Collections.max(candidates.keySet()));

    I think this step would serve better if you use the Bayes theorem to find which correction is more likely to occur instead of just assuming the most frequent word in the dictionary to be the correction.

    ReplyDelete
  3. Dictionary populated can be in anothe class.

    ReplyDelete
  4. Thank you for sharing such a nice story with us. I really enjoyed very much And please keep updating like this with this site.


    Java Training

    ReplyDelete
  5. Great work, programming was very easily understand and more important coding are provided on this post and this is very valuable in my studies,all coding easily understand and develop more skills,thanks for sharing this post.
    php training

    ReplyDelete
  6. Another great articles and very interesting to read,thanks for sharing that wonderful useful information,given programming coding was very excellent and i can easily observe all provided information.
    sas

    ReplyDelete
  7. your blog is really helpful and useful it really helped me during coding process and it is useful thanks for sharing those information.

    digital marketing training in chennai

    ReplyDelete
  8. Thank you for your post. This was really an appreciating one. You done a good job. Keep on blogging like this unique information with us.

    SAS Training in Chennai

    ReplyDelete
  9. This only considers words at an edit distance of 1 from the original word, is that correct?

    ReplyDelete
  10. Great blog..Step by step explanation is too good to understand..Its very useful for me to understand..Keep on sharing..
    SEO Training in Chennai

    ReplyDelete
  11. Wow amazing i saw the article with execution models you had posted. It was such informative. Really its a wonderful article. Thank you for sharing and please keep update like this type of article because i want to learn more relevant to this topic.

    SAP Training in Chennai

    ReplyDelete
  12. Thank you for sharing the explanation which is related with spelling corrector. It is much essential for typing through the document. So please keep update like this.

    SAP Training in Chennai

    ReplyDelete
  13. thanks for sharing wonderful information keep sharing more.
    Software Testing Training in Chennai

    ReplyDelete