Tuesday, January 31, 2012

Spelling Corrector Algorithm - Java Code

I was amazed at how Google does spelling correction so well. Type in a search like 'speling' and Google comes back in 0.1 seconds or so with Did you mean: spelling. Impressive!

In one of my past projects, I was working on a RealEstate infrastructure where we had to support spelling corrections for City/State searches. At that point, we used Solr's spelling corrector. But I wanted to dig into the algorithm and write something simple and works without Solr. Just did some research online, went through some papers publishings and wrote an algorithm that serves what we are getting from Solr.

The code maintains a static dictionary of correct spellings in Hashmap, which we load on-boot. Below is the algorithm written in Java.

It works in O(n), n being the length of the word.

Spelling Corrector - Java code


package main.java.algo;

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

// References
//  - http://norvig.com/spell-correct.html
//  - http://raelcunha.com/spell-correct.php
//  - http://surguy.net/articles/scala-spelling.xml


// Components

//- dictionary (pre populated), map with word to weight
//- edits : deletes, transpose, replace, inserts
//- candidates : loop edits and populate. map to weight to word

public class SpellingCorrector {
 
 public static void main(String[] args) {

  dictionary.put("spelling", 1);
  dictionary.put("ashish", 1);
  dictionary.put("param", 1);
  
  String input = "spiling";
  String correct = correct(input);
  System.out.println(correct);
 }
 
 
 // word to count map - how may times a word is present - or a weight attached to a word
 public static Map<String, Integer> dictionary = new HashMap<String, Integer>();

 public static String correct(String word) {
  
  if(dictionary.containsKey(word))
   return word;
  
  // getting all possible edits of input word
  List<String> edits = edits(word);
  
  // Here we can either iterate through list of edits and find the 1st word that is in dictionary and return.
  // Or we can go to below logic to return the word with most weight (that we would have already stored in dictionary map)
  
  // map to hold the possible matches
  Map<Integer, String> candidates = new HashMap<Integer, String>();

  // keep checking the dictionary and populate the possible matches
  // it stores the count (or weight) of word and the actual word
  for(String s : edits) {
   if(dictionary.containsKey(s)) {
    candidates.put(dictionary.get(s), s);
   }
  }
  
  // one we have found possible matches - we want to return most popular (most weight) word
  if(candidates.size() > 0)
   return candidates.get(Collections.max(candidates.keySet()));
  
  
  // If none matched.
  // We will pick every word from edits and again do edits (using same previous logic) on that to see if anything matches
  // We don't do recursion here because we don't the loop to continue infinitely if none matches
  // If even after doing edits of edits, we don't find the correct word - then return.
  
  for(String s : edits) {
   
   List<String> newEdits = edits(s);
   for(String ns : newEdits) {
    if(dictionary.containsKey(ns)) {
     candidates.put(dictionary.get(ns), ns);
    }
   }
  }
  if(candidates.size() > 0)
   return candidates.get(Collections.max(candidates.keySet()));
  else 
   return word;
 }
 
 // Word EDITS
 // Getting all possible corrections c of a given word w. 
 // It is the edit distance between two words: the number of edits it would take to turn one into the other
 
 public static List<String> edits(String word) {
  
  if(word == null || word.isEmpty())
   return null;
  
  List<String> list = new ArrayList<String>();
  
  String w = null;
  
  // deletes (remove one letter)
  for (int i = 0; i < word.length(); i++) {
   w = word.substring(0, i) + word.substring(i + 1); // ith word skipped
   list.add(w);
  }
  
  // transpose (swap adjacent letters)
  // note here i is less than word.length() - 1
  for (int i = 0; i < word.length() - 1; i++) { // < w.len -1 :: -1 because we swapped last 2 elements in go.
   w = word.substring(0, i) + word.charAt(i + 1) + word.charAt(i) + word.substring(i + 2); // swapping ith and i+1'th char
   list.add(w);
  }
  
  // replace (change one letter to another)
  for (int i = 0; i < word.length(); i++) {
   for (char c = 'a'; c <= 'z'; c++) {
    w = word.substring(0, i) + c + word.substring(i + 1); // replacing ith char with all possible alphabets
    list.add(w);
   }
  }
  
  // insert (add a letter)
  // note here i is less than and EQUAL to
  for (int i = 0; i <= word.length(); i++) { // <= because we want to insert the new char at the end as well
   for (char c = 'a'; c <= 'z'; c++) {
    w = word.substring(0, i) + c + word.substring(i); // inserting new char at ith position
    list.add(w);
   }
  }
  
  return list;
 }
 
}

Monday, January 30, 2012

Closest pair of points - Algorithm


Credit: wikipedia
Problem Statement : The closest pair of points problem or closest pair problem is a problem of computational geometry: given n points in metric space, find a pair of points with the smallest distance between them.

Basically you are given N points in a plane - and you ahve to find two closest points.

This is algorithm may be used to find the nearest location  given your current geo location. This question was posted to me as -

'Write the algorithm to find the nearest Star bucks coffee shop given your lat longs'

The first things that comes to the mind is a brute force approach, which is O(n^2) and may not be ideal.


minDist = infinity
 for each p in P:
  for each q in P:
   if p ≠ q and dist(p, q) < minDist:
    minDist = dist(p, q)
    closestPair = (p, q)
 return closestPair


So, can we optimize it? Yes, lets get it in O(n log n).

This algorithm very similar to what we do in Merge Sort. The algorithm described here uses a divide and conquer approach to find out the closest  pair of points in a plane. It takes an input array of points of the type described as input  above

PSEUDO CODE

P set of points - SORTED by X-axis
procedure findclosest ( P, int n)
  { n is the number of elements in set P }
  Set: PLeft, PRight, Pleftmin, Prightmin, Pclosest
  if (n  <= 3) 
   return SHORTEST(P)
  else 
   PLeft = { p(1), p(2), …. , p(n/2) }
   PRight = { p(n/2 + 1), ……. , p(n) }
   Pleftmin = findclosest (PLeft , n/2)
   Prightmin = findclosest (PRight , n/2)
   Pclosest = MERGEPLANES (Pleftmin, Pleft, Prightmin, Pright)
   return Pclosest
  endif
 endprocedure
 
 procedure MERGEPLANES (P1min, P1, P2min, P2min)
  
  D = MINIMUM among (P1min, P2min)

  There might be points in each around mid-x-axis that do not form the minimum on left/right but if joined across points on other side - might give min.
  We create a rectange to consider such points.

  xi - xj < D == Means any 2 points whoes x-axis coordinates are less than D distance apart.
  yi +D > yj > yi – D = means any 2 points with y-coordinates are less than D distance apart. Considering above and below 7-axis, means they lie in 2D range.
  
  
  for (every i in P1) 
   for every j in P2 such that
    xi - xj < D AND  (yi +D > yj > yi – D )
    if DISTANCE (i, j) <  D
     return (i, j)
   endfor
  endfor
   
 endprocedure
 



Inside loop will at the max have 6 elements falling in that range. O(constant)
Outer loop worst case goes n times. O(n)
Total complexity of MERGEPLANES - O(n) * O(constant) = O(n)
Recursive loop - O(log n)

Thus, Complexity - O(n log n)

Some places it is also said that it may be O(n)
It is because only points in both the planes which are not more than MinDistance distance apart will have to be considered in order to find the closest pair.

Explanation

- Say P is a set of points with x and y coordinates. 
 - sort all points by by x-coordinate (Write a comparatr which does that)
 - Recursively, devide the list into halves. Pleft, Pright
 - pass halved lists recursively to the function. We follow bottoms up approach. We'll keep deviding till we reach base case of 2-3 points and we find nearest points amont them.
 - Then while keep coming out of recursion, we keep maitainining the closest pair amont Pleft and Pright.
 - leftClosest = recursive(Pleft)
 - rightClosest = recursive(Pright)
 - Sat we have ClosestPoint data structure: - that has Point1, Point2, distance
 - base case: if list.length < 3 -- return Shortest Distance ClosestPoint.
 - now we have closest point from left & right halves.
 - closest point will be either one of them, or points joining across the deviding line
 - ClosestPoint c = merge( leftClosest, Pleft, rightClosest, Pright )
 - merge
  -- D = min distance from leftClosest, rightClosest
  -- find min reactangle around deviding line
  -- for(i: 0-Pleft.len)
  --- for (j: 0-Pright.len)
  ----  if(Mod(pi.x-pj.x) < D || Mod(pi.y-pj.y) < d)
  -----   d = Distance(...)
  -----   if(d < D)
  ------    maintain minDistance and closest pair

 



Some places it is also said that it may be O(n)
It is because only points in both the planes which are not more than MinDistance distance apart will have to be considered in order to find the closest pair.


References

  •  http:people.cis.ksu.edu/~subbu/Papers/Closest%20pair.pdf
  •  http:en.wikipedia.org/wiki/Closest_pair_of_points_problem