Tuesday, January 31, 2012

Spelling Corrector Algorithm - Java Code

I was amazed at how Google does spelling correction so well. Type in a search like 'speling' and Google comes back in 0.1 seconds or so with Did you mean: spelling. Impressive!

In one of my past projects, I was working on a RealEstate infrastructure where we had to support spelling corrections for City/State searches. At that point, we used Solr's spelling corrector. But I wanted to dig into the algorithm and write something simple and works without Solr. Just did some research online, went through some papers publishings and wrote an algorithm that serves what we are getting from Solr.

The code maintains a static dictionary of correct spellings in Hashmap, which we load on-boot. Below is the algorithm written in Java.

It works in O(n), n being the length of the word.

Spelling Corrector - Java code


package main.java.algo;

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

// References
//  - http://norvig.com/spell-correct.html
//  - http://raelcunha.com/spell-correct.php
//  - http://surguy.net/articles/scala-spelling.xml


// Components

//- dictionary (pre populated), map with word to weight
//- edits : deletes, transpose, replace, inserts
//- candidates : loop edits and populate. map to weight to word

public class SpellingCorrector {
 
 public static void main(String[] args) {

  dictionary.put("spelling", 1);
  dictionary.put("ashish", 1);
  dictionary.put("param", 1);
  
  String input = "spiling";
  String correct = correct(input);
  System.out.println(correct);
 }
 
 
 // word to count map - how may times a word is present - or a weight attached to a word
 public static Map<String, Integer> dictionary = new HashMap<String, Integer>();

 public static String correct(String word) {
  
  if(dictionary.containsKey(word))
   return word;
  
  // getting all possible edits of input word
  List<String> edits = edits(word);
  
  // Here we can either iterate through list of edits and find the 1st word that is in dictionary and return.
  // Or we can go to below logic to return the word with most weight (that we would have already stored in dictionary map)
  
  // map to hold the possible matches
  Map<Integer, String> candidates = new HashMap<Integer, String>();

  // keep checking the dictionary and populate the possible matches
  // it stores the count (or weight) of word and the actual word
  for(String s : edits) {
   if(dictionary.containsKey(s)) {
    candidates.put(dictionary.get(s), s);
   }
  }
  
  // one we have found possible matches - we want to return most popular (most weight) word
  if(candidates.size() > 0)
   return candidates.get(Collections.max(candidates.keySet()));
  
  
  // If none matched.
  // We will pick every word from edits and again do edits (using same previous logic) on that to see if anything matches
  // We don't do recursion here because we don't the loop to continue infinitely if none matches
  // If even after doing edits of edits, we don't find the correct word - then return.
  
  for(String s : edits) {
   
   List<String> newEdits = edits(s);
   for(String ns : newEdits) {
    if(dictionary.containsKey(ns)) {
     candidates.put(dictionary.get(ns), ns);
    }
   }
  }
  if(candidates.size() > 0)
   return candidates.get(Collections.max(candidates.keySet()));
  else 
   return word;
 }
 
 // Word EDITS
 // Getting all possible corrections c of a given word w. 
 // It is the edit distance between two words: the number of edits it would take to turn one into the other
 
 public static List<String> edits(String word) {
  
  if(word == null || word.isEmpty())
   return null;
  
  List<String> list = new ArrayList<String>();
  
  String w = null;
  
  // deletes (remove one letter)
  for (int i = 0; i < word.length(); i++) {
   w = word.substring(0, i) + word.substring(i + 1); // ith word skipped
   list.add(w);
  }
  
  // transpose (swap adjacent letters)
  // note here i is less than word.length() - 1
  for (int i = 0; i < word.length() - 1; i++) { // < w.len -1 :: -1 because we swapped last 2 elements in go.
   w = word.substring(0, i) + word.charAt(i + 1) + word.charAt(i) + word.substring(i + 2); // swapping ith and i+1'th char
   list.add(w);
  }
  
  // replace (change one letter to another)
  for (int i = 0; i < word.length(); i++) {
   for (char c = 'a'; c <= 'z'; c++) {
    w = word.substring(0, i) + c + word.substring(i + 1); // replacing ith char with all possible alphabets
    list.add(w);
   }
  }
  
  // insert (add a letter)
  // note here i is less than and EQUAL to
  for (int i = 0; i <= word.length(); i++) { // <= because we want to insert the new char at the end as well
   for (char c = 'a'; c <= 'z'; c++) {
    w = word.substring(0, i) + c + word.substring(i); // inserting new char at ith position
    list.add(w);
   }
  }
  
  return list;
 }
 
}

Monday, January 30, 2012

Closest pair of points - Algorithm


Credit: wikipedia
Problem Statement : The closest pair of points problem or closest pair problem is a problem of computational geometry: given n points in metric space, find a pair of points with the smallest distance between them.

Basically you are given N points in a plane - and you ahve to find two closest points.

This is algorithm may be used to find the nearest location  given your current geo location. This question was posted to me as -

'Write the algorithm to find the nearest Star bucks coffee shop given your lat longs'

The first things that comes to the mind is a brute force approach, which is O(n^2) and may not be ideal.


minDist = infinity
 for each p in P:
  for each q in P:
   if p ≠ q and dist(p, q) < minDist:
    minDist = dist(p, q)
    closestPair = (p, q)
 return closestPair


So, can we optimize it? Yes, lets get it in O(n log n).

This algorithm very similar to what we do in Merge Sort. The algorithm described here uses a divide and conquer approach to find out the closest  pair of points in a plane. It takes an input array of points of the type described as input  above

PSEUDO CODE

P set of points - SORTED by X-axis
procedure findclosest ( P, int n)
  { n is the number of elements in set P }
  Set: PLeft, PRight, Pleftmin, Prightmin, Pclosest
  if (n  <= 3) 
   return SHORTEST(P)
  else 
   PLeft = { p(1), p(2), …. , p(n/2) }
   PRight = { p(n/2 + 1), ……. , p(n) }
   Pleftmin = findclosest (PLeft , n/2)
   Prightmin = findclosest (PRight , n/2)
   Pclosest = MERGEPLANES (Pleftmin, Pleft, Prightmin, Pright)
   return Pclosest
  endif
 endprocedure
 
 procedure MERGEPLANES (P1min, P1, P2min, P2min)
  
  D = MINIMUM among (P1min, P2min)

  There might be points in each around mid-x-axis that do not form the minimum on left/right but if joined across points on other side - might give min.
  We create a rectange to consider such points.

  xi - xj < D == Means any 2 points whoes x-axis coordinates are less than D distance apart.
  yi +D > yj > yi – D = means any 2 points with y-coordinates are less than D distance apart. Considering above and below 7-axis, means they lie in 2D range.
  
  
  for (every i in P1) 
   for every j in P2 such that
    xi - xj < D AND  (yi +D > yj > yi – D )
    if DISTANCE (i, j) <  D
     return (i, j)
   endfor
  endfor
   
 endprocedure
 



Inside loop will at the max have 6 elements falling in that range. O(constant)
Outer loop worst case goes n times. O(n)
Total complexity of MERGEPLANES - O(n) * O(constant) = O(n)
Recursive loop - O(log n)

Thus, Complexity - O(n log n)

Some places it is also said that it may be O(n)
It is because only points in both the planes which are not more than MinDistance distance apart will have to be considered in order to find the closest pair.

Explanation

- Say P is a set of points with x and y coordinates. 
 - sort all points by by x-coordinate (Write a comparatr which does that)
 - Recursively, devide the list into halves. Pleft, Pright
 - pass halved lists recursively to the function. We follow bottoms up approach. We'll keep deviding till we reach base case of 2-3 points and we find nearest points amont them.
 - Then while keep coming out of recursion, we keep maitainining the closest pair amont Pleft and Pright.
 - leftClosest = recursive(Pleft)
 - rightClosest = recursive(Pright)
 - Sat we have ClosestPoint data structure: - that has Point1, Point2, distance
 - base case: if list.length < 3 -- return Shortest Distance ClosestPoint.
 - now we have closest point from left & right halves.
 - closest point will be either one of them, or points joining across the deviding line
 - ClosestPoint c = merge( leftClosest, Pleft, rightClosest, Pright )
 - merge
  -- D = min distance from leftClosest, rightClosest
  -- find min reactangle around deviding line
  -- for(i: 0-Pleft.len)
  --- for (j: 0-Pright.len)
  ----  if(Mod(pi.x-pj.x) < D || Mod(pi.y-pj.y) < d)
  -----   d = Distance(...)
  -----   if(d < D)
  ------    maintain minDistance and closest pair

 



Some places it is also said that it may be O(n)
It is because only points in both the planes which are not more than MinDistance distance apart will have to be considered in order to find the closest pair.


References

  •  http:people.cis.ksu.edu/~subbu/Papers/Closest%20pair.pdf
  •  http:en.wikipedia.org/wiki/Closest_pair_of_points_problem

Thursday, October 6, 2011

Inverted index - Lucene Data Structure

Inverted index

An inverted index is an index data structure storing a mapping from content, such as wordsto its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database. It is the most popular data structure used in search engines.

Inverted Index Example

T0 = "it is what it is", T1 = "what is it" and T2 = "it is a banana".
We have the following inverted file index, the integers in {} refer to the their document where prseent.. T0, T1 etc.):

"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}

A term search for the terms "what", "is" and "it" would give
-> {0,1} intersect {0,1,2} intersect {0,1,2} = {0,1}.

With the same texts, we get the following full inverted index, where the pairs are document numbers and local word numbers. Like the document numbers, local word numbers also begin with zero. So, "banana": {(2, 3)} means the word "banana" is in the third document (T2), and it is the fourth word in that document (position 3).

"a": {(2, 2)}
"banana": {(2, 3)}
"is": {(0, 1), (0, 4), (1, 1), (2, 1)}
"it": {(0, 0), (0, 3), (1, 2), (2, 0)}
"what": {(0, 2), (1, 0)}

If we run a phrase search for "what is it" we get hits for all the words in both document 0 and 1. But the terms occur consecutively only in document 1.

Inverted Index Applications

The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a forward index is developed, which stores lists of words per document, it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed which lists the documents per word.

With the inverted index created, the word to document mapping can be stored in hashmap and the query can now be resolved by jumping to the word id (via random access) in the inverted index.

References -
http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html
http://en.wikipedia.org/wiki/Search_engine_indexing
http://en.wikipedia.org/wiki/Inverted_index
http://lucene.apache.org/java/2_3_2/fileformats.html

Thursday, September 29, 2011

Creating a Custom LRU Cache in Java

Caching is an extremely useful concept to improve performance. There are many commercial and open-source caching packages available for Java. These packages need setting up the whole cache infrastructure in dev, qa, prod environments and then hooking it into the code. This is fine, but sometimes we just want a quick and simple solution - to address what we need.

In one of projects in past, we had a Gallery promotion every few days and under peak traffic, site started to melt because of read read timeouts. One option was to use cache tools like Memcached. But did we really need that? We quickly wrote a custom LRU cache that fits all the needs. The solution was well accepted and lives in production today.

Here I'll illustrate the construction of custom LRU cache. I used LinkedHashMap's accessOrder and removeEldestEntry capabilities to create the LRU feature of a cache.

// Used Generics for illustration. Static cannot be used with Generics.
// For real usage, make the cache to be static. This will make get, put everything as static.
// Also, the functions have to be synchronized, since we are deleting old elements and updating as per access order. Under multiple threads - it will lead to race condition.
public class LRUCache<K, V> {

 private final int CACHE_SIZE;
 private final int initialCapacity = 16;
 private final float loadFactor = 0.75F;

 public LRUCache(int size) {
  this.CACHE_SIZE = size;
 }

 // LinkedHashMap(int initialCapacity, float loadFactor, boolean accessOrder)
 // accessOrder - to maintain in order of elements from least-recently accessed to most-recently. Invoking the put or get method results in an access.
 public LinkedHashMap<K, V> cache = new LinkedHashMap<K, V>(initialCapacity, loadFactor, true) {

  private static final long serialVersionUID = 1L;

  // The removeEldestEntry(Map.Entry) - is a method from LinkedHashMap, that should be overridden to impose a policy for removing OLD mappings automatically when new mappings are added to the map.
  // Returns true if this map should remove its eldest entry. This method is invoked by put and putAll after inserting a new entry into the map. 
  @Override
  protected boolean removeEldestEntry(Map.Entry<K, V> eldest) {
   boolean ifRemove = this.size() > CACHE_SIZE;
   return ifRemove;
  }
  
 };

 // Adds an entry to this cache. The new entry becomes the MRU (most recently used) entry. If an entry with the specified key already exists in the
 // cache, it is replaced by the new entry. If the cache is full, the LRU (least recently used) entry is removed from the cache.
 // it has to be synchronized, since we are deleting old elements and updating as per access order. Under multiple threads - it will be an issue.
 public synchronized void put(K key, V value) {
  if (value == null)
   return;
  else
   cache.put(key, value);
 }
 
 // Retrieves an entry from the cache. The retrieved entry becomes the MRU (most recently used) entry.
 public synchronized V get(K key) {
  return cache.get(key);
 }

 public synchronized void clear() {
  cache.clear();
 }
 
 
 // Test routine for the LRUCache class.
 public static void main(String[] args) {
  LRUCache<String, String> c = new LRUCache<String, String>(3);
  c.put("1", "one"); // 1
  c.put("2", "two"); // 2 1
  c.put("3", "three"); // 3 2 1
  c.put("4", "four"); // 4 3 2
  c.get("3");
  for (Map.Entry<String, String> e : c.cache.entrySet()) {
   System.out.println(e.getKey() + " : " + e.getValue());
  }
 }
}


Used Generics for illustration. Static cannot be used with Generics. For real usage, make the cache to be static. This will make get, put as static.

Also, the functions have to be synchronized, since we are deleting old elements and updating as per access order. Under multiple threads - it will lead to race condition.

Sunday, September 18, 2011

Breadth-First Binary Tree Traversal: Implementation & Complexity Analysis

For every element to be processed, we keep looping down till we reach the leaf child node. And then elements are processed while coming up.
Thus to process element n, we recursively loop (or loop to stack the elements) upto height of that node O(h)
Since going down by height, we divide the scope into half (left or right subtree) - O(h) is log n
And each time we loop (through recursion or stack) - a new layer of stack is allocated in memory

Time Complexity - O ( n log n )
Space Complexity - O ( n log n )

Thus for all elements -
Time & Space Complexity = nO(h) = nlog n = n log n

Any tree traversal - Inorder, PreOrder, PostOrder, BFS, DFS - are all same O(n log n)

package main.java.algo.datastructures.binarytree.traversal;

import java.util.LinkedList;
import java.util.Queue;

import main.java.algo.datastructures.binarytree.Node;

public class BFSTreeTraversal {
	
	public void mirrorTreeWithOutRecursion(Node root) {

		Queue<Node> queue = new LinkedList<Node>();

		queue.add(root);

		Node current = null;

		while (!queue.isEmpty()) {
			current = queue.poll();
			
			// do the processing on a node
			process(current);

			if (current.left != null)
				queue.add(current.left);
			if (current.right != null)
				queue.add(current.right);
		}
		return;
	}
	
	// process
	// example - here we are processing to create mirror of a tree
	public void process(Node n) {
		Node temp = n.left;
		n.left = n.right;
		n.right = temp;
	}

}

Depth-First Binary Tree Traversal

Time Complexity - O ( n log n )
Space Complexity - O ( n log n )

For every element to be processed, we keep looping down till we reach the leaf child node. And then elements are processed while coming up.
Thus to process element n, we recursively loop (or loop to stack the elements) upto height of that node O(h)
Since going down by height, we divide the scope into half (left or right subtree) - O(h) is log n
And each time we loop (through recursion or stack) - a new layer of stack is allocated in memory

Thus for all elements -
Time & Space Complexity = nO(h) = nlog n = n log n

Any tree traversal - Inorder, PreOrder, PostOrder, BFS, DFS - are all same O(n log n)


package main.java.algo.datastructures.binarytree.traversal;

import java.util.Stack;

import main.java.algo.datastructures.binarytree.Node;

public class DFSTreeTraversal {
	
	public void mirrorTreeWithOutRecursion(Node root) {

		Stack<Node> stack = new Stack<Node>();

		stack.push(root);

		Node current = null;

		while (!stack.isEmpty()) {
			current = stack.pop();
			
			// do the processing on a node
			process(current);

			if (current.left != null)
				stack.push(current.left);
			if (current.right != null)
				stack.push(current.right);
		}
		return;
	}
	
	// process
	// example - here we are processing to create mirror of a tree
	public void process(Node n) {
		Node temp = n.left;
		n.left = n.right;
		n.right = temp;
	}

}

Pre-Order Binary Tree Traversal

Time Complexity - O ( n log n )
Space Complexity - O ( n log n )

For every element to be processed, we keep looping down till we reach the leaf child node. And then elements are processed while coming up.
Thus to process element n, we recursively loop (or loop to stack the elements) upto height of that node O(h)
Since going down by height, we divide the scope into half (left or right subtree) - O(h) is log n
And each time we loop (through recursion or stack) - a new layer of stack is allocated in memory

Thus for all elements -
Time & Space Complexity = nO(h) = nlog n = n log n

Any tree traversal - Inorder, PreOrder, PostOrder, BFS, DFS - are all same O(n log n)

package main.java.algo.datastructures.binarytree.traversal;

import java.util.Stack;
import main.java.algo.datastructures.binarytree.Node;

/**
 ** Given a binary search tree  
.
       4 
      / \ 
     2   5 
    / \ 
   1   3
  / 
0.5
 
** Produces the output "4 2 1 0.5 3 5". 

** http://codepad.org/QEVUNzYU
**/

public class PreOrder {
	
	
	/** ************************** PreOrder : Recursive ************************** **/

	public void preOrderRecursive(Node node) {
		
		if (node == null)
			return;
		
		// first deal with the node
		System.out.print(node.data);

		// then recur on both subtrees
		preOrderRecursive(node.left);
		preOrderRecursive(node.right);

	}
	
	/** ************************** PreOrder : Iterative ************************** **/

	
	public void preOrderIterative (Node n) {
		
		//using stack
		Stack<Object> s = new Stack<Object>();
		
		// what preOrderRecursive(n.left) does
		preOrderpPushLeftChildsToStack(s, n);

		// what happens when preOrderRecursive(n.right) is called
		while (!s.isEmpty()) {
			
			Node t = (Node) s.pop();
			
			// preOrderRecursive(n.right) is called
			t = t.right;
			// what preOrderRecursive(n.left) does
			preOrderpPushLeftChildsToStack(s, t);	
		}
	}
	
	// it does what preOrderRecursive(n.left) was doing
	// note - in preorder - the processing of node is done as soon as it is discovered.
	public void preOrderpPushLeftChildsToStack(Stack<Object> s, Node n) {
		while (n != null) {
			
			// processing on a node
			process(n);
			
			s.push(n);
			n = n.left;
		}
	}
	
	public void process(Node n) {
		System.out.println(n.data);
	}

}