Essentials of Algorithmic Design with proper Data structures.

There can be multiple solutions exist for a problem, choosing the best solution from around set of solutions is the key for solving problems. Algorithmic design approach enables us to solve computational problems that would n’t other wise be addressed. Now a days product companies facing problems with performance of applications. When it comes to end user satisfaction, performance is the key and is the factor that differs you from others. It’s not about developing a working solution with output, It’s about the key terms performance, scalability and availability for web applications. This post high lights Essentials of Algorithmic Design with proper Data structures.

Every programmer developing a working solution should ask a question like Will my program be able to solve a large practical input? , He may get around with set of questions, Why is my program so slow ?  and Why does it run out of memory ?  Well, you have developed a working solution, when it comes to large sets of practical input, your program fails!  This is the root cause and a motivation factor for us to analyze our programs for Time Complexity and Space Complexity, which would lead to best algorithmic solutions for a problem.

Google faced problems in using data structures with their Big Data, so they have come up with additions to Java Collections (java.util.*) like Google Collections (Guava) showing the importance of data structures into programming which would ease the faster phase development.

Linus Torvalds, creator of Linux says that

I will, in fact, claim that the difference between a bad programmer and good one is whether he considers his ode or his data structures
more important. Bad Programmers worry about the code, good programmers worry about data structures and their relationships.

The Big names like Google, Microsoft , Amazon, Facebook, Yahoo, LinkedIn etc., successful just because the underlying programs having good algorithms and they are making use of proper data structures.

Robert Sedgewick, Computer scientist and a professor at Princeton University says that

A good Algorithm is better than a super computer!

Here are some of the real world examples that motivate us to think in terms of algorithmic design with proper data structures.

Databases: B-Trees, B+ Trees are extensively used in databases, secondary memory data structures for faster searching in secondary memory hard disk. If you are a programmer you might have created indexes on your databases by something like this.CREATE INDEX ON TABLE <TABLE_NAME>(<COLUMN>); In fact, no RDBMS would exist with out using these underlying B-tree, B+Tree data structures. This is a motivation for studying Tree data structures like BST, Balanced Binary Search Trees, Red-Black BSTs,(2-3 Tree), Bayer’s Tree (B-Tree, B+-Tree).

Sorting: Sorting is every where in computer science, with sorting, your problem statement can be simplified and can be solved easily. Java library provides system sorts Arrays.sort()  and Collections.sort() using a combination of classic algorithms like Insertion Sort, Quick Sort for primitive types and Merge Sort for Object type data.Recently Java 7 introduced a new Dual Pivot Quick Sort algorithm implementation for improved performance. By knowing classic algorithms in sorting like Bubble sort, Selection sort, Insertion sort, Shell sort, Merge sort, Quick sort, Counting sort and Radix sort problems can be solved easily for performance, with out them, wouldn’t otherwise be addressed.

Google Search : Right from our classic searching algorithms like linear search, binary search,Binary Search Trees, Red-Black Binary Search Trees, and B-Trees. Inverted Index is the key data structure for Google Search. Every Search Engine Algorithm would follow this key Inverted Index data Structure. If you are a java programmer, LUCENE  from Apache, a Java library that implements Inverted Indexing data structure, using this, you can create your own Google Search! There are other searching related projects on Apache, which would solve searching problems in web apps are SolrNutch , Katta and Elastic Search. These real world implementations motivate us to study classic searching algorithms like Linear Search, Binary Search, Symbol Tables and Hash Tables. Hash Table is a key data structure for searching in the Order of constant time O(1).

Facebook Graph Search: Recently Facebook introduced a Graph Search, what do you think that they would do internally ?!  They would maintain a social graph of friends list and do processing on the graphs. Pregel is a Big Graph processing API used by Google. Inspiring from Pregel, Apache has similar open source implementation callled Giraph even twitter has their own Graph processing API called Cassovary which is coded in modern JVM language, Scala. You can find source code for Graph API in the twitter GitHub repository . These real world implementations motivate us to study classic Algorithms like Breadth First Search, Depth First Search and Spanning Tree algorithms like Kruskal, Prims, Dijkistra and Bellman-Ford to understand better of present Graph APIs.

Google Top Hitters in the Search :

At the end of the year, Google releases top searches related to people,movies, cities and travel etc., in the Google Trends, http://www.google.co.in/zeitgeist/2012/#india . As Google receives data continuously through searches (Streaming data), How do they really compute these top searches?!. Well, if you want to know the under hood of it, the answer is Streaming Algorithms . This is the motivation for us to study Streaming Algorithms problem statements like Finding frequent items, Count Min sketch and Finding a random sample from streaming data.

From Wikipedia – “Streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). These algorithms have limited memory available to them (much less than the input size) and also limited processing time per item.”

Pattern matching in text documents, Text Editors:

We might wonder how searching in a word document work for particular pattern?! The internals of any document pattern search would follow String algorithms. Eclipse follows similar pattern matching algorithms (Eclipse, a Java Implemenation). These real world implementations are motivation for us to study our classic String Pattern matching algorithms like Knuth-Morris-Pratt (KMP), Boyer-Moore, Rabin-Karp, regular expression matching, run-length coding, Huffman coding and LZW compression. and to perform prefix based search, using Trie and Patricia Trie as data structures.

Distributed Computing / Parallel Processing Algorithms:

In the modern world of internet. data sharing is growing enormously across web apps.In the Facebook every day 10 billion photos would take up 1 peta byte of storage. For processing such a huge data, normal servers even production blade servers doesn’t scale up. Google published a paper on Distributed / Parallel processing, they used an algorithmic strategy called Map-Reduce. For their Distributed parallel processing Google uses their own file system GFS (Google File System) and Big Table (a large scale , NO SQL data base) as database. Inspiring from Google, a similar implementation from Apache is Hadoop. What ever may be the distributed computing platform those internals use probabilistic data structures called as Bloom Filters. a Bloom Filter is a space efficient probabilistic data structure used to test set membership.These real world scenarios are motivation for us to study probabilistic algorithms and data structures such as Bloom Filters. Static Bloom Filter, Dynamic Bloom Filter and Counting Bloom Filter are variations of Bloom Filter.

NO-SQL Data bases (Not only SQL):

According to http://nosql-database.org/ The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge amount of data and more. So the misleading term “nosql” (the community now translates it mostly with “not only sql“).

Whether it is Google Big Table or Facebooks Cassendra or Amazon’s Dynamo or Hadoop’s  HBase. All these NO-SQL data bases uses Probabilistic algorithms. and variations of Bloom Filters. Bloom Filters basically implemented with BitSet class in java and set of Hashing algorithms like MD5, SHA, Jenkins and Murmur hashes.

 Modern Algorithms and Data Structures :

From among the above algorithms and data structures, Inverted Indexing is a data structure for search engine algorithms. Streaming algorithms and Bloom Filters are used mostly in distributed large scale data web applications. LinkedIn(a Java Web App) uses LUCENE  as core for searching in their Web App. Google uses, implements streaming algorithms for problem statements such as finding top-k hitters in search. Facebook uses Cassendra as data base which internally uses Bloom Filters. Yahoo servers run on Hadoop cluster which again uses Bloom Filter variations. Every modern internet applications which care about scalability, and availability they should go by parallel processing, Google’s Map-Reduce  algorithmic paradigm is the key for distributed parallel computing.

Here is the list of data structures from Wiki http://en.wikipedia.org/wiki/List_of_data_structures.

Right algorithmic design with proper data structures is the key to success of modern web applications!

Do let us know, your thoughts on this.

Leave a Comment