Wednesday, December 5, 2012

Key research papers behind the growth in Big Data tools

A large chunk of big data tools including Hadoop owe their beginnings to the research papers published by Google and Amazon. These papers are a good place to start when trying to understand the technologies and tools that drive big data analytics. Below are descriptions of the technology  and links to the research paper.

BigTable - Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance.

MapReduce - MapReduce is a programming model and an associated implementation for processing and generating large data sets.Google has implemented hundreds of special-purpose computations that process large amounts of raw data using MapReduce. Inspired by the map and reduce primitives present in Lisp and many other functional languages, Google introduced MapReduce. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

Google File System - Google designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs.

Dynamo - Dynamo is a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience.