Big Data Rookie

Saturday, January 19, 2013

Resources for Learning more on Hadoop

I recently attended a Hadoop Users Group meetup. Here are some of the suggested material to learn more on Hadoop and it's ecosystem.

Essentials for Appache's Hadoop - Register and watch six recorded webnairs from Cloudera on Hadoop.

Yahoo! Hadoop Tutorial - A series of tutorials on how to how to use the Hadoop distributed data processing environment

Hadoop: The Definitive Guide - The Hadoop Bible

O’Reilly Ecosystem books - Hive, Pig, Hbase, Cassandra, others

Hadoop in Action and Hadoop In Practice - Example-based books. Very pragmatic, get you up-to-speed quickly

Other Links :

Cloudera Training & Distributions
http://www.cloudera.com/resources/
https://ccp.cloudera.com/display/SUPPORT/Downloads
HortonWorks Training & Distributions
http://hortonworks.com/community/
http://hortonworks.com/download/
Hadoop World 2010, 2011, 2012 - Slides and video
http://www.hadoopworld.com/
Cloudera Essentials Series – 1 to 6 (Audio) http://www.cloudera.com/search/?q=essentials
Apache Hadoop – Petabytes and Terawatts
http://www.youtube.com/watch?v=SS27F-hYWfU
History of Hadoop
http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop/
Adam Bosworth Interview from 2005 (source of some quotes in this presentation)
http://itc.conversationsnetwork.org/shows/detail571.html
An Intro to Hadoop – Mark Fei
http://cdn.oreillystatic.com/en/assets/1/event/85/An%20Introduction%20to%20Hadoop%20Presentation.pdf
YARN/MRv2 Information (Next-Generation Hadoop)
http://blog.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/
http://hadoop.apache.org/docs/r0.23.0/index.html
Brad Hedlund - Understanding Hadoop Clusters and the Network
http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/#download

Wednesday, December 5, 2012

Key research papers behind the growth in Big Data tools

A large chunk of big data tools including Hadoop owe their beginnings to the research papers published by Google and Amazon. These papers are a good place to start when trying to understand the technologies and tools that drive big data analytics. Below are descriptions of the technology and links to the research paper.

BigTable - Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance.

MapReduce - MapReduce is a programming model and an associated implementation for processing and generating large data sets.Google has implemented hundreds of special-purpose computations that process large amounts of raw data using MapReduce. Inspired by the map and reduce primitives present in Lisp and many other functional languages, Google introduced MapReduce. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

Google File System - Google designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs.

Dynamo - Dynamo is a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience.

Monday, November 12, 2012

Few thoughts on visualizing Big Data

Big data is about volume, velocity and data in a variety of formats. How do you communicate meaning from petabytes of data? If you had only one page or slide to capture the attention of your audience, which medium would you choose to reach out to your audience. I would say a visual, specifically an image that visualizes the big data. Data visualization can be the key to exposing something new about the underlying patterns and relationships contained in the thousands or even millions of rows of data. It is increasing becoming the language preferred by customers and collaborators to understand the data you present - dynamic images are a better way to communicate than long lists of numbers.

My first postively awe inspiring moment on the power of visual data came when viewing Hans Roslings' presentation (available on TED) on global trends in health and economics. Another dynamic visual example is the History of the World in 100 seconds created by pulling out 424,000 articles and 35000 reference to events with coordinates, parsing an XML dump of all Wikipedia articles. How does art, story-telling and information come together to create a dynamic visual? Below are links to some interesting articles exploring this theme -

Big Data : A Picture is Worth a Thousand Words : Visualize big data to make better decisions.Visualization provides data in a format that’s easy for business users to digest and use.

Conflicting Advice on Data Visualization : Data visualization is often thought of as a simple communication tool. Is there room for artistic expression and what design features and labeling methodology should you consider while creating a visual image of data?

Data Visualizations Are More Than Just Pictures : When visualization is done right, it can reveal so much. Data visualization are a kind of bidirectional encoding that lets ideas and information be transported from the database into your brain.Software and automation helps to quickly iterate data and experiment with it to find the signal within the noise. Interactive visualizations add a new dimension to complex data sets enhancing the audience's ability to understand a company's business. How will you ensure that the graph or visuals are not incomplete or misleading representations of the knowledge your company holds? Should you hire a professional who understands principles of design and visual communication?

Making Data Beautiful : Making data visually beautiful so that it becomes they become a pleasure for us to absorb.

New York Times has a team dedicated to data visualization and information design. Here is a link to a page with some interesting graphics from the nytimes.com. The Times also has some interesting examples of how graphics can be used in the classroom and a list of places to start learning about infographics. There is a sequel to this article with more information on teaching how to create and interpret infographics.

Friday, November 2, 2012

Cloudera's Impala

Cloudera is the best known Hadoop vendor around. Last week Cloudera announced it's latest offering, Project Impala.

Project Impala is a parallel real-time query engine that can run atop the raw Hadoop Distributed File System (HDFS) or the HBase tabular overlay for HDFS that makes it look somewhat like a relational database.

Impala does not work through Hadoop MapReduce. Impala uses a SQL-like syntax and allows you to query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time.

Here is a list of articles and opinions on what this will mean for Hadoop users:

Thursday, November 1, 2012

A place for the latest Hadoop news

For a novice in big data, Hadoop is a moment of truth. Learning what Hadoop does stirred my imagination and it was the first instance of realizing that it is possible to bring together a variety of data types to find patterns without going through the conventional RDMS path. Here is a link to latest news on Hadoop.

Tuesday, October 30, 2012

Wow ... Microsoft

Ther are very few recent examples where Micrososft makes you dream of the possibilities of software. This article in the nytimes.com is a great example of the pay off from investment in R&D. Microsoft is likely to create a new revenue stream with it's entry into the big data scene.

Monday, October 29, 2012

HBR on Big Data

Harvard Business Review has a 'big' segment on Big Data. This link is rich with Collection of articles documenting experiences from retailers, social networking sites, management perspectives, emerging careers in the field and a host of opinions on the impact of analytics on decision making.