Monday, November 12, 2012

Few thoughts on visualizing Big Data

Big data is about volume, velocity and data in a variety of formats. How do you communicate meaning from petabytes of data? If you had only one page or slide to capture the attention of your audience, which medium would you choose to reach out to your audience. I would say a visual, specifically an image that visualizes the big data. Data visualization can be the key to exposing something new about the underlying patterns and relationships contained in the thousands or even millions of rows of data. It is increasing becoming the language preferred by customers and collaborators to understand the data you present - dynamic images are a better way to communicate than long lists of numbers.

My first postively awe inspiring moment on the power of visual data came when viewing Hans Roslings' presentation (available on TED) on global trends in health and economics. Another dynamic visual example is the History of the World in 100 seconds  created by pulling out 424,000 articles and 35000 reference to events with coordinates, parsing an XML dump of all Wikipedia articles. How does art, story-telling and information come together to create a dynamic visual? Below are links to some interesting articles exploring this theme -


Big Data : A Picture is Worth a Thousand Words : Visualize big data to make better decisions.Visualization provides  data in a format that’s easy for business users to digest and use.

Conflicting Advice on Data Visualization : Data visualization is often thought of as a simple communication tool.  Is there room for artistic expression and what design features and labeling methodology should you consider while creating a visual image of data?

Data Visualizations Are More Than Just Pictures : When visualization is done right, it can reveal so much. Data visualization are a kind of bidirectional encoding that lets ideas and information be transported from the database into your brain.Software and automation helps to quickly iterate data and experiment with it to find the signal within the noise. Interactive visualizations add a new dimension to complex data sets enhancing the audience's ability to understand a company's business. How will you ensure that the graph or visuals are not incomplete or misleading representations of the knowledge your company holds? Should you hire a professional who understands principles of design and visual communication?

Making Data Beautiful : Making data visually beautiful so that it becomes they become a pleasure for us to absorb.

New York Times has a team dedicated to data visualization and information design. Here is a link to a page with some interesting graphics from the nytimes.com. The Times also has some  interesting examples of how graphics can be used in the classroom and a list of places to start learning about infographics. There is a sequel to this article with more information on teaching how to create and interpret infographics.

Friday, November 2, 2012

Cloudera's Impala

Cloudera is the best known Hadoop vendor around. Last week Cloudera announced it's latest offering, Project Impala.

Project Impala is a parallel real-time query engine that can run atop the raw Hadoop Distributed File System (HDFS) or the HBase tabular overlay for HDFS that makes it look somewhat like a relational database. 

Impala does not work through Hadoop MapReduce. Impala uses a SQL-like syntax and allows you to query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time.

Here is a list of articles and opinions on what this will mean for Hadoop users:

Thursday, November 1, 2012

A place for the latest Hadoop news

For a novice in big data, Hadoop is a moment of truth. Learning what Hadoop does stirred my imagination and it was the first instance of realizing that it is possible to bring together a variety of data types to find patterns without going through the conventional RDMS path. Here is a link to latest news on Hadoop.