Wednesday, December 5, 2012

Key research papers behind the growth in Big Data tools

A large chunk of big data tools including Hadoop owe their beginnings to the research papers published by Google and Amazon. These papers are a good place to start when trying to understand the technologies and tools that drive big data analytics. Below are descriptions of the technology  and links to the research paper.

BigTable - Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance.

MapReduce - MapReduce is a programming model and an associated implementation for processing and generating large data sets.Google has implemented hundreds of special-purpose computations that process large amounts of raw data using MapReduce. Inspired by the map and reduce primitives present in Lisp and many other functional languages, Google introduced MapReduce. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

Google File System - Google designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs.

Dynamo - Dynamo is a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience.

Monday, November 12, 2012

Few thoughts on visualizing Big Data

Big data is about volume, velocity and data in a variety of formats. How do you communicate meaning from petabytes of data? If you had only one page or slide to capture the attention of your audience, which medium would you choose to reach out to your audience. I would say a visual, specifically an image that visualizes the big data. Data visualization can be the key to exposing something new about the underlying patterns and relationships contained in the thousands or even millions of rows of data. It is increasing becoming the language preferred by customers and collaborators to understand the data you present - dynamic images are a better way to communicate than long lists of numbers.

My first postively awe inspiring moment on the power of visual data came when viewing Hans Roslings' presentation (available on TED) on global trends in health and economics. Another dynamic visual example is the History of the World in 100 seconds  created by pulling out 424,000 articles and 35000 reference to events with coordinates, parsing an XML dump of all Wikipedia articles. How does art, story-telling and information come together to create a dynamic visual? Below are links to some interesting articles exploring this theme -

Big Data : A Picture is Worth a Thousand Words : Visualize big data to make better decisions.Visualization provides  data in a format that’s easy for business users to digest and use.

Conflicting Advice on Data Visualization : Data visualization is often thought of as a simple communication tool.  Is there room for artistic expression and what design features and labeling methodology should you consider while creating a visual image of data?

Data Visualizations Are More Than Just Pictures : When visualization is done right, it can reveal so much. Data visualization are a kind of bidirectional encoding that lets ideas and information be transported from the database into your brain.Software and automation helps to quickly iterate data and experiment with it to find the signal within the noise. Interactive visualizations add a new dimension to complex data sets enhancing the audience's ability to understand a company's business. How will you ensure that the graph or visuals are not incomplete or misleading representations of the knowledge your company holds? Should you hire a professional who understands principles of design and visual communication?

Making Data Beautiful : Making data visually beautiful so that it becomes they become a pleasure for us to absorb.

New York Times has a team dedicated to data visualization and information design. Here is a link to a page with some interesting graphics from the The Times also has some  interesting examples of how graphics can be used in the classroom and a list of places to start learning about infographics. There is a sequel to this article with more information on teaching how to create and interpret infographics.

Friday, November 2, 2012

Cloudera's Impala

Cloudera is the best known Hadoop vendor around. Last week Cloudera announced it's latest offering, Project Impala.

Project Impala is a parallel real-time query engine that can run atop the raw Hadoop Distributed File System (HDFS) or the HBase tabular overlay for HDFS that makes it look somewhat like a relational database. 

Impala does not work through Hadoop MapReduce. Impala uses a SQL-like syntax and allows you to query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time.

Here is a list of articles and opinions on what this will mean for Hadoop users:

Thursday, November 1, 2012

A place for the latest Hadoop news

For a novice in big data, Hadoop is a moment of truth. Learning what Hadoop does stirred my imagination and it was the first instance of realizing that it is possible to bring together a variety of data types to find patterns without going through the conventional RDMS path. Here is a link to latest news on Hadoop.

Tuesday, October 30, 2012

Wow ... Microsoft

Ther are very few recent examples where Micrososft makes you dream of the possibilities of software. This article in the is a great example of the pay off from investment in R&D. Microsoft is likely to create a new revenue stream with it's entry into the big data scene.

Monday, October 29, 2012

HBR on Big Data

Harvard Business Review has a 'big' segment on Big Data. This link is rich with Collection of articles documenting experiences from retailers, social networking sites, management perspectives, emerging careers in the field and a host of opinions on the impact of analytics on decision making.

Monday, October 22, 2012

A 'go to' for the latest tech news

I always wonder which is the best place to see what is buzzing in technology. Traditional sources like,, all cover tech in detail but most news makes it to these papers only after they have been news for a while. Where can we hear about it when it is just a concept or an idea that is popular among techies? I stumbled across Hacker News following something mentioned in an article on So far, it has proved to be a reliable source.

Monday, October 8, 2012

Who is a Data Scientist?

A data scientist is central to extracting facts and figures from large volumes of data. Apart from being able to find patterns from large data sets, the data scientist should be able to mine for the most important and business-focused parts and present it to business users from all levels of the enterprise. They are part geek, part story-teller and part graphic illustrator as they deal with algorithms, use narratives to explain their findings and make visual/graphic illustrations to communicate it all. Charles Roe tackles this question in an article published at Dataversity. Ray Rivera, Director, Solutions Management, Workforce Planning and Analytics, SAP has written a Forbes guest article on this very subject. He chooses to call data scientists 'analytical wizards'!

Monday, October 1, 2012

A look at the possibilities of R

Came across an article, written about three years back, talking about the possibilites of R. It is well written and touches on many of the striking features of R that makes it attactive to users.

Tuesday, August 28, 2012

Marriage of Cloud Computing and Big Data Analytics

The linkages between the power of cloud computing and big data analytics is increasing becoming stronger. Without the storage capacity and the cheap computing power offered by cloud, it would be virtually impossible for many companies to enage in the business of analyzing large volumes of data. This New York Times article explores how Amazon's web server provides cloud services to companies across the globe changing conventional business models, and also traditional company structures and resource utilization. A Forbes article explores another side of cloud computing, the changing skill set needed to be successful when using cloud computing power. The two articles indicate close ties between cloud and analytics, and it is clear that this relationship is likely to grow even more tighter in the future.

Thursday, August 16, 2012

Storing Data on DNA

Today I read two posts on how huge volumes of data can be stored outside of conventional devices likes computer chips, drives and discs. One option that a Wall Street Journal article discusses is DNA. A research report in the journal Science reports that the a group of Harvard researchers translated the English text of an up-coming book on genomic engineering into actual DNA. The article can be read here. Another article in Forbes examines the issue of data storage and suggest storing data in bacteria and diamonds. These ideas are in preliminary stages and commercial applications are likely to be a long way off, but one thing is certain - the explosive growth of digital data is driving the development of alternatives storage solutions.

Thursday, July 19, 2012

Data and analysis, the job engine of the future?

The growth of cloud computing and storage capacity has given rise to new capabilities in data analytics. This article in Forbes explores world-wide job growth in these areas.

Wednesday, July 18, 2012

Big Data on Campus

A article examines the how big data is being used by higher learning institutions to shape how students choose courses and classes. It sounded a bit too Orwellian for my taste. It is worth reading as a futuristic look into how big data and analytics shape our choices. Click here for the link.

Thursday, July 5, 2012

I came across an academic paper written a couple of years back. It uses machine learning algorithms to analyse consumer credit risk and predict the probability of default. This paper has come from MIT Sloan School of Management. The learning I take away from it are the possibilities that are opened up by practical application of tools used in big data analytics to broad issues like credit risk that have a bearing on the economy at large. Click here for the link.

Tuesday, June 19, 2012

A real time application of a tracking tool

Check out this article in on how search results are used to evaluate how well a candidate knows what s/he claims to be a core skill. There is a long ways to go before this becomes a general application tool, but it is sufficient to get one's imagination fired up about the possibilities it holds.

Saturday, June 16, 2012

Monday, June 11, 2012

Thursday, May 31, 2012

Here is an online book that could help with those learning to forecast using R.

Wednesday, May 2, 2012

R is an important open source tool to analytics. Here is a link to R tutorials from a blog that recently came to my attention!

Monday, April 30, 2012

A conversation on Big Data

Sometimes you want to hear from the companies that are actually innovating in the big data space. You want to know that they are not just talking possibilities but are talking about applications they use. Click here to hear such a converstation. The panalists include experts from leading companies like Symantec and Google and also others like Baynote, and Collective(i). The video is over an hour long but exciting, and worth listening to.

Wednesday, April 25, 2012

Making data human using visuals

Imagine using data visualisation tools to put data into a human context. Here is a link that explores making data human.

Answers to drowning in data

Sometimes when you get online, do you feel trapped by the feeling that you are drowing in a sea of data, trying to find that exact piece of information. I do. Is there an easy way to extract quality content opposed to what a search engine wants you to read (a click that will maximise their revenue)? Could sophisticated algorithms be the answer, with their capability to understand context and semantic relationships to match web information with specific customer information needs? Click here for a article that examines just this question.

Monday, April 23, 2012

Just read an interesting piece on reviewing if big data is a strategic fit for an organisation. Go here for more.

Saturday, April 14, 2012

Data visualization

Data visualization tools are key to communicating findings, and this can be very effective when dealing with volumes and volumes of data. Forbes has a slideshow featuring some interesting data visualizations ranging from charts used by Florence Nightingale to more contemporary examples. Check them out here

Thursday, April 12, 2012

An interesting look into the job market for techies and how companies are wooing potential candidates. Read here

Saturday, April 7, 2012

Data visualization is important to managing and making sense of big data. Clickhere for a article exploring the art of data visualization.

Sunday, March 18, 2012

Haven't read anything exciting in a while, but this article from caught my attention. It explores how physicists are using big data technologies to trace word origins and their usage over time. It is yet another exciting example of how data crunching is fast eliminating the divide between science and humanities when it comes to research. Click here for a link to the article. Go here for the scientific paper

Thursday, March 1, 2012

Big data in HR

Big data is making it's way to HR and analytics is helping companies make decisions on managing their workforce. Forbes has an article examining this facet of big data analytics.

Thursday, February 16, 2012

Data crunching customer habits

When I first started reading more on big data, I used to feel very lucky. Most of the magazines I read regularly would have articles on big data. Looks like it is possible that an algorithm has identified my reading habits and makes sure I see the big data articles. Nytimes has an article about how retailers like Target use data from our shopping habits to calculate the likelyhood of a life-changing event like a pregnancy, and send us apt coupons that encourage us to shop at their store. Click here for the link.

Saturday, February 11, 2012

Crowd source science?

This is so exciting .... Here is a link to how crowd sourcing helped scientists Come up with enzyme designs 18 times more efficient than the lab version. Gamers did it !!!

Tweets as Data

Tweets can be an important source of public opinion. Given the complexity of language structure, how easy is it to successfully translate tweets into quantifiable sentiment? Wsj has an article examining just this question, click here for the link

Wednesday, February 8, 2012

I did not think it was unusual to use more than one monitor. This article in the discusses this question.

Monday, January 30, 2012

In the 90's B2B and B2C was the rage. Many debated how the Internet would change brick and motar stores. Amazon was both admired and loathed. Today, we hear a similar debate about cloud computing, big data, high end manufacturing. WSJ has an article exploring the technologies of the future and click here to read the story.

Monday, January 23, 2012

Universities are creating programs to meet industry demand for data analytics professionals. Forbes link to the story here

Saturday, January 21, 2012

Here is an interesting article from TIME on how the virtual footprints we leave behind are mined to create profiles of our habits and traits. Click here for link.

Thursday, January 19, 2012

Wednesday, January 4, 2012

Harnessing the Power of Algorithms to Reduce Human Bias

Just read a Wall Street Journal article on how self-learning algorithms, capable of chewing through billions of bits of data, are being used by businesses to reduce human bias in decision making. The potential for Data Analytics to improve decision making is increasingly being recognized by businesses. This article reviews how some companies are taking advantage of this computing technology to increase revenues. Link:

Tuesday, January 3, 2012

Popular Science Nov 2011 - Data is Power

Popular Science's November 2011 issue was a special on 'Big Data'. There are several interesting interviews and articles that explore the world of big data and examines how the deluge of data is influencing decision making at different levels. The article on Albert-László Barabási's work in applying network theory to data mining is very interesting and a very clear example of the inter-disciplinary nature of 'big data'. The list of the ten most amazing databases in the world and the possibilities they hold is very impressive. Click here to go to the Nov 2011 issue of Poupular Science.