During 2010, the human race took 48 hours to generate the same amount of data as was produced between the dawn of civilization and the year 2003. Being able to manage this flood of data and turn it into meaningful insight means your business can make better decisions.
An example of this is the Google search engine. Imagine the challenge of reading and indexing all of the information on the world wide web. It is perhaps poetic that Google is named after the number 10100 (a Googol is the digit 1 followed by 100 zeroes).
Some applications, like that of the Google search engine, require the collection and analysis of data sets that are so large and/or complex that they become phenomenally difficult to process and store. This data is now commonly known as “Big Data”, and Hadoop is one of the tools used to process and store it.
Whilst Google might seem like an extreme example, we are starting to see the staggering impact on organisations that embrace Hadoop. For example, Tesco expects to cut refrigeration costs by around €20m as a result of processing 70 million pieces of data gathered from refrigerators. The company identified that power was being wasted by keeping the units cooler than they needed to be, and that the cost of failing equipment could be reduced by detecting problems much sooner.
How does it work?
The challenge of processing large data sets within traditional databases arises from the fact that data is centralised – a set of computer processors read and process data that is held in one location.
Hadoop approaches the problem differently, separating data into manageable chunks, each of which is stored and processed independently. As a result, Hadoop does not need shared storage or memory and can use commodity hardware.
To illustrate the impact of this, imagine for a moment that you work for a secret spy organisation. Your team is given just 10 minutes to find out if the name of a public figure occurs in a 50 page report before it is released to the press. You know that you won’t have time to read all 50 pages yourself. Instead, you split the pages between members of the team. Each person then reads through their pages at the same time.
By splitting the information (pages of the report) to different people, and having each person process their own data, you are able to process the data in the required time. This is the principle behind Hadoop.
Splitting the data up is handled by the Hadoop Distributed File System (HDFS). As new data comes in, it is split across multiple servers (nodes) and a duplicate copy is then stored on another node in case the node fails. Hadoop can scale massively, unlike many traditional file systems and storage devices that have limitations to their scale.
The processing of data is then handled by MapReduce. This instructs each of your servers to perform a required action (for example, to find the words ‘storage’ and ‘critical’ in a log file) but only on the data held on that server. This means the data and the processors are local to each other (on the same server), and avoids the need to send this data over the network.
So should your business be looking at Hadoop?
Hadoop is gaining traction, yet, naturally, the pool of experienced professionals is likely to be smaller than that for traditional database skills. Therefore, embracing Hadoop brings the challenge of increasing your in-house skills, or going to the market for Hadoop expertise. You may find that limited supply means a premium price.
So what is the right approach, given the Hadoop ecosystem is evolving? As some tools and techniques may fall out of favour as the technology settles, be careful not to invest too heavily in one approach. Also, keep an eye on projects that make is it easier to consume Hadoop with existing skillsets, such as SQL access to Hadoop.
For companies without a burning demand, it may be sensible to start exploring the technology for small projects whilst you see how the market emerges. You may want to act sooner if your data is too ‘big’, unstructured or computationally-intensive for a traditional database to handle.
Why not experiment with Hadoop to help shape your future strategy? Node4 has significant open source expertise and can help you to quickly spin up virtual machines on N4Cloud for a temporary proof-of-concept project. At the end of the project, your servers can be decommissioned, meaning you are not left with stranded IT costs.
Join us at our next Cloud Clarity event in October to learn more about open source technologies including Hadoop.
Mark Wilson is a technology fanatic who works for Node4. He is focused on helping our Customers benefit from innovation and new technology. With a mix of technical and commercial expertise, he has developed innovative IT services for major global outsourcers, midmarket service providers and SME businesses, and is one of just over 1000 IT evangelists recognised worldwide in 2014 and 2015 as a VMware vExpert. Follow him on twitter @markwilsontech.
 http://www.computing.co.uk/ctg/news/2337666/hadoop-summit-2014-usd50bn-of-big-data-market-in-2020-will-be-driven-by-hadoop-idc. IDC prediction. £30bn in 2020 minus £4bn in 2015.