Helping IT, Information and Data professionals reduce cost and complexity whilst driving innovation across Public Cloud Platforms.
Hadoop hit our datacentres nearly 15 years ago and provided democratised access to distributed storage and processing for Big Data workloads, without the need for expensive Enterprise Software and dedicated High-Performance Compute.
Over the years, the suite (and I include submodules such as Pig, Hive and Spark etc in this) has grown into a feature rich Open Source, High Availability (HA) distributed file system that has stayed true to its heritage of deploying across commodity hardware. Many large Enterprises looking to exploit the ever increasing mine of data available have taken advantage of the low entry cost into Big Data processing and have built large clusters processing ever larger workloads.
It is this success that is now starting to shine a spotlight on the Total Cost of Ownership of these large On-Premise clusters. What started out as Open Source on commodity hardware has turned into ever larger clusters requiring, maintenance, data centre real estate, performance management and a strategic approach top capacity planning.
It’s the high availability architecture that is now turning into a negative from a cost management perspective. In high level terms the way a multi node Hadoop cluster works is for a master to manage jobs and tasks and farm them out to slave nodes to compute, whilst at the same time manage where the data is then stored. It’s easy to add a slave for additional processing and storage capability. As the workloads increase and the value of data is realised, more money is spent on the commodity nodes (extra memory, faster disks etc). To support exceptional events like Black Friday, additional nodes need to be added (sometimes lots of additional nodes) that then remain unused for the rest of the year.
Suddenly you find yourself with a large team of highly qualified technical staff required solely for the purpose of feeding and watering your cluster. Often, I see the analytics team that is actually using the data is smaller and less well funded than the maintenance team.
This incremental growth in support creeps up on most companies and it’s not until a formal hardware refresh that the cost is really seen.
The good news is that you have options. The rise in cloud compute has democratised high availability compute in a way that the co-founders of Hadoop (Doug Cutting and Mike Cafarella) would (I’m sure) applaud. Add in the functionality to burst (grow then shrink back in response to peak demands) and you no longer need to build a cluster optimised for peak loads and then spend most of your Operational Budgets in maintaining it.
Option 1 would be to ‘lift and shift’ a minimal footprint to cloud using HA and burst capability, reducing maintenance and data centre spend whilst building in scale and performance gains. The advantage is that it’s seamless to your analytics team who continue to use the same software and all of the routines they have developed over the years to carry on as normal.
Option 2 would be to take the opportunity to ‘transform’ and take advantage of born in the cloud big data storage and processing capability such as Google Cloud Platform’s Cloud BigTable or Big Query. Other cloud services are available from the likes of Azure Oracle OCI or AWS but I like the symmetry that Doug and Mike’s original efforts on Hadoop were originally inspired by Google white papers on MapReduce and Google File System.