When it comes to optimizing Hadoop performance, DevOps professionals and the administrators who manage distributed storage and processing systems might want to pull out a page or two from their high school chemistry textbooks. As one might have learned in grade 10 science, to optimize a chemical reaction, the focus must be on the rate-determining step. Many different reactions may take place to transform one molecule into another, but no matter how many molecule reactions occur along the way, it's the one that takes the longest that determines how fast or slow the overall reaction time will be.
Identifying the rate-determining in a poorly performing Hadoop cluster has become a real problem. Troubleshooting was a simpler science four or five years ago when a Hadoop installation was little more than the core Apache Hadoop binaries with a peripheral Apache project thrown in for good measure.
But the world of distributed clock cycles and data has evolved, and a typical installation will incorporate everything from MapReduce, the Hadoop Distributed File System and YARN to complementary Apache Hadoop projects like Hive, Pig and Zookeeper. When a system misbehaves and DevOps professionals try to fix Hadoop, figuring out what is wrong -- and where the rate-determining step for a given transaction is located -- can be a darned near impossible problem to solve.
"Each one of the components that come out of the Apache Software Foundation have their own settings and parameters that you need to fiddle with in order to get them configured and deployed properly," said Tim Hall, Hortonworks' vice president of product management. He empathized with the plight of administrators and operations people trying to squeeze as much performance out of the infrastructure as possible. But why would someone from Hortonworks be so empathetic?
Projects competing during runtime
The fact is, the Hortonworks Data Platform (HDP) is made up of more than 20 different Apache projects, all of which compete for clock cycles, storage space and network bandwidth at runtime. Perhaps it's a curse that they've brought upon themselves, but in order for customers to have faith in Hortonworks, every single component that comprises HDP must be tested, vetted and proven to work with every other one. And even if Hortonworks can prove out the veracity of its software in a controlled environment, there's no telling what type of hardware or infrastructure customers will unload on Hadoop.
We're testing, validating that they work together in concert, and we want to provide recommendations to customers so that everything works properly, but there are kinds of variations.
Tim Hall, vice president of product management, Hortonworks
"We're testing, validating that they work together in concert, and we want to provide recommendations to customers so that everything works properly, but there are kinds of variations in hardware, the number of discs, the CPUs, the memory footprint and network I/O," Hall said. "There is a lot of work for the administrators to do."
So in response to all of the Hadoop users asking time and time again for recommendation in terms of optimizing their systems, Hortonworks is introducing the notion of SmartSense and more specifically, Smart Configuration. The goal is to help users focus on what their rate-determining steps are in their data-crunching solutions.
This goal is achieved by guiding users to the 15 or 20 most critical configuration settings and suggesting how to tune those settings given specific hardware and the way customers are using various Apache projects. Which projects are most critical to an HDP setup is a big part of the information that Smart Configuration provides, because with each suggested setting, advice is given on how tweaking a parameter to make one component more efficient might affect another component in the domain. "If I set one value, does it affect any other values in that configuration? Smart Configuration will tell me before I get into trouble," Hall said.
Prepackaged configurations are coming
Smart Configuration is only one part of the three-step journey that Hortonworks is embarking upon to make it easier to get the most out of a Hadoop system:
- The first part is the aggregating of all of the various parameters that can be tweaked to change how a system behaves at runtime, something Hortonworks already provides.
- The second step is the Smart Configuration concept, in which an entire system can be modified simply by moving a slider or turning a dial on the program's user interface (UI). The UI input then gets transformed into actual system settings, and the product takes care of pushing those configuration changes out to all of the nodes on the network.
- To complete the trifecta, Hortonworks is looking at introducing profiles, in which an entire set of configurations come prepackaged, and administrators can apply an entire suite of optimizations simply by choosing the right profile from a drop-down list. It's not available yet, but it's coming.
Optimizing enterprise software is always a challenge, especially when your software suite is a conglomeration of different Apache projects that may not have been integrated as vigorously as they were unit tested. Organizations like Hortonworks are filling in the gaps for customers by providing ideas about how best to optimize an enterprise-strength Hadoop installation.
What tips and tricks do you have for optimizing Hadoop performance? Let us know.
Avoid Hadoop performance bottleneck issues