High-performance computing with cloud clusters

An expert offers tips for using cloud clusters to roll out high-performance computing applications in the cloud.

Cloud computing began by focusing on improving the application architecture for systems of engagement, but had little to offer in terms of high-performance computing. Today, leading cloud providers are refactoring their offerings and the associated infrastructure to make computationally intensive applications practical and cost effective.

Traditionally, clouds have been architected for the delivery of services that combined applications with storage like Dropbox, Gmail, iTunes and Evernote. "Clusters are architected to expose resources in addition to storage, such as, for example, those required to execute vendor-provided or user-built applications in customized networks," said Matthijs Van Leeuwen, CEO of Bright Computing.

Much like traditional clusters running on dedicated hardware, a cloud-based cluster comprises distinct distributed resources that have been combined for some purpose. This could include delivering a platform for a cluster-aware database management system, or DBMS; high-performance computing (HPC) application; or big data analytics application. Public cloud providers like Amazon and Rackspace expose predefined instances of resources that can be used in building clusters on their cloud infrastructure.

OpenStack allows organizations to define their own resource instances and then use these to build clusters in their own private clouds. Physical servers or virtual machines (VMs) that make use of hypervisors on physical servers are typical in dedicated on-premises clusters. The key difference for developers is the resource-instance abstraction that is the major point of differentiation between cloud and dedicated clusters.

Common cluster uses

Leeuwen said cloud clusters can be used to replace or complement dedicated resources. For applications with minimal dedicated hardware, like a laptop, the cloud can be used to instantiate, use and de-instantiate clusters. In this use case, the laptop is no more than an end-user device that accesses the cloud-based cluster. It is not providing any instantiated resources that are used on executing calculations or crafting networks.

In the second common use case, cloud-based resources can be used to complement dedicated resources. In this case, the on-premises resources are extended by those available in the cloud via cloud bursting processes. The cloud-based resources need only be instantiated, used and de-instantiated as demand dictates. This distinction between on-premises and in-the-cloud resources can be made transparent to end users and many types of applications.

Both use cases can be applied to public or private clouds. Organizations can architect their application to do this directly or leverage a tool like Bright Cluster Manager to create a cluster in the AWS public cloud or in an OpenStack private cloud with less up-front development and configuration work.

Bridge the abstraction gap

The biggest challenge developers face is the different abstraction models for provisioning cloud resources like networking, CPU and storage compared with dedicated hardware. Clouds are dependent upon instantiated resources. In addition to storage, the exposure of cloud-based CPU instances is quite mature in both public and private cloud offerings. The latest cloud offerings come with services and hooks for specifying exotic requirements like InfiniBand network connects, GPU acceleration and customized IP networks.

Any resource will need to pass through this same path before it can be exposed for exploitation within a cloud of any type. Because clusters routinely make use of low-latency, high-bandwidth interconnect fabrics; accelerators and coprocessors; and other specialized resources, each of these presents both a challenge and an opportunity in the case of cloud-based clusters.

Organizations are at the mercy of the cloud provider to support the instantiation of resources beyond storage and compute, said Leeuwen. AWS, for example, supports customized IP networks via Amazon VPC as well as an NVIDIA GPU instance. A good practice is to develop standard configurations or leverage third-party cloud management to manage storage, compute, networks and accelerator resources, whether they reside on-premises or in concert with AWS.

Latency is critical for clusters

Communications latency is one of the biggest challenges in building scalable cluster applications. A good practice is to intelligently stage data for HPC. On the data side, this involves thinking through the use of more cost-effective and slower persistent storage services like AWS S3 and archiving like AWS Glacier, versus more expensive RAM instances.

But an even bigger networking challenge lies in minimizing the latency of communication between nodes during computation. HPC applications that make use of message passing during processing are the most susceptible to bottlenecks. Applications that make extensive use of interfaces like MPI will flounder unless the developers and operations team ensure the latency between nodes is extremely low. 

If the MPI application runs in clusters enclosed within the confines of a private cloud or a public cloud, this is easier to address. But this can be a bigger issue if there is a lot of MPI traffic between different nodes running on separate public or private infrastructure.

The same considerations also apply to running big data analytics in the cloud. It does not make much sense to have a Hadoop Distributed File System (HDFS) instance straddle an on-premises and cloud infrastructure. "But HDFS instances entirely on the ground or in the cloud can work out quite well in practice," Leeuwen said.

They key to maintaining performance as you scale is distributed architecture, said Ilan Sehayek, the CTO of Jitterbit, an Agile cloud integration solution. "Let users choose where to run the API, and where to run services that support the API."

Also ensure that all communication is enabled by a scalable messaging infrastructure that provides fast, guaranteed delivery of API requests between API gateways and services. Cluster-oriented services also need effective caching techniques to provide fast response to APIs, Sehayek added.

Next Steps

Automate cluster management

Deploying server clusters in the cloud

What challenges have you faced when implementing high-performance computing with cloud clusters? Let me know what you think, or find me on Twitter @Potemcam.

Dig Deeper on Software development best practices and processes

App Architecture
Software Quality
Cloud Computing