OpenStack is, without doubt, an exciting project and the lead open source Infrastructure-as-a-Service platform. In the last couple years, I had the privilege to architect and deployed dozens of OpenStack clouds for multiple customers and use cases. One of the use cases that I worked on in the last year was High-Performance Computing (HPC) on OpenStack. In this blog, I am going to cover some of the considerations to host high performance and high throughput workloads.
First, let's’ start with three types of architectures that could be used when hosting HPC workloads on OpenStack:
- Virtualized HPC on OpenStack
- In this architecture, all component of the HPC cluster is virtualized in OpenStack.
- Bare-metal HPC on OpenStack
- In this architecture, all components of the HPC cluster are deployed in bare metal servers using OpenStack Ironic.
- Virtualized Head Node and Bare-metal compute nodes
- In this architecture, the head node (scheduler, master and login node) are virtualized in OpenStack, and the compute nodes are deployed in bare metal servers using OpenStack Ironic.
Now that we discussed the 3 types of architecture that we could deploy HPC software in OpenStack, I am going to discuss a few OpenStack best practices when hosting this type of workloads.
For the networking aspect of OpenStack, there are two recommended configuration options:
- Provider networks: The OpenStack administrator creates these networks and maps them directly to existing physical networks in the datacenter (L2). Because of the direct attachment to the L2 switching infrastructure, provider networks don’t need to route L3 traffic using the OpenStack control plane, as they should have an L3 gateway in the DC network topology.
- SRIOV: SRIOV/SR-IOV (single root input/output virtualization) is recommended for HPC workloads based on performance requirements. SR-IOV enables OpenStack to extend the physical NIC’s capabilities directly through to the instance by using the available SRIOV NIC Virtual Functions (VF). Also, support for IEEE 802.1br allows virtual NICs to integrate with, and be managed by, the physical switch.
- It’s important to mention that in tests conducted by various vendors, results show that SR-IOV can achieve near line rate performance at a low CPU overhead cost per Virtual Machine/Instance.
- When implementing SRIOV, you need to take in consideration two essential limitations: not been able to use live migrations for instances using VF devices and bypassing OpenStack’s security groups.
For an HPC architecture, there are two major storage categories to consider:
- OpenStack storage: image (glance), ephemeral (nova), and volume (cinder).
- HPC cluster file-based data storage: Used by the HPC cluster to store data.
Based in both categories here are couple recommendations to consider while architecting your cluster:
- Glance and Nova: For the Glance and Nova (ephemeral) storage, I like to recommend Ceph. One of the significant advantages of ceph (besides the tight integration with OpenStack) is the performances benefits that you could obtain at instance creation time that image copy-on-write offers with this backend. Another advantage for the ephemeral workloads (not using SRIOV in this case) is the ability to live migrate between the members of the compute cluster.
- Cinder: For the cinder backend in this HPC use case, I like to recommend Ceph (same benefits apply from the previous point) and NFS/iSCSI backends like NetApp, EMC VNX, and similar systems with supported cinder drivers.
HPC Cluster file-based data storage:
Common used parallel file systems in HPC, like Lustre, GPFS, OrangeFS should be used by accessing them from dedicated SRIOV/Provider networks. Another recommended backend will be Ceph, also providing the access directly from the SRIOV/Provider networks for better performance.
Ceph as a backend, in general, is very flexible. A well-architected Ceph cluster could benefit multiple types of workloads in different configurations/architectures, e.g.:
- Ethernet-based connectivity could benefit performance by higher throughput NIC interfaces for frontend and backend storage traffic (10/25/40/50/100 Gbps), plus LACP configurations that could double the amount of bandwidth available.
- Storage servers components could be a combination of NVMe, SSD, SAS and SATA drives. Tailored to provide the required performance IO wise.
- The distributed nature of the technology provides a flexible and resilient platform.
The next thing to consider after this will be to automate the deployment of your HPC application on OpenStack. For that multiple tools could be used: heat, Ansible, or API calls from an orchestrator system.
Happy HPC on OpenStack hacking!