Multi-Partition Slurm Cluster on Google Cloud Platform
High Performance Computing Cluster on Google Cloud Platform with Multiple Partitions and Preemptible and GPU support
Fluid Numerics has contributed a new feature to SchedMD’s slurm-gcp repository. SchedMD’s repo offers up deployment scripts for Google Cloud Platform to construct an elastic High Performance Computing cluster. This cluster has support for GPU accelerators and preemptible compute nodes that are created on-the-fly as HPC developers submit jobs to the Slurm job queue.
Fluid Numerics has added support for multiple compute partitions. This is useful for development teams that are experimenting with multiple GPU types or have a variety of core/memory requirements across their applications. Additionally, multiple partitions brings multi-zone support to the compute nodes.
Online Forums and Discussion for slurm-gcp
Fluid Numerics participates in the Slurm-GCP Google Discussion Group and has an open issue collector on their slurm-gcp fork on Github. Post your questions and issues on either platform and get started with High Performance Computing on Google Cloud Platform!
Multiple Partitions Example
From the user’s perspective, the most notable change to SchedMD’s repository is the slurm-cluster.yaml file, which specifies the cluster configuration. Rather than specifying a single compute machine type, we have expanded the slurm-gcp schema to allow for an array of partitions. A partition is defined by the name of the partition, the machine type, the number of nodes in that partition, the zone, and other optional parameters.
Below is an example slurm-cluster.yaml specification with the new partitions schema
In this example, we have the controller and login nodes available in us-central1-a. Two partitions have been specified.
The first partition is called “octo-v100” and has the ability to elastically burst out to 100 compute nodes in us-central1-a. Compute nodes in octo-v100 are preemptible and will each have 32 virtual cores and 8 Nvidia V100 GPUs. This partition alone has a theoretical peak performance of 11.2 Petaflops at Single Precision and 89.6 Petaflops at half precision on the Tensor cores. Of course, your mileage will vary depending on your applications arithmetic intensity and how well your application leverages the GPUs.
The second partition, “big-compute”, consists of 100 nodes, each specified as the n1-highcpu-96 instance. Notice that this partition resides in us-central1-b and is not preemptible, unlike the first partition.
Multi-Partition Slurm Cluster Schema
The schema for the multi-partition deployment is documented in slurm.jinja.schema. Our partition schema has the following required parameters :
Machine type to use for compute node instances (e.g. n1-standard-4).
Maximum number of instances that the cluster can grow to. Consider adding 10% to account for preemptible nodes.
Name of the compute node partition. This is the name that will appear when executing “sinfo” and can be passed to sbatch, salloc, and squeue using the “—partition” flag
The zone the partition will reside in. For ideal performance and due to networking limitations, the zone must be in the same region as the controller and login nodes.
In addition, users to specify the following optional parameters
Disk type (pd-ssd or pd-standard) for compute nodes.
Size of the local disks on the compute nodes. Default is 10GB, and the maximum allowed size is 2000 GB
Labels to add to compute instances. Labels are provided as a list of key, value pairs ( e.g. ‘partition : octo-v100’ ). Labels can be useful for tracking utilization and performance of each partition in Stackdriver
Minimum Intel Platform for compute nodes to use. Instances using 64 cores or more must use Intel Skylake. Valid options include Intel Sandy Bridge, Intel Ivy Bridge, Intel Haswell, Intel Broadwell, or Intel Skylake.
The type of GPU to attach to the compute nodes. Valid options are nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-v100, nvidia-tesla-p4, and nvidia-tesla-t4. Keep in mind that GPUs are not available in all GCP zone. Check the GCP documentation for a complete list of GPU types and the zones they are available in.
The number of GPUs to attach to each node. Valid options are 0, 1, 2, 4, and 8.
A boolean flag that determines whether bursted compute nodes are preemptible instances or not. Make sure to choose a zone that has preemptible resources.
The instructions here assume you have a GSuite or GMail e-mail account and have already setup a project on Google Cloud Platform. Reach out to us if you’re looking for support to get started with GCP.
Clone the Fluid Numerics fork of slurm-gcp from github. Then, modify the slurm-cluster.yaml file to place your cluster in the desired zone and to be provisioned with your desired compute partitions.
Requesting sufficient quota for slurm-gcp
Make sure you have requested sufficient quota for your resources. For the slurm-gcp deployment, you will want to make sure you have quota set for the following items
Compute Engine API : CPUs
Compute Engine API : GPUs
Compute Engine API : Persistent Standard/SSD Disk
The CPU quota should be set high enough to allow you to create the login node, the controller node, and the max number of compute nodes you plan to have active at any given time.
Let's calculate a CPU quota from the above example. The login and controller nodes are n1-standard-8 VMs. We will need 16 cores to launch these instances. The octo-v100 partition has a max of 100 n1-standard-32 nodes, requiring 3,200 Preemptible CPUs. The big-compute partition has a max of n1-highcpu-96 nodes, requiring 9,600 cores. This gives a total of 9,616 Standard CPUs and 3,200 Preemptible CPUs in us-central1. Note that Compute Engine Quotas are set by Region; you don't need quota for each zone within a region.
In this example, we only have GPUs in the octo-v100 partition. There are 8 per node and these GPUs, requiring 800 preemptible Nvidia Tesla V100 GPUs in us-central1.
Last, we need to make sure we have sufficient quota for all of the disk space. The default disk size per instance in the slurm-gcp deployment is 10GB. Since we did not specify a disk size for the controller, login, or any of the compute partitions, each instance will have 10GB. Additionally, since we didn't specify a disk type, the disk type defaults to Persistent Standard Disks. To support one controller, one login node, and 200 compute nodes, we will need 2,020 GB of Persistent Standard Disk space in us-central1
To request the additional quota, follow the Google Cloud documentation on requesting quota.
Deploy the slurm-gcp cluster
Once you’re ready to launch, and you are sure you have sufficient quota, you can deploy the cluster using deployment manager
This will create the resources and execute the necessary startup scripts to set up slurm and other required packages. Credentialing is handled through GCP’s OSLogin API. If you’d like to use your own ssh client, be sure to add the appropriate public RSA key to your OS Login Profile. Once you have ensured this, you can ssh into the login node through it’s external IP address.
Want to make this even easier ?
Fluid Numerics is ready to help you get access to burstable compute resources for high performance computing workloads. We can offer you access to a fully managed, multi-partition slurm cluster; simply reach out to us and we’ll take care of handling the infrastructure and credentialing for you!
If you’d like to be more hands on, we’re open to providing hourly support to help you get your cluster off the ground for your organization.
We have spent time internally working through integrations with Elastifile, Lustre, and other high performance file systems. Additionally, we have developed automated workflows for Continuous Integration and regression testing for HPC applications. We are ready to help you launch a highly customized HPC system backed by resources on Google Cloud Platform. Let us help you leverage cloud resources and reduce your time to solution!