Testing a Containerized HPC Application
Dr. Joseph Schoonover
How to run a GPU accelerated Singularity Container in the Cloud
In the previous article in this series, CI/CD in the cloud for HPC applications, we show how to build a Singularity image for a GPU accelerated application GCP. Now that you have a singularity container, a good next question is “How do I run a singularity container ?”. Ultimately, you want to use your HPC container for science, to create products, or to test your application to make sure it’s running as expected.
This article covers how to provision a compute engine resource on Google Cloud Platform that has all of the tools necessary for running a GPU accelerated Singularity container.
We start by covering what needs to be available on a cloud VM for this purpose. Then, we share with you our Bitbucket repository, which contains Deployment Manager scripts for launching a GPU accelerated VM with all of the necessary tools installed.
Create a Compute Engine Instance with Singularity and CUDA
To start, we create a Google Compute Engine (GCE) instance with the CentOS 7 operating system image and a GPU accelerator. To execute a singularity container easily with Nvidia GPU support, we need to install Singularity and Nvidia drivers.
We opted to install Singularity v3.2.1 using Go following Sylab’s install instructions. This entails installing Go, among other dependencies, and then building singularity from source.
To install the Nvidia drivers, we follow the same procedure as in the open source slurm-gcp startup script. We download the cuda-repo rpm from Nvidia’s developer download site and installed with rpm and yum. After drivers are installed, nvidia-persistenced is enabled and started and modprobe is used to ensure all /dev/nvidia* device files exist. This last step is critical for ensuring that device files are available when running a Singularity container. Without this step, you may experience issues similar to what is described in this issue on Sylab’s Github repository.
Tools to quickly deploy Singularity and CUDA on GCP
We have conveniently packaged and released (under an MIT License) a set of scripts you can use to deploy the GPU accelerated “Singularity Runner”. These scripts use GCP’s Deployment Manager to create the compute engine instance with the startup script that sets up Nvidia drivers and Singularity.
To use these scripts, start a cloud shell session. Once started, clone the bitbucket repository
$ cd singularity-runner
Modify the runner.yaml file to select the region and zone where you wish to deploy, the gpu type, the number of GPUs, and the machine type. Keep in mind that there are a number of different GPU types available, and they are not available in every zone. See the GPUs on Compute Engine documentation for making your selection. Before launching the deployment, check to make sure you have sufficient quota for the CPUs and GPUs listed in runner.yaml.
Once you’re ready, you can launch the deployment with deployment manager
$ gcloud deployment-manager deployments create srunner --config=runner.yaml
This will provision a compute engine instance with the GPUs you specified and begin running the startup-script for you. After about 5 minutes, you can ssh into the instance from cloud shell with
$ gcloud compute ssh singularity-runner --zone=ZONE
Replacing ZONE with the zone you specified in runner.yaml . You can verify that singularity is installed with
$ singularity --version
And you can verify that Nvidia drivers are installed with
Run a GPU accelerated Singularity Container
While you are on the singularity-runner instance, you can copy in your Singularity image file. In our case, we had built our Singularity image file and saved it in a GCS Bucket. Once you have your image file available, you can run your image with
$ singularity run --nv my-image.sif
The --nv flag is necessary for Singularity to expose the host OS GPU drivers to the container.
Check out our video that demos how this works for our application.
So far, we can create and run Singularity containers for GPU accelerated HPC applications. In the next article in this series, we’ll start looking at how to set up infrastructure to automatically test HPC applications for us after the container image has been built. This next step is the remaining essential ingredient in the Continuous Integration loop that will help accelerate our application development.
With these components in place, we’ll be able to start discussing how to best make use of this type of system. This will include topics covering
- Optimizing Container Image Size
- Defining and establishing good application tests
- Understanding Code Coverage
- When to test : Build Trigger choices
- Execute Multi-GPU Container Images