Building an HPC app with Docker and Cloud Build

Dr. Joseph Schoonover

Last updated : March 23, 2020

In a previous article, I described how to use Cloud Build and Singularity to build container images for HPC applications. In the end, this process required

  • Creation and management of the custom Singularity build step container
  • Storing Singularity images as artifacts in Google Cloud Storage (GCS) buckets

While this setup provided a reliable build system, there's always more than one way to solve a problem.

Google Container Registry (GCR) is a Google Cloud service that hosts private Docker images within a GCP project. GCR naturally integrates with Cloud Build, removing the need for us to declare artifacts that end up stored in GCS Buckets. Further, GCP provides a Docker build step so that we don't have to maintain a custom build step container image, like we did with Singularity. With these tools we can build a system that builds our application ("self-fluids") when there are commits or tags pushed to our repository.

When the build is triggered, Cloud Build executes a docker build using a Dockerfile in our repository. This Dockerfile pulls a Docker image containing our applications dependencies (we have been referring to this as our "dependency image") and builds the application.

The result is a Docker image containing a complete build of our application that is posted to Google Container Registry.

The benefits of this approach include

  • "Hands free" container registry management
  • Less build infrastructure-as-code to maintain (no custom build step for Singularity required)

The potential drawbacks of this strategy include

  • Loss of the singularity %test section in the container images
  • Docker raises security concerns on shared HPC systems, where Singularity is preferred; Docker images likely need to be converted to Singularity images for production, provide the host VM has Singularity installed.

Our HPC Application : SELF-Fluids

SELF-Fluids is a Fortran application that solves the 3-D compressible Navier Stokes equations using a Discontinuous Galerkin Spectral Element Method(Learn more about SELF-Fluids). This code depends on HDF5 and METIS. To build with GPU acceleration, we need to use the PGI compilers.

Building HDF5 can take between 30 minutes to 1 hour, whereas building SELF-Fluids takes less than one minute. Because of this, we opt to build a “dependency image” that contains all of SELF-Fluids’ dependencies. Once the dependency image is built, it is used to build SELF-Fluids.

How to use Cloud Build and Docker with an HPC app

Google Cloud Build is a service on GCP that is used to build container images in a private, secure virtual machine. Our goal is to create a Docker image that has our application installed using Cloud Build and to host this image on Google Container Registry. From a high level, the process for achieving this is as follows :

  1. Install/Setup gcloud SDK
  2. Create a new GCP project for building and testing our code
  3. Create a dependency image with our software’s dependencies installed
  4. Build our application with Cloud Build
  5. Set up build triggers

Creating a dependency container

A container image that contains all of your application's prerequisites can reduce the amount of time it takes to build your application. Many HPC applications have complex dependencies that are usually not readily available in a single public container image. For these reasons, we need to create our own dependency container.

In our article on building and HPC application with Singularity and Cloud Build, I "hand-crafted" Dockerfiles that build our dependency containers. Since then, I've shifted to using Nvidia's HPC container maker (HPCCM). Essentially, the HPCCM tool allows you to specify your application's dependencies in Python in order to generate a Dockerfile. You can see an example of our PGI+OpenMPI+GPU dependency HPCCM file at the self-fluids bitbucket repository. The HPCCM-generated Dockerfile is also hosted on the self-fluids bitbucket repository.

At this stage in my application's life, I'm interested in understanding the performance and cost differences that arise based on choice in compiler, MPI flavor, and GPU acceleration. Because of this, self-fluids is built with GNU and PGI compilers and OpenMPI, MPICH, and MVAPICH MPI flavors. This gives eight total builds to keep track of; six MPI-builds of self-fluids plus two single process builds :

  1. GNU
  2. PGI (+GPU)
  3. GNU+OpenMPI
  4. GNU+MVAPICH
  5. GNU+MPICH
  6. PGI+OpenMPI (+GPU)
  7. PGI+MVAPICH(+GPU)
  8. PGI+MPICH (+GPU)

This implies that there are eight dependency container images that need to be built to support building each flavor of self-fluids.

To build the dependency images, we use Cloud Build with this cloudbuild.yaml and manually run

gcloud builds submit .

Upon successful completion of this run, dependency images are published in my project's Google Container Registry. These images can be referenced in other Dockerfiles using the path gcr.io/PROJECT_ID/ (where PROJECT_ID is replaced with my GCP project id)

Looking ahead, self-fluids developers can decide when to begin testing on new releases of compilers and dependencies in a controlled manner by updating the dependency images. Certainly, there is some room for automation here.

Building our application with Cloud Build

With the dependency containers installed, I can now use them to build self-fluids using Cloud Build and Docker. To do this, I created a Dockerfile for each flavor of our application. Each flavor starts from one of the dependency containers. For example, the PGI+OpenMPI build uses

FROM gcr.io/self-fluids/pgi-openmpi as devel 

at the start of its Dockerfile. You can check out the docker/build directory on the self-fluids repository to get an idea of how we handled multiple application builds.

At the root of the repository, I defined a cloudbuild.yaml that can build any of the application flavors. This is made possible by Cloud Build's substitutions and a clean directory structure.

cloudbuild.yaml

steps:

- id: SELF-Fluids Build
  name: 'gcr.io/cloud-builders/docker'
  args: ['build',
         '--build-arg', 
         'BUILD_FLAGS=${_BUILD_FLAGS}',
         '--file',
         './docker/build/${_BUILD_BASE}/Dockerfile',
         '.',
         '-t',
         'gcr.io/${PROJECT_ID}/self-fluids:${_TAG}',
         ]
substitutions:
_BUILD_BASE: pgi
_BUILD_FLAGS: '--enable-diagnostics --enable-tecplot --enable-cuda'
_TAG: 'pgi-gpu'

images: ['gcr.io/${PROJECT_ID}/self-fluids:${_TAG}']

In this cloudbuild.yaml file, the only build step that is called uses the Docker builder. Note that ${PROJECT_ID} is a default substitution provided by Cloud Build. This builder uses the Dockerfile in our repository ./docker/build/${_BUILD_BASE}/Dockerfile to build an image named gcr.io/${PROJECT_ID}/self-fluids:${_TAG} that is published to the GCR. When executing the build, either manually or via a build trigger, the variables _BUILD_BASE, _BUILD_FLAGS, and _TAG can be defined . For example,

gcloud builds submit . --substitutions=_BUILD_BASE=gnu,_BUILD_FLAGS='--enable-tecplot',_TAG='gnu'

will use the docker/build/gnu/Dockerfile to build the gcr.io/${PROJECT_ID}/self-fluids:gnu. The _BUILD_FLAGS option are autoconf flags specific to my application.

Setting up Build Triggers

You can find documentation on setting up build triggers through Google Cloud's Cloud Build documentation . I'll provide a brief summary of the steps I took to configure our cloud build triggers

  1. Mirror the bitbucket repository to Google Source Repositories
  2. Manually set up triggers for all 8 builds, providing the appropriate substitution values for each build.

While I have handled this process manually to date for this project, I've seen value in using Terraform to manage this build infrastructure. With the cloudbuild.yaml and all of the Dockerfiles already defined in my application's repository, Terraform only needs to define build triggers. By using the cloudbuild_trigger resource, I can craft a simple module to define build triggers based on pushes to git branches or git tags.

What’s next

Running applications

With Docker container images published to Google Container Registry, we can now verify that published images have completed a successful build. With container images published, application users can pull the container images for their use. As an example, I'll illustrate how I'm currently using our self-fluids images.

I've created a cloud-native HPC cluster on Google Cloud Platform using the fluid-slurm-gcp marketplace solution. This solution comes with Singularity already installed. While on the fluid-slurm-gcp cluster, I can pull the docker container using (for example)

module load singularity
singularity pull docker://gcr.io/self-fluids/self-fluids:pgi-openmpi

To run the application, I use a Slurm batch script to launch the application (after installing openmpi)

self-fluids.batch
#!/bin/bash
#
#SBATCH --account=default
#
#SBATCH --partition=gpu-8xv100
#
#SBATCH --ntasks=8
#
#SBATCH --gres=gpu:8
#
#//////////////////////////////////////#

module purge
module load singularity
module load openmpi


mpirun -np 8 singularity exec -nv self-fluids.sif sfluid --param-file=/examples/thermalbubble/runtime.params
                                                         --equation-file=/example/thermalbubble/self.equations

Automated testing

Often, HPC developers want to ensure their applications properly runs by executing test cases (aka benchmarks), prior to making the application available to users. At a minimum, automated HPC application testing typically requires infrastructure that meets the following criteria

  • High core counts ( > 100 MPI Ranks )
  • GPU and Multi-GPU systems ( 1-100 GPUs )
  • STDERR/STDOUT logging
  • Application test case/benchmark runtime

Additional properties, that would be nice to have include

  • Application test case/benchmark profile/hot-spot analysis
  • Application test case/benchmark output archiving

Fluid Numerics is currently working on a system that integrates into Cloud Build and other platforms to enable automated HPC application testing that meets this criteria. Stay tuned !

HPC Container Maker Issues

Of course, nothing in HPC is really that easy. I did encounter some issues along the way, and I've documented them here in the hopes that you find it helpful in your containerization journey

Issue : On all parallel HDF5 installs (GNU and PGI), FC, CC, anc CXX were not set to the MPI compilers

Solution : Edit the HPCCM generated Dockerfiles to swap compilers (FC, CC, CXX) for MPI compilers


Issue : PGI builds failed during "make" of HDF5. Additional compiler flags needed for HDF5

Solution : Additional compiler flags needed for HDF5. Add the following in the configure step for HDF5 in generated Dockerfile

CFLAGS="-fPIC -m64 -tp=px" CXXFLAGS="-fPIC -m64 -tp=px" FCFLAGS="-fPIC -m64 -tp=px"

Issue : Parallel builds of HDF5 with PGI compilers failed during make.

Solution : Add CPP=cpp during the configure stage to use the GNU C-preprocessor during make.


Issue : OpenMPI and MVAPICH builds with PGI (CUDA enabled) fail to build during make of openmpi or mvapich. Errors indicate that we cannot link -lcuda

Solution : Add the following line to the Dockerfile, just before building the MPI library

ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH