Minimizing GPU Costs on Google Cloud Platform

Dr. Joseph Schoonover

With the variety of architectures available in High Performance Computing (HPC), it's easy to wonder how your software performs across platforms. On cloud platforms, addressing this curiosity becomes pertinent. Choosing the platform that minimizes costs for each execution of your software can help stretch your R&D budget and save your company money on compute expenses.

At Fluid Numerics, we maintain a modest set of R&D codes, including SELF-Fluids. SELF-Fluids is written to solve the Compressible Navier-Stokes equations using the Discontinuous Galerkin Spectral Element Method. It is written in modern Fortran and can be run in serial or parallel. On-node parallelism is achieved with CUDA-Fortran (GPU Accelerated) or OpenMP ( Multi-threaded CPU). We are able to scale to larger problems with MPI, MPI+CUDA, or MPI+OpenMP.

As SELF-Fluids has developed, so have our testing strategies. For this code, we care about compiler and platform portability. We regularly test across compilers ( GCC, PGI, and Intel ) and across CPU and GPU architectures. With Google Cloud Platform, we are able to assess the code's performance on a variety of GPUs and on moderately sized virtual HPC clusters.

In this article, I'll share our strategy for cross-platform and cross-compiler testing using a customized version of the slurm-gcp deployment that uses environment modules and multiple compute partitions.

Resources and Methods

Compute Resources

On Google Cloud Platform, SchedMD and Google have been working deployment-manager scripts for launching a bare-metal style HPC cluster. We have taken these scripts and added the following modifications :

  • Environment Modules
  • Additional HPC Tools
    • PGI 18.10 Compilers ( for CUDA-Fortran support )
    • OpenMPI 3.1.2 ( Built with each compiler )
    • HDF5 ( Parallel and Serial Version )
    • CUDA Toolkit 9.2
  • Multiple Partitions

The addition of multiple partitions was driven by the need to have multiple types of GPUs to conduct this cost comparison study.

In the design of this feature, each partition can have its own VM and GPU types and can exists in any zone in the same region as the login and controller nodes. In fact, we had to add a feature to "round-robin-select" the zones to automate the zone selection for each partition. This was needed to automate the zone selection for GPU equipped nodes.

For our study, we want to compare software performance and model simulation costs across GPU platforms. To do this, we provisioned the following partitions in our customized slurm deployment :

  • n1-standard32
  • n1-highmem8 + 1xNvidia Tesla K80
  • n1-highmem8 + 1xNvidia Tesla P4
  • n1-highmem8 + 1xNvidia Tesla P100
  • n1-highmem8 + 1xNvidia Tesla V100
  • n1-highmem64 + 8xNvidia Tesla V100

Test Simulation

For the test simulation, we've used a boundary layer separation demo that has around 1.7-million degrees of freedom. This is a rather small problem with a memory footprint that is roughly 10MB. We've elected to run for 10,000 timesteps of the model to obtain an accurate average runtime.

The testing of code is managed using a simple python layer called Fluid-Testing. We developed this tools to allow us to specify build and run procedures in a simple yaml file. As an example, here's a snippet for running our separation demo on the 8xNvidia Tesla V100 partition

  - name        : 'PGI 18.10 + OctoV100'
    directory   : 'openmpi/3.1.2/pgi/18.10/8xV100'
    modules     : 'pgi/18.10 openmpi/1.10.2 hdf5/1.10.3/openmpi/3.1.2/pgi/18.10'
    gpu_arch    : cc75
    multithread : false
    build       : |
            autoreconf --install
            make distclean
            ./configure --enable-mpi --enable-cuda --enable-tecplot --enable-diagnostics --enable-timing --prefix="${TEST_PATH}/${DIRECTORY}"
            make
            make install
    runs :
      - name               : 'separation'
        num_nodes          : 1
        num_tasks          : 8
        num_tasks_per_node : 8
        cpus_per_task      : 8
        timeout            : '2:00:00'
        partition          : 'octo-v100
        run  : |
          mpirun -np 8 ${BIN_DIR}/sfluid --equation-file ${SRC}/examples/separation/self.equations --param-file ${SRC}/examples/separation/runtime.params

This file provides the necessary information for building different flavors of SELF-Fluids and for executing demo runs with each flavor. A configuration file is typically populated with multiple entries, like the one shown above. This config file is passed into the python application along with the git remote url and the branch or tag name to test.

Upon execution, the python script will build each flavor of the code and will submit all of the runs for each build flavor that was successfully built. Once the jobs are complete, the python script posts a JSON summarizing the success of the build and run steps, in addition to the runtimes for the applications.

Results

This shows the run times (top) and the speedup relative to the standard-32 run (bottom) for the simulation on each of the compute partitions .

The fully subscribed, MPI-only simulation on the standard-32 took the longest to complete. On the single-GPU partitions, the run-time successively improves from the older K80 (4070.5 s) model to the newer V100 (810.0 s; 5x faster than K80).The multi-GPU run dropped the run-time further to 345.9 s, but only provided a 2.3x speedup over the single V100.

Ideally, we would want to see 8x speedup if we increased the number of GPUs by a factor of 8. If this were the case, then the cost for an 8 GPU simulation would be equivalent to the cost for the single-GPU simulation and we would have the solution sooner. However, since most codes will exhibit less than ideal scaling in the real world, we should expect to pay more to get the solution sooner under strong scaling.

This chart summarizes the cost estimates for running this simulation on each platform. The cost is estimated by multiplying the cost per hour for the compute instance and the measured run-time for each simulation.

For this demo of SELF-Fluids, the highmem-8+V100 partition is the most cost effective solution. Of the single GPU partitions, this is the most expensive per hour. However, sufficient speedup over the other platforms is realized so that the overall simulation cost is reduced.

Note that this whole study was conducted with less than $20 of compute engine resources.

Summary

In this article, we demonstrated our use of a customized Slurm cluster to conduct a cost analysis study. The procedures involved using scripting layer that allowed us to easily build different flavors of our code (SELF-Fluids) for each platform, submit jobs to multiple partitions of the custom cluster, and summarize the build and run results.

We found that, for the particular demo executed by our code, the highmem8-V100 nodes were the most cost-effective, even though, per hour, V100's are the most expensive of the GPU platforms. We also illustrated that, when codes have less than ideal strong scaling, increasing the number of GPUs can bring the time to solution down, but will generally cost more per simulation.

I'll conclude with the statement that these results may not generalize to all applications. Each software application has it's own unique "fingerprint". Cross-platform benchmarking of software allows you to get a glimpse of that fingerprint at a particular point in time. It allows you to determine exactly how much resources you need and what types of resources to use to make the most cost-effective decisions for you, your organization, and your customers.

Got Questions or Feedback ?

We are interested in talking to you! Feel free to reach out to us using our support request panel .