GPU Acceleration

About GPU Programming

Over the last few decades, General Purpose GPU (GPGPU) computing has moved from fad to necessity for modern compute intensive applications. GPUs offer a considerably higher theoretical peak performance than CPUs that can potentially reduce the time to solution and the energy footprint of applications.. Historically, to implement algorithms on GPUs, programmers had to reframe their algorithms in terms of graphics operations which was time consuming and error-prone. The continued interest and success of GPGPU computing led to the introduction of new languages and specifications, like CUDA, HIP, and OpenCL in addition to directive based programming models, like OpenACC, and a zoo of GPU accelerated libraries. Currently, the GPGPU ecosystem is rapidly evolving as GPU acceleration brings value to all scientific domains.

GPU Acceleration Challenges

Distinct Memory Space

Modern GPU accelerated platforms employ one or more dedicated GPUs on each compute node. A dedicated GPU has its own memory space, distinct from the host system memory. Data and instructions are shared with the GPU across a bandwidth limited PCI bus. Data transactions between the CPU and GPU most often result in overall application slow-downs during the early porting stages. Developer teams must plan to minimize or hide these data transactions to achieve optimal performance

With any programming language, API, or library, GPU software developers must decide how they will handle GPU and CPU memory spaces.

Massive Parallelism

In a broad sense, GPUs are distinct from CPUs in two ways

  1. GPUs have lower clock speeds than CPUs
  2. GPUs can process 1000's of threads of execution per clock cycle, while CPUs process 10's of threads per clock cycle

GPU architecture offers significantly higher performance peaks over CPUs. However, only some applications can out-perform their CPU implementations after porting to GPUs. Performance gains over existing CPU code depend on the amount of parallelism and the compute intensity of an application's algorithms and how well the developer exposes the parallelism on the target GPU hardware.

Profiling and Debugging

Traditional HPC profilers and debuggers are unable to report details of kernels running on GPUs. Profiling and debugging GPU accelerated applications requires experience with tools such as nvprof and cuda-gdb. Score-P, Vampir, and commercial tools, like ARM-Forge (formerly Allinea-Forge), are a necessity when developing multi-GPU applications.

Planning to accelerate your HPC application with GPUs

When considering a transition to GPUs for your application, it is necessary to be aware of and familiar with characteristics of the GPGPU ecosystem :

  • Basics of GPU accelerated platforms and the types of applications that perform well on GPUs
  • Programming language (Fortran, C/C++, and Python) and compiler support for GPU programming languages, specifications, and APIs
  • Profiler and debugger support for GPU and multi-GPU applications
  • GPU API difficulty, maintainability, code readability, and performance trade-offs
  • Community engagement and activities
  • GPU hardware diversity
  • Cloud provider support and pricing
  • On-premise infrastructure and maintenance costs and considerations

Awareness of the the GPGPU ecosystem can help you make informed decisions when planning to accelerate your applications with GPUs.

GPU Acceleration Support

Professional Services

Fluid Numerics offers services ranging from training and education to hands-on development alongside your teams! Training and education is delivered through one-on-one training, team training, or larger scale GPU hackathons.

GPU Programming Curriculum

Fluid Numerics is developing GPU programming curriculum that we can offer at no-cost. We believe in empowering teams to learn new skills in an open and inclusive environment. Simply sign up for access!