GPU Acceleration
About GPU Programming
Over the last few decades, General Purpose GPU (GPGPU) computing has moved from fad to necessity for modern compute intensive applications. GPUs offer a considerably higher theoretical peak performance than CPUs that can potentially reduce the time to solution and the energy footprint of applications.. Historically, to implement algorithms on GPUs, programmers had to reframe their algorithms in terms of graphics operations which was time consuming and error-prone. The continued interest and success of GPGPU computing led to the introduction of new languages and specifications, like CUDA, HIP, and OpenCL in addition to directive based programming models, like OpenACC, and a zoo of GPU accelerated libraries. Currently, the GPGPU ecosystem is rapidly evolving as GPU acceleration brings value to all scientific domains.
GPU Acceleration Challenges
Distinct Memory Space
Modern GPU accelerated platforms employ one or more dedicated GPUs on each compute node. A dedicated GPU has its own memory space, distinct from the host system memory. Data and instructions are shared with the GPU across a bandwidth limited PCI bus. Data transactions between the CPU and GPU most often result in overall application slow-downs during the early porting stages. Developer teams must plan to minimize or hide these data transactions to achieve optimal performance
With any programming language, API, or library, GPU software developers must decide how they will handle GPU and CPU memory spaces.
Massive Parallelism
In a broad sense, GPUs are distinct from CPUs in two ways
GPUs have lower clock speeds than CPUs
GPUs can process 1000's of threads of execution per clock cycle, while CPUs process 10's of threads per clock cycle
GPU architecture offers significantly higher performance peaks over CPUs. However, only some applications can out-perform their CPU implementations after porting to GPUs. Performance gains over existing CPU code depend on the amount of parallelism and the compute intensity of an application's algorithms and how well the developer exposes the parallelism on the target GPU hardware.
Profiling and Debugging
Traditional HPC profilers and debuggers are unable to report details of kernels running on GPUs. Profiling and debugging GPU accelerated applications requires experience with tools such as nvprof and cuda-gdb. Score-P, Vampir, and commercial tools, like ARM-Forge (formerly Allinea-Forge), are a necessity when developing multi-GPU applications.
Planning to accelerate your HPC application with GPUs
When considering a transition to GPUs for your application, it is necessary to be aware of and familiar with characteristics of the GPGPU ecosystem :
Basics of GPU accelerated platforms and the types of applications that perform well on GPUs
Programming language (Fortran, C/C++, and Python) and compiler support for GPU programming languages, specifications, and APIs
Profiler and debugger support for GPU and multi-GPU applications
GPU API difficulty, maintainability, code readability, and performance trade-offs
Community engagement and activities
GPU hardware diversity
Cloud provider support and pricing
On-premise infrastructure and maintenance costs and considerations
Awareness of the the GPGPU ecosystem can help you make informed decisions when planning to accelerate your applications with GPUs.
GPU Acceleration Support
Professional Services
Fluid Numerics offers services ranging from training and education to hands-on development alongside your teams! Training and education is delivered through one-on-one training, team training, or larger scale GPU hackathons.
GPU Programming Curriculum
Fluid Numerics is developing GPU programming curriculum that we can offer at no-cost. We believe in empowering teams to learn new skills in an open and inclusive environment. Simply sign up for access!