The Ultimate CUDA Programming Cheatsheet: GPU Acceleration Guide

Introduction: What is CUDA?

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model that enables dramatic performance increases in computing by harnessing the power of NVIDIA GPUs. CUDA allows developers to use C, C++, and Fortran to write code that executes on the GPU, enabling acceleration for data-parallel computations in various domains including scientific computing, machine learning, image processing, and more.

Core Concepts of CUDA Programming

CUDA Architecture Fundamentals

  • Host: The CPU and its memory (host memory)
  • Device: The GPU and its memory (device memory)
  • Kernel: A function executed on the GPU in parallel by many threads
  • Thread: The basic unit of execution in CUDA
  • Thread Block: A group of threads that can cooperate via shared memory
  • Grid: A collection of thread blocks that execute the same kernel
  • Warp: A group of 32 threads that execute instructions in SIMT (Single Instruction, Multiple Thread) fashion

CUDA Memory Hierarchy

Memory TypeScopeLifetimeSpeedSize
RegisterThreadKernelFastestLimited per thread
Shared MemoryBlockKernelVery FastLimited per block (up to 48KB)
Local MemoryThreadKernelSlowLimited per thread
Global MemoryGridApplicationSlowLarge (Several GB)
Constant MemoryGridApplicationFast (cached)Small (64KB total)
Texture MemoryGridApplicationFast (cached)Limited by global memory

CUDA Programming Model: Step-by-Step

  1. Allocate memory on both host (CPU) and device (GPU)
  2. Transfer data from host to device
  3. Launch kernel to process data on device
  4. Transfer results back from device to host
  5. Free memory on both host and device

Basic CUDA Program Structure

// CUDA kernel definition
__global__ void vectorAdd(float* A, float* B, float* C, int numElements) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < numElements) {
        C[i] = A[i] + B[i];
    }
}

int main() {
    // Host memory pointers
    float *h_A, *h_B, *h_C;
    // Device memory pointers
    float *d_A, *d_B, *d_C;
    int numElements = 50000;
    size_t size = numElements * sizeof(float);
    
    // Allocate host memory
    h_A = (float *)malloc(size);
    h_B = (float *)malloc(size);
    h_C = (float *)malloc(size);
    
    // Initialize host arrays
    for (int i = 0; i < numElements; ++i) {
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;
    }
    
    // Allocate device memory
    cudaMalloc((void **)&d_A, size);
    cudaMalloc((void **)&d_B, size);
    cudaMalloc((void **)&d_C, size);
    
    // Copy data from host to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
    
    // Launch kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    
    // Copy result back to host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    
    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    
    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);
    
    return 0;
}

CUDA Memory Management

Host-Device Memory Operations

// Allocation
float* d_data;
cudaMalloc((void**)&d_data, size);

// Data transfer: Host to Device
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);

// Data transfer: Device to Host
cudaMemcpy(h_result, d_result, size, cudaMemcpyDeviceToHost);

// Free memory
cudaFree(d_data);

Unified Memory (CUDA 6.0+)

// Allocate unified memory (accessible from CPU and GPU)
float* data;
cudaMallocManaged(&data, size);

// Access directly from host or device code
// No explicit transfers needed

// Free unified memory
cudaFree(data);

Pinned Memory

// Allocate page-locked (pinned) memory
float* h_data;
cudaMallocHost((void**)&h_data, size);

// Use for faster transfers to/from device

// Free pinned memory
cudaFreeHost(h_data);

CUDA Kernel Programming

Function Type Qualifiers

  • __global__: Runs on device, called from host
  • __device__: Runs on device, called from device
  • __host__: Runs on host, called from host (default)
  • __host__ __device__: Can run on either host or device

Variable Type Qualifiers

  • __shared__: Shared memory (accessible by all threads in a block)
  • __constant__: Constant memory (read-only, cached)
  • __device__: Global memory (accessible by all threads)
  • __managed__: Managed memory (automatically migrated between host and device)

Launching Kernels

// Basic kernel launch
myKernel<<<numBlocks, threadsPerBlock>>>(args...);

// With shared memory allocation
myKernel<<<numBlocks, threadsPerBlock, sharedMemBytes>>>(args...);

// With stream specification
myKernel<<<numBlocks, threadsPerBlock, sharedMemBytes, stream>>>(args...);

Thread and Block Indexing

// 1D grid of 1D blocks
int idx = blockIdx.x * blockDim.x + threadIdx.x;

// 2D grid of 2D blocks
int idx_x = blockIdx.x * blockDim.x + threadIdx.x;
int idx_y = blockIdx.y * blockDim.y + threadIdx.y;
int idx = idx_y * width + idx_x;

// 3D grid of 3D blocks
int idx_x = blockIdx.x * blockDim.x + threadIdx.x;
int idx_y = blockIdx.y * blockDim.y + threadIdx.y;
int idx_z = blockIdx.z * blockDim.z + threadIdx.z;

Thread Organization and Execution Model

Thread Hierarchy

  • Thread: Individual execution unit
  • Warp: Group of 32 threads that execute in SIMT fashion
  • Block: Group of threads that can synchronize and share memory
  • Grid: Collection of blocks executing the same kernel

Built-in Variables

  • threadIdx: Thread index within block (x, y, z)
  • blockIdx: Block index within grid (x, y, z)
  • blockDim: Block dimensions (threads per block in x, y, z)
  • gridDim: Grid dimensions (blocks per grid in x, y, z)
  • warpSize: Number of threads in a warp (32)

Synchronization

// Synchronize threads within a block
__syncthreads();

// Synchronize threads within a warp (CUDA 9.0+)
__syncwarp();

// Wait for all kernels to complete
cudaDeviceSynchronize();

// Wait for operations in a stream to complete
cudaStreamSynchronize(stream);

Key Techniques by Category

Shared Memory Usage

__global__ void sharedMemExample(float* input, float* output, int n) {
    // Declare shared memory
    __shared__ float sharedData[256];
    
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Load data into shared memory
    if (idx < n) {
        sharedData[threadIdx.x] = input[idx];
    }
    
    // Synchronize to ensure all data is loaded
    __syncthreads();
    
    // Process data in shared memory
    if (idx < n && threadIdx.x < blockDim.x - 1) {
        output[idx] = sharedData[threadIdx.x] + sharedData[threadIdx.x + 1];
    }
}

Atomic Operations

// Atomic add for integers
atomicAdd(&counter, 1);

// Atomic min
atomicMin(&minValue, value);

// Atomic max
atomicMax(&maxValue, value);

// Compare and swap
atomicCAS(&data[idx], oldValue, newValue);

Warp-Level Primitives (CUDA 9.0+)

// Warp vote functions
int allTrue = __all_sync(mask, predicate);
int anyTrue = __any_sync(mask, predicate);

// Warp shuffle
int value = __shfl_sync(mask, myValue, srcLane);
int value = __shfl_up_sync(mask, myValue, delta);
int value = __shfl_down_sync(mask, myValue, delta);

Performance Optimization Techniques

Memory Coalescing

  • Ensure threads in a warp access contiguous memory addresses
  • Align data structures to 128-byte boundaries
  • Use appropriate data types (prefer float4, int4 for bandwidth-bound kernels)
// Good: Coalesced access pattern
__global__ void goodAccess(float* data, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (x < width && y < height) {
        float value = data[y * width + x]; // Threads in a warp access consecutive elements
    }
}

// Bad: Strided access pattern
__global__ void badAccess(float* data, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (x < width && y < height) {
        float value = data[x * height + y]; // Threads in a warp access elements far apart
    }
}

Occupancy Optimization

  • Balance registers per thread and shared memory per block
  • Choose appropriate block sizes (multiples of 32)
  • Use the CUDA Occupancy Calculator to determine optimal launch configurations
// Launch configuration with occupancy calculator
int blockSize;
int minGridSize;
int gridSize;

cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, numElements);
gridSize = (numElements + blockSize - 1) / blockSize;

myKernel<<<gridSize, blockSize>>>(d_data, numElements);

Loop Unrolling

// Manual loop unrolling
__global__ void unrolledKernel(float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (idx + 3 < n) {
        output[idx] = input[idx] * 2.0f;
        output[idx+1] = input[idx+1] * 2.0f;
        output[idx+2] = input[idx+2] * 2.0f;
        output[idx+3] = input[idx+3] * 2.0f;
    } else {
        for (int i = 0; i < 4 && idx + i < n; i++) {
            output[idx+i] = input[idx+i] * 2.0f;
        }
    }
}

// Compiler pragmas for unrolling
#pragma unroll 4
for (int i = 0; i < 16; i++) {
    result += data[i];
}

Comparison: CUDA vs. Other Parallel Programming Models

FeatureCUDAOpenCLOpenMPOpenACC
Hardware SupportNVIDIA GPUs onlyGPUs, CPUs, FPGAsCPUs, limited GPUCPUs, GPUs
Ease of UseModerateComplexEasyVery Easy
PerformanceExcellentGoodGoodGood
Control LevelVery HighHighModerateLow
Language SupportC, C++, FortranC, C++C, C++, FortranC, C++, Fortran
Development ToolsExtensiveLimitedGoodLimited
Community SupportExcellentGoodExcellentLimited

Common Challenges and Solutions

Challenge: Kernel Launch Failure

Problem: Kernel fails to launch with error code.

Solution:

  • Check for prior errors with cudaGetLastError()
  • Verify grid and block dimensions don’t exceed device limits
  • Ensure sufficient resources (registers, shared memory)
// Error checking pattern
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
    printf("Kernel launch error: %s\n", cudaGetErrorString(err));
}

Challenge: Memory Bottlenecks

Problem: Performance limited by memory bandwidth.

Solution:

  • Use shared memory for data reuse
  • Optimize global memory access patterns
  • Reduce redundant data transfers
  • Consider using texture memory for read-only data with spatial locality
// Tiling approach with shared memory
__global__ void matrixMulTiled(float* A, float* B, float* C, int width) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];
    
    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;
    
    for (int t = 0; t < width/TILE_SIZE; t++) {
        // Collaborative loading of tiles
        tileA[threadIdx.y][threadIdx.x] = A[row*width + t*TILE_SIZE + threadIdx.x];
        tileB[threadIdx.y][threadIdx.x] = B[(t*TILE_SIZE + threadIdx.y)*width + col];
        __syncthreads();
        
        // Compute partial sum using tile data
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }
        __syncthreads();
    }
    
    C[row*width + col] = sum;
}

Challenge: Warp Divergence

Problem: Threads within a warp take different execution paths, causing serialization.

Solution:

  • Minimize conditional code within warps
  • Restructure algorithms to reduce divergence
  • Group similar work to align with warp boundaries
// Bad: Warp divergence
if (threadIdx.x % 2 == 0) {
    // Even threads do this
} else {
    // Odd threads do this
}

// Better: Minimize divergence
if ((threadIdx.x / 32) % 2 == 0) {
    // Entire warps do this
} else {
    // Other entire warps do this
}

Challenge: Bank Conflicts

Problem: Multiple threads access the same shared memory bank, causing serialization.

Solution:

  • Pad shared memory arrays to avoid conflicts
  • Redesign access patterns to minimize conflicts
// Shared memory with padding to avoid bank conflicts
__shared__ float sharedData[TILE_SIZE][TILE_SIZE + 1]; // +1 padding

Best Practices and Practical Tips

Kernel Design

  • Keep kernels focused: One kernel, one task
  • Minimize divergence: Avoid conditionals that cause threads in a warp to take different paths
  • Balance workload: Ensure even distribution of work across threads
  • Asynchronous operations: Use streams for concurrent execution of kernels and data transfers

Memory Management

  • Minimize host-device transfers: Keep data on the GPU as long as possible
  • Use pinned memory for faster host-device transfers
  • Consider unified memory for simpler code but be aware of performance implications
  • Prefer coalesced access patterns for global memory

Performance Tuning

  • Profile before optimizing: Use NVIDIA Visual Profiler or Nsight to identify bottlenecks
  • Consider arithmetic intensity: Compute-bound vs. memory-bound kernels need different optimizations
  • Balance occupancy and resources: Sometimes fewer threads with more resources per thread is better
  • Minimize thread synchronization: Each sync point has overhead

Development Workflow

  • Start simple: Get a basic version working, then optimize
  • Validate results: Compare GPU results with CPU implementation
  • Incremental optimization: Make one change at a time and measure impact
  • Consider portability: Encapsulate CUDA-specific code if cross-platform support is needed

Advanced CUDA Features

Dynamic Parallelism (CUDA 5.0+)

__global__ void childKernel(float* data) {
    // Child kernel code
}

__global__ void parentKernel(float* data) {
    // Parent kernel code
    
    // Launch a child kernel from within a kernel
    childKernel<<<gridSize, blockSize>>>(data);
    
    // Wait for child kernels to complete
    cudaDeviceSynchronize();
}

Cooperative Groups (CUDA 9.0+)

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void cooperativeKernel(float* data, int n) {
    // Create a thread block group
    cg::thread_block block = cg::this_thread_block();
    
    // Use the group for synchronization
    block.sync();
    
    // Create a grid group (requires device support)
    cg::grid_group grid = cg::this_grid();
    
    // Grid-wide synchronization
    grid.sync();
}

// Launch with cooperative groups
void launchCooperativeKernel() {
    cudaLaunchCooperativeKernel((void*)cooperativeKernel, 
                                 gridSize, blockSize, 
                                 args, sharedMemBytes, stream);
}

Tensor Cores (Volta+ GPUs)

// Matrix multiplication with Tensor Cores using CUDA libraries
#include <cublas_v2.h>

cublasHandle_t handle;
cublasCreate(&handle);

// Enable Tensor Core operations
cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);

// Perform matrix multiplication: C = alpha*A*B + beta*C
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
            m, n, k, &alpha,
            d_A, lda, d_B, ldb, &beta, d_C, ldc);

cublasDestroy(handle);

Resources for Further Learning

Official Documentation and Tools

Books

  • “CUDA by Example: An Introduction to General-Purpose GPU Programming” by Jason Sanders and Edward Kandrot
  • “Programming Massively Parallel Processors” by David B. Kirk and Wen-mei W. Hwu
  • “CUDA for Engineers” by Duane Storti and Mete Yurtoglu

Online Courses and Tutorials

Community Resources

By understanding the core concepts, programming patterns, and optimization techniques in this cheatsheet, you’ll be well-equipped to harness the power of NVIDIA GPUs for your parallel computing needs. Remember to profile your applications, identify bottlenecks, and apply targeted optimizations to achieve the best performance.

Scroll to Top