The Ultimate CUDA Programming Cheatsheet: GPU Acceleration Guide – The Fox Click : Free Tools and Resources

Introduction: What is CUDA?

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model that enables dramatic performance increases in computing by harnessing the power of NVIDIA GPUs. CUDA allows developers to use C, C++, and Fortran to write code that executes on the GPU, enabling acceleration for data-parallel computations in various domains including scientific computing, machine learning, image processing, and more.

Core Concepts of CUDA Programming

CUDA Architecture Fundamentals

Host: The CPU and its memory (host memory)
Device: The GPU and its memory (device memory)
Kernel: A function executed on the GPU in parallel by many threads
Thread: The basic unit of execution in CUDA
Thread Block: A group of threads that can cooperate via shared memory
Grid: A collection of thread blocks that execute the same kernel
Warp: A group of 32 threads that execute instructions in SIMT (Single Instruction, Multiple Thread) fashion

CUDA Memory Hierarchy

Memory Type	Scope	Lifetime	Speed	Size
Register	Thread	Kernel	Fastest	Limited per thread
Shared Memory	Block	Kernel	Very Fast	Limited per block (up to 48KB)
Local Memory	Thread	Kernel	Slow	Limited per thread
Global Memory	Grid	Application	Slow	Large (Several GB)
Constant Memory	Grid	Application	Fast (cached)	Small (64KB total)
Texture Memory	Grid	Application	Fast (cached)	Limited by global memory

CUDA Programming Model: Step-by-Step

Allocate memory on both host (CPU) and device (GPU)
Transfer data from host to device
Launch kernel to process data on device
Transfer results back from device to host
Free memory on both host and device

Basic CUDA Program Structure

// CUDA kernel definition
__global__ void vectorAdd(float* A, float* B, float* C, int numElements) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < numElements) {
        C[i] = A[i] + B[i];
    }
}

int main() {
    // Host memory pointers
    float *h_A, *h_B, *h_C;
    // Device memory pointers
    float *d_A, *d_B, *d_C;
    int numElements = 50000;
    size_t size = numElements * sizeof(float);
    
    // Allocate host memory
    h_A = (float *)malloc(size);
    h_B = (float *)malloc(size);
    h_C = (float *)malloc(size);
    
    // Initialize host arrays
    for (int i = 0; i < numElements; ++i) {
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;
    }
    
    // Allocate device memory
    cudaMalloc((void **)&d_A, size);
    cudaMalloc((void **)&d_B, size);
    cudaMalloc((void **)&d_C, size);
    
    // Copy data from host to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
    
    // Launch kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    
    // Copy result back to host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    
    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    
    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);
    
    return 0;
}

CUDA Memory Management

Host-Device Memory Operations

// Allocation
float* d_data;
cudaMalloc((void**)&d_data, size);

// Data transfer: Host to Device
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);

// Data transfer: Device to Host
cudaMemcpy(h_result, d_result, size, cudaMemcpyDeviceToHost);

// Free memory
cudaFree(d_data);

Unified Memory (CUDA 6.0+)

// Allocate unified memory (accessible from CPU and GPU)
float* data;
cudaMallocManaged(&data, size);

// Access directly from host or device code
// No explicit transfers needed

// Free unified memory
cudaFree(data);

Pinned Memory

// Allocate page-locked (pinned) memory
float* h_data;
cudaMallocHost((void**)&h_data, size);

// Use for faster transfers to/from device

// Free pinned memory
cudaFreeHost(h_data);

CUDA Kernel Programming

Function Type Qualifiers

__global__: Runs on device, called from host
__device__: Runs on device, called from device
__host__: Runs on host, called from host (default)
__host__ __device__: Can run on either host or device

Variable Type Qualifiers

__shared__: Shared memory (accessible by all threads in a block)
__constant__: Constant memory (read-only, cached)
__device__: Global memory (accessible by all threads)
__managed__: Managed memory (automatically migrated between host and device)

Launching Kernels

// Basic kernel launch
myKernel<<<numBlocks, threadsPerBlock>>>(args...);

// With shared memory allocation
myKernel<<<numBlocks, threadsPerBlock, sharedMemBytes>>>(args...);

// With stream specification
myKernel<<<numBlocks, threadsPerBlock, sharedMemBytes, stream>>>(args...);

Thread and Block Indexing

// 1D grid of 1D blocks
int idx = blockIdx.x * blockDim.x + threadIdx.x;

// 2D grid of 2D blocks
int idx_x = blockIdx.x * blockDim.x + threadIdx.x;
int idx_y = blockIdx.y * blockDim.y + threadIdx.y;
int idx = idx_y * width + idx_x;

// 3D grid of 3D blocks
int idx_x = blockIdx.x * blockDim.x + threadIdx.x;
int idx_y = blockIdx.y * blockDim.y + threadIdx.y;
int idx_z = blockIdx.z * blockDim.z + threadIdx.z;

Thread Organization and Execution Model

Thread Hierarchy

Thread: Individual execution unit
Warp: Group of 32 threads that execute in SIMT fashion
Block: Group of threads that can synchronize and share memory
Grid: Collection of blocks executing the same kernel

Built-in Variables

threadIdx: Thread index within block (x, y, z)
blockIdx: Block index within grid (x, y, z)
blockDim: Block dimensions (threads per block in x, y, z)
gridDim: Grid dimensions (blocks per grid in x, y, z)
warpSize: Number of threads in a warp (32)

Synchronization

// Synchronize threads within a block
__syncthreads();

// Synchronize threads within a warp (CUDA 9.0+)
__syncwarp();

// Wait for all kernels to complete
cudaDeviceSynchronize();

// Wait for operations in a stream to complete
cudaStreamSynchronize(stream);

Key Techniques by Category

Shared Memory Usage

__global__ void sharedMemExample(float* input, float* output, int n) {
    // Declare shared memory
    __shared__ float sharedData[256];
    
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Load data into shared memory
    if (idx < n) {
        sharedData[threadIdx.x] = input[idx];
    }
    
    // Synchronize to ensure all data is loaded
    __syncthreads();
    
    // Process data in shared memory
    if (idx < n && threadIdx.x < blockDim.x - 1) {
        output[idx] = sharedData[threadIdx.x] + sharedData[threadIdx.x + 1];
    }
}

Atomic Operations

// Atomic add for integers
atomicAdd(&counter, 1);

// Atomic min
atomicMin(&minValue, value);

// Atomic max
atomicMax(&maxValue, value);

// Compare and swap
atomicCAS(&data[idx], oldValue, newValue);

Warp-Level Primitives (CUDA 9.0+)

// Warp vote functions
int allTrue = __all_sync(mask, predicate);
int anyTrue = __any_sync(mask, predicate);

// Warp shuffle
int value = __shfl_sync(mask, myValue, srcLane);
int value = __shfl_up_sync(mask, myValue, delta);
int value = __shfl_down_sync(mask, myValue, delta);

Performance Optimization Techniques

Memory Coalescing

Ensure threads in a warp access contiguous memory addresses
Align data structures to 128-byte boundaries
Use appropriate data types (prefer float4, int4 for bandwidth-bound kernels)

// Good: Coalesced access pattern
__global__ void goodAccess(float* data, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (x < width && y < height) {
        float value = data[y * width + x]; // Threads in a warp access consecutive elements
    }
}

// Bad: Strided access pattern
__global__ void badAccess(float* data, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (x < width && y < height) {
        float value = data[x * height + y]; // Threads in a warp access elements far apart
    }
}

Occupancy Optimization

Balance registers per thread and shared memory per block
Choose appropriate block sizes (multiples of 32)
Use the CUDA Occupancy Calculator to determine optimal launch configurations

// Launch configuration with occupancy calculator
int blockSize;
int minGridSize;
int gridSize;

cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, numElements);
gridSize = (numElements + blockSize - 1) / blockSize;

myKernel<<<gridSize, blockSize>>>(d_data, numElements);

Loop Unrolling

// Manual loop unrolling
__global__ void unrolledKernel(float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (idx + 3 < n) {
        output[idx] = input[idx] * 2.0f;
        output[idx+1] = input[idx+1] * 2.0f;
        output[idx+2] = input[idx+2] * 2.0f;
        output[idx+3] = input[idx+3] * 2.0f;
    } else {
        for (int i = 0; i < 4 && idx + i < n; i++) {
            output[idx+i] = input[idx+i] * 2.0f;
        }
    }
}

// Compiler pragmas for unrolling
#pragma unroll 4
for (int i = 0; i < 16; i++) {
    result += data[i];
}

Comparison: CUDA vs. Other Parallel Programming Models

Feature	CUDA	OpenCL	OpenMP	OpenACC
Hardware Support	NVIDIA GPUs only	GPUs, CPUs, FPGAs	CPUs, limited GPU	CPUs, GPUs
Ease of Use	Moderate	Complex	Easy	Very Easy
Performance	Excellent	Good	Good	Good
Control Level	Very High	High	Moderate	Low
Language Support	C, C++, Fortran	C, C++	C, C++, Fortran	C, C++, Fortran
Development Tools	Extensive	Limited	Good	Limited
Community Support	Excellent	Good	Excellent	Limited

Common Challenges and Solutions

Challenge: Kernel Launch Failure

Problem: Kernel fails to launch with error code.

Solution:

Check for prior errors with cudaGetLastError()
Verify grid and block dimensions don’t exceed device limits
Ensure sufficient resources (registers, shared memory)

// Error checking pattern
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
    printf("Kernel launch error: %s\n", cudaGetErrorString(err));
}

Challenge: Memory Bottlenecks

Problem: Performance limited by memory bandwidth.

Solution:

Use shared memory for data reuse
Optimize global memory access patterns
Reduce redundant data transfers
Consider using texture memory for read-only data with spatial locality

// Tiling approach with shared memory
__global__ void matrixMulTiled(float* A, float* B, float* C, int width) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];
    
    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;
    
    for (int t = 0; t < width/TILE_SIZE; t++) {
        // Collaborative loading of tiles
        tileA[threadIdx.y][threadIdx.x] = A[row*width + t*TILE_SIZE + threadIdx.x];
        tileB[threadIdx.y][threadIdx.x] = B[(t*TILE_SIZE + threadIdx.y)*width + col];
        __syncthreads();
        
        // Compute partial sum using tile data
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }
        __syncthreads();
    }
    
    C[row*width + col] = sum;
}

Challenge: Warp Divergence

Problem: Threads within a warp take different execution paths, causing serialization.

Solution:

Minimize conditional code within warps
Restructure algorithms to reduce divergence
Group similar work to align with warp boundaries

// Bad: Warp divergence
if (threadIdx.x % 2 == 0) {
    // Even threads do this
} else {
    // Odd threads do this
}

// Better: Minimize divergence
if ((threadIdx.x / 32) % 2 == 0) {
    // Entire warps do this
} else {
    // Other entire warps do this
}

Challenge: Bank Conflicts

Problem: Multiple threads access the same shared memory bank, causing serialization.

Solution:

Pad shared memory arrays to avoid conflicts
Redesign access patterns to minimize conflicts

// Shared memory with padding to avoid bank conflicts
__shared__ float sharedData[TILE_SIZE][TILE_SIZE + 1]; // +1 padding

Best Practices and Practical Tips

Kernel Design

Keep kernels focused: One kernel, one task
Minimize divergence: Avoid conditionals that cause threads in a warp to take different paths
Balance workload: Ensure even distribution of work across threads
Asynchronous operations: Use streams for concurrent execution of kernels and data transfers

Memory Management

Minimize host-device transfers: Keep data on the GPU as long as possible
Use pinned memory for faster host-device transfers
Consider unified memory for simpler code but be aware of performance implications
Prefer coalesced access patterns for global memory

Performance Tuning

Profile before optimizing: Use NVIDIA Visual Profiler or Nsight to identify bottlenecks
Consider arithmetic intensity: Compute-bound vs. memory-bound kernels need different optimizations
Balance occupancy and resources: Sometimes fewer threads with more resources per thread is better
Minimize thread synchronization: Each sync point has overhead

Development Workflow

Start simple: Get a basic version working, then optimize
Validate results: Compare GPU results with CPU implementation
Incremental optimization: Make one change at a time and measure impact
Consider portability: Encapsulate CUDA-specific code if cross-platform support is needed

Advanced CUDA Features

Dynamic Parallelism (CUDA 5.0+)

__global__ void childKernel(float* data) {
    // Child kernel code
}

__global__ void parentKernel(float* data) {
    // Parent kernel code
    
    // Launch a child kernel from within a kernel
    childKernel<<<gridSize, blockSize>>>(data);
    
    // Wait for child kernels to complete
    cudaDeviceSynchronize();
}

Cooperative Groups (CUDA 9.0+)

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void cooperativeKernel(float* data, int n) {
    // Create a thread block group
    cg::thread_block block = cg::this_thread_block();
    
    // Use the group for synchronization
    block.sync();
    
    // Create a grid group (requires device support)
    cg::grid_group grid = cg::this_grid();
    
    // Grid-wide synchronization
    grid.sync();
}

// Launch with cooperative groups
void launchCooperativeKernel() {
    cudaLaunchCooperativeKernel((void*)cooperativeKernel, 
                                 gridSize, blockSize, 
                                 args, sharedMemBytes, stream);
}

Tensor Cores (Volta+ GPUs)

// Matrix multiplication with Tensor Cores using CUDA libraries
#include <cublas_v2.h>

cublasHandle_t handle;
cublasCreate(&handle);

// Enable Tensor Core operations
cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);

// Perform matrix multiplication: C = alpha*A*B + beta*C
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
            m, n, k, &alpha,
            d_A, lda, d_B, ldb, &beta, d_C, ldc);

cublasDestroy(handle);

Resources for Further Learning

Official Documentation and Tools

Books

“CUDA by Example: An Introduction to General-Purpose GPU Programming” by Jason Sanders and Edward Kandrot
“Programming Massively Parallel Processors” by David B. Kirk and Wen-mei W. Hwu
“CUDA for Engineers” by Duane Storti and Mete Yurtoglu

Online Courses and Tutorials

Community Resources

By understanding the core concepts, programming patterns, and optimization techniques in this cheatsheet, you’ll be well-equipped to harness the power of NVIDIA GPUs for your parallel computing needs. Remember to profile your applications, identify bottlenecks, and apply targeted optimizations to achieve the best performance.

Introduction: What is CUDA?

Core Concepts of CUDA Programming

CUDA Architecture Fundamentals

CUDA Memory Hierarchy

CUDA Programming Model: Step-by-Step

Basic CUDA Program Structure

CUDA Memory Management

Host-Device Memory Operations

Unified Memory (CUDA 6.0+)

Pinned Memory

CUDA Kernel Programming

Function Type Qualifiers

Variable Type Qualifiers

Launching Kernels

Thread and Block Indexing

Thread Organization and Execution Model

Thread Hierarchy

Built-in Variables

Synchronization

Key Techniques by Category

Shared Memory Usage

Atomic Operations

Warp-Level Primitives (CUDA 9.0+)

Performance Optimization Techniques

Memory Coalescing

Occupancy Optimization

Loop Unrolling

Comparison: CUDA vs. Other Parallel Programming Models

Common Challenges and Solutions

Challenge: Kernel Launch Failure

Challenge: Memory Bottlenecks

Challenge: Warp Divergence

Challenge: Bank Conflicts

Best Practices and Practical Tips

Kernel Design

Memory Management

Performance Tuning

Development Workflow

Advanced CUDA Features

Dynamic Parallelism (CUDA 5.0+)

Cooperative Groups (CUDA 9.0+)

Tensor Cores (Volta+ GPUs)

Resources for Further Learning

Official Documentation and Tools

Books

Online Courses and Tutorials

Community Resources

Related Posts