Introduction: What is CUDA?
CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model that enables dramatic performance increases in computing by harnessing the power of NVIDIA GPUs. CUDA allows developers to use C, C++, and Fortran to write code that executes on the GPU, enabling acceleration for data-parallel computations in various domains including scientific computing, machine learning, image processing, and more.
Core Concepts of CUDA Programming
CUDA Architecture Fundamentals
- Host: The CPU and its memory (host memory)
- Device: The GPU and its memory (device memory)
- Kernel: A function executed on the GPU in parallel by many threads
- Thread: The basic unit of execution in CUDA
- Thread Block: A group of threads that can cooperate via shared memory
- Grid: A collection of thread blocks that execute the same kernel
- Warp: A group of 32 threads that execute instructions in SIMT (Single Instruction, Multiple Thread) fashion
CUDA Memory Hierarchy
| Memory Type | Scope | Lifetime | Speed | Size |
|---|---|---|---|---|
| Register | Thread | Kernel | Fastest | Limited per thread |
| Shared Memory | Block | Kernel | Very Fast | Limited per block (up to 48KB) |
| Local Memory | Thread | Kernel | Slow | Limited per thread |
| Global Memory | Grid | Application | Slow | Large (Several GB) |
| Constant Memory | Grid | Application | Fast (cached) | Small (64KB total) |
| Texture Memory | Grid | Application | Fast (cached) | Limited by global memory |
CUDA Programming Model: Step-by-Step
- Allocate memory on both host (CPU) and device (GPU)
- Transfer data from host to device
- Launch kernel to process data on device
- Transfer results back from device to host
- Free memory on both host and device
Basic CUDA Program Structure
// CUDA kernel definition
__global__ void vectorAdd(float* A, float* B, float* C, int numElements) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements) {
C[i] = A[i] + B[i];
}
}
int main() {
// Host memory pointers
float *h_A, *h_B, *h_C;
// Device memory pointers
float *d_A, *d_B, *d_C;
int numElements = 50000;
size_t size = numElements * sizeof(float);
// Allocate host memory
h_A = (float *)malloc(size);
h_B = (float *)malloc(size);
h_C = (float *)malloc(size);
// Initialize host arrays
for (int i = 0; i < numElements; ++i) {
h_A[i] = rand()/(float)RAND_MAX;
h_B[i] = rand()/(float)RAND_MAX;
}
// Allocate device memory
cudaMalloc((void **)&d_A, size);
cudaMalloc((void **)&d_B, size);
cudaMalloc((void **)&d_C, size);
// Copy data from host to device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Launch kernel
int threadsPerBlock = 256;
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
// Copy result back to host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
// Free host memory
free(h_A);
free(h_B);
free(h_C);
return 0;
}
CUDA Memory Management
Host-Device Memory Operations
// Allocation
float* d_data;
cudaMalloc((void**)&d_data, size);
// Data transfer: Host to Device
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);
// Data transfer: Device to Host
cudaMemcpy(h_result, d_result, size, cudaMemcpyDeviceToHost);
// Free memory
cudaFree(d_data);
Unified Memory (CUDA 6.0+)
// Allocate unified memory (accessible from CPU and GPU)
float* data;
cudaMallocManaged(&data, size);
// Access directly from host or device code
// No explicit transfers needed
// Free unified memory
cudaFree(data);
Pinned Memory
// Allocate page-locked (pinned) memory
float* h_data;
cudaMallocHost((void**)&h_data, size);
// Use for faster transfers to/from device
// Free pinned memory
cudaFreeHost(h_data);
CUDA Kernel Programming
Function Type Qualifiers
__global__: Runs on device, called from host__device__: Runs on device, called from device__host__: Runs on host, called from host (default)__host__ __device__: Can run on either host or device
Variable Type Qualifiers
__shared__: Shared memory (accessible by all threads in a block)__constant__: Constant memory (read-only, cached)__device__: Global memory (accessible by all threads)__managed__: Managed memory (automatically migrated between host and device)
Launching Kernels
// Basic kernel launch
myKernel<<<numBlocks, threadsPerBlock>>>(args...);
// With shared memory allocation
myKernel<<<numBlocks, threadsPerBlock, sharedMemBytes>>>(args...);
// With stream specification
myKernel<<<numBlocks, threadsPerBlock, sharedMemBytes, stream>>>(args...);
Thread and Block Indexing
// 1D grid of 1D blocks
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// 2D grid of 2D blocks
int idx_x = blockIdx.x * blockDim.x + threadIdx.x;
int idx_y = blockIdx.y * blockDim.y + threadIdx.y;
int idx = idx_y * width + idx_x;
// 3D grid of 3D blocks
int idx_x = blockIdx.x * blockDim.x + threadIdx.x;
int idx_y = blockIdx.y * blockDim.y + threadIdx.y;
int idx_z = blockIdx.z * blockDim.z + threadIdx.z;
Thread Organization and Execution Model
Thread Hierarchy
- Thread: Individual execution unit
- Warp: Group of 32 threads that execute in SIMT fashion
- Block: Group of threads that can synchronize and share memory
- Grid: Collection of blocks executing the same kernel
Built-in Variables
- threadIdx: Thread index within block (x, y, z)
- blockIdx: Block index within grid (x, y, z)
- blockDim: Block dimensions (threads per block in x, y, z)
- gridDim: Grid dimensions (blocks per grid in x, y, z)
- warpSize: Number of threads in a warp (32)
Synchronization
// Synchronize threads within a block
__syncthreads();
// Synchronize threads within a warp (CUDA 9.0+)
__syncwarp();
// Wait for all kernels to complete
cudaDeviceSynchronize();
// Wait for operations in a stream to complete
cudaStreamSynchronize(stream);
Key Techniques by Category
Shared Memory Usage
__global__ void sharedMemExample(float* input, float* output, int n) {
// Declare shared memory
__shared__ float sharedData[256];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Load data into shared memory
if (idx < n) {
sharedData[threadIdx.x] = input[idx];
}
// Synchronize to ensure all data is loaded
__syncthreads();
// Process data in shared memory
if (idx < n && threadIdx.x < blockDim.x - 1) {
output[idx] = sharedData[threadIdx.x] + sharedData[threadIdx.x + 1];
}
}
Atomic Operations
// Atomic add for integers
atomicAdd(&counter, 1);
// Atomic min
atomicMin(&minValue, value);
// Atomic max
atomicMax(&maxValue, value);
// Compare and swap
atomicCAS(&data[idx], oldValue, newValue);
Warp-Level Primitives (CUDA 9.0+)
// Warp vote functions
int allTrue = __all_sync(mask, predicate);
int anyTrue = __any_sync(mask, predicate);
// Warp shuffle
int value = __shfl_sync(mask, myValue, srcLane);
int value = __shfl_up_sync(mask, myValue, delta);
int value = __shfl_down_sync(mask, myValue, delta);
Performance Optimization Techniques
Memory Coalescing
- Ensure threads in a warp access contiguous memory addresses
- Align data structures to 128-byte boundaries
- Use appropriate data types (prefer float4, int4 for bandwidth-bound kernels)
// Good: Coalesced access pattern
__global__ void goodAccess(float* data, int width, int height) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height) {
float value = data[y * width + x]; // Threads in a warp access consecutive elements
}
}
// Bad: Strided access pattern
__global__ void badAccess(float* data, int width, int height) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height) {
float value = data[x * height + y]; // Threads in a warp access elements far apart
}
}
Occupancy Optimization
- Balance registers per thread and shared memory per block
- Choose appropriate block sizes (multiples of 32)
- Use the CUDA Occupancy Calculator to determine optimal launch configurations
// Launch configuration with occupancy calculator
int blockSize;
int minGridSize;
int gridSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, numElements);
gridSize = (numElements + blockSize - 1) / blockSize;
myKernel<<<gridSize, blockSize>>>(d_data, numElements);
Loop Unrolling
// Manual loop unrolling
__global__ void unrolledKernel(float* input, float* output, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx + 3 < n) {
output[idx] = input[idx] * 2.0f;
output[idx+1] = input[idx+1] * 2.0f;
output[idx+2] = input[idx+2] * 2.0f;
output[idx+3] = input[idx+3] * 2.0f;
} else {
for (int i = 0; i < 4 && idx + i < n; i++) {
output[idx+i] = input[idx+i] * 2.0f;
}
}
}
// Compiler pragmas for unrolling
#pragma unroll 4
for (int i = 0; i < 16; i++) {
result += data[i];
}
Comparison: CUDA vs. Other Parallel Programming Models
| Feature | CUDA | OpenCL | OpenMP | OpenACC |
|---|---|---|---|---|
| Hardware Support | NVIDIA GPUs only | GPUs, CPUs, FPGAs | CPUs, limited GPU | CPUs, GPUs |
| Ease of Use | Moderate | Complex | Easy | Very Easy |
| Performance | Excellent | Good | Good | Good |
| Control Level | Very High | High | Moderate | Low |
| Language Support | C, C++, Fortran | C, C++ | C, C++, Fortran | C, C++, Fortran |
| Development Tools | Extensive | Limited | Good | Limited |
| Community Support | Excellent | Good | Excellent | Limited |
Common Challenges and Solutions
Challenge: Kernel Launch Failure
Problem: Kernel fails to launch with error code.
Solution:
- Check for prior errors with
cudaGetLastError() - Verify grid and block dimensions don’t exceed device limits
- Ensure sufficient resources (registers, shared memory)
// Error checking pattern
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
printf("Kernel launch error: %s\n", cudaGetErrorString(err));
}
Challenge: Memory Bottlenecks
Problem: Performance limited by memory bandwidth.
Solution:
- Use shared memory for data reuse
- Optimize global memory access patterns
- Reduce redundant data transfers
- Consider using texture memory for read-only data with spatial locality
// Tiling approach with shared memory
__global__ void matrixMulTiled(float* A, float* B, float* C, int width) {
__shared__ float tileA[TILE_SIZE][TILE_SIZE];
__shared__ float tileB[TILE_SIZE][TILE_SIZE];
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
for (int t = 0; t < width/TILE_SIZE; t++) {
// Collaborative loading of tiles
tileA[threadIdx.y][threadIdx.x] = A[row*width + t*TILE_SIZE + threadIdx.x];
tileB[threadIdx.y][threadIdx.x] = B[(t*TILE_SIZE + threadIdx.y)*width + col];
__syncthreads();
// Compute partial sum using tile data
for (int k = 0; k < TILE_SIZE; k++) {
sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
}
__syncthreads();
}
C[row*width + col] = sum;
}
Challenge: Warp Divergence
Problem: Threads within a warp take different execution paths, causing serialization.
Solution:
- Minimize conditional code within warps
- Restructure algorithms to reduce divergence
- Group similar work to align with warp boundaries
// Bad: Warp divergence
if (threadIdx.x % 2 == 0) {
// Even threads do this
} else {
// Odd threads do this
}
// Better: Minimize divergence
if ((threadIdx.x / 32) % 2 == 0) {
// Entire warps do this
} else {
// Other entire warps do this
}
Challenge: Bank Conflicts
Problem: Multiple threads access the same shared memory bank, causing serialization.
Solution:
- Pad shared memory arrays to avoid conflicts
- Redesign access patterns to minimize conflicts
// Shared memory with padding to avoid bank conflicts
__shared__ float sharedData[TILE_SIZE][TILE_SIZE + 1]; // +1 padding
Best Practices and Practical Tips
Kernel Design
- Keep kernels focused: One kernel, one task
- Minimize divergence: Avoid conditionals that cause threads in a warp to take different paths
- Balance workload: Ensure even distribution of work across threads
- Asynchronous operations: Use streams for concurrent execution of kernels and data transfers
Memory Management
- Minimize host-device transfers: Keep data on the GPU as long as possible
- Use pinned memory for faster host-device transfers
- Consider unified memory for simpler code but be aware of performance implications
- Prefer coalesced access patterns for global memory
Performance Tuning
- Profile before optimizing: Use NVIDIA Visual Profiler or Nsight to identify bottlenecks
- Consider arithmetic intensity: Compute-bound vs. memory-bound kernels need different optimizations
- Balance occupancy and resources: Sometimes fewer threads with more resources per thread is better
- Minimize thread synchronization: Each sync point has overhead
Development Workflow
- Start simple: Get a basic version working, then optimize
- Validate results: Compare GPU results with CPU implementation
- Incremental optimization: Make one change at a time and measure impact
- Consider portability: Encapsulate CUDA-specific code if cross-platform support is needed
Advanced CUDA Features
Dynamic Parallelism (CUDA 5.0+)
__global__ void childKernel(float* data) {
// Child kernel code
}
__global__ void parentKernel(float* data) {
// Parent kernel code
// Launch a child kernel from within a kernel
childKernel<<<gridSize, blockSize>>>(data);
// Wait for child kernels to complete
cudaDeviceSynchronize();
}
Cooperative Groups (CUDA 9.0+)
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void cooperativeKernel(float* data, int n) {
// Create a thread block group
cg::thread_block block = cg::this_thread_block();
// Use the group for synchronization
block.sync();
// Create a grid group (requires device support)
cg::grid_group grid = cg::this_grid();
// Grid-wide synchronization
grid.sync();
}
// Launch with cooperative groups
void launchCooperativeKernel() {
cudaLaunchCooperativeKernel((void*)cooperativeKernel,
gridSize, blockSize,
args, sharedMemBytes, stream);
}
Tensor Cores (Volta+ GPUs)
// Matrix multiplication with Tensor Cores using CUDA libraries
#include <cublas_v2.h>
cublasHandle_t handle;
cublasCreate(&handle);
// Enable Tensor Core operations
cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);
// Perform matrix multiplication: C = alpha*A*B + beta*C
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
m, n, k, &alpha,
d_A, lda, d_B, ldb, &beta, d_C, ldc);
cublasDestroy(handle);
Resources for Further Learning
Official Documentation and Tools
- NVIDIA CUDA Documentation
- CUDA C++ Programming Guide
- CUDA C++ Best Practices Guide
- NVIDIA Nsight Systems
- NVIDIA Nsight Compute
Books
- “CUDA by Example: An Introduction to General-Purpose GPU Programming” by Jason Sanders and Edward Kandrot
- “Programming Massively Parallel Processors” by David B. Kirk and Wen-mei W. Hwu
- “CUDA for Engineers” by Duane Storti and Mete Yurtoglu
Online Courses and Tutorials
- NVIDIA Deep Learning Institute (DLI)
- Udacity: Intro to Parallel Programming
- CUDA Crash Course (YouTube)
Community Resources
By understanding the core concepts, programming patterns, and optimization techniques in this cheatsheet, you’ll be well-equipped to harness the power of NVIDIA GPUs for your parallel computing needs. Remember to profile your applications, identify bottlenecks, and apply targeted optimizations to achieve the best performance.
