Comprehensive Biostatistics Cheat Sheet

Introduction to Biostatistics

Biostatistics is the application of statistical methods to biological data and problems in the health sciences. It plays a crucial role in:

Designing rigorous medical studies
Analyzing health and disease patterns in populations
Evaluating the effectiveness of treatments and interventions
Identifying risk factors for diseases
Interpreting and communicating research findings
Supporting evidence-based medicine and public health decisions

Core Concepts & Principles

Types of Variables

Type	Description	Examples
Categorical	Qualitative data that can be sorted into groups	Blood type (A, B, AB, O), Disease status (yes/no)
Numerical	Quantitative data represented by numbers
– Discrete	Countable values with gaps between them	Number of heart attacks, Children per family
– Continuous	Can take any value within a range	Blood pressure, BMI, Temperature
Ordinal	Categorical data with a natural order	Disease severity (mild, moderate, severe), Pain scales (1-10)

Measures of Central Tendency

Mean: The average value (sum of values divided by count)
Median: The middle value when data is arranged in order
Mode: The most frequently occurring value

Measures of Dispersion

Range: Difference between maximum and minimum values
Variance: Average of squared deviations from the mean
Standard Deviation (SD): Square root of variance
Interquartile Range (IQR): Range between 25th and 75th percentiles
Coefficient of Variation: (SD / Mean) × 100%

Probability Distributions

Distribution	Description	Applications
Normal	Bell-shaped curve defined by mean and SD	Heights, blood pressure, measurement errors
Binomial	Probability of x successes in n trials	Disease occurrence, treatment success/failure
Poisson	Rare events in a fixed time or space	Disease incidence, number of mutations
Exponential	Time between independent events	Survival times, waiting times
Chi-square	Sum of squared standard normal variables	Testing independence, goodness of fit
t-distribution	More spread out than normal, depends on degrees of freedom	Small sample inference about means

Study Design Methodology

Types of Research Studies

Study Type	Description	Strengths	Limitations
Randomized Controlled Trial (RCT)	Participants randomly assigned to treatment/control	Gold standard; reduces confounding & bias	Expensive; ethical limitations; may lack external validity
Cohort Study	Follows groups with different exposures over time	Establishes temporal sequence; can study multiple outcomes	Time-consuming; expensive; susceptible to loss to follow-up
Case-Control Study	Compares people with disease to those without	Efficient for rare diseases; requires fewer subjects	Susceptible to recall & selection bias; cannot calculate incidence
Cross-sectional Study	Data collected at one point in time	Quick, inexpensive; good for prevalence	Cannot establish causality; temporal ambiguity
Ecological Study	Compares groups, not individuals	Useful for generating hypotheses; uses existing data	Ecological fallacy; cannot link exposure to outcome at individual level

Key Study Design Elements

Randomization: Random assignment to reduce systematic differences
Blinding: Single (participant) or double (participant and researcher) to reduce bias
Controls: Comparison groups to isolate variable effects
Sample Size Calculation: Ensures adequate statistical power
Inclusion/Exclusion Criteria: Defines study population

Statistical Methods by Purpose

Descriptive Statistics

Frequency Tables: Counts and percentages
Measures of Central Tendency: Mean, median, mode
Measures of Dispersion: SD, variance, range, IQR
Data Visualization: Histograms, box plots, scatter plots, bar charts

Inferential Statistics

Point Estimation: Single value estimate of a parameter
Interval Estimation: Range of plausible values (confidence intervals)
Hypothesis Testing: Process to test claims about populations
- Null hypothesis (H₀): No effect/difference
- Alternative hypothesis (H₁): Effect/difference exists

Comparative Analyses

Test	Purpose	Data Type	Assumptions
t-test (independent)	Compare means of 2 independent groups	Continuous	Normal distribution, equal variances
t-test (paired)	Compare means of matched pairs	Continuous	Normal distribution of differences
ANOVA	Compare means of 3+ groups	Continuous	Normal distribution, equal variances
Chi-square	Compare proportions between groups	Categorical	Expected frequencies ≥5 in each cell
Fisher’s Exact	Compare proportions (small samples)	Categorical	Small sample sizes
Mann-Whitney U	Compare 2 groups (non-parametric)	Ordinal/continuous	Does not require normality
Kruskal-Wallis	Compare 3+ groups (non-parametric)	Ordinal/continuous	Does not require normality
Wilcoxon Signed-Rank	Paired data (non-parametric)	Ordinal/continuous	Does not require normality

Correlation and Regression Analyses

Analysis	Purpose	Output	Assumptions
Pearson Correlation	Linear relationship between 2 continuous variables	r (-1 to +1)	Normal distribution, linear relationship
Spearman Correlation	Monotonic relationship between 2 variables	rs (-1 to +1)	Monotonic relationship
Simple Linear Regression	Predict continuous outcome from one predictor	β coefficients, R²	Linearity, normality, homoscedasticity, independence
Multiple Linear Regression	Predict continuous outcome from multiple predictors	β coefficients, R²	Linearity, normality, homoscedasticity, independence, no multicollinearity
Logistic Regression	Predict binary outcome	Odds ratios, log odds	Binary outcome, independence, no multicollinearity
Cox Proportional Hazards	Analyze time-to-event data with censoring	Hazard ratios	Proportional hazards, independent censoring

Advanced Methods

Survival Analysis: Analyzes time until event occurs
- Kaplan-Meier curves
- Log-rank test
- Cox proportional hazards models
Meta-Analysis: Statistically combines results of multiple studies
Multivariate Analysis: Analyzes multiple dependent variables simultaneously
Cluster Analysis: Groups similar observations together
Principal Component Analysis: Reduces data dimensionality

Statistical Power and Sample Size

Key Determinants of Statistical Power

Sample Size: Larger samples provide more power
Effect Size: Larger effects are easier to detect
Variability: Less variability gives more power
Significance Level (α): Usually set at 0.05
Type I Error: False positive (rejecting true null hypothesis)
Type II Error: False negative (failing to reject false null hypothesis)
Power = 1 – β: Probability of detecting a true effect (usually aim for 0.8 or 80%)

Sample Size Calculation Components

Expected effect size
Desired power level (typically 80% or 90%)
Significance level (typically α = 0.05)
Variability estimate
Study design factors (one vs. two-sided tests, paired vs. independent samples)

P-values, Confidence Intervals, and Significance

P-value Interpretation

Definition: Probability of obtaining results at least as extreme as observed, if null hypothesis is true
Interpretation:
- p < 0.05: Statistically significant (by convention)
- p ≥ 0.05: Not statistically significant
Caution: Statistical significance ≠ clinical significance

Confidence Intervals (CI)

Definition: Range of values likely to contain the true population parameter
Interpretation:
- 95% CI: 95% confidence that interval contains true parameter
- Narrow CI indicates precise estimate
- Wide CI indicates less precision
Advantage: Provides both magnitude and precision of effect

Common Challenges and Solutions

Selection Bias

Problem: Study participants not representative of target population
Solutions: Random sampling, clear inclusion/exclusion criteria, reporting participation rates

Confounding

Problem: Extraneous variable associated with both exposure and outcome
Solutions: Randomization, matching, stratification, multivariable analysis, restriction

Missing Data

Problem: Incomplete datasets leading to bias or reduced power
Solutions:
- Complete case analysis (if missing completely at random)
- Imputation methods (mean/median substitution, multiple imputation)
- Sensitivity analyses

Multiple Comparisons

Problem: Increased risk of Type I errors when performing many tests
Solutions: Bonferroni correction, False Discovery Rate, pre-specified primary endpoints

Low Statistical Power

Problem: Inability to detect true effects
Solutions: Increase sample size, reduce measurement variability, use more efficient designs

Best Practices and Tips

Study Design

Clearly define research question and hypothesis before starting
Choose appropriate study design for your research question
Conduct proper sample size calculations before beginning
Pre-register study protocols and analysis plans
Use validated measurement tools when possible

Data Analysis

Examine data distribution before choosing statistical tests
Check assumptions of statistical tests
Present effect sizes along with p-values
Report confidence intervals
Conduct sensitivity analyses for important findings
Consider clinical significance, not just statistical significance

Reporting

Follow relevant reporting guidelines (CONSORT, STROBE, PRISMA)
Report all outcomes, not just significant ones
Be transparent about analytical methods and decisions
Avoid overinterpreting results (especially with observational data)
Include appropriate visualizations of data
Present absolute risk differences, not just relative risks

Software Tools for Biostatistics

Statistical Packages

R: Free, powerful, versatile; steep learning curve
SPSS: User-friendly interface; limited advanced capabilities
SAS: Industry standard in healthcare; expensive
Stata: Popular in epidemiology; clean syntax
GraphPad Prism: User-friendly; focused on biological research

Key Functions to Know

Data import/cleaning
Descriptive statistics
Basic visualizations
Common statistical tests
Regression analyses
Power calculations

Resources for Further Learning

Books

Fundamentals of Biostatistics by Bernard Rosner
Essential Medical Statistics by Betty Kirkwood and Jonathan Sterne
Statistical Methods in Medical Research by P. Armitage, G. Berry, J.N.S. Matthews
Practical Statistics for Medical Research by Douglas G. Altman

Online Courses and Resources

Coursera: “Statistics for Life Sciences” specialization
EdX: “Statistics and R” by Harvard
StatLearning.com: Free course materials
UCLA Statistical Computing Resources
BMJ Statistics at Square One series

Key Journals

Statistics in Medicine
Biostatistics
Statistical Methods in Medical Research
Journal of the Royal Statistical Society

Professional Organizations

American Statistical Association (ASA)
International Biometric Society
Royal Statistical Society
Society for Clinical Trials