Introduction to Biostatistics
Biostatistics is the application of statistical methods to biological data and problems in the health sciences. It plays a crucial role in:
- Designing rigorous medical studies
- Analyzing health and disease patterns in populations
- Evaluating the effectiveness of treatments and interventions
- Identifying risk factors for diseases
- Interpreting and communicating research findings
- Supporting evidence-based medicine and public health decisions
Core Concepts & Principles
Types of Variables
Type | Description | Examples |
---|---|---|
Categorical | Qualitative data that can be sorted into groups | Blood type (A, B, AB, O), Disease status (yes/no) |
Numerical | Quantitative data represented by numbers | |
– Discrete | Countable values with gaps between them | Number of heart attacks, Children per family |
– Continuous | Can take any value within a range | Blood pressure, BMI, Temperature |
Ordinal | Categorical data with a natural order | Disease severity (mild, moderate, severe), Pain scales (1-10) |
Measures of Central Tendency
- Mean: The average value (sum of values divided by count)
- Median: The middle value when data is arranged in order
- Mode: The most frequently occurring value
Measures of Dispersion
- Range: Difference between maximum and minimum values
- Variance: Average of squared deviations from the mean
- Standard Deviation (SD): Square root of variance
- Interquartile Range (IQR): Range between 25th and 75th percentiles
- Coefficient of Variation: (SD / Mean) × 100%
Probability Distributions
Distribution | Description | Applications |
---|---|---|
Normal | Bell-shaped curve defined by mean and SD | Heights, blood pressure, measurement errors |
Binomial | Probability of x successes in n trials | Disease occurrence, treatment success/failure |
Poisson | Rare events in a fixed time or space | Disease incidence, number of mutations |
Exponential | Time between independent events | Survival times, waiting times |
Chi-square | Sum of squared standard normal variables | Testing independence, goodness of fit |
t-distribution | More spread out than normal, depends on degrees of freedom | Small sample inference about means |
Study Design Methodology
Types of Research Studies
Study Type | Description | Strengths | Limitations |
---|---|---|---|
Randomized Controlled Trial (RCT) | Participants randomly assigned to treatment/control | Gold standard; reduces confounding & bias | Expensive; ethical limitations; may lack external validity |
Cohort Study | Follows groups with different exposures over time | Establishes temporal sequence; can study multiple outcomes | Time-consuming; expensive; susceptible to loss to follow-up |
Case-Control Study | Compares people with disease to those without | Efficient for rare diseases; requires fewer subjects | Susceptible to recall & selection bias; cannot calculate incidence |
Cross-sectional Study | Data collected at one point in time | Quick, inexpensive; good for prevalence | Cannot establish causality; temporal ambiguity |
Ecological Study | Compares groups, not individuals | Useful for generating hypotheses; uses existing data | Ecological fallacy; cannot link exposure to outcome at individual level |
Key Study Design Elements
- Randomization: Random assignment to reduce systematic differences
- Blinding: Single (participant) or double (participant and researcher) to reduce bias
- Controls: Comparison groups to isolate variable effects
- Sample Size Calculation: Ensures adequate statistical power
- Inclusion/Exclusion Criteria: Defines study population
Statistical Methods by Purpose
Descriptive Statistics
- Frequency Tables: Counts and percentages
- Measures of Central Tendency: Mean, median, mode
- Measures of Dispersion: SD, variance, range, IQR
- Data Visualization: Histograms, box plots, scatter plots, bar charts
Inferential Statistics
- Point Estimation: Single value estimate of a parameter
- Interval Estimation: Range of plausible values (confidence intervals)
- Hypothesis Testing: Process to test claims about populations
- Null hypothesis (H₀): No effect/difference
- Alternative hypothesis (H₁): Effect/difference exists
Comparative Analyses
Test | Purpose | Data Type | Assumptions |
---|---|---|---|
t-test (independent) | Compare means of 2 independent groups | Continuous | Normal distribution, equal variances |
t-test (paired) | Compare means of matched pairs | Continuous | Normal distribution of differences |
ANOVA | Compare means of 3+ groups | Continuous | Normal distribution, equal variances |
Chi-square | Compare proportions between groups | Categorical | Expected frequencies ≥5 in each cell |
Fisher’s Exact | Compare proportions (small samples) | Categorical | Small sample sizes |
Mann-Whitney U | Compare 2 groups (non-parametric) | Ordinal/continuous | Does not require normality |
Kruskal-Wallis | Compare 3+ groups (non-parametric) | Ordinal/continuous | Does not require normality |
Wilcoxon Signed-Rank | Paired data (non-parametric) | Ordinal/continuous | Does not require normality |
Correlation and Regression Analyses
Analysis | Purpose | Output | Assumptions |
---|---|---|---|
Pearson Correlation | Linear relationship between 2 continuous variables | r (-1 to +1) | Normal distribution, linear relationship |
Spearman Correlation | Monotonic relationship between 2 variables | rs (-1 to +1) | Monotonic relationship |
Simple Linear Regression | Predict continuous outcome from one predictor | β coefficients, R² | Linearity, normality, homoscedasticity, independence |
Multiple Linear Regression | Predict continuous outcome from multiple predictors | β coefficients, R² | Linearity, normality, homoscedasticity, independence, no multicollinearity |
Logistic Regression | Predict binary outcome | Odds ratios, log odds | Binary outcome, independence, no multicollinearity |
Cox Proportional Hazards | Analyze time-to-event data with censoring | Hazard ratios | Proportional hazards, independent censoring |
Advanced Methods
- Survival Analysis: Analyzes time until event occurs
- Kaplan-Meier curves
- Log-rank test
- Cox proportional hazards models
- Meta-Analysis: Statistically combines results of multiple studies
- Multivariate Analysis: Analyzes multiple dependent variables simultaneously
- Cluster Analysis: Groups similar observations together
- Principal Component Analysis: Reduces data dimensionality
Statistical Power and Sample Size
Key Determinants of Statistical Power
- Sample Size: Larger samples provide more power
- Effect Size: Larger effects are easier to detect
- Variability: Less variability gives more power
- Significance Level (α): Usually set at 0.05
- Type I Error: False positive (rejecting true null hypothesis)
- Type II Error: False negative (failing to reject false null hypothesis)
- Power = 1 – β: Probability of detecting a true effect (usually aim for 0.8 or 80%)
Sample Size Calculation Components
- Expected effect size
- Desired power level (typically 80% or 90%)
- Significance level (typically α = 0.05)
- Variability estimate
- Study design factors (one vs. two-sided tests, paired vs. independent samples)
P-values, Confidence Intervals, and Significance
P-value Interpretation
- Definition: Probability of obtaining results at least as extreme as observed, if null hypothesis is true
- Interpretation:
- p < 0.05: Statistically significant (by convention)
- p ≥ 0.05: Not statistically significant
- Caution: Statistical significance ≠ clinical significance
Confidence Intervals (CI)
- Definition: Range of values likely to contain the true population parameter
- Interpretation:
- 95% CI: 95% confidence that interval contains true parameter
- Narrow CI indicates precise estimate
- Wide CI indicates less precision
- Advantage: Provides both magnitude and precision of effect
Common Challenges and Solutions
Selection Bias
- Problem: Study participants not representative of target population
- Solutions: Random sampling, clear inclusion/exclusion criteria, reporting participation rates
Confounding
- Problem: Extraneous variable associated with both exposure and outcome
- Solutions: Randomization, matching, stratification, multivariable analysis, restriction
Missing Data
- Problem: Incomplete datasets leading to bias or reduced power
- Solutions:
- Complete case analysis (if missing completely at random)
- Imputation methods (mean/median substitution, multiple imputation)
- Sensitivity analyses
Multiple Comparisons
- Problem: Increased risk of Type I errors when performing many tests
- Solutions: Bonferroni correction, False Discovery Rate, pre-specified primary endpoints
Low Statistical Power
- Problem: Inability to detect true effects
- Solutions: Increase sample size, reduce measurement variability, use more efficient designs
Best Practices and Tips
Study Design
- Clearly define research question and hypothesis before starting
- Choose appropriate study design for your research question
- Conduct proper sample size calculations before beginning
- Pre-register study protocols and analysis plans
- Use validated measurement tools when possible
Data Analysis
- Examine data distribution before choosing statistical tests
- Check assumptions of statistical tests
- Present effect sizes along with p-values
- Report confidence intervals
- Conduct sensitivity analyses for important findings
- Consider clinical significance, not just statistical significance
Reporting
- Follow relevant reporting guidelines (CONSORT, STROBE, PRISMA)
- Report all outcomes, not just significant ones
- Be transparent about analytical methods and decisions
- Avoid overinterpreting results (especially with observational data)
- Include appropriate visualizations of data
- Present absolute risk differences, not just relative risks
Software Tools for Biostatistics
Statistical Packages
- R: Free, powerful, versatile; steep learning curve
- SPSS: User-friendly interface; limited advanced capabilities
- SAS: Industry standard in healthcare; expensive
- Stata: Popular in epidemiology; clean syntax
- GraphPad Prism: User-friendly; focused on biological research
Key Functions to Know
- Data import/cleaning
- Descriptive statistics
- Basic visualizations
- Common statistical tests
- Regression analyses
- Power calculations
Resources for Further Learning
Books
- Fundamentals of Biostatistics by Bernard Rosner
- Essential Medical Statistics by Betty Kirkwood and Jonathan Sterne
- Statistical Methods in Medical Research by P. Armitage, G. Berry, J.N.S. Matthews
- Practical Statistics for Medical Research by Douglas G. Altman
Online Courses and Resources
- Coursera: “Statistics for Life Sciences” specialization
- EdX: “Statistics and R” by Harvard
- StatLearning.com: Free course materials
- UCLA Statistical Computing Resources
- BMJ Statistics at Square One series
Key Journals
- Statistics in Medicine
- Biostatistics
- Statistical Methods in Medical Research
- Journal of the Royal Statistical Society
Professional Organizations
- American Statistical Association (ASA)
- International Biometric Society
- Royal Statistical Society
- Society for Clinical Trials