Introduction: Why Biopython Matters
Biopython is an open-source Python library designed specifically for biological computation. It provides tools for bioinformatics tasks including:
- Processing biological sequences (DNA, RNA, proteins)
- Parsing bioinformatics file formats
- Accessing online biological databases
- Running and analyzing results from common bioinformatics programs
Why Biopython matters: The integration of biological data with computational methods is essential for modern biological research. Biopython streamlines these complex tasks, making computational biology accessible to researchers without specialized programming backgrounds.
Core Concepts and Objects
Key Biopython Objects
| Object | Description | Primary Use Cases |
|---|---|---|
SeqRecord | Container holding sequence, ID, description | Storing and manipulating biological sequences |
Seq | Represents biological sequences | Sequence manipulation and analysis |
SeqFeature | Annotated regions of a sequence | Representing gene structures, annotations |
SeqIO | Module for sequence file I/O | Reading and writing sequence files |
Align | Tools for sequence alignments | Creating and manipulating alignments |
NCBI | Interface to NCBI databases | Retrieving sequences and information |
PDB | Protein Data Bank module | Working with 3D protein structures |
Phylo | Tools for phylogenetic trees | Working with evolutionary relationships |
Installing Biopython
pip install biopython
Basic Import Statements
from Bio import SeqIO # Sequence I/O
from Bio.Seq import Seq # Sequence object
from Bio import Entrez # NCBI database access
from Bio import AlignIO # Alignment I/O
from Bio.Align import MultipleSeqAlignment # MSA object
from Bio import SearchIO # BLAST result parsing
from Bio import pairwise2 # Pairwise alignments
from Bio.PDB import PDBParser # PDB file parsing
from Bio import Phylo # Phylogenetic trees
Fundamental Sequence Operations
Creating and Manipulating Sequences
# Creating a sequence
from Bio.Seq import Seq
my_seq = Seq("ATGCTAGCTAGCTAGC")
# Basic operations
length = len(my_seq) # Sequence length
complement = my_seq.complement() # Complement sequence
reverse_complement = my_seq.reverse_complement() # Reverse complement
transcription = my_seq.transcribe() # DNA to RNA (T→U)
translation = my_seq.translate() # DNA to protein
# Sequence slicing (zero-based indexing)
first_codon = my_seq[0:3] # First 3 bases
last_codon = my_seq[-3:] # Last 3 bases
# Sequence composition
gc_content = 100 * ((my_seq.count("G") + my_seq.count("C")) / len(my_seq))
base_counts = {'A': my_seq.count('A'), 'T': my_seq.count('T'),
'G': my_seq.count('G'), 'C': my_seq.count('C')}
Sequence Record Operations
# Creating a SeqRecord
from Bio.SeqRecord import SeqRecord
record = SeqRecord(my_seq,
id="ABC123",
name="Example",
description="An example sequence")
# Accessing SeqRecord attributes
sequence = record.seq
identifier = record.id
description = record.description
File Operations: Reading and Writing Sequence Data
Reading Sequence Files
# Read a FASTA file
from Bio import SeqIO
records = list(SeqIO.parse("sequences.fasta", "fasta"))
# Read a single sequence from FASTA
single_record = SeqIO.read("single_sequence.fasta", "fasta")
# Other common formats: GenBank, EMBL, FASTQ
genbank_records = list(SeqIO.parse("sequences.gb", "genbank"))
embl_records = list(SeqIO.parse("sequences.embl", "embl"))
fastq_records = list(SeqIO.parse("sequences.fastq", "fastq"))
Writing Sequence Files
# Write a single record
SeqIO.write(record, "output.fasta", "fasta")
# Write multiple records
SeqIO.write(records, "output.fasta", "fasta")
# Convert between formats
sequences = SeqIO.parse("input.gb", "genbank")
SeqIO.write(sequences, "output.fasta", "fasta")
Supported File Formats
| Format | File Extensions | Read | Write | Description |
|---|---|---|---|---|
| FASTA | .fa, .fasta, .fna, .faa | ✓ | ✓ | Simple format with headers and sequences |
| GenBank | .gb, .genbank | ✓ | ✓ | NCBI’s rich annotation format |
| EMBL | .embl | ✓ | ✓ | European equivalent to GenBank |
| FASTQ | .fastq, .fq | ✓ | ✓ | FASTA with quality scores |
| SWISS-PROT | .swiss, .sp | ✓ | ✓ | Curated protein database format |
| Clustal | .aln | ✓ | ✓ | Multiple sequence alignment |
| Stockholm | .sto, .sth | ✓ | ✓ | Multiple sequence alignment with annotation |
Online Database Access
NCBI Entrez Database Access
from Bio import Entrez
# Always provide your email for NCBI's records
Entrez.email = "your.email@example.com"
# Search NCBI databases
search_handle = Entrez.esearch(db="nucleotide",
term="Homo sapiens[Organism] AND COX1[Gene]",
retmax=10)
search_results = Entrez.read(search_handle)
search_handle.close()
# Get IDs from search results
id_list = search_results["IdList"]
# Fetch records using IDs
fetch_handle = Entrez.efetch(db="nucleotide",
id=id_list,
rettype="gb",
retmode="text")
records = list(SeqIO.parse(fetch_handle, "gb"))
fetch_handle.close()
Common NCBI Databases
| Database Name | Description | Common Use |
|---|---|---|
| nucleotide | Nucleotide sequences | DNA/RNA sequences |
| protein | Protein sequences | Amino acid sequences |
| pubmed | Scientific literature | Literature searches |
| genome | Genome sequences | Whole genome data |
| gene | Gene records | Gene information |
| taxonomy | Taxonomic information | Species classification |
Sequence Alignment
Pairwise Sequence Alignment
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
# Define two sequences
seq1 = Seq("ACCGT")
seq2 = Seq("ACGT")
# Perform global alignment
alignments = pairwise2.align.globalxx(seq1, seq2)
# Display the best alignment
print(format_alignment(*alignments[0]))
# Local alignment
local_alignments = pairwise2.align.localxx(seq1, seq2)
Multiple Sequence Alignment
# Using external tools like Clustal Omega via Biopython
from Bio.Align.Applications import ClustalOmegaCommandline
in_file = "unaligned.fasta"
out_file = "aligned.fasta"
# Path to clustalo executable must be specified
clustalomega_cline = ClustalOmegaCommandline(
cmd="path/to/clustalo",
infile=in_file,
outfile=out_file,
verbose=True,
auto=True)
# Run the command
stdout, stderr = clustalomega_cline()
# Read the alignment
from Bio import AlignIO
alignment = AlignIO.read(out_file, "fasta")
Alignment Analysis
# Calculate alignment summary statistics
from Bio.Align import AlignInfo
summary_align = AlignInfo.SummaryInfo(alignment)
# Calculate consensus sequence
consensus = summary_align.dumb_consensus()
# Get position-specific score matrix
pssm = summary_align.pos_specific_score_matrix()
# Calculate conservation by position
conservation = summary_align.information_content()
Sequence Similarity Searching
BLAST Through Biopython
from Bio.Blast import NCBIWWW, NCBIXML
# Run BLAST online
result_handle = NCBIWWW.qblast("blastn", "nt", "ATGCTAGCTAGCTCGAT")
# Parse BLAST results
blast_records = list(NCBIXML.parse(result_handle))
# Examine results
for record in blast_records:
for alignment in record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 0.01: # E-value threshold
print("Sequence:", alignment.title)
print("E-value:", hsp.expect)
print("Score:", hsp.score)
print(hsp.query)
print(hsp.match)
print(hsp.sbjct)
Local BLAST with Biopython
from Bio.Blast.Applications import NcbiblastnCommandline
# Run BLAST locally (requires BLAST+ installed)
blastn_cline = NcbiblastnCommandline(query="query.fasta",
db="nt",
outfmt=5,
out="results.xml")
stdout, stderr = blastn_cline()
# Parse results
result_handle = open("results.xml")
blast_records = list(NCBIXML.parse(result_handle))
Protein Structure Analysis
Working with PDB Files
from Bio.PDB import PDBParser, PDBIO
# Initialize parser
parser = PDBParser()
# Parse PDB structure
structure = parser.get_structure("protein_name", "protein.pdb")
# Access structure components (hierarchical)
model = structure[0] # First model
chain = model['A'] # Chain A
residue = chain[100] # Residue number 100
atom = residue['CA'] # Alpha carbon atom
# Get atom coordinates
coords = atom.get_coord()
# Calculate distances
from Bio.PDB import calc_distance
distance = calc_distance(atom1.get_coord(), atom2.get_coord())
# Write modified structure
io = PDBIO()
io.set_structure(structure)
io.save("modified.pdb")
Structure Analysis
# Secondary structure assignment
from Bio.PDB.DSSP import DSSP
dssp = DSSP(model, "protein.pdb")
# Ramachandran plot data
from Bio.PDB.internal_coords import IC_Chain
ic_chain = IC_Chain(chain)
phi_psi = [(res.get_phi(), res.get_psi()) for res in ic_chain]
Phylogenetic Analysis
Working with Phylogenetic Trees
from Bio import Phylo
# Read a tree file
tree = Phylo.read("tree.newick", "newick")
# Draw the tree
Phylo.draw(tree)
# Other supported formats: nexus, phyloxml, nexml
tree_nexus = Phylo.read("tree.nex", "nexus")
# Tree manipulation
tree.root_with_outgroup("TaxonA") # Root using outgroup
clade = tree.common_ancestor("TaxonA", "TaxonB") # Find MRCA
# Calculate distances
distance = tree.distance("TaxonA", "TaxonB")
# Write trees
Phylo.write(tree, "output.newick", "newick")
Data Analysis and Visualization
Basic Statistical Analysis
# GC content analysis
from Bio.SeqUtils import GC
gc_content = GC(sequence)
# Codon usage table
from Bio.SeqUtils import CodonUsage
table = CodonUsage.CodonsDict
count = CodonUsage.CodonCount(record)
usage = count.get_codons_frequencies(table)
Visualization with Matplotlib
import matplotlib.pyplot as plt
# Sequence length distribution
lengths = [len(record.seq) for record in records]
plt.hist(lengths, bins=20)
plt.xlabel("Sequence Length")
plt.ylabel("Frequency")
plt.title("Distribution of Sequence Lengths")
plt.savefig("length_dist.png")
# GC content across a sequence (sliding window)
def gc_content_window(seq, window_size=100):
gc_content = []
for i in range(0, len(seq) - window_size + 1, window_size):
window = seq[i:i + window_size]
gc_content.append(GC(window))
return gc_content
gc_values = gc_content_window(record.seq)
plt.plot(range(0, len(gc_values) * window_size, window_size), gc_values)
plt.xlabel("Position")
plt.ylabel("GC Content (%)")
plt.title("GC Content Across Sequence")
plt.savefig("gc_content.png")
Common Challenges and Solutions
| Challenge | Solution |
|---|---|
| Memory issues with large files | Use iterative parsing: for record in SeqIO.parse() |
| Slow BLAST searches | Use local BLAST installation instead of web interface |
| File format conversion errors | Check sequence validity before conversion; handle non-standard characters |
| NCBI connection errors | Implement retry mechanism with increasing delays |
| Sequence alignment of diverse sequences | Use more sensitive alignment algorithms (MAFFT, T-Coffee) |
| Working with nested features | Use recursive functions to traverse feature hierarchies |
Best Practices for Bioinformatics with Biopython
Error handling: Always implement try/except blocks around I/O operations and API calls
try: result = Entrez.efetch(db="nucleotide", id="12345", rettype="gb") except (urllib.error.HTTPError, urllib.error.URLError) as e: print(f"Failed to fetch data: {e}") # Implement retry logic hereData validation: Check sequences for validity before analysis
from Bio.Data import IUPACData def is_valid_dna(seq): return set(seq).issubset(set(IUPACData.ambiguous_dna_letters))Reproducibility: Document all parameters and versions
import Bio print(f"Using Biopython version {Bio.__version__}")Resource management: Close file handles and network connections
with Entrez.efetch(...) as handle: # Process data # Handle is automatically closedParallel processing: Use multiprocessing for CPU-intensive tasks
from multiprocessing import Pool def process_record(record): # Process a single record return result with Pool(processes=4) as pool: results = pool.map(process_record, records)
Resources for Further Learning
Official Documentation
Books
- Bioinformatics with Python Cookbook by Tiago Antao
- Building Bioinformatics Solutions by Conrad Bessant et al.
- Python for Biologists by Martin Jones
Online Courses
- Coursera: “Python for Genomic Data Science”
- edX: “Computational Biology”
- DataCamp: “Introduction to Bioinformatics in Python”
Community Resources
- Biopython GitHub Repository
- Biostars Forum (tag: biopython)
- RSeQC – For RNA sequencing QC
