The Ultimate Bioinformatics with Biopython Cheatsheet

Introduction: Why Biopython Matters

Biopython is an open-source Python library designed specifically for biological computation. It provides tools for bioinformatics tasks including:

  • Processing biological sequences (DNA, RNA, proteins)
  • Parsing bioinformatics file formats
  • Accessing online biological databases
  • Running and analyzing results from common bioinformatics programs

Why Biopython matters: The integration of biological data with computational methods is essential for modern biological research. Biopython streamlines these complex tasks, making computational biology accessible to researchers without specialized programming backgrounds.


Core Concepts and Objects

Key Biopython Objects

ObjectDescriptionPrimary Use Cases
SeqRecordContainer holding sequence, ID, descriptionStoring and manipulating biological sequences
SeqRepresents biological sequencesSequence manipulation and analysis
SeqFeatureAnnotated regions of a sequenceRepresenting gene structures, annotations
SeqIOModule for sequence file I/OReading and writing sequence files
AlignTools for sequence alignmentsCreating and manipulating alignments
NCBIInterface to NCBI databasesRetrieving sequences and information
PDBProtein Data Bank moduleWorking with 3D protein structures
PhyloTools for phylogenetic treesWorking with evolutionary relationships

Installing Biopython

pip install biopython

Basic Import Statements

from Bio import SeqIO                  # Sequence I/O
from Bio.Seq import Seq                # Sequence object
from Bio import Entrez                 # NCBI database access
from Bio import AlignIO                # Alignment I/O
from Bio.Align import MultipleSeqAlignment # MSA object
from Bio import SearchIO               # BLAST result parsing
from Bio import pairwise2              # Pairwise alignments
from Bio.PDB import PDBParser         # PDB file parsing
from Bio import Phylo                  # Phylogenetic trees

Fundamental Sequence Operations

Creating and Manipulating Sequences

# Creating a sequence
from Bio.Seq import Seq
my_seq = Seq("ATGCTAGCTAGCTAGC")

# Basic operations
length = len(my_seq)                    # Sequence length
complement = my_seq.complement()        # Complement sequence
reverse_complement = my_seq.reverse_complement()  # Reverse complement
transcription = my_seq.transcribe()     # DNA to RNA (T→U)
translation = my_seq.translate()        # DNA to protein

# Sequence slicing (zero-based indexing)
first_codon = my_seq[0:3]               # First 3 bases
last_codon = my_seq[-3:]                # Last 3 bases

# Sequence composition
gc_content = 100 * ((my_seq.count("G") + my_seq.count("C")) / len(my_seq))
base_counts = {'A': my_seq.count('A'), 'T': my_seq.count('T'),
              'G': my_seq.count('G'), 'C': my_seq.count('C')}

Sequence Record Operations

# Creating a SeqRecord
from Bio.SeqRecord import SeqRecord
record = SeqRecord(my_seq, 
                  id="ABC123",
                  name="Example",
                  description="An example sequence")

# Accessing SeqRecord attributes
sequence = record.seq
identifier = record.id
description = record.description

File Operations: Reading and Writing Sequence Data

Reading Sequence Files

# Read a FASTA file
from Bio import SeqIO
records = list(SeqIO.parse("sequences.fasta", "fasta"))

# Read a single sequence from FASTA
single_record = SeqIO.read("single_sequence.fasta", "fasta")

# Other common formats: GenBank, EMBL, FASTQ
genbank_records = list(SeqIO.parse("sequences.gb", "genbank"))
embl_records = list(SeqIO.parse("sequences.embl", "embl"))
fastq_records = list(SeqIO.parse("sequences.fastq", "fastq"))

Writing Sequence Files

# Write a single record
SeqIO.write(record, "output.fasta", "fasta")

# Write multiple records
SeqIO.write(records, "output.fasta", "fasta")

# Convert between formats
sequences = SeqIO.parse("input.gb", "genbank")
SeqIO.write(sequences, "output.fasta", "fasta")

Supported File Formats

FormatFile ExtensionsReadWriteDescription
FASTA.fa, .fasta, .fna, .faa✓✓Simple format with headers and sequences
GenBank.gb, .genbank✓✓NCBI’s rich annotation format
EMBL.embl✓✓European equivalent to GenBank
FASTQ.fastq, .fq✓✓FASTA with quality scores
SWISS-PROT.swiss, .sp✓✓Curated protein database format
Clustal.aln✓✓Multiple sequence alignment
Stockholm.sto, .sth✓✓Multiple sequence alignment with annotation

Online Database Access

NCBI Entrez Database Access

from Bio import Entrez

# Always provide your email for NCBI's records
Entrez.email = "your.email@example.com"

# Search NCBI databases
search_handle = Entrez.esearch(db="nucleotide", 
                              term="Homo sapiens[Organism] AND COX1[Gene]",
                              retmax=10)
search_results = Entrez.read(search_handle)
search_handle.close()

# Get IDs from search results
id_list = search_results["IdList"]

# Fetch records using IDs
fetch_handle = Entrez.efetch(db="nucleotide", 
                             id=id_list, 
                             rettype="gb", 
                             retmode="text")
records = list(SeqIO.parse(fetch_handle, "gb"))
fetch_handle.close()

Common NCBI Databases

Database NameDescriptionCommon Use
nucleotideNucleotide sequencesDNA/RNA sequences
proteinProtein sequencesAmino acid sequences
pubmedScientific literatureLiterature searches
genomeGenome sequencesWhole genome data
geneGene recordsGene information
taxonomyTaxonomic informationSpecies classification

Sequence Alignment

Pairwise Sequence Alignment

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

# Define two sequences
seq1 = Seq("ACCGT")
seq2 = Seq("ACGT")

# Perform global alignment
alignments = pairwise2.align.globalxx(seq1, seq2)

# Display the best alignment
print(format_alignment(*alignments[0]))

# Local alignment
local_alignments = pairwise2.align.localxx(seq1, seq2)

Multiple Sequence Alignment

# Using external tools like Clustal Omega via Biopython
from Bio.Align.Applications import ClustalOmegaCommandline

in_file = "unaligned.fasta"
out_file = "aligned.fasta"

# Path to clustalo executable must be specified
clustalomega_cline = ClustalOmegaCommandline(
    cmd="path/to/clustalo",
    infile=in_file,
    outfile=out_file,
    verbose=True,
    auto=True)

# Run the command
stdout, stderr = clustalomega_cline()

# Read the alignment
from Bio import AlignIO
alignment = AlignIO.read(out_file, "fasta")

Alignment Analysis

# Calculate alignment summary statistics
from Bio.Align import AlignInfo
summary_align = AlignInfo.SummaryInfo(alignment)

# Calculate consensus sequence
consensus = summary_align.dumb_consensus()

# Get position-specific score matrix
pssm = summary_align.pos_specific_score_matrix()

# Calculate conservation by position
conservation = summary_align.information_content()

Sequence Similarity Searching

BLAST Through Biopython

from Bio.Blast import NCBIWWW, NCBIXML

# Run BLAST online
result_handle = NCBIWWW.qblast("blastn", "nt", "ATGCTAGCTAGCTCGAT")

# Parse BLAST results
blast_records = list(NCBIXML.parse(result_handle))

# Examine results
for record in blast_records:
    for alignment in record.alignments:
        for hsp in alignment.hsps:
            if hsp.expect < 0.01:  # E-value threshold
                print("Sequence:", alignment.title)
                print("E-value:", hsp.expect)
                print("Score:", hsp.score)
                print(hsp.query)
                print(hsp.match)
                print(hsp.sbjct)

Local BLAST with Biopython

from Bio.Blast.Applications import NcbiblastnCommandline

# Run BLAST locally (requires BLAST+ installed)
blastn_cline = NcbiblastnCommandline(query="query.fasta", 
                                     db="nt", 
                                     outfmt=5, 
                                     out="results.xml")
stdout, stderr = blastn_cline()

# Parse results
result_handle = open("results.xml")
blast_records = list(NCBIXML.parse(result_handle))

Protein Structure Analysis

Working with PDB Files

from Bio.PDB import PDBParser, PDBIO

# Initialize parser
parser = PDBParser()

# Parse PDB structure
structure = parser.get_structure("protein_name", "protein.pdb")

# Access structure components (hierarchical)
model = structure[0]  # First model
chain = model['A']    # Chain A
residue = chain[100]  # Residue number 100
atom = residue['CA']  # Alpha carbon atom

# Get atom coordinates
coords = atom.get_coord() 

# Calculate distances
from Bio.PDB import calc_distance
distance = calc_distance(atom1.get_coord(), atom2.get_coord())

# Write modified structure
io = PDBIO()
io.set_structure(structure)
io.save("modified.pdb")

Structure Analysis

# Secondary structure assignment
from Bio.PDB.DSSP import DSSP
dssp = DSSP(model, "protein.pdb")

# Ramachandran plot data
from Bio.PDB.internal_coords import IC_Chain
ic_chain = IC_Chain(chain)
phi_psi = [(res.get_phi(), res.get_psi()) for res in ic_chain]

Phylogenetic Analysis

Working with Phylogenetic Trees

from Bio import Phylo

# Read a tree file
tree = Phylo.read("tree.newick", "newick")

# Draw the tree
Phylo.draw(tree)

# Other supported formats: nexus, phyloxml, nexml
tree_nexus = Phylo.read("tree.nex", "nexus")

# Tree manipulation
tree.root_with_outgroup("TaxonA")  # Root using outgroup
clade = tree.common_ancestor("TaxonA", "TaxonB")  # Find MRCA

# Calculate distances
distance = tree.distance("TaxonA", "TaxonB")

# Write trees
Phylo.write(tree, "output.newick", "newick")

Data Analysis and Visualization

Basic Statistical Analysis

# GC content analysis
from Bio.SeqUtils import GC
gc_content = GC(sequence)

# Codon usage table
from Bio.SeqUtils import CodonUsage
table = CodonUsage.CodonsDict
count = CodonUsage.CodonCount(record)
usage = count.get_codons_frequencies(table)

Visualization with Matplotlib

import matplotlib.pyplot as plt

# Sequence length distribution
lengths = [len(record.seq) for record in records]
plt.hist(lengths, bins=20)
plt.xlabel("Sequence Length")
plt.ylabel("Frequency")
plt.title("Distribution of Sequence Lengths")
plt.savefig("length_dist.png")

# GC content across a sequence (sliding window)
def gc_content_window(seq, window_size=100):
    gc_content = []
    for i in range(0, len(seq) - window_size + 1, window_size):
        window = seq[i:i + window_size]
        gc_content.append(GC(window))
    return gc_content

gc_values = gc_content_window(record.seq)
plt.plot(range(0, len(gc_values) * window_size, window_size), gc_values)
plt.xlabel("Position")
plt.ylabel("GC Content (%)")
plt.title("GC Content Across Sequence")
plt.savefig("gc_content.png")

Common Challenges and Solutions

ChallengeSolution
Memory issues with large filesUse iterative parsing: for record in SeqIO.parse()
Slow BLAST searchesUse local BLAST installation instead of web interface
File format conversion errorsCheck sequence validity before conversion; handle non-standard characters
NCBI connection errorsImplement retry mechanism with increasing delays
Sequence alignment of diverse sequencesUse more sensitive alignment algorithms (MAFFT, T-Coffee)
Working with nested featuresUse recursive functions to traverse feature hierarchies

Best Practices for Bioinformatics with Biopython

  1. Error handling: Always implement try/except blocks around I/O operations and API calls

    try:
        result = Entrez.efetch(db="nucleotide", id="12345", rettype="gb")
    except (urllib.error.HTTPError, urllib.error.URLError) as e:
        print(f"Failed to fetch data: {e}")
        # Implement retry logic here
    
  2. Data validation: Check sequences for validity before analysis

    from Bio.Data import IUPACData
    def is_valid_dna(seq):
        return set(seq).issubset(set(IUPACData.ambiguous_dna_letters))
    
  3. Reproducibility: Document all parameters and versions

    import Bio
    print(f"Using Biopython version {Bio.__version__}")
    
  4. Resource management: Close file handles and network connections

    with Entrez.efetch(...) as handle:
        # Process data
    # Handle is automatically closed
    
  5. Parallel processing: Use multiprocessing for CPU-intensive tasks

    from multiprocessing import Pool
    
    def process_record(record):
        # Process a single record
        return result
    
    with Pool(processes=4) as pool:
        results = pool.map(process_record, records)
    

Resources for Further Learning

Official Documentation

Books

  • Bioinformatics with Python Cookbook by Tiago Antao
  • Building Bioinformatics Solutions by Conrad Bessant et al.
  • Python for Biologists by Martin Jones

Online Courses

  • Coursera: “Python for Genomic Data Science”
  • edX: “Computational Biology”
  • DataCamp: “Introduction to Bioinformatics in Python”

Community Resources

Scroll to Top