The Ultimate Apache Cassandra Cheatsheet: Master NoSQL Database Management

Introduction to Apache Cassandra

Apache Cassandra is a free, open-source, distributed NoSQL database management system designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability, and superior performance for write-heavy workloads. Originally developed at Facebook to power their Inbox Search feature, Cassandra was designed to handle massive datasets across distributed systems with exceptional fault tolerance and tunable consistency.

Core Cassandra Concepts

Data Model Architecture

ConceptDescription
KeyspaceContainer for tables, similar to a schema in relational databases
TableCollection of rows and columns, similar to tables in relational databases
Partition KeyFirst part of primary key that determines data distribution across nodes
Clustering KeyOptional second part of primary key that determines sort order within a partition
ColumnName-value pair with a defined data type
RowCollection of columns identified by a primary key

Cassandra’s CAP Characteristics

  • Consistency: Tunable consistency levels (ONE, QUORUM, ALL, etc.)
  • Availability: No single point of failure, continuous availability
  • Partition Tolerance: Designed to operate across distributed nodes

Cassandra vs. Traditional RDBMS

FeatureCassandraTraditional RDBMS
Data ModelColumn-family, schema-flexibleRow-based, rigid schema
ScalingHorizontal (add nodes)Vertical (bigger servers)
TransactionsLimited (lightweight transactions)ACID compliant
JoinsNot supported nativelyFully supported
ArchitectureMasterless, peer-to-peerMaster-slave
Best ForWrite-heavy workloads, time-series dataComplex queries, transactional data
ConsistencyTunable (eventual to strong)Strong consistency

Cassandra Query Language (CQL)

CQL Data Types

CategoryTypes
Numericint, bigint, smallint, tinyint, float, double, decimal, varint
Texttext, varchar, ascii
Time/Datetimestamp, date, time, duration
Identifiersuuid, timeuuid
Collectionslist, set, map, tuple
Othersboolean, blob, inet, counter

Basic CQL Commands

Database Operations

-- Create a keyspace
CREATE KEYSPACE my_keyspace 
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

-- Use a keyspace
USE my_keyspace;

-- Drop a keyspace
DROP KEYSPACE my_keyspace;

Table Operations

-- Create a table
CREATE TABLE users (
    user_id uuid PRIMARY KEY,
    first_name text,
    last_name text,
    email text,
    created_at timestamp
);

-- Alter table (add column)
ALTER TABLE users ADD age int;

-- Drop table
DROP TABLE users;

-- Truncate table (remove all data)
TRUNCATE users;

Data Manipulation

-- Insert data
INSERT INTO users (user_id, first_name, last_name, email, created_at)
VALUES (uuid(), 'John', 'Doe', 'john@example.com', toTimestamp(now()));

-- Update data
UPDATE users 
SET email = 'newemail@example.com' 
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- Delete data
DELETE FROM users 
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- Select data
SELECT * FROM users WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

Secondary Indexes

-- Create an index
CREATE INDEX ON users (email);

-- Drop an index
DROP INDEX users_email_idx;

Collections

-- Table with collections
CREATE TABLE user_preferences (
    user_id uuid PRIMARY KEY,
    favorite_colors set<text>,
    address_history list<text>,
    phone_numbers map<text, text>
);

-- Insert into collections
INSERT INTO user_preferences (user_id, favorite_colors, address_history, phone_numbers)
VALUES (
    uuid(), 
    {'blue', 'green', 'red'},
    ['123 Main St', '456 Oak Ave'],
    {'home': '555-1234', 'work': '555-5678'}
);

-- Update collections
UPDATE user_preferences 
SET favorite_colors = favorite_colors + {'yellow'}
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- Remove from collections
UPDATE user_preferences 
SET phone_numbers = phone_numbers - {'work'}
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

Cassandra Data Modeling

Key Principles

  1. Model around queries, not entities
  2. Denormalize for performance
  3. Partition based on access patterns
  4. Avoid hotspots by distributing workload
  5. Minimize partitions read per query

Primary Key Design Patterns

PatternStructureUse Case
Simple Primary KeyPRIMARY KEY (id)Single record lookup
Compound KeyPRIMARY KEY ((partition_key), clustering_column)Sorted data within partition
Composite Partition KeyPRIMARY KEY ((key1, key2), clustering_column)Distributing data evenly
Time SeriesPRIMARY KEY ((entity_id), timestamp)Time-ordered events per entity
BucketingPRIMARY KEY ((entity_id, bucket), timestamp)Managing wide partitions

Common Data Modeling Techniques

One-to-Many Relationship

CREATE TABLE posts (
    user_id uuid,
    post_id timeuuid,
    content text,
    created_at timestamp,
    PRIMARY KEY (user_id, post_id)
) WITH CLUSTERING ORDER BY (post_id DESC);

Many-to-Many Relationship

-- User to groups
CREATE TABLE user_groups (
    user_id uuid,
    group_id uuid,
    joined_at timestamp,
    PRIMARY KEY (user_id, group_id)
);

-- Group to users (duplication for query efficiency)
CREATE TABLE group_users (
    group_id uuid,
    user_id uuid,
    joined_at timestamp,
    PRIMARY KEY (group_id, user_id)
);

Time Series Data

CREATE TABLE temperature_by_sensor (
    sensor_id uuid,
    day date,
    timestamp timestamp,
    temperature float,
    PRIMARY KEY ((sensor_id, day), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

Consistency Levels

Read Consistency Levels

LevelDescription
ONEReturn data from nearest replica
QUORUMReturn data when majority of replicas respond
LOCAL_QUORUMQuorum of replicas in same datacenter
EACH_QUORUMQuorum of replicas in each datacenter
ALLReturn data when all replicas respond
LOCAL_ONEReturn data from nearest replica in local datacenter

Write Consistency Levels

LevelDescription
ANYWrite to any node (can be hinted handoff coordinator)
ONEWrite confirmed by at least one replica
QUORUMWrite confirmed by majority of replicas
LOCAL_QUORUMQuorum of replicas in same datacenter
EACH_QUORUMQuorum of replicas in each datacenter
ALLWrite confirmed by all replicas
LOCAL_ONEWrite confirmed by at least one replica in local datacenter

Setting Consistency Levels

-- Set read consistency for session
CONSISTENCY QUORUM;

-- Per-query consistency
SELECT * FROM users WHERE user_id = 123 USING CONSISTENCY QUORUM;

Cassandra Architecture

Key Components

ComponentDescription
NodeSingle Cassandra instance
ClusterCollection of nodes that store your data
Data CenterGroup of related nodes (often a physical location)
RackCollection of servers (often on same switch)
RingThe ring structure that represents data distribution
Gossip ProtocolHow nodes exchange state information
SnitchDetermines network topology
PartitionerDetermines how data is distributed

Replication Strategies

StrategyDescriptionUsage
SimpleStrategyPlaces replicas on consecutive nodes around the ringTesting, single datacenter
NetworkTopologyStrategyPrecise control over replica placement by datacenterProduction, multiple datacenters

Write Path

  1. Write to commit log (durability)
  2. Write to memtable (in-memory)
  3. Periodically flush memtable to SSTable (immutable on disk)
  4. Eventually compact SSTables

Read Path

  1. Check row cache (if enabled)
  2. Check partition key cache (if enabled)
  3. Check memtable
  4. Check SSTables (using Bloom filters and indexes)
  5. Perform read repair if needed

Performance Optimization

Performance Tuning Parameters

ParameterDescriptionRecommendation
concurrent_readsNumber of concurrent reads16 × number of drives
concurrent_writesNumber of concurrent writes8 × number of CPU cores
memtable_flush_writersWriters for flushing memtablesNumber of disks
compaction_throughput_mb_per_secThrottle for compactionStart at 16-32, adjust based on load
read_request_timeout_in_msRead timeoutDefault 5000ms
write_request_timeout_in_msWrite timeoutDefault 2000ms

Compaction Strategies

StrategyDescriptionBest For
SizeTieredCompactionStrategy (STCS)Default, groups similarly sized SSTablesWrite-heavy workloads
LeveledCompactionStrategy (LCS)Organize SSTables in levelsRead-heavy workloads
TimeWindowCompactionStrategy (TWCS)Optimized for time series dataTime series, TTL data
DateTieredCompactionStrategy (DTCS)Deprecated, replaced by TWCSLegacy systems

Caching Options

Cache TypeDescriptionUse Case
Row CacheCaches entire rowsFrequently accessed, rarely changing rows
Key CacheCaches partition keysDefault cache, improves read performance
Counter CacheCaches countersHigh-volume counters
Chunk CacheCaches chunks of dataImproves read performance for wide rows

Backup and Recovery

Backup Strategies

Snapshots

# Create snapshot of all keyspaces
nodetool snapshot

# Create snapshot of specific keyspace
nodetool snapshot my_keyspace

# Create snapshot with a name
nodetool snapshot -t backup_name my_keyspace

Incremental Backups

Enable in cassandra.yaml:

incremental_backups: true

Restore Process

# Stop Cassandra
service cassandra stop

# Clear data (except for snapshots and backups)
rm -rf /var/lib/cassandra/data/my_keyspace/my_table/*

# Restore from snapshot
cp -R /var/lib/cassandra/snapshots/snapshot_name/* /var/lib/cassandra/data/my_keyspace/my_table/

# Restart Cassandra
service cassandra start

# Run repair
nodetool repair my_keyspace my_table

Monitoring and Maintenance

Essential nodetool Commands

# Check cluster status
nodetool status

# Check node status
nodetool info

# Get statistics
nodetool tablestats my_keyspace.my_table

# Run repair
nodetool repair

# Run cleanup
nodetool cleanup

# Flush memtables to disk
nodetool flush

# Compact SSTables
nodetool compact

Monitoring Metrics

Metric CategoryKey Metrics to Monitor
LatencyRead/write latency, request coordinator latency
ThroughputRead/write requests per second
CompactionPending compactions, compaction history
StorageDisk usage, SSTable count
CacheCache hit rates, cache size
GCGarbage collection pauses, frequency
Thread PoolsPending/blocked tasks

Recommended Monitoring Tools

  • Prometheus with Cassandra exporter
  • Grafana dashboards
  • DataStax OpsCenter
  • Instaclustr Console
  • JMX tools (JConsole, jmxterm)

Common Challenges and Solutions

Challenge: Tombstones

Solution:

  • Set appropriate TTL values
  • Use USING TIMESTAMP for overwrites
  • Schedule regular tombstone GC with nodetool garbagecollect
  • Configure gc_grace_seconds based on repair frequency

Challenge: Wide Partitions

Solution:

  • Implement bucketing in primary key design
  • Split logical entities across multiple tables
  • Use time-based bucketing for time series
  • Monitor partition sizes with nodetool tablehistograms

Challenge: Hot Partitions

Solution:

  • Review partition key design
  • Add more granularity to partition key
  • Consider application-level sharding
  • Cache hot data in application

Challenge: Read Before Write

Solution:

  • Use lightweight transactions (WITH IF EXISTS)
  • Consider timestamp-based conflict resolution
  • Design for idempotent operations

Best Practices

Schema Design

  • Design tables based on query patterns
  • Keep related data in same partition
  • Limit partition size (aim for <100MB)
  • Choose appropriate compaction strategy
  • Use TTL for temporary data

Operational

  • Run repairs regularly
  • Monitor tombstone counts
  • Plan capacity in advance
  • Test with realistic data volumes
  • Use vnodes for easier scaling
  • Consider dedicated seed nodes

Application Integration

  • Use prepared statements
  • Implement retry policies
  • Use token-aware load balancing
  • Batch with caution (use unlogged batches)
  • Consider asynchronous operations for throughput

Resources for Further Learning

Official Resources

Books

  • “Cassandra: The Definitive Guide” by Jeff Carpenter and Eben Hewitt
  • “Mastering Apache Cassandra” by Nishant Neeraj
  • “Learning Apache Cassandra” by Sandeep Yarabarla

Online Courses

  • DataStax Academy courses
  • Udemy: “Apache Cassandra for Beginners”
  • Pluralsight: “Getting Started with Apache Cassandra”

Community Resources

  • Stack Overflow Cassandra tag
  • Cassandra mailing lists
  • #cassandra IRC channel
  • CQLSH cheat sheets

Remember: Cassandra excels at handling massive scale with high availability, but requires thinking differently about data modeling than traditional relational databases. Design for your queries, not your entities!

Scroll to Top