Apache Accumulo Database: Complete Reference Guide

Introduction to Apache Accumulo

Apache Accumulo is a highly scalable, distributed key-value store built on Apache Hadoop, ZooKeeper, and Thrift. Modeled after Google’s BigTable design, Accumulo extends the BigTable architecture with cell-level security and server-side programming mechanisms. It’s designed to handle massive amounts of structured data across clusters of commodity hardware with high availability and fault tolerance.

Key Characteristics:

  • Cell-level security through fine-grained access controls
  • Server-side programming through Iterators
  • High performance for massive datasets (petabytes of data and billions of entries)
  • Designed for parallel processing with MapReduce integration
  • Native support for multi-tenancy
  • Strong consistency model

Core Concepts and Architecture

Data Model

ConceptDescription
Key-Value PairThe basic unit of storage in Accumulo
Row IDThe primary key for a record
Column FamilyUsed to organize related columns together
Column QualifierIdentifies specific data within a column family
Column VisibilityControls access to data using security expressions
TimestampVersion control for values in a cell
ValueThe actual data stored in a cell

Full Key Format: row ID + column family + column qualifier + column visibility + timestamp

Accumulo Components

ComponentFunction
Tablet ServerHandles read/write operations for a subset of tablets
Master ServerCoordinates tablet servers and performs administrative operations
Garbage CollectorRemoves deleted files from HDFS
MonitorWeb interface for monitoring the system
TracerTracks timing of operations across the distributed system
ZooKeeperCoordinates the distributed components and stores metadata
TabletA contiguous partition of a table
Minor CompactionProcess that combines in-memory data with files in HDFS
Major CompactionProcess that combines multiple files in HDFS

Installation and Setup

Prerequisites

  • Java 8 or newer
  • Hadoop Distributed File System (HDFS)
  • Apache ZooKeeper ensemble
  • Sufficient disk space and memory

Basic Installation Steps

  1. Download and extract:

    $ wget https://downloads.apache.org/accumulo/[VERSION]/accumulo-[VERSION]-bin.tar.gz
    $ tar -xzf accumulo-[VERSION]-bin.tar.gz
    $ mv accumulo-[VERSION] /path/to/accumulo
    
  2. Configure environment variables:

    $ export ACCUMULO_HOME=/path/to/accumulo
    $ export PATH=$PATH:$ACCUMULO_HOME/bin
    
  3. Configure core properties in accumulo-site.xml:

    <property>
      <name>instance.volumes</name>
      <value>hdfs://namenode:8020/accumulo</value>
    </property>
    <property>
      <name>instance.zookeeper.host</name>
      <value>zookeeper1:2181,zookeeper2:2181,zookeeper3:2181</value>
    </property>
    
  4. Initialize Accumulo:

    $ accumulo init --instance-name myinstance --password mypassword
    
  5. Start Accumulo:

    $ accumulo-service all start
    

Basic Operations

Command Line Interface

Login to shell:

$ accumulo shell -u myuser -p mypassword

Basic commands:

CommandDescriptionExample
helpShow available commandshelp
tablesList tablestables
createtableCreate a new tablecreatetable mytable
deletetableDelete a tabledeletetable mytable
scanScan records in a tablescan -t mytable
insertInsert a recordinsert row cf cq value
deleteDelete a recorddelete row cf cq
flushFlush a table’s dataflush -t mytable
compactCompact a tablecompact -t mytable
userUser managementuser add myuser
userpermissionsShow user permissionsuserpermissions -u myuser
exitExit the shellexit

Working with Tables

Create a table:

shell> createtable mytable

Add data:

shell> insert row1 columnFamily columnQualifier value1
shell> insert row2 columnFamily columnQualifier value2

Scan data:

shell> scan -t mytable

Delete data:

shell> delete row1 columnFamily columnQualifier

Delete table:

shell> deletetable mytable

Table Configurations

Set table properties:

shell> config -t mytable -s table.cache.block.enable=true

View table properties:

shell> config -t mytable -f print

Table splits:

shell> addsplits -t mytable -sf splitfile.txt

Advanced Features

Security

Table Permissions

PermissionDescription
READCan read table data
WRITECan write to the table
BULK_IMPORTCan bulk import files
ALTER_TABLECan alter table properties
GRANTCan grant permissions to others
DROP_TABLECan delete the table

Grant permissions:

shell> grant Table.READ -t mytable -u user1
shell> grant Table.WRITE -t mytable -u user1

Revoke permissions:

shell> revoke Table.WRITE -t mytable -u user1

Cell-Level Security

Insert with visibility labels:

shell> insert row1 cf cq value -l A&B|C

Scan with authorizations:

shell> scan -t mytable -s A,B

Set user authorizations:

shell> setauths -u user1 -s A,B,C

Iterators

Iterators are server-side programming mechanisms that allow data transformation, filtering, and aggregation operations during scans, compactions, or both.

Types of Iterators:

TypeWhen AppliedUse Case
Scan-timeDuring scan operationsQuery-time filtering/transformations
Minor compactionDuring minor compactionsData reduction during compaction
Major compactionDuring major compactionsPermanent data transformations

Iterator Priority Levels:

Priority RangePurpose
1-9User iterators (application-specific)
10-19Accumulo system iterators
20-29Accumulo default iterators

Attach iterator to a table:

shell> config -t mytable -s table.iterator.scan.myscanner=10,org.example.MyIterator
shell> config -t mytable -s table.iterator.scan.myscanner.opt.myOption=myValue

Common Built-in Iterators:

IteratorPurpose
VersioningIteratorLimits entries to most recent timestamp
RegExFilterFilters based on regular expressions
AgeOffFilterRemoves entries older than a threshold
WholeRowIteratorProcesses entire rows at once
SummingCombinerAggregates numeric values

Bulk Import

  1. Generate RFiles:

    $ accumulo rfile-info --generate-splits --num-splits 10 --output-file splits.txt /path/to/data.txt
    
  2. Import the data:

    $ accumulo import /path/to/RFiles /path/to/failures mytable
    

Accumulo Client API (Java)

Basic connection:

// Accumulo 2.x style
ClientContext context = ClientInfo.from("/path/to/client.properties").createContext();

// Or directly:
Properties props = new Properties();
props.setProperty("instance.name", "myinstance");
props.setProperty("instance.zookeepers", "zoo1:2181,zoo2:2181");
props.setProperty("auth.type", "password");
props.setProperty("auth.principal", "user");
props.setProperty("auth.token", "password");

ClientContext context = new ClientContext(
    new ClientInfo(props)
);

Write data:

try (BatchWriter writer = context.createBatchWriter("mytable")) {
    Mutation mutation = new Mutation("row1");
    mutation.put("cf", "cq", "value");
    writer.addMutation(mutation);
    writer.flush();
}

Read data:

try (Scanner scanner = context.createScanner("mytable", Authorizations.EMPTY)) {
    scanner.setRange(Range.exact("row1"));
    scanner.fetchColumn(Text.of("cf"), Text.of("cq"));
    for (Map.Entry<Key, Value> entry : scanner) {
        System.out.println(entry.getKey() + " -> " + entry.getValue());
    }
}

Performance Tuning

Memory Management

PropertyDescriptionRecommendation
tserver.memory.maps.maxMaximum memory for in-memory maps25-30% of available RAM
tserver.cache.data.sizeSize of data block cache15-20% of available RAM
tserver.cache.index.sizeSize of index cache5-10% of available RAM
tserver.sort.buffer.sizeSort buffer size10% of in-memory map size

Compaction Settings

Configure compaction thresholds:

shell> config -t mytable -s table.compaction.major.ratio=3.0
shell> config -t mytable -s table.compaction.minor.ratio=1.5

Write Performance

StrategyDescription
Increase batch writer threadsSet table.writer.threads.max higher for more parallelism
Increase batch writer memorySet tserver.mutation.queue.max higher for larger batches
Balance split pointsEnsure even distribution with proper presplitting
Tune compaction settingsAdjust ratios to avoid compaction storms

Read Performance

StrategyDescription
Increase scanner threadsSet tserver.read.ahead.maxconcurrent higher
Tune cache sizesBalance tserver.cache.data.size and tserver.cache.index.size
Use locality groupsGroup related column families for better scan performance
Use bloom filtersEnable for tables with random access patterns

Common Challenges and Solutions

ChallengeSolution
Slow table scansCheck for appropriate split points, increase scanner threads, use proper scan authorizations
Out of memory errorsReduce memory maps size, check for memory leaks in iterators, tune JVM settings
Write performance issuesIncrease batch writer threads and memory, check for bottlenecks in preprocessing
ZooKeeper connection issuesEnsure proper ZooKeeper quorum, check network connectivity, validate configuration
CompactionExecutor warningsIncrease compaction threads, tune compaction ratios, balance tablet distribution
Tablet server failuresCheck for balancing issues, monitor resource usage, investigate logs for errors
HDFS latency issuesValidate HDFS health, check DataNode distribution, examine network performance

Best Practices

Design Guidelines

  • Row key design: Create row keys that distribute load evenly
  • Column family usage: Use a small number of column families
  • Iterators: Implement efficient iterators that minimize memory usage
  • Locality groups: Group related column families for faster scans
  • Splits: Pre-split tables based on expected data distribution
  • Write patterns: Avoid “hot-spotting” with well-distributed row keys

Operational Tips

  • Monitoring: Set up alerts for important metrics like tablet server health
  • Regular maintenance: Schedule compactions during off-peak hours
  • Backups: Regularly export important tables
  • Upgrades: Test upgrades in a staging environment before production
  • Resource planning: Monitor and plan for growth in data size
  • Security updates: Regularly review and update security policies
  • Client connections: Use connection pooling and proper retry logic

Resources for Further Learning

Official Documentation

Books

  • “Accumulo: Application Development, Table Design, and Best Practices” by Aaron Cordova, Michael Wall, and Billie Rinaldi
  • “Apache Accumulo for Developers” by Guomundur Jonsson

Community Resources

Training Resources

Related Technologies

Scroll to Top