Databricks Commands Complete Cheat Sheet – Essential CLI, SQL & Notebook Reference

What is Databricks?

Databricks is a unified analytics platform that combines data engineering, data science, and machine learning on a cloud-based Apache Spark environment. It provides collaborative notebooks, automated cluster management, and integrated MLOps capabilities for processing large-scale data workloads.

Why Databricks Commands Matter:

  • Streamline data pipeline development and deployment
  • Enable efficient cluster and resource management
  • Automate data processing and ML workflows
  • Facilitate collaboration between data teams
  • Optimize performance for big data analytics
  • Simplify integration with cloud storage and services

Core Concepts & Principles

Databricks Architecture Components

Workspace

  • Collaborative environment for notebooks and jobs
  • Centralized location for code, data, and models
  • Role-based access control and sharing

Clusters

  • Managed Apache Spark compute resources
  • Auto-scaling and auto-termination capabilities
  • Support for multiple Spark versions and configurations

Notebooks

  • Interactive development environment
  • Support for Python, Scala, SQL, and R
  • Built-in visualizations and collaboration features

Jobs

  • Scheduled and triggered data processing workflows
  • Support for notebook and JAR-based jobs
  • Monitoring and alerting capabilities

Delta Lake

  • ACID transactions for data lakes
  • Schema evolution and time travel
  • Unified batch and streaming data processing

Step-by-Step Setup Process

Phase 1: Environment Setup

  1. Install Databricks CLI

    pip install databricks-cli
    
  2. Configure Authentication

    databricks configure --token
    
  3. Verify Connection

    databricks workspace ls
    

Phase 2: Workspace Configuration

  1. Set Up Workspace Structure

    • Create folders for different projects
    • Organize notebooks by team or function
    • Set up shared libraries and utilities
  2. Configure Cluster Policies

    • Define resource limits and permissions
    • Set auto-scaling parameters
    • Configure security and network settings

Phase 3: Development Workflow

  1. Create and Configure Clusters

    • Choose appropriate instance types
    • Configure Spark settings
    • Install required libraries
  2. Develop and Test Code

    • Use interactive notebooks for exploration
    • Create reusable functions and modules
    • Implement error handling and logging
  3. Deploy and Schedule Jobs

    • Convert notebooks to scheduled jobs
    • Set up monitoring and alerting
    • Implement CI/CD workflows

Essential Databricks CLI Commands

Authentication & Configuration

CommandDescriptionExample
databricks configureSet up authenticationdatabricks configure --token
databricks configure --listShow current configurationdatabricks configure --list
databricks configure --profileUse named profilesdatabricks configure --profile dev

Workspace Management

CommandDescriptionExample
databricks workspace lsList workspace itemsdatabricks workspace ls /Users
databricks workspace importImport notebook/filedatabricks workspace import notebook.py /Users/me/notebook
databricks workspace exportExport notebook/filedatabricks workspace export /Users/me/notebook notebook.py
databricks workspace deleteDelete workspace itemdatabricks workspace delete /Users/me/old_notebook
databricks workspace mkdirsCreate directoriesdatabricks workspace mkdirs /Shared/team_folder

Cluster Operations

CommandDescriptionExample
databricks clusters listList all clustersdatabricks clusters list
databricks clusters createCreate new clusterdatabricks clusters create --json-file cluster.json
databricks clusters startStart clusterdatabricks clusters start --cluster-id 1234-567890-abc123
databricks clusters restartRestart clusterdatabricks clusters restart --cluster-id 1234-567890-abc123
databricks clusters deleteDelete clusterdatabricks clusters delete --cluster-id 1234-567890-abc123
databricks clusters getGet cluster detailsdatabricks clusters get --cluster-id 1234-567890-abc123

Job Management

CommandDescriptionExample
databricks jobs listList all jobsdatabricks jobs list
databricks jobs createCreate new jobdatabricks jobs create --json-file job.json
databricks jobs run-nowRun job immediatelydatabricks jobs run-now --job-id 123
databricks jobs deleteDelete jobdatabricks jobs delete --job-id 123
databricks runs listList job runsdatabricks runs list --job-id 123
databricks runs getGet run detailsdatabricks runs get --run-id 456

File System Operations (DBFS)

CommandDescriptionExample
databricks fs lsList files/directoriesdatabricks fs ls dbfs:/mnt/data/
databricks fs cpCopy filesdatabricks fs cp local_file.csv dbfs:/tmp/
databricks fs rmRemove filesdatabricks fs rm dbfs:/tmp/old_file.csv
databricks fs mkdirsCreate directoriesdatabricks fs mkdirs dbfs:/mnt/project/
databricks fs catDisplay file contentsdatabricks fs cat dbfs:/tmp/config.json

Essential Notebook Commands

Magic Commands

CommandDescriptionExample
%pythonSwitch to Python%python print("Hello World")
%scalaSwitch to Scala%scala println("Hello World")
%sqlExecute SQL%sql SELECT * FROM table LIMIT 10
%rSwitch to R%r print("Hello World")
%shExecute shell commands%sh ls -la /tmp
%fsFile system operations%fs ls /mnt/data
%runRun another notebook%run ./helper_functions
%mdMarkdown cell%md # This is a header

Display and Visualization

CommandDescriptionExample
display()Show DataFrame with formattingdisplay(df)
displayHTML()Render HTML contentdisplayHTML("<h1>Title</h1>")
dbutils.notebook.exit()Exit notebook with valuedbutils.notebook.exit("Success")

Widget Commands

CommandDescriptionExample
dbutils.widgets.text()Create text widgetdbutils.widgets.text("name", "default")
dbutils.widgets.dropdown()Create dropdown widgetdbutils.widgets.dropdown("env", "prod", ["dev", "prod"])
dbutils.widgets.get()Get widget valueenv = dbutils.widgets.get("env")
dbutils.widgets.remove()Remove widgetdbutils.widgets.remove("name")
dbutils.widgets.removeAll()Remove all widgetsdbutils.widgets.removeAll()

Spark SQL Commands

Data Definition Language (DDL)

CommandDescriptionExample
CREATE TABLECreate new tableCREATE TABLE users (id INT, name STRING) USING DELTA
DROP TABLEDelete tableDROP TABLE IF EXISTS temp_table
ALTER TABLEModify table structureALTER TABLE users ADD COLUMN email STRING
DESCRIBEShow table schemaDESCRIBE EXTENDED users
SHOW TABLESList all tablesSHOW TABLES IN database_name
SHOW DATABASESList all databasesSHOW DATABASES

Data Manipulation Language (DML)

CommandDescriptionExample
SELECTQuery dataSELECT * FROM users WHERE age > 25
INSERTInsert new dataINSERT INTO users VALUES (1, 'John', 30)
UPDATEUpdate existing dataUPDATE users SET age = 31 WHERE id = 1
DELETEDelete dataDELETE FROM users WHERE age < 18
MERGEUpsert operationMERGE INTO target USING source ON condition WHEN MATCHED THEN UPDATE

Performance & Optimization

CommandDescriptionExample
CACHE TABLECache table in memoryCACHE TABLE users
UNCACHE TABLERemove table from cacheUNCACHE TABLE users
ANALYZE TABLECollect table statisticsANALYZE TABLE users COMPUTE STATISTICS
OPTIMIZECompact Delta tablesOPTIMIZE users ZORDER BY (date)
VACUUMClean up old filesVACUUM users RETAIN 168 HOURS

Delta Lake Commands

Delta Operations

CommandDescriptionExample
DESCRIBE HISTORYShow table historyDESCRIBE HISTORY users
RESTORE TABLETime travel restoreRESTORE TABLE users TO TIMESTAMP '2023-01-01'
CONVERT TO DELTAConvert Parquet to DeltaCONVERT TO DELTA parquet.table
FSCK REPAIR TABLERepair Delta tableFSCK REPAIR TABLE users

Delta Lake in Python/Scala

# Read Delta table
df = spark.read.format("delta").load("/path/to/delta-table")

# Write Delta table
df.write.format("delta").mode("overwrite").save("/path/to/delta-table")

# Streaming with Delta
spark.readStream.format("delta").load("/path").writeStream.format("delta").outputMode("append").start("/path/output")

# Time travel
df = spark.read.format("delta").option("timestampAsOf", "2023-01-01").load("/path")

Common Integration Patterns

Cloud Storage Integration

Storage TypeMount CommandAccess Pattern
AWS S3dbutils.fs.mount() with S3 credentialss3a://bucket/path
Azure Blobdbutils.fs.mount() with Azure credentialsabfss://container@account.dfs.core.windows.net/path
Google Clouddbutils.fs.mount() with GCS credentialsgs://bucket/path

Database Connections

# JDBC Connection
df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://host:port/database") \
    .option("dbtable", "table_name") \
    .option("user", "username") \
    .option("password", "password") \
    .load()

# Write to database
df.write \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://host:port/database") \
    .option("dbtable", "target_table") \
    .option("user", "username") \
    .option("password", "password") \
    .mode("overwrite") \
    .save()

Common Challenges & Solutions

Performance Issues

Challenge: Slow query execution and resource utilization Solutions:

  • Use appropriate cluster sizing and auto-scaling
  • Implement data partitioning and Z-ordering
  • Cache frequently accessed data
  • Optimize join strategies and broadcast variables
  • Use Delta Lake for ACID transactions and optimization

Memory Management

Challenge: Out-of-memory errors and inefficient resource usage Solutions:

  • Configure executor memory and cores appropriately
  • Use .coalesce() and .repartition() for optimal partitioning
  • Implement lazy evaluation patterns
  • Use columnar storage formats (Parquet, Delta)
  • Enable adaptive query execution (AQE)

Data Quality & Consistency

Challenge: Ensuring data reliability and handling schema changes Solutions:

  • Implement data validation and quality checks
  • Use Delta Lake for schema evolution
  • Set up monitoring and alerting for data pipelines
  • Implement error handling and retry mechanisms
  • Use structured streaming for real-time data processing

Security & Access Control

Challenge: Managing permissions and secure data access Solutions:

  • Implement fine-grained access controls
  • Use service principals for authentication
  • Encrypt data at rest and in transit
  • Implement data masking and anonymization
  • Regular security audits and compliance checks

Best Practices & Practical Tips

Cluster Configuration

  • Choose appropriate instance types for your workload
  • Enable auto-scaling to handle variable loads
  • Set appropriate idle timeout to reduce costs
  • Use spot instances for non-critical workloads
  • Configure cluster policies for consistent settings

Code Organization

  • Structure notebooks with clear sections and documentation
  • Use modular code with reusable functions
  • Implement proper error handling and logging
  • Version control notebooks using Git integration
  • Create shared libraries for common functionality

Performance Optimization

  • Use broadcast joins for small lookup tables
  • Implement proper data partitioning strategies
  • Cache intermediate results when appropriate
  • Use columnar file formats (Parquet, Delta)
  • Monitor query plans and optimize accordingly

Data Management

  • Implement proper data lifecycle management
  • Use Delta Lake for ACID transactions
  • Set up regular OPTIMIZE and VACUUM operations
  • Monitor storage costs and usage patterns
  • Implement data retention policies

Monitoring & Debugging

  • Use Spark UI for performance analysis
  • Implement comprehensive logging
  • Set up alerts for job failures and performance issues
  • Monitor cluster utilization and costs
  • Use profiling tools for code optimization

MLflow Integration Commands

Model Management

CommandDescriptionExample
mlflow.start_run()Start MLflow runwith mlflow.start_run(): # training code
mlflow.log_metric()Log metricsmlflow.log_metric("accuracy", 0.95)
mlflow.log_param()Log parametersmlflow.log_param("learning_rate", 0.01)
mlflow.log_model()Log modelmlflow.sklearn.log_model(model, "model")
mlflow.register_model()Register modelmlflow.register_model("runs:/run_id/model", "MyModel")

Troubleshooting Commands

Debug Information

CommandDescriptionExample
spark.conf.get()Get Spark configurationspark.conf.get("spark.sql.adaptive.enabled")
spark.sparkContext.getConf().getAll()Get all Spark configsspark.sparkContext.getConf().getAll()
df.explain()Show query execution plandf.explain(True)
df.cache().count()Force computation and cachedf.cache().count()
spark.catalog.listTables()List available tablesspark.catalog.listTables()

Resources for Further Learning

Official Documentation & Guides

  • Databricks Documentation: Comprehensive platform documentation
  • Apache Spark Documentation: Core Spark functionality and APIs
  • Delta Lake Documentation: Advanced data lake operations
  • MLflow Documentation: Machine learning lifecycle management

Training & Certification

  • Databricks Academy: Official training courses and certifications
  • Databricks Certified Associate Developer: Entry-level certification
  • Databricks Certified Professional Data Scientist: Advanced ML certification
  • Apache Spark Certification: Industry-recognized Spark expertise

Books & Publications

  • “Learning Spark: Lightning-Fast Data Analytics” by Jules Damji
  • “High Performance Spark” by Holden Karau
  • “Spark: The Definitive Guide” by Bill Chambers
  • “Delta Lake: The Definitive Guide” by Denny Lee

Community Resources

  • Databricks Community Forums: User discussions and Q&A
  • Stack Overflow: Technical questions and solutions
  • GitHub: Open source projects and examples
  • Medium: Technical articles and use cases
  • YouTube: Video tutorials and conference talks

Practice Environments

  • Databricks Community Edition: Free tier for learning
  • Try Databricks: Platform trials and demos
  • Azure Databricks: Microsoft cloud integration
  • AWS Databricks: Amazon cloud integration
  • Google Cloud Databricks: Google cloud integration

This cheatsheet serves as a comprehensive reference for Databricks commands and operations. Bookmark this guide for quick access during your data engineering and analytics projects.

Scroll to Top