Databricks Commands Complete Cheat Sheet – Essential CLI, SQL & Notebook Reference

What is Databricks?

Databricks is a unified analytics platform that combines data engineering, data science, and machine learning on a cloud-based Apache Spark environment. It provides collaborative notebooks, automated cluster management, and integrated MLOps capabilities for processing large-scale data workloads.

Why Databricks Commands Matter:

Streamline data pipeline development and deployment
Enable efficient cluster and resource management
Automate data processing and ML workflows
Facilitate collaboration between data teams
Optimize performance for big data analytics
Simplify integration with cloud storage and services

Core Concepts & Principles

Databricks Architecture Components

Workspace

Collaborative environment for notebooks and jobs
Centralized location for code, data, and models
Role-based access control and sharing

Clusters

Managed Apache Spark compute resources
Auto-scaling and auto-termination capabilities
Support for multiple Spark versions and configurations

Notebooks

Interactive development environment
Support for Python, Scala, SQL, and R
Built-in visualizations and collaboration features

Jobs

Scheduled and triggered data processing workflows
Support for notebook and JAR-based jobs
Monitoring and alerting capabilities

Delta Lake

ACID transactions for data lakes
Schema evolution and time travel
Unified batch and streaming data processing

Step-by-Step Setup Process

Phase 1: Environment Setup

Install Databricks CLI
```
pip install databricks-cli
```
Configure Authentication
```
databricks configure --token
```
Verify Connection
```
databricks workspace ls
```

Phase 2: Workspace Configuration

Set Up Workspace Structure
- Create folders for different projects
- Organize notebooks by team or function
- Set up shared libraries and utilities
Configure Cluster Policies
- Define resource limits and permissions
- Set auto-scaling parameters
- Configure security and network settings

Phase 3: Development Workflow

Create and Configure Clusters
- Choose appropriate instance types
- Configure Spark settings
- Install required libraries
Develop and Test Code
- Use interactive notebooks for exploration
- Create reusable functions and modules
- Implement error handling and logging
Deploy and Schedule Jobs
- Convert notebooks to scheduled jobs
- Set up monitoring and alerting
- Implement CI/CD workflows

Essential Databricks CLI Commands

Authentication & Configuration

Command	Description	Example
`databricks configure`	Set up authentication	`databricks configure --token`
`databricks configure --list`	Show current configuration	`databricks configure --list`
`databricks configure --profile`	Use named profiles	`databricks configure --profile dev`

Workspace Management

Command	Description	Example
`databricks workspace ls`	List workspace items	`databricks workspace ls /Users`
`databricks workspace import`	Import notebook/file	`databricks workspace import notebook.py /Users/me/notebook`
`databricks workspace export`	Export notebook/file	`databricks workspace export /Users/me/notebook notebook.py`
`databricks workspace delete`	Delete workspace item	`databricks workspace delete /Users/me/old_notebook`
`databricks workspace mkdirs`	Create directories	`databricks workspace mkdirs /Shared/team_folder`

Cluster Operations

Command	Description	Example
`databricks clusters list`	List all clusters	`databricks clusters list`
`databricks clusters create`	Create new cluster	`databricks clusters create --json-file cluster.json`
`databricks clusters start`	Start cluster	`databricks clusters start --cluster-id 1234-567890-abc123`
`databricks clusters restart`	Restart cluster	`databricks clusters restart --cluster-id 1234-567890-abc123`
`databricks clusters delete`	Delete cluster	`databricks clusters delete --cluster-id 1234-567890-abc123`
`databricks clusters get`	Get cluster details	`databricks clusters get --cluster-id 1234-567890-abc123`

Job Management

Command	Description	Example
`databricks jobs list`	List all jobs	`databricks jobs list`
`databricks jobs create`	Create new job	`databricks jobs create --json-file job.json`
`databricks jobs run-now`	Run job immediately	`databricks jobs run-now --job-id 123`
`databricks jobs delete`	Delete job	`databricks jobs delete --job-id 123`
`databricks runs list`	List job runs	`databricks runs list --job-id 123`
`databricks runs get`	Get run details	`databricks runs get --run-id 456`

File System Operations (DBFS)

Command	Description	Example
`databricks fs ls`	List files/directories	`databricks fs ls dbfs:/mnt/data/`
`databricks fs cp`	Copy files	`databricks fs cp local_file.csv dbfs:/tmp/`
`databricks fs rm`	Remove files	`databricks fs rm dbfs:/tmp/old_file.csv`
`databricks fs mkdirs`	Create directories	`databricks fs mkdirs dbfs:/mnt/project/`
`databricks fs cat`	Display file contents	`databricks fs cat dbfs:/tmp/config.json`

Essential Notebook Commands

Magic Commands

Command	Description	Example
`%python`	Switch to Python	`%python print("Hello World")`
`%scala`	Switch to Scala	`%scala println("Hello World")`
`%sql`	Execute SQL	`%sql SELECT * FROM table LIMIT 10`
`%r`	Switch to R	`%r print("Hello World")`
`%sh`	Execute shell commands	`%sh ls -la /tmp`
`%fs`	File system operations	`%fs ls /mnt/data`
`%run`	Run another notebook	`%run ./helper_functions`
`%md`	Markdown cell	`%md # This is a header`

Display and Visualization

Command	Description	Example
`display()`	Show DataFrame with formatting	`display(df)`
`displayHTML()`	Render HTML content	`displayHTML("<h1>Title</h1>")`
`dbutils.notebook.exit()`	Exit notebook with value	`dbutils.notebook.exit("Success")`

Widget Commands

Command	Description	Example
`dbutils.widgets.text()`	Create text widget	`dbutils.widgets.text("name", "default")`
`dbutils.widgets.dropdown()`	Create dropdown widget	`dbutils.widgets.dropdown("env", "prod", ["dev", "prod"])`
`dbutils.widgets.get()`	Get widget value	`env = dbutils.widgets.get("env")`
`dbutils.widgets.remove()`	Remove widget	`dbutils.widgets.remove("name")`
`dbutils.widgets.removeAll()`	Remove all widgets	`dbutils.widgets.removeAll()`

Spark SQL Commands

Data Definition Language (DDL)

Command	Description	Example
`CREATE TABLE`	Create new table	`CREATE TABLE users (id INT, name STRING) USING DELTA`
`DROP TABLE`	Delete table	`DROP TABLE IF EXISTS temp_table`
`ALTER TABLE`	Modify table structure	`ALTER TABLE users ADD COLUMN email STRING`
`DESCRIBE`	Show table schema	`DESCRIBE EXTENDED users`
`SHOW TABLES`	List all tables	`SHOW TABLES IN database_name`
`SHOW DATABASES`	List all databases	`SHOW DATABASES`

Data Manipulation Language (DML)

Command	Description	Example
`SELECT`	Query data	`SELECT * FROM users WHERE age > 25`
`INSERT`	Insert new data	`INSERT INTO users VALUES (1, 'John', 30)`
`UPDATE`	Update existing data	`UPDATE users SET age = 31 WHERE id = 1`
`DELETE`	Delete data	`DELETE FROM users WHERE age < 18`
`MERGE`	Upsert operation	`MERGE INTO target USING source ON condition WHEN MATCHED THEN UPDATE`

Performance & Optimization

Command	Description	Example
`CACHE TABLE`	Cache table in memory	`CACHE TABLE users`
`UNCACHE TABLE`	Remove table from cache	`UNCACHE TABLE users`
`ANALYZE TABLE`	Collect table statistics	`ANALYZE TABLE users COMPUTE STATISTICS`
`OPTIMIZE`	Compact Delta tables	`OPTIMIZE users ZORDER BY (date)`
`VACUUM`	Clean up old files	`VACUUM users RETAIN 168 HOURS`

Delta Lake Commands

Delta Operations

Command	Description	Example
`DESCRIBE HISTORY`	Show table history	`DESCRIBE HISTORY users`
`RESTORE TABLE`	Time travel restore	`RESTORE TABLE users TO TIMESTAMP '2023-01-01'`
`CONVERT TO DELTA`	Convert Parquet to Delta	`CONVERT TO DELTA parquet.table`
`FSCK REPAIR TABLE`	Repair Delta table	`FSCK REPAIR TABLE users`

Delta Lake in Python/Scala

# Read Delta table
df = spark.read.format("delta").load("/path/to/delta-table")

# Write Delta table
df.write.format("delta").mode("overwrite").save("/path/to/delta-table")

# Streaming with Delta
spark.readStream.format("delta").load("/path").writeStream.format("delta").outputMode("append").start("/path/output")

# Time travel
df = spark.read.format("delta").option("timestampAsOf", "2023-01-01").load("/path")

Common Integration Patterns

Cloud Storage Integration

Storage Type	Mount Command	Access Pattern
AWS S3	`dbutils.fs.mount()` with S3 credentials	`s3a://bucket/path`
Azure Blob	`dbutils.fs.mount()` with Azure credentials	`abfss://container@account.dfs.core.windows.net/path`
Google Cloud	`dbutils.fs.mount()` with GCS credentials	`gs://bucket/path`

Database Connections

# JDBC Connection
df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://host:port/database") \
    .option("dbtable", "table_name") \
    .option("user", "username") \
    .option("password", "password") \
    .load()

# Write to database
df.write \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://host:port/database") \
    .option("dbtable", "target_table") \
    .option("user", "username") \
    .option("password", "password") \
    .mode("overwrite") \
    .save()

Common Challenges & Solutions

Performance Issues

Challenge: Slow query execution and resource utilization Solutions:

Use appropriate cluster sizing and auto-scaling
Implement data partitioning and Z-ordering
Cache frequently accessed data
Optimize join strategies and broadcast variables
Use Delta Lake for ACID transactions and optimization

Memory Management

Challenge: Out-of-memory errors and inefficient resource usage Solutions:

Configure executor memory and cores appropriately
Use .coalesce() and .repartition() for optimal partitioning
Implement lazy evaluation patterns
Use columnar storage formats (Parquet, Delta)
Enable adaptive query execution (AQE)

Data Quality & Consistency

Challenge: Ensuring data reliability and handling schema changes Solutions:

Implement data validation and quality checks
Use Delta Lake for schema evolution
Set up monitoring and alerting for data pipelines
Implement error handling and retry mechanisms
Use structured streaming for real-time data processing

Security & Access Control

Challenge: Managing permissions and secure data access Solutions:

Implement fine-grained access controls
Use service principals for authentication
Encrypt data at rest and in transit
Implement data masking and anonymization
Regular security audits and compliance checks

Best Practices & Practical Tips

Cluster Configuration

Choose appropriate instance types for your workload
Enable auto-scaling to handle variable loads
Set appropriate idle timeout to reduce costs
Use spot instances for non-critical workloads
Configure cluster policies for consistent settings

Code Organization

Structure notebooks with clear sections and documentation
Use modular code with reusable functions
Implement proper error handling and logging
Version control notebooks using Git integration
Create shared libraries for common functionality

Performance Optimization

Use broadcast joins for small lookup tables
Implement proper data partitioning strategies
Cache intermediate results when appropriate
Use columnar file formats (Parquet, Delta)
Monitor query plans and optimize accordingly

Data Management

Implement proper data lifecycle management
Use Delta Lake for ACID transactions
Set up regular OPTIMIZE and VACUUM operations
Monitor storage costs and usage patterns
Implement data retention policies

Monitoring & Debugging

Use Spark UI for performance analysis
Implement comprehensive logging
Set up alerts for job failures and performance issues
Monitor cluster utilization and costs
Use profiling tools for code optimization

MLflow Integration Commands

Model Management

Command	Description	Example
`mlflow.start_run()`	Start MLflow run	`with mlflow.start_run(): # training code`
`mlflow.log_metric()`	Log metrics	`mlflow.log_metric("accuracy", 0.95)`
`mlflow.log_param()`	Log parameters	`mlflow.log_param("learning_rate", 0.01)`
`mlflow.log_model()`	Log model	`mlflow.sklearn.log_model(model, "model")`
`mlflow.register_model()`	Register model	`mlflow.register_model("runs:/run_id/model", "MyModel")`

Troubleshooting Commands

Debug Information

Command	Description	Example
`spark.conf.get()`	Get Spark configuration	`spark.conf.get("spark.sql.adaptive.enabled")`
`spark.sparkContext.getConf().getAll()`	Get all Spark configs	`spark.sparkContext.getConf().getAll()`
`df.explain()`	Show query execution plan	`df.explain(True)`
`df.cache().count()`	Force computation and cache	`df.cache().count()`
`spark.catalog.listTables()`	List available tables	`spark.catalog.listTables()`

Resources for Further Learning

Official Documentation & Guides

Databricks Documentation: Comprehensive platform documentation
Apache Spark Documentation: Core Spark functionality and APIs
Delta Lake Documentation: Advanced data lake operations
MLflow Documentation: Machine learning lifecycle management

Training & Certification

Databricks Academy: Official training courses and certifications
Databricks Certified Associate Developer: Entry-level certification
Databricks Certified Professional Data Scientist: Advanced ML certification
Apache Spark Certification: Industry-recognized Spark expertise

Books & Publications

“Learning Spark: Lightning-Fast Data Analytics” by Jules Damji
“High Performance Spark” by Holden Karau
“Spark: The Definitive Guide” by Bill Chambers
“Delta Lake: The Definitive Guide” by Denny Lee

Community Resources

Databricks Community Forums: User discussions and Q&A
Stack Overflow: Technical questions and solutions
GitHub: Open source projects and examples
Medium: Technical articles and use cases
YouTube: Video tutorials and conference talks

Practice Environments

Databricks Community Edition: Free tier for learning
Try Databricks: Platform trials and demos
Azure Databricks: Microsoft cloud integration
AWS Databricks: Amazon cloud integration
Google Cloud Databricks: Google cloud integration

This cheatsheet serves as a comprehensive reference for Databricks commands and operations. Bookmark this guide for quick access during your data engineering and analytics projects.