The Ultimate BigQuery SQL Cheat Sheet: From Basics to Advanced Techniques

Introduction to BigQuery SQL

BigQuery is Google Cloud’s fully-managed, serverless data warehouse that enables analyzing large datasets with SQL. It combines the flexibility of SQL with the scalability of cloud computing, allowing you to analyze terabytes or petabytes of data without managing infrastructure. BigQuery uses standard SQL dialect (compliant with SQL:2011), making it accessible for those familiar with other SQL implementations while offering powerful features for big data analytics.

Core Concepts and Fundamentals

Basic BigQuery Architecture

Projects: Containers for datasets, tables, and other resources
Datasets: Collections of tables and views within a project
Tables: Structured data storage following a schema
Views: Virtual tables created by a SQL query
Materialized Views: Precomputed views that periodically cache results of a query

Data Types

Type	Description	Example
`STRING`	UTF-8 encoded string	`'Hello World'`
`INT64`	64-bit signed integer	`42`
`FLOAT64`	Double precision floating point	`3.14159`
`BOOL`	Boolean value	`TRUE`, `FALSE`
`BYTES`	Binary data	`b'Hello'`
`DATE`	Calendar date	`DATE '2025-05-10'`
`DATETIME`	Date and time	`DATETIME '2025-05-10 12:30:00'`
`TIME`	Time of day	`TIME '12:30:00'`
`TIMESTAMP`	Absolute point in time	`TIMESTAMP '2025-05-10 12:30:00 UTC'`
`ARRAY`	Ordered list of values	`[1, 2, 3]`
`STRUCT`	Container of ordered fields	`STRUCT('John' AS name, 42 AS age)`
`GEOGRAPHY`	Geography points	`ST_GEOGPOINT(-122.4, 37.8)`

Essential BigQuery SQL Operations

Basic Queries

-- Simple SELECT query with filtering
SELECT
  column1,
  column2,
  column3
FROM
  `project.dataset.table`
WHERE
  condition
LIMIT 1000;

-- Query with aggregation
SELECT
  category,
  COUNT(*) AS count,
  SUM(value) AS total_value
FROM
  `project.dataset.table`
GROUP BY
  category
HAVING
  COUNT(*) > 10
ORDER BY
  total_value DESC;

Working with Multiple Tables

-- INNER JOIN
SELECT
  a.id,
  a.name,
  b.value
FROM
  `project.dataset.tableA` AS a
INNER JOIN
  `project.dataset.tableB` AS b
ON
  a.id = b.id;

-- LEFT JOIN
SELECT
  a.id,
  a.name,
  b.value
FROM
  `project.dataset.tableA` AS a
LEFT JOIN
  `project.dataset.tableB` AS b
ON
  a.id = b.id;

-- UNION ALL
SELECT * FROM `project.dataset.table1`
UNION ALL
SELECT * FROM `project.dataset.table2`;

Working with Nested and Repeated Data

-- Query with UNNEST to flatten arrays
SELECT
  id,
  item
FROM
  `project.dataset.table`,
  UNNEST(array_column) AS item;

-- Access nested fields
SELECT
  id,
  struct_column.field1,
  struct_column.field2
FROM
  `project.dataset.table`;

Advanced BigQuery Features

Window Functions

-- Rank values within partitions
SELECT
  category,
  value,
  RANK() OVER (PARTITION BY category ORDER BY value DESC) AS rank
FROM
  `project.dataset.table`;

-- Calculate running totals
SELECT
  date,
  value,
  SUM(value) OVER (ORDER BY date) AS running_total
FROM
  `project.dataset.table`;

-- Calculate moving averages
SELECT
  date,
  value,
  AVG(value) OVER (
    ORDER BY date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS seven_day_average
FROM
  `project.dataset.table`;

Advanced Analytical Functions

-- Calculate percentiles
SELECT
  APPROX_QUANTILES(value, 100)[OFFSET(50)] AS median,
  APPROX_QUANTILES(value, 100)[OFFSET(90)] AS percentile_90
FROM
  `project.dataset.table`;

-- Clustering analysis
SELECT
  *
FROM
  ML.KMEANS(
    MODEL `project.dataset.kmeans_model`,
    TABLE `project.dataset.features`
  );

User-Defined Functions (UDFs)

-- JavaScript UDF
CREATE TEMP FUNCTION multiplyByTwo(x FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
  return x * 2;
""";

SELECT
  value,
  multiplyByTwo(value) AS doubled_value
FROM
  `project.dataset.table`;

-- SQL UDF
CREATE TEMP FUNCTION categorize(value FLOAT64)
RETURNS STRING
AS (
  CASE
    WHEN value < 0 THEN 'Negative'
    WHEN value = 0 THEN 'Zero'
    ELSE 'Positive'
  END
);

SELECT
  value,
  categorize(value) AS category
FROM
  `project.dataset.table`;

Data Manipulation and Governance

Data Manipulation Language (DML)

-- Insert new rows
INSERT INTO `project.dataset.table` (column1, column2)
VALUES ('value1', 'value2'), ('value3', 'value4');

-- Insert using a query
INSERT INTO `project.dataset.target_table` (column1, column2)
SELECT column1, column2
FROM `project.dataset.source_table`
WHERE condition;

-- Update values
UPDATE `project.dataset.table`
SET column1 = 'new_value'
WHERE condition;

-- Delete rows
DELETE FROM `project.dataset.table`
WHERE condition;

-- Merge (upsert) operation
MERGE INTO `project.dataset.target_table` T
USING `project.dataset.source_table` S
ON T.id = S.id
WHEN MATCHED THEN
  UPDATE SET T.column1 = S.column1, T.column2 = S.column2
WHEN NOT MATCHED THEN
  INSERT (id, column1, column2)
  VALUES (S.id, S.column1, S.column2);

Data Definition Language (DDL)

-- Create a table
CREATE TABLE `project.dataset.table` (
  id STRING,
  name STRING,
  value FLOAT64,
  ts TIMESTAMP,
  tags ARRAY<STRING>
)
PARTITION BY DATE(ts)
CLUSTER BY id;

-- Create a view
CREATE VIEW `project.dataset.view` AS
SELECT id, name, value
FROM `project.dataset.table`
WHERE condition;

-- Create a materialized view
CREATE MATERIALIZED VIEW `project.dataset.mat_view` AS
SELECT
  date,
  category,
  SUM(value) AS total_value
FROM
  `project.dataset.table`
GROUP BY
  date, category;

Performance Optimization Techniques

Partitioning and Clustering

-- Create a partitioned table
CREATE TABLE `project.dataset.partitioned_table`
PARTITION BY
  DATE(timestamp_column) -- Time-unit based partitioning
  -- OR
  -- RANGE_BUCKET(integer_column, GENERATE_ARRAY(0, 100, 10)) -- Integer range partitioning
AS SELECT * FROM `project.dataset.source_table`;

-- Create a clustered table
CREATE TABLE `project.dataset.clustered_table`
CLUSTER BY
  category, region
AS SELECT * FROM `project.dataset.source_table`;

-- Query a partitioned table (with pruning)
SELECT *
FROM `project.dataset.partitioned_table`
WHERE DATE(timestamp_column) BETWEEN '2025-01-01' AND '2025-01-31';

Cost Optimization Techniques

Technique	Impact	Implementation
Column Selection	Reduces bytes scanned	Select only needed columns, avoid `SELECT *`
Partition Pruning	Reduces data scanned	Query with filters on partitioning columns
Cluster Pruning	Improves filter efficiency	Filter on clustering columns
Materialized Views	Reuses computation	Create for frequently used aggregations
Result Caching	Avoids repeated work	Identical queries within 24 hours use cached results
Query Approximation	Trades accuracy for speed	Use `APPROX_` functions for aggregations
Query Preview	Test without cost	Add `LIMIT 0` during development

Common Query Anti-Patterns

Anti-Pattern	Better Approach
`SELECT *`	Specify only needed columns
`UNION` instead of `UNION ALL`	Use `UNION ALL` when duplicates allowed (avoids sorting)
Joining on transformed columns	Pre-transform in subqueries, use original columns in joins
Filtering after aggregation	Push filters into subqueries before aggregation
Querying all partitions	Limit date ranges when possible
Complex nested subqueries	Use CTEs (WITH clauses) for readability and optimization
Repeated calculation	Use CTEs or window functions to calculate once

Common Task Examples

Time Series Analysis

-- Time-based aggregation with proper handling of time zones
SELECT
  TIMESTAMP_TRUNC(timestamp_column, DAY, 'US/Pacific') AS day,
  COUNT(*) AS event_count
FROM
  `project.dataset.events`
GROUP BY
  day
ORDER BY
  day;

-- Time-based window functions
SELECT
  timestamp_column AS time,
  value,
  AVG(value) OVER (
    ORDER BY timestamp_column
    RANGE BETWEEN INTERVAL 1 HOUR PRECEDING AND CURRENT ROW
  ) AS rolling_average
FROM
  `project.dataset.measurements`;

Geographic Analysis

-- Calculate distances between points
SELECT
  id,
  ST_DISTANCE(
    ST_GEOGPOINT(longitude1, latitude1),
    ST_GEOGPOINT(longitude2, latitude2)
  ) AS distance_meters
FROM
  `project.dataset.locations`;

-- Find points within a radius
SELECT
  id, name
FROM
  `project.dataset.places`
WHERE
  ST_DWithin(
    ST_GEOGPOINT(longitude, latitude),
    ST_GEOGPOINT(-122.4194, 37.7749),  -- San Francisco
    5000  -- 5km radius
  );

Regular Expressions

-- Extract patterns
SELECT
  text,
  REGEXP_EXTRACT(text, r'(\d{3})-(\d{3})-(\d{4})') AS phone_number
FROM
  `project.dataset.documents`;

-- Replace patterns
SELECT
  email,
  REGEXP_REPLACE(email, r'@.*$', '@example.com') AS anonymized_email
FROM
  `project.dataset.users`;

-- Check if pattern exists
SELECT
  url,
  REGEXP_CONTAINS(url, r'^https://') AS is_secure
FROM
  `project.dataset.links`;

BigQuery ML

Creating Models

-- Create a linear regression model
CREATE OR REPLACE MODEL `project.dataset.linear_model`
OPTIONS(
  model_type='LINEAR_REG',
  input_label_cols=['target']
) AS
SELECT
  feature1,
  feature2,
  feature3,
  target
FROM
  `project.dataset.training_data`;

-- Create a classification model
CREATE OR REPLACE MODEL `project.dataset.classifier`
OPTIONS(
  model_type='LOGISTIC_REG',
  input_label_cols=['is_converted']
) AS
SELECT
  feature1,
  feature2,
  feature3,
  is_converted
FROM
  `project.dataset.training_data`;

Using Models

-- Make predictions
SELECT
  *
FROM
  ML.PREDICT(
    MODEL `project.dataset.model`,
    TABLE `project.dataset.prediction_input`
  );

-- Evaluate model
SELECT
  *
FROM
  ML.EVALUATE(
    MODEL `project.dataset.model`,
    TABLE `project.dataset.test_data`
  );

-- Extract feature importance
SELECT
  *
FROM
  ML.FEATURE_IMPORTANCE(
    MODEL `project.dataset.model`
  );

Best Practices

Query Optimization

Write queries to minimize data processed (filter early, select specific columns)
Use partitioning and clustering aligned with your query patterns
Materialize intermediate results for complex multi-stage queries
Use appropriate JOIN types and join on columns with similar distributions
Leverage approximate aggregation functions for large datasets
Use query parameterization to improve caching and security

Data Organization

Partition tables by date for time-series data
Cluster large tables on frequently filtered columns (max 4 columns)
Use nested and repeated fields appropriately for hierarchical data
Consider column ordering for better compression (similar data together)
Use appropriate data types (e.g., INT64 instead of STRING for numbers)

Cost Control

Set up project-level query quotas and user-level limits
Monitor with BigQuery Monitoring in Cloud Monitoring
Use flat-rate pricing for predictable workloads
Implement row-level access policies instead of creating filtered views
Use Information Schema views to monitor usage patterns

Troubleshooting Common Issues

Issue	Possible Causes	Solutions
Query timeout	Query too complex, insufficient slots	Break into smaller queries, optimize joins, increase slots
Out of memory	JOIN producing too many rows	Filter before JOIN, use approximation functions
Slow query performance	Missing partitioning, inefficient JOINs	Optimize table structure, rewrite query, use EXPLAIN
“Resources exceeded” error	Query processing too much data	Reduce data scanned, filter earlier, use partitioning
High costs	Inefficient queries, scanning too much data	Optimize queries, use caching, implement cost controls
Quota limits reached	Too many concurrent queries	Implement query queue, optimize query frequency

Resources for Further Learning

Official Documentation

Training and Certification

Community Resources

Tools

BigQuery Visualizer – Visualize query execution
Google Data Studio – Visualize BigQuery data
dbt – Transform data in your warehouse
Visual Studio Code BigQuery Extension
JupyterLab BigQuery Extension