The Ultimate BigQuery SQL Cheat Sheet: From Basics to Advanced Techniques

Introduction to BigQuery SQL

BigQuery is Google Cloud’s fully-managed, serverless data warehouse that enables analyzing large datasets with SQL. It combines the flexibility of SQL with the scalability of cloud computing, allowing you to analyze terabytes or petabytes of data without managing infrastructure. BigQuery uses standard SQL dialect (compliant with SQL:2011), making it accessible for those familiar with other SQL implementations while offering powerful features for big data analytics.

Core Concepts and Fundamentals

Basic BigQuery Architecture

  • Projects: Containers for datasets, tables, and other resources
  • Datasets: Collections of tables and views within a project
  • Tables: Structured data storage following a schema
  • Views: Virtual tables created by a SQL query
  • Materialized Views: Precomputed views that periodically cache results of a query

Data Types

TypeDescriptionExample
STRINGUTF-8 encoded string'Hello World'
INT6464-bit signed integer42
FLOAT64Double precision floating point3.14159
BOOLBoolean valueTRUE, FALSE
BYTESBinary datab'Hello'
DATECalendar dateDATE '2025-05-10'
DATETIMEDate and timeDATETIME '2025-05-10 12:30:00'
TIMETime of dayTIME '12:30:00'
TIMESTAMPAbsolute point in timeTIMESTAMP '2025-05-10 12:30:00 UTC'
ARRAYOrdered list of values[1, 2, 3]
STRUCTContainer of ordered fieldsSTRUCT('John' AS name, 42 AS age)
GEOGRAPHYGeography pointsST_GEOGPOINT(-122.4, 37.8)

Essential BigQuery SQL Operations

Basic Queries

-- Simple SELECT query with filtering
SELECT
  column1,
  column2,
  column3
FROM
  `project.dataset.table`
WHERE
  condition
LIMIT 1000;

-- Query with aggregation
SELECT
  category,
  COUNT(*) AS count,
  SUM(value) AS total_value
FROM
  `project.dataset.table`
GROUP BY
  category
HAVING
  COUNT(*) > 10
ORDER BY
  total_value DESC;

Working with Multiple Tables

-- INNER JOIN
SELECT
  a.id,
  a.name,
  b.value
FROM
  `project.dataset.tableA` AS a
INNER JOIN
  `project.dataset.tableB` AS b
ON
  a.id = b.id;

-- LEFT JOIN
SELECT
  a.id,
  a.name,
  b.value
FROM
  `project.dataset.tableA` AS a
LEFT JOIN
  `project.dataset.tableB` AS b
ON
  a.id = b.id;

-- UNION ALL
SELECT * FROM `project.dataset.table1`
UNION ALL
SELECT * FROM `project.dataset.table2`;

Working with Nested and Repeated Data

-- Query with UNNEST to flatten arrays
SELECT
  id,
  item
FROM
  `project.dataset.table`,
  UNNEST(array_column) AS item;

-- Access nested fields
SELECT
  id,
  struct_column.field1,
  struct_column.field2
FROM
  `project.dataset.table`;

Advanced BigQuery Features

Window Functions

-- Rank values within partitions
SELECT
  category,
  value,
  RANK() OVER (PARTITION BY category ORDER BY value DESC) AS rank
FROM
  `project.dataset.table`;

-- Calculate running totals
SELECT
  date,
  value,
  SUM(value) OVER (ORDER BY date) AS running_total
FROM
  `project.dataset.table`;

-- Calculate moving averages
SELECT
  date,
  value,
  AVG(value) OVER (
    ORDER BY date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS seven_day_average
FROM
  `project.dataset.table`;

Advanced Analytical Functions

-- Calculate percentiles
SELECT
  APPROX_QUANTILES(value, 100)[OFFSET(50)] AS median,
  APPROX_QUANTILES(value, 100)[OFFSET(90)] AS percentile_90
FROM
  `project.dataset.table`;

-- Clustering analysis
SELECT
  *
FROM
  ML.KMEANS(
    MODEL `project.dataset.kmeans_model`,
    TABLE `project.dataset.features`
  );

User-Defined Functions (UDFs)

-- JavaScript UDF
CREATE TEMP FUNCTION multiplyByTwo(x FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
  return x * 2;
""";

SELECT
  value,
  multiplyByTwo(value) AS doubled_value
FROM
  `project.dataset.table`;

-- SQL UDF
CREATE TEMP FUNCTION categorize(value FLOAT64)
RETURNS STRING
AS (
  CASE
    WHEN value < 0 THEN 'Negative'
    WHEN value = 0 THEN 'Zero'
    ELSE 'Positive'
  END
);

SELECT
  value,
  categorize(value) AS category
FROM
  `project.dataset.table`;

Data Manipulation and Governance

Data Manipulation Language (DML)

-- Insert new rows
INSERT INTO `project.dataset.table` (column1, column2)
VALUES ('value1', 'value2'), ('value3', 'value4');

-- Insert using a query
INSERT INTO `project.dataset.target_table` (column1, column2)
SELECT column1, column2
FROM `project.dataset.source_table`
WHERE condition;

-- Update values
UPDATE `project.dataset.table`
SET column1 = 'new_value'
WHERE condition;

-- Delete rows
DELETE FROM `project.dataset.table`
WHERE condition;

-- Merge (upsert) operation
MERGE INTO `project.dataset.target_table` T
USING `project.dataset.source_table` S
ON T.id = S.id
WHEN MATCHED THEN
  UPDATE SET T.column1 = S.column1, T.column2 = S.column2
WHEN NOT MATCHED THEN
  INSERT (id, column1, column2)
  VALUES (S.id, S.column1, S.column2);

Data Definition Language (DDL)

-- Create a table
CREATE TABLE `project.dataset.table` (
  id STRING,
  name STRING,
  value FLOAT64,
  ts TIMESTAMP,
  tags ARRAY<STRING>
)
PARTITION BY DATE(ts)
CLUSTER BY id;

-- Create a view
CREATE VIEW `project.dataset.view` AS
SELECT id, name, value
FROM `project.dataset.table`
WHERE condition;

-- Create a materialized view
CREATE MATERIALIZED VIEW `project.dataset.mat_view` AS
SELECT
  date,
  category,
  SUM(value) AS total_value
FROM
  `project.dataset.table`
GROUP BY
  date, category;

Performance Optimization Techniques

Partitioning and Clustering

-- Create a partitioned table
CREATE TABLE `project.dataset.partitioned_table`
PARTITION BY
  DATE(timestamp_column) -- Time-unit based partitioning
  -- OR
  -- RANGE_BUCKET(integer_column, GENERATE_ARRAY(0, 100, 10)) -- Integer range partitioning
AS SELECT * FROM `project.dataset.source_table`;

-- Create a clustered table
CREATE TABLE `project.dataset.clustered_table`
CLUSTER BY
  category, region
AS SELECT * FROM `project.dataset.source_table`;

-- Query a partitioned table (with pruning)
SELECT *
FROM `project.dataset.partitioned_table`
WHERE DATE(timestamp_column) BETWEEN '2025-01-01' AND '2025-01-31';

Cost Optimization Techniques

TechniqueImpactImplementation
Column SelectionReduces bytes scannedSelect only needed columns, avoid SELECT *
Partition PruningReduces data scannedQuery with filters on partitioning columns
Cluster PruningImproves filter efficiencyFilter on clustering columns
Materialized ViewsReuses computationCreate for frequently used aggregations
Result CachingAvoids repeated workIdentical queries within 24 hours use cached results
Query ApproximationTrades accuracy for speedUse APPROX_ functions for aggregations
Query PreviewTest without costAdd LIMIT 0 during development

Common Query Anti-Patterns

Anti-PatternBetter Approach
SELECT *Specify only needed columns
UNION instead of UNION ALLUse UNION ALL when duplicates allowed (avoids sorting)
Joining on transformed columnsPre-transform in subqueries, use original columns in joins
Filtering after aggregationPush filters into subqueries before aggregation
Querying all partitionsLimit date ranges when possible
Complex nested subqueriesUse CTEs (WITH clauses) for readability and optimization
Repeated calculationUse CTEs or window functions to calculate once

Common Task Examples

Time Series Analysis

-- Time-based aggregation with proper handling of time zones
SELECT
  TIMESTAMP_TRUNC(timestamp_column, DAY, 'US/Pacific') AS day,
  COUNT(*) AS event_count
FROM
  `project.dataset.events`
GROUP BY
  day
ORDER BY
  day;

-- Time-based window functions
SELECT
  timestamp_column AS time,
  value,
  AVG(value) OVER (
    ORDER BY timestamp_column
    RANGE BETWEEN INTERVAL 1 HOUR PRECEDING AND CURRENT ROW
  ) AS rolling_average
FROM
  `project.dataset.measurements`;

Geographic Analysis

-- Calculate distances between points
SELECT
  id,
  ST_DISTANCE(
    ST_GEOGPOINT(longitude1, latitude1),
    ST_GEOGPOINT(longitude2, latitude2)
  ) AS distance_meters
FROM
  `project.dataset.locations`;

-- Find points within a radius
SELECT
  id, name
FROM
  `project.dataset.places`
WHERE
  ST_DWithin(
    ST_GEOGPOINT(longitude, latitude),
    ST_GEOGPOINT(-122.4194, 37.7749),  -- San Francisco
    5000  -- 5km radius
  );

Regular Expressions

-- Extract patterns
SELECT
  text,
  REGEXP_EXTRACT(text, r'(\d{3})-(\d{3})-(\d{4})') AS phone_number
FROM
  `project.dataset.documents`;

-- Replace patterns
SELECT
  email,
  REGEXP_REPLACE(email, r'@.*$', '@example.com') AS anonymized_email
FROM
  `project.dataset.users`;

-- Check if pattern exists
SELECT
  url,
  REGEXP_CONTAINS(url, r'^https://') AS is_secure
FROM
  `project.dataset.links`;

BigQuery ML

Creating Models

-- Create a linear regression model
CREATE OR REPLACE MODEL `project.dataset.linear_model`
OPTIONS(
  model_type='LINEAR_REG',
  input_label_cols=['target']
) AS
SELECT
  feature1,
  feature2,
  feature3,
  target
FROM
  `project.dataset.training_data`;

-- Create a classification model
CREATE OR REPLACE MODEL `project.dataset.classifier`
OPTIONS(
  model_type='LOGISTIC_REG',
  input_label_cols=['is_converted']
) AS
SELECT
  feature1,
  feature2,
  feature3,
  is_converted
FROM
  `project.dataset.training_data`;

Using Models

-- Make predictions
SELECT
  *
FROM
  ML.PREDICT(
    MODEL `project.dataset.model`,
    TABLE `project.dataset.prediction_input`
  );

-- Evaluate model
SELECT
  *
FROM
  ML.EVALUATE(
    MODEL `project.dataset.model`,
    TABLE `project.dataset.test_data`
  );

-- Extract feature importance
SELECT
  *
FROM
  ML.FEATURE_IMPORTANCE(
    MODEL `project.dataset.model`
  );

Best Practices

Query Optimization

  • Write queries to minimize data processed (filter early, select specific columns)
  • Use partitioning and clustering aligned with your query patterns
  • Materialize intermediate results for complex multi-stage queries
  • Use appropriate JOIN types and join on columns with similar distributions
  • Leverage approximate aggregation functions for large datasets
  • Use query parameterization to improve caching and security

Data Organization

  • Partition tables by date for time-series data
  • Cluster large tables on frequently filtered columns (max 4 columns)
  • Use nested and repeated fields appropriately for hierarchical data
  • Consider column ordering for better compression (similar data together)
  • Use appropriate data types (e.g., INT64 instead of STRING for numbers)

Cost Control

  • Set up project-level query quotas and user-level limits
  • Monitor with BigQuery Monitoring in Cloud Monitoring
  • Use flat-rate pricing for predictable workloads
  • Implement row-level access policies instead of creating filtered views
  • Use Information Schema views to monitor usage patterns

Troubleshooting Common Issues

IssuePossible CausesSolutions
Query timeoutQuery too complex, insufficient slotsBreak into smaller queries, optimize joins, increase slots
Out of memoryJOIN producing too many rowsFilter before JOIN, use approximation functions
Slow query performanceMissing partitioning, inefficient JOINsOptimize table structure, rewrite query, use EXPLAIN
“Resources exceeded” errorQuery processing too much dataReduce data scanned, filter earlier, use partitioning
High costsInefficient queries, scanning too much dataOptimize queries, use caching, implement cost controls
Quota limits reachedToo many concurrent queriesImplement query queue, optimize query frequency

Resources for Further Learning

Official Documentation

Training and Certification

Community Resources

Tools

Scroll to Top