Introduction to BigQuery SQL
BigQuery is Google Cloud’s fully-managed, serverless data warehouse that enables analyzing large datasets with SQL. It combines the flexibility of SQL with the scalability of cloud computing, allowing you to analyze terabytes or petabytes of data without managing infrastructure. BigQuery uses standard SQL dialect (compliant with SQL:2011), making it accessible for those familiar with other SQL implementations while offering powerful features for big data analytics.
Core Concepts and Fundamentals
Basic BigQuery Architecture
- Projects: Containers for datasets, tables, and other resources
- Datasets: Collections of tables and views within a project
- Tables: Structured data storage following a schema
- Views: Virtual tables created by a SQL query
- Materialized Views: Precomputed views that periodically cache results of a query
Data Types
Type | Description | Example |
---|
STRING | UTF-8 encoded string | 'Hello World' |
INT64 | 64-bit signed integer | 42 |
FLOAT64 | Double precision floating point | 3.14159 |
BOOL | Boolean value | TRUE , FALSE |
BYTES | Binary data | b'Hello' |
DATE | Calendar date | DATE '2025-05-10' |
DATETIME | Date and time | DATETIME '2025-05-10 12:30:00' |
TIME | Time of day | TIME '12:30:00' |
TIMESTAMP | Absolute point in time | TIMESTAMP '2025-05-10 12:30:00 UTC' |
ARRAY | Ordered list of values | [1, 2, 3] |
STRUCT | Container of ordered fields | STRUCT('John' AS name, 42 AS age) |
GEOGRAPHY | Geography points | ST_GEOGPOINT(-122.4, 37.8) |
Essential BigQuery SQL Operations
Basic Queries
-- Simple SELECT query with filtering
SELECT
column1,
column2,
column3
FROM
`project.dataset.table`
WHERE
condition
LIMIT 1000;
-- Query with aggregation
SELECT
category,
COUNT(*) AS count,
SUM(value) AS total_value
FROM
`project.dataset.table`
GROUP BY
category
HAVING
COUNT(*) > 10
ORDER BY
total_value DESC;
Working with Multiple Tables
-- INNER JOIN
SELECT
a.id,
a.name,
b.value
FROM
`project.dataset.tableA` AS a
INNER JOIN
`project.dataset.tableB` AS b
ON
a.id = b.id;
-- LEFT JOIN
SELECT
a.id,
a.name,
b.value
FROM
`project.dataset.tableA` AS a
LEFT JOIN
`project.dataset.tableB` AS b
ON
a.id = b.id;
-- UNION ALL
SELECT * FROM `project.dataset.table1`
UNION ALL
SELECT * FROM `project.dataset.table2`;
Working with Nested and Repeated Data
-- Query with UNNEST to flatten arrays
SELECT
id,
item
FROM
`project.dataset.table`,
UNNEST(array_column) AS item;
-- Access nested fields
SELECT
id,
struct_column.field1,
struct_column.field2
FROM
`project.dataset.table`;
Advanced BigQuery Features
Window Functions
-- Rank values within partitions
SELECT
category,
value,
RANK() OVER (PARTITION BY category ORDER BY value DESC) AS rank
FROM
`project.dataset.table`;
-- Calculate running totals
SELECT
date,
value,
SUM(value) OVER (ORDER BY date) AS running_total
FROM
`project.dataset.table`;
-- Calculate moving averages
SELECT
date,
value,
AVG(value) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS seven_day_average
FROM
`project.dataset.table`;
Advanced Analytical Functions
-- Calculate percentiles
SELECT
APPROX_QUANTILES(value, 100)[OFFSET(50)] AS median,
APPROX_QUANTILES(value, 100)[OFFSET(90)] AS percentile_90
FROM
`project.dataset.table`;
-- Clustering analysis
SELECT
*
FROM
ML.KMEANS(
MODEL `project.dataset.kmeans_model`,
TABLE `project.dataset.features`
);
User-Defined Functions (UDFs)
-- JavaScript UDF
CREATE TEMP FUNCTION multiplyByTwo(x FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
return x * 2;
""";
SELECT
value,
multiplyByTwo(value) AS doubled_value
FROM
`project.dataset.table`;
-- SQL UDF
CREATE TEMP FUNCTION categorize(value FLOAT64)
RETURNS STRING
AS (
CASE
WHEN value < 0 THEN 'Negative'
WHEN value = 0 THEN 'Zero'
ELSE 'Positive'
END
);
SELECT
value,
categorize(value) AS category
FROM
`project.dataset.table`;
Data Manipulation and Governance
Data Manipulation Language (DML)
-- Insert new rows
INSERT INTO `project.dataset.table` (column1, column2)
VALUES ('value1', 'value2'), ('value3', 'value4');
-- Insert using a query
INSERT INTO `project.dataset.target_table` (column1, column2)
SELECT column1, column2
FROM `project.dataset.source_table`
WHERE condition;
-- Update values
UPDATE `project.dataset.table`
SET column1 = 'new_value'
WHERE condition;
-- Delete rows
DELETE FROM `project.dataset.table`
WHERE condition;
-- Merge (upsert) operation
MERGE INTO `project.dataset.target_table` T
USING `project.dataset.source_table` S
ON T.id = S.id
WHEN MATCHED THEN
UPDATE SET T.column1 = S.column1, T.column2 = S.column2
WHEN NOT MATCHED THEN
INSERT (id, column1, column2)
VALUES (S.id, S.column1, S.column2);
Data Definition Language (DDL)
-- Create a table
CREATE TABLE `project.dataset.table` (
id STRING,
name STRING,
value FLOAT64,
ts TIMESTAMP,
tags ARRAY<STRING>
)
PARTITION BY DATE(ts)
CLUSTER BY id;
-- Create a view
CREATE VIEW `project.dataset.view` AS
SELECT id, name, value
FROM `project.dataset.table`
WHERE condition;
-- Create a materialized view
CREATE MATERIALIZED VIEW `project.dataset.mat_view` AS
SELECT
date,
category,
SUM(value) AS total_value
FROM
`project.dataset.table`
GROUP BY
date, category;
Performance Optimization Techniques
Partitioning and Clustering
-- Create a partitioned table
CREATE TABLE `project.dataset.partitioned_table`
PARTITION BY
DATE(timestamp_column) -- Time-unit based partitioning
-- OR
-- RANGE_BUCKET(integer_column, GENERATE_ARRAY(0, 100, 10)) -- Integer range partitioning
AS SELECT * FROM `project.dataset.source_table`;
-- Create a clustered table
CREATE TABLE `project.dataset.clustered_table`
CLUSTER BY
category, region
AS SELECT * FROM `project.dataset.source_table`;
-- Query a partitioned table (with pruning)
SELECT *
FROM `project.dataset.partitioned_table`
WHERE DATE(timestamp_column) BETWEEN '2025-01-01' AND '2025-01-31';
Cost Optimization Techniques
Technique | Impact | Implementation |
---|
Column Selection | Reduces bytes scanned | Select only needed columns, avoid SELECT * |
Partition Pruning | Reduces data scanned | Query with filters on partitioning columns |
Cluster Pruning | Improves filter efficiency | Filter on clustering columns |
Materialized Views | Reuses computation | Create for frequently used aggregations |
Result Caching | Avoids repeated work | Identical queries within 24 hours use cached results |
Query Approximation | Trades accuracy for speed | Use APPROX_ functions for aggregations |
Query Preview | Test without cost | Add LIMIT 0 during development |
Common Query Anti-Patterns
Anti-Pattern | Better Approach |
---|
SELECT * | Specify only needed columns |
UNION instead of UNION ALL | Use UNION ALL when duplicates allowed (avoids sorting) |
Joining on transformed columns | Pre-transform in subqueries, use original columns in joins |
Filtering after aggregation | Push filters into subqueries before aggregation |
Querying all partitions | Limit date ranges when possible |
Complex nested subqueries | Use CTEs (WITH clauses) for readability and optimization |
Repeated calculation | Use CTEs or window functions to calculate once |
Common Task Examples
Time Series Analysis
-- Time-based aggregation with proper handling of time zones
SELECT
TIMESTAMP_TRUNC(timestamp_column, DAY, 'US/Pacific') AS day,
COUNT(*) AS event_count
FROM
`project.dataset.events`
GROUP BY
day
ORDER BY
day;
-- Time-based window functions
SELECT
timestamp_column AS time,
value,
AVG(value) OVER (
ORDER BY timestamp_column
RANGE BETWEEN INTERVAL 1 HOUR PRECEDING AND CURRENT ROW
) AS rolling_average
FROM
`project.dataset.measurements`;
Geographic Analysis
-- Calculate distances between points
SELECT
id,
ST_DISTANCE(
ST_GEOGPOINT(longitude1, latitude1),
ST_GEOGPOINT(longitude2, latitude2)
) AS distance_meters
FROM
`project.dataset.locations`;
-- Find points within a radius
SELECT
id, name
FROM
`project.dataset.places`
WHERE
ST_DWithin(
ST_GEOGPOINT(longitude, latitude),
ST_GEOGPOINT(-122.4194, 37.7749), -- San Francisco
5000 -- 5km radius
);
Regular Expressions
-- Extract patterns
SELECT
text,
REGEXP_EXTRACT(text, r'(\d{3})-(\d{3})-(\d{4})') AS phone_number
FROM
`project.dataset.documents`;
-- Replace patterns
SELECT
email,
REGEXP_REPLACE(email, r'@.*$', '@example.com') AS anonymized_email
FROM
`project.dataset.users`;
-- Check if pattern exists
SELECT
url,
REGEXP_CONTAINS(url, r'^https://') AS is_secure
FROM
`project.dataset.links`;
BigQuery ML
Creating Models
-- Create a linear regression model
CREATE OR REPLACE MODEL `project.dataset.linear_model`
OPTIONS(
model_type='LINEAR_REG',
input_label_cols=['target']
) AS
SELECT
feature1,
feature2,
feature3,
target
FROM
`project.dataset.training_data`;
-- Create a classification model
CREATE OR REPLACE MODEL `project.dataset.classifier`
OPTIONS(
model_type='LOGISTIC_REG',
input_label_cols=['is_converted']
) AS
SELECT
feature1,
feature2,
feature3,
is_converted
FROM
`project.dataset.training_data`;
Using Models
-- Make predictions
SELECT
*
FROM
ML.PREDICT(
MODEL `project.dataset.model`,
TABLE `project.dataset.prediction_input`
);
-- Evaluate model
SELECT
*
FROM
ML.EVALUATE(
MODEL `project.dataset.model`,
TABLE `project.dataset.test_data`
);
-- Extract feature importance
SELECT
*
FROM
ML.FEATURE_IMPORTANCE(
MODEL `project.dataset.model`
);
Best Practices
Query Optimization
- Write queries to minimize data processed (filter early, select specific columns)
- Use partitioning and clustering aligned with your query patterns
- Materialize intermediate results for complex multi-stage queries
- Use appropriate JOIN types and join on columns with similar distributions
- Leverage approximate aggregation functions for large datasets
- Use query parameterization to improve caching and security
Data Organization
- Partition tables by date for time-series data
- Cluster large tables on frequently filtered columns (max 4 columns)
- Use nested and repeated fields appropriately for hierarchical data
- Consider column ordering for better compression (similar data together)
- Use appropriate data types (e.g., INT64 instead of STRING for numbers)
Cost Control
- Set up project-level query quotas and user-level limits
- Monitor with BigQuery Monitoring in Cloud Monitoring
- Use flat-rate pricing for predictable workloads
- Implement row-level access policies instead of creating filtered views
- Use Information Schema views to monitor usage patterns
Troubleshooting Common Issues
Issue | Possible Causes | Solutions |
---|
Query timeout | Query too complex, insufficient slots | Break into smaller queries, optimize joins, increase slots |
Out of memory | JOIN producing too many rows | Filter before JOIN, use approximation functions |
Slow query performance | Missing partitioning, inefficient JOINs | Optimize table structure, rewrite query, use EXPLAIN |
“Resources exceeded” error | Query processing too much data | Reduce data scanned, filter earlier, use partitioning |
High costs | Inefficient queries, scanning too much data | Optimize queries, use caching, implement cost controls |
Quota limits reached | Too many concurrent queries | Implement query queue, optimize query frequency |
Resources for Further Learning
Official Documentation
Training and Certification
Community Resources
Tools