Mastering Snowflake’s Query Profile: A Guide to Performance Tuning
When it comes to optimizing queries in Snowflake, the Query Profile is your secret weapon. This powerful tool provides a detailed breakdown of how your query is executed, allowing you to identify bottlenecks and opportunities for optimization. In this guide, we’ll walk through a real-world example of how to use the Query Profile to dramatically improve query performance.
How to Access the Query Profile
In the left hand hamburger menu select: Monitoring > Query History > Click your query
Understanding the Query Profile
Before we dive into our example, let’s quickly review the key components of a Snowflake Query Profile:
- Query Plan: A visual representation of how Snowflake executes your query, with each operation shown as a node.
- Most Expensive Nodes: Highlights the operations consuming the most resources.
- Profile Overview: Breaks down execution time across different phases (Processing, I/O, etc.).
- Statistics: Provides detailed metrics on data scanned, partitions accessed, and more.
A Real-World Optimization Example
Let’s look at a complex query involving multiple CTEs and joins:
WITH lineitem_details AS (
SELECT
l_orderkey AS orderkey
, FIRST_VALUE(l_shipdate) OVER (PARTITION BY L_ORDERKEY ORDER BY l_shipdate ASC) as first_ship_date
, l_linestatus AS line_status
, l_returnflag AS return_flag
, l_discount AS discount
, l_shipmode AS ship_mode
FROM lineitem
),
order_details AS (
-- All orders with first item shipped within 90 days
-- Sum discount where ship mode is AIR
SELECT
o.o_custkey
, o.o_orderdate
, l.first_ship_date
, SUM(discount) as total_discount
FROM lineitem_details AS l
JOIN orders AS o
ON l.orderkey = o.o_orderkey
AND o.o_orderdate > l.first_ship_date - INTERVAL '90 days'
WHERE
l.ship_mode = 'AIR'
GROUP BY 1,2,3
)
-- We only want to see this for BUILDING customers
SELECT
c.c_name
, c.c_mktsegment
, o.o_orderdate
, o.first_ship_date
, o.total_discount
FROM customer AS c
LEFT JOIN order_details AS o
ON c.c_custkey = o.o_custkey
WHERE
c.c_mktsegment = 'BUILDING'
Step 1: Identify the Bottleneck
Looking at the Query Profile for this query, we can immediately spot two bottlenecks:
1. Window Function Performance
The WindowFunction operation is consuming a whopping 79.7% of the total execution time. This operation is processing 600M rows, causing the query to slow down significantly.
2. Data Spilling
Looking at the Statistics section, we see “Bytes spilled to local storage: 15.61GB”. This indicates that the query is using more memory than allocated to the warehouse, forcing it to write temporary data to disk. This process, known as data spilling, can significantly slow down query execution.
Let’s break down why these bottlenecks are occurring:
- Window Function Overhead: The window function is being applied to the entire LINEITEM table, which has 600M rows. However, we only need this calculation for a subset of customers (those in the ‘BUILDING’ market segment).
- Memory Constraints: The large amount of data being processed, combined with complex operations like window functions and joins, is exceeding the memory capacity of our current warehouse configuration. This forces Snowflake to spill data to disk, which is much slower than in-memory processing.
Step 2: Analyze and Optimize
To address these bottlenecks, we need to focus on two main areas:
- Reducing the amount of data processed by the window function
- Minimizing memory usage to reduce or eliminate data spilling
Let’s optimize the query by:
- Filtering customers earlier in the query
- Reducing the number of rows processed by the window function
- (Optional): Upsize our warehouse (e.g. from XS to S) to address memory issues.
Let’s try query optimization first, here’s the optimized version:
WITH base_customers AS (
-- Setup a base customers CTE to pre-filter lineitem
SELECT
c_custkey
, c_name
, c_mktsegment
FROM customer
WHERE
c_mktsegment = 'BUILDING'
),
lineitem_details AS (
SELECT
l_orderkey AS orderkey
, FIRST_VALUE(l_shipdate) OVER (PARTITION BY L_ORDERKEY ORDER BY l_shipdate ASC) as first_ship_date
, l_linestatus AS line_status
, l_returnflag AS return_flag
, l_discount AS discount
, l_shipmode AS ship_mode
FROM lineitem AS l
WHERE
EXISTS (
-- Keep all line items where there is a match on order key
SELECT 1
FROM orders AS o
JOIN base_customers AS c
ON o.o_custkey = c.c_custkey
WHERE
o.o_orderkey = l.l_orderkey
)
),
order_details AS (
-- All orders with first item shipped within 90 days
-- Total discount on ship mode AIR
SELECT
o.o_custkey
, o.o_orderdate
, l.first_ship_date
, SUM(discount) as total_discount
FROM lineitem_details AS l
JOIN orders AS o
ON l.orderkey = o.o_orderkey
AND o.o_orderdate > l.first_ship_date - INTERVAL '90 days'
WHERE
l.ship_mode = 'AIR'
GROUP BY 1,2,3
)
SELECT
c.c_name
, c.c_mktsegment
, o.o_orderdate
, o.first_ship_date
, o.total_discount
FROM base_customers AS c
LEFT JOIN order_details AS o
ON c.c_custkey = o.o_custkey
Step 3: Compare Results
After running the optimized query, we see a dramatic improvement in the Query Profile (execution time from 2m 12s to 40s):
The WindowFunction operation now processes 480 million fewer rows, significantly reducing its execution time and overall query duration.
Key Optimization Techniques
- Push filters down: By filtering customers earlier, we reduced the amount of data processed in subsequent steps.
- Minimize data movement: The EXISTS clause in the lineitem_details CTE ensures we only process relevant line items.
- Optimize window functions: By reducing the input data to the window function, we dramatically improved its performance.
Addressing Remaining Data Spilling
Even after our optimizations, we still see some data spilling occurring:
While we’ve significantly reduced data spilling from 15.61GB to 1.02GB, there’s still room for improvement. Let’s discuss how we can address this:
Option 1: Upsizing the Warehouse
Increasing the size of your Snowflake warehouse is often an effective way to eliminate data spilling. Larger warehouses provide more memory and compute power, reducing the likelihood of running out of memory during query execution.
Pros of upsizing:
- Simple to implement
- Can resolve spilling issues immediately
- May improve overall query performance
Cons of upsizing:
- Increases cost per second of warehouse runtime
- May be overkill if the spilling is minimal or infrequent
Option 2: Further Query Optimization
Before resorting to upsizing, consider if there are any other optimizations we can make to our query:
- Break down complex operations: If possible, split the query into smaller, more manageable parts.
- Optimize JOIN operations: Ensure all JOINs are necessary and using appropriate join types.
- Review GROUP BY clauses: Large GROUP BY operations can cause spilling. Consider pre-aggregating data if possible.
Conclusion
The Snowflake Query Profile is an invaluable tool for identifying and resolving performance bottlenecks. By systematically analyzing and optimizing your queries, you can achieve significant improvements in execution time and resource utilization.
Remember, query optimization is an iterative process. Always test your optimizations and use the Query Profile to validate your improvements.
Struggling with Snowflake costs?
All usage-based cloud platforms can get expensive when not used carefully. There are a ton of controls teams can fiddle with to get a handle on their Snowflake costs. At Greybeam, we’ve built a query performance and observability platform that automagically optimizes SQL queries sent to Snowflake, saving you thousands in compute costs. Reach out to kyle@corktownanalytics.com to learn more about how we can optimize your Snowflake environment.