Table of contents:

Subscribe to our newsletter - Data Engineering ACID

Get 3 weekly stories around data engineering at scale that the e6data team is reading.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Share this article

How to Optimize Databricks Performance: Complete Query Tuning Guide {2025}

September 4, 2025
/
e6data Team
Databricks
Query Optimization
Advanced
Redditors discussing performance issues with Databricks SQL

Databricks cost optimization and databricks performance tuning are critical for enterprise data teams managing large-scale analytics workloads. According to Databricks' 2024 State of Data & AI Report, organizations implementing comprehensive databricks performance optimization strategies can achieve significant cost reductions and query performance improvements.

Modern Databricks clusters face performance challenges as data volumes grow exponentially. Enterprise data engineers working with databricks delta lake architectures report that poorly optimized databricks spark queries can consume substantially more DBUs than necessary, directly impacting both operational costs and user experience.

Databricks Performance Optimization Metrics and Thresholds

Based on Databricks Runtime Performance Benchmarks, these databricks cluster performance indicators help identify optimization opportunities across workloads:

Performance Metric Optimization Threshold Required Action
Databricks SQL query latency >30s for BI dashboards Implement databricks liquid clustering and delta lake Z-ordering
Data skipping efficiency <70% files eliminated Optimize databricks delta tables layout and statistics collection
Delta Cache hit ratio <60% for repeated queries Configure databricks cluster cache settings and Unity Catalog caching
Cluster CPU utilization >85% sustained load Scale databricks cluster resources or optimize Spark SQL parallelism
Shuffle operation volume >1GB per query Review databricks spark join strategies and broadcast hints
Concurrent query queue time >10s during peak hours Enable databricks sql serverless or implement workload isolation
DBU consumption variance >150% of cost baseline Audit spark databricks execution plans and implement databricks cost optimization

BI Dashboard Optimization Tactics

1. Implement Z-Ordering for Sub-Second Query Response

When to apply: Deploy Z-ordering when tables exceed 1GB and databricks sql dashboards filter on multiple dimensions like customer_id, date_range, and product_category. According to Databricks Delta Lake Performance Guide, traditional partitioning fails when users query across various column combinations. Z-ordering excels on high-cardinality columns appearing frequently in WHERE clauses, particularly for sql dashboard queries filtering on 2+ columns regularly.

How to implement: Z-ordering co-locates related data using space-filling curves, dramatically improving data skipping efficiency for multi-dimensional queries. Enterprise implementations can achieve significant performance improvements on point lookups and range scans when Z-ordering is implemented correctly.

1-- Optimize sales table for common BI query patterns
2OPTIMIZE sales_fact 
3ZORDER BY (customer_id, transaction_date, product_category);
4
5-- Enable auto-optimize for ongoing maintenance
6ALTER TABLE sales_fact 
7SET TBLPROPERTIES (
8  'delta.autoOptimize.optimizeWrite' = 'true',
9  'delta.autoOptimize.autoCompact' = 'true'
10);
11
12-- Example dashboard query that benefits from Z-ordering
13SELECT 
14    product_category,
15    COUNT(*) as transaction_count,
16    SUM(amount) as total_revenue
17FROM sales_fact 
18WHERE customer_id IN (12345, 67890, 11111)
19    AND transaction_date >= '2024-01-01'
20    AND product_category = 'Electronics'
21GROUP BY product_category;

Alternatives: Delta Lake Bloom filters for high-cardinality string columns (note: effectiveness is limited and feature availability may vary), databricks liquid clustering for evolving access patterns (note: Liquid Clustering replaces Z-Order and requires Unity Catalog managed tables), or migrating performance-critical queries to e6data for guaranteed sub-second latency without maintenance overhead.

2. Implement Delta Cache for Repetitive Dashboard Queries

When to apply: Implement caching for tables <10GB that are accessed >5 times per hour when dashboard users repeatedly access the same data throughout the day, making caching strategies crucial for maintaining sub-second response times.

How to implement: Databricks IO cache (formerly Delta Cache) stores frequently accessed data on local SSD storage, reducing network I/O and significantly improving query performance for repeated access patterns. IO cache works transparently at the file level and automatically manages cache eviction based on access patterns.

-- Enable Databricks IO cache on cluster
-- Cluster configuration: Advanced Options > Spark Config
spark.databricks.io.cache.enabled true
spark.databricks.io.cache.maxDiskUsage 50g

-- Optional: Use Spark in-memory caching for specific tables (separate from IO cache)
-- CACHE TABLE customer_dim;
-- CACHE TABLE product_dim;

-- Dashboard query leveraging cached dimensions
SELECT 
    p.product_name,
    c.customer_segment,
    COUNT(*) as order_count,
    SUM(o.order_value) as total_revenue
FROM orders_fact o
JOIN customer_dim c ON o.customer_id = c.customer_id  -- Benefits from IO cache
JOIN product_dim p ON o.product_id = p.product_id    -- Benefits from IO cache
WHERE o.order_date >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY p.product_name, c.customer_segment
HAVING COUNT(*) >= 10;

Alternatives: Result caching for identical queries, or adaptive query execution for dynamic optimization.

3. Optimize Partition Pruning with Predicate Pushdown

When to apply: Partition tables >1GB with predictable access patterns where effective partition pruning can dramatically reduce data scanning for time-based BI reports. Target file sizes of 128MB-1GB within partitions, and ensure partition cardinality is balanced (avoid too many small partitions or too few large ones).

How to implement: The key insight here is avoiding the common mistake where users partition by date but then query across customer segments or regions, resulting in full partition scans. Once you've set up partition elimination correctly, query execution times drop dramatically because Spark only reads relevant partitions rather than scanning the entire table.

1-- Create properly partitioned table for BI workloads
2CREATE TABLE sales_monthly_partitioned (
3    transaction_id BIGINT,
4    customer_id BIGINT,
5    product_id BIGINT,
6    sales_amount DECIMAL(10,2),
7    region STRING,
8    transaction_timestamp TIMESTAMP
9) USING DELTA
10PARTITIONED BY (
11    transaction_month STRING  -- YYYY-MM format for monthly reports
12);
13
14-- Insert with proper partition values
15INSERT INTO sales_monthly_partitioned
16SELECT 
17    transaction_id,
18    customer_id,
19    product_id,
20    sales_amount,
21    region,
22    transaction_timestamp,
23    DATE_FORMAT(transaction_timestamp, 'yyyy-MM') as transaction_month
24FROM raw_sales_data;
25
26-- BI query with effective partition pruning
27SELECT 
28    region,
29    COUNT(*) as transaction_count,
30    AVG(sales_amount) as avg_sales
31FROM sales_monthly_partitioned
32WHERE transaction_month = '2024-12'  -- Partition filter eliminates majority of data
33    AND region IN ('North', 'South')
34GROUP BY region;

Alternatives: Dynamic partition pruning for complex joins, or liquid clustering for evolving patterns.

4. Enable Serverless SQL for Consistent BI Performance

When to apply: Use serverless for BI workloads with unpredictable usage patterns where response time consistency matters more than absolute performance. Serverless scales well beyond 50 concurrent users and particularly shines for user-facing dashboards that need to eliminate the cold start problem.

How to implement: Serverless SQL eliminates the cold start problem that plagues traditional cluster-based BI deployments by providing instant query execution with automatic scaling based on demand instead of waiting several minutes for cluster startup during peak usage. The beauty of this approach is that it removes the operational overhead of cluster sizing and management while providing predictable query latency.

1-- Serverless SQL endpoint configuration (via UI or API)
2-- No cluster management required - automatic scaling
3
4-- BI dashboard query on serverless endpoint
5WITH monthly_metrics AS (
6    SELECT 
7        DATE_TRUNC('month', order_date) as month,
8        customer_segment,
9        SUM(order_value) as revenue,
10        COUNT(DISTINCT customer_id) as active_customers,
11        AVG(order_value) as avg_order_size
12    FROM orders_fact o
13    JOIN customer_dim c ON o.customer_id = c.customer_id
14    WHERE order_date >= '2024-01-01'
15    GROUP BY DATE_TRUNC('month', order_date), customer_segment
16)
17SELECT 
18    month,
19    customer_segment,
20    revenue,
21    active_customers,
22    avg_order_size,
23    LAG(revenue) OVER (PARTITION BY customer_segment ORDER BY month) as prev_month_revenue,
24    (revenue - LAG(revenue) OVER (PARTITION BY customer_segment ORDER BY month)) / 
25        LAG(revenue) OVER (PARTITION BY customer_segment ORDER BY month) * 100 as growth_rate
26FROM monthly_metrics
27ORDER BY month DESC, customer_segment;

Alternatives: Right-sized clusters with auto-scaling, Databricks SQL Pro for guaranteed performance, or e6data for sub-second latency with 1000+ concurrent users without migrating from Databricks.

Ad-hoc Analytics Optimization Tactics

1. Implement Broadcast Joins for Dimension Table Performance

When to apply: Use broadcast joins for tables <200MB joining with fact tables >1GB where large fact table joins with smaller dimension tables create massive shuffle operations that can consume significantly more resources than necessary.

How to implement: Broadcast joins copy small tables to all executors, eliminating network shuffle and dramatically reducing query time for typical star schema queries. Spark's adaptive query execution automatically identifies broadcast opportunities, but you can manually control this behavior for predictable performance. What makes this particularly effective is that broadcast joins work exceptionally well with cached dimension tables, creating a powerful combination for analytical workloads.

1-- Configure broadcast join thresholds
2SET spark.sql.adaptive.enabled = true;
3SET spark.sql.autoBroadcastJoinThreshold = 100MB;
4
5-- Manual broadcast hint for guaranteed behavior (hint table aliases individually)
6SELECT /*+ BROADCAST(c), BROADCAST(p) */
7    c.customer_segment,
8    p.product_category,
9    COUNT(*) as order_count,
10    SUM(o.order_value) as total_revenue,
11    AVG(o.order_value) as avg_order_value
12FROM orders_fact o
13JOIN customer_dim c ON o.customer_id = c.customer_id
14JOIN product_dim p ON o.product_id = p.product_id
15WHERE o.order_date >= '2024-01-01'
16GROUP BY c.customer_segment, p.product_category
17HAVING SUM(o.order_value) > 100000;

Alternatives: Bucketed joins for predictable data distribution, or sort-merge joins for large-to-large table joins

2. Optimize Window Functions with Proper Partitioning

When to apply: Partition window functions when processing >10M rows where window functions in analytical queries trigger expensive global sorts across the entire dataset, creating memory pressure and long execution times.

How to implement: When you implement proper window partitioning strategies, you transform these operations from cluster-wide sorts to manageable partition-level operations. Partitioning window functions by logical business dimensions like customer_id or region substantially reduces memory usage while maintaining result accuracy. Here's where it gets interesting: combining window partitioning with Z-ordering creates significant performance synergies for ranking and aggregation operations.

1-- Inefficient: Global sorting across entire dataset
2SELECT 
3    customer_id,
4    order_date,
5    order_value,
6    ROW_NUMBER() OVER (ORDER BY order_value DESC) as global_rank  -- Expensive!
7FROM orders_fact;
8
9-- Optimized: Partition-aware window functions
10SELECT 
11    customer_id,
12    order_date,
13    order_value,
14    SUM(order_value) OVER (
15        PARTITION BY customer_id 
16        ORDER BY order_date 
17        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
18    ) as rolling_7day_total,
19    ROW_NUMBER() OVER (
20        PARTITION BY customer_id 
21        ORDER BY order_value DESC
22    ) as customer_order_rank,
23    DENSE_RANK() OVER (
24        PARTITION BY DATE_TRUNC('month', order_date)
25        ORDER BY order_value DESC
26    ) as monthly_rank
27FROM orders_fact
28WHERE order_date >= '2024-01-01';
29
30-- Advanced: Optimize table layout for window operations
31OPTIMIZE orders_fact 
32ZORDER BY (customer_id, order_date);

Alternatives: Pre-aggregated materialized views or repeated calculations, approximate functions like percentile_approx.

3. Leverage Adaptive Query Execution for Dynamic Optimization

When to apply: Enable AQE for all analytical workloads where complex analytical queries with multiple joins and aggregations need automatic optimization without manual intervention.

How to implement: Adaptive Query Execution (AQE) accelerates analytical query performance by making runtime decisions based on actual data statistics rather than pre-execution estimates. What makes this particularly effective is that AQE automatically handles skewed joins, optimizes shuffle partitions, and converts sort-merge joins to broadcast joins when beneficial.

1-- Enable comprehensive AQE features
2SET spark.sql.adaptive.enabled = true;
3SET spark.sql.adaptive.coalescePartitions.enabled = true;
4SET spark.sql.adaptive.skewJoin.enabled = true;
5SET spark.sql.adaptive.localShuffleReader.enabled = true;
6
7-- Complex analytical query that benefits from AQE
8WITH customer_metrics AS (
9    SELECT 
10        c.customer_id,
11        c.customer_segment,
12        COUNT(*) as order_count,
13        SUM(o.order_value) as total_spent,
14        MAX(o.order_date) as last_order_date
15    FROM customers c
16    LEFT JOIN orders o ON c.customer_id = o.customer_id
17    WHERE c.registration_date >= '2023-01-01'
18    GROUP BY c.customer_id, c.customer_segment
19),
20segment_analysis AS (
21    SELECT 
22        customer_segment,
23        AVG(total_spent) as avg_spent,
24        PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY total_spent) as p90_spent,
25        COUNT(*) as segment_size
26    FROM customer_metrics
27    GROUP BY customer_segment
28)
29SELECT 
30    cm.customer_segment,
31    COUNT(*) as high_value_customers,
32    AVG(cm.total_spent) as avg_high_value_spent,
33    sa.avg_spent as segment_avg
34FROM customer_metrics cm
35JOIN segment_analysis sa ON cm.customer_segment = sa.customer_segment
36WHERE cm.total_spent > sa.p90_spent
37GROUP BY cm.customer_segment, sa.avg_spent
38ORDER BY high_value_customers DESC;

Alternatives: Manual join hints and partition tuning, cost-based optimizer statistics collection, or e6data's lakehouse query engine's query optimization that eliminates manual tuning entirely.

4. Implement Columnar Statistics for Intelligent Data Skipping

When to apply: Collect statistics for tables >1GB with selective filter patterns where Delta Lake's column-level statistics can enable sophisticated data skipping that goes beyond basic partition pruning.

How to implement: When you collect statistics on frequently filtered columns, the query optimizer can skip entire files without reading them, substantially reducing I/O for selective analytical queries. Here's what happens next: the Delta Log maintains min/max statistics for each data file, allowing the query engine to eliminate files that don't contain relevant data before any actual data reading occurs. What makes this particularly effective is combining statistics with Z-ordering for maximum data skipping efficiency.

1-- Delta Lake maintains file-level min/max statistics automatically
2-- Focus on collecting table-level statistics for cost-based optimization
3
4-- Analyze table to compute statistics
5ANALYZE TABLE sales_fact COMPUTE STATISTICS FOR ALL COLUMNS;
6
7-- Query that benefits from data skipping
8SELECT 
9    customer_segment,
10    product_category,
11    SUM(sales_amount) as total_sales,
12    COUNT(DISTINCT customer_id) as unique_customers
13FROM sales_fact s
14JOIN customer_dim c ON s.customer_id = c.customer_id
15WHERE s.sales_amount > 1000  -- Statistics enable file skipping
16    AND s.transaction_date BETWEEN '2024-11-01' AND '2024-11-30'
17    AND c.customer_segment = 'Enterprise'
18GROUP BY customer_segment, product_category;

Alternatives: Bloom filters for high-cardinality columns, manual file organization strategies, or e6data's query engine's automatic data skipping optimization.

5. Deploy Predictive I/O for Large Scan Operations

When to apply: Enable predictive I/O for sequential scan patterns >10GB where many analytical queries follow predictable access patterns, allowing the storage layer to anticipate data needs and reduce wait times.

How to implement: Predictive I/O pre-fetches data that's likely to be accessed based on query patterns, reducing latency for large analytical scans. You'll find that predictive I/O works exceptionally well with time-series analysis and sequential data processing where queries typically access adjacent data ranges. Once you've enabled predictive optimization, scan-heavy analytical queries see meaningful latency reduction due to reduced I/O wait times.

1-- Enable IO cache for improved scan performance
2SET spark.databricks.io.cache.enabled = true;
3-- Note: Read-ahead and prefetch optimizations are handled automatically
4
5-- Time-series analysis that benefits from predictive I/O
6WITH daily_metrics AS (
7    SELECT 
8        transaction_date,
9        COUNT(*) as transaction_count,
10        SUM(amount) as daily_revenue,
11        AVG(amount) as avg_transaction_size,
12        COUNT(DISTINCT customer_id) as active_customers
13    FROM transactions
14    WHERE transaction_date >= '2024-01-01'
15    GROUP BY transaction_date
16),
17moving_averages AS (
18    SELECT 
19        transaction_date,
20        daily_revenue,
21        AVG(daily_revenue) OVER (
22            ORDER BY transaction_date 
23            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
24        ) as ma_7day,
25        AVG(daily_revenue) OVER (
26            ORDER BY transaction_date 
27            ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
28        ) as ma_30day
29    FROM daily_metrics
30)
31SELECT 
32    transaction_date,
33    daily_revenue,
34    ma_7day,
35    ma_30day,
36    (daily_revenue - ma_7day) / ma_7day * 100 as deviation_from_7day,
37    CASE 
38        WHEN daily_revenue > ma_30day * 1.2 THEN 'High Performance'
39        WHEN daily_revenue < ma_30day * 0.8 THEN 'Low Performance'
40        ELSE 'Normal'
41    END as performance_category
42FROM moving_averages
43ORDER BY transaction_date DESC;

Alternatives: Manual data pre-loading strategies, or result caching for repeated queries.

ETL/Streaming Optimization Tactics

1. Optimize Auto Loader for High-Throughput Ingestion

When to apply: Optimize Auto Loader for ingestion rates >1GB/hour where default configurations often underperform for high-volume ETL workloads and you need efficient incremental data ingestion.

How to implement: When you optimize Auto Loader settings for your specific data patterns, you can achieve significant throughput improvements while maintaining exactly-once processing semantics. The key insight here is that Auto Loader performance depends heavily on file size, arrival patterns, and parallelism configuration. What makes this particularly effective is combining Auto Loader with Delta Lake's merge operations for upsert-heavy scenarios common in enterprise ETL pipelines.

1-- Optimized Auto Loader configuration for high-throughput ingestion
2CREATE OR REFRESH STREAMING LIVE TABLE raw_events_optimized
3AS SELECT *
4FROM cloud_files(
5  "s3://your-bucket/events/",
6  "json",
7  map(
8    "cloudFiles.format", "json",
9    "cloudFiles.schemaLocation", "s3://your-bucket/schemas/events",
10    "cloudFiles.inferColumnTypes", "true",
11    "cloudFiles.schemaEvolutionMode", "addNewColumns",
12    "cloudFiles.maxFilesPerTrigger", "1000",        -- Optimize for throughput
13    "cloudFiles.maxBytesPerTrigger", "1GB",         -- Control batch size
14    "cloudFiles.useNotifications", "true",          -- Enable event notifications
15    "cloudFiles.validateOptions", "false"           -- Skip validation for speed
16  )
17);
18
19-- DLT CDC pattern for upsert operations
20APPLY CHANGES INTO LIVE.events_processed
21FROM STREAM(LIVE.raw_events_optimized)
22KEYS (event_id)
23SEQUENCE BY event_timestamp
24COLUMNS * EXCEPT (event_timestamp)
25STORED AS SCD TYPE 1;

Alternatives: Delta Live Tables for complex pipelines, Kafka integration for real-time streams, or e6data's real-time streaming ingest with guaranteed throughput SLAs.

2. Implement Z-Ordering for ETL Output Tables

When to apply: Apply Z-ordering to ETL output tables >1GB accessed by multiple downstream consumers where ETL processes create tables with multiple access patterns, making traditional partitioning insufficient for downstream analytical queries.

How to implement: Z-ordering optimizes data layout for multiple columns simultaneously, improving query performance for various analytical access patterns without sacrificing ETL throughput. You'll find that implementing Z-ordering during ETL write operations eliminates the need for separate optimization jobs while ensuring optimal performance for downstream consumers. Here's where it gets interesting: combining Z-ordering with Delta Lake's write optimization features creates a powerful foundation for both ETL efficiency and query performance.

1-- ETL process with integrated Z-ordering
2CREATE OR REPLACE TABLE customer_transactions_optimized
3USING DELTA
4LOCATION 's3://your-bucket/optimized/customer_transactions'
5TBLPROPERTIES (
6    'delta.autoOptimize.optimizeWrite' = 'true',
7    'delta.autoOptimize.autoCompact' = 'true'
8);
9
10-- ETL transformation with Z-ordering
11INSERT INTO customer_transactions_optimized
12SELECT 
13    t.transaction_id,
14    t.customer_id,
15    c.customer_segment,
16    t.transaction_date,
17    t.amount,
18    t.product_category,
19    CURRENT_TIMESTAMP() as etl_processed_time
20FROM raw_transactions t
21JOIN customer_master c ON t.customer_id = c.customer_id
22WHERE t.transaction_date >= '2024-12-01';
23
24-- Apply Z-ordering after ETL batch completion
25-- Note: OPTIMIZE consumes DBUs and should be scheduled, not run on streaming tables
26-- Important: Z-Order and Liquid Clustering cannot be used together
27OPTIMIZE customer_transactions_optimized 
28ZORDER BY (customer_id, transaction_date, product_category);
29
30-- Vacuum old files to reclaim storage
31-- Warning: Setting retention < 7 days requires disabling retention check and has data loss risk
32VACUUM customer_transactions_optimized RETAIN 168 HOURS;

Alternatives: Liquid clustering for evolving access patterns, or partition-based organization.

3. Leverage Delta Lake Change Data Feed for Incremental Processing

When to apply: Implement CDF for tables with <50% daily change rates and downstream dependency chains where efficient incremental ETL is needed by tracking only changed records rather than reprocessing entire datasets.

How to implement: Change Data Feed (CDF) enables you to dramatically reduce ETL runtime while maintaining data consistency across downstream systems. The beauty of this approach is that CDF automatically captures insert, update, and delete operations with versioning information, allowing downstream processes to apply changes incrementally. What makes this particularly effective is combining CDF with streaming processing for near real-time data pipeline updates.

1-- Enable Change Data Feed on source table
2ALTER TABLE customer_master 
3SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
4
5-- For streaming CDF consumption, use DataFrame API with readChangeData
6-- Note: table_changes() is for batch processing only
7-- Example in Python/PySpark:
8-- df = spark.readStream.format("delta") \
9--   .option("readChangeData", "true") \
10--   .table("customer_master") \
11--   .where("_change_type in ('insert','update_postimage')")
12
13-- Alternative: Use DLT APPLY CHANGES for CDC patterns
14APPLY CHANGES INTO LIVE.customer_analytics_summary
15FROM STREAM(LIVE.customer_master)
16KEYS (customer_id)
17SEQUENCE BY _commit_timestamp
18COLUMNS * EXCEPT (_change_type, _commit_version, _commit_timestamp)
19STORED AS SCD TYPE 1;
20
21-- Apply incremental changes to downstream analytics table
22MERGE INTO customer_analytics_summary t
23USING customer_changes_stream s
24ON t.customer_id = s.customer_id
25WHEN MATCHED AND s._change_type = 'update_postimage' THEN
26    UPDATE SET 
27        t.customer_segment = s.customer_segment,
28        t.last_transaction_date = s.last_transaction_date,
29        t.lifetime_value = s.lifetime_value,
30        t.updated_timestamp = CURRENT_TIMESTAMP()
31WHEN NOT MATCHED AND s._change_type = 'insert' THEN
32    INSERT (customer_id, customer_segment, last_transaction_date, lifetime_value, created_timestamp)
33    VALUES (s.customer_id, s.customer_segment, s.last_transaction_date, s.lifetime_value, CURRENT_TIMESTAMP());

Alternatives: Timestamp-based incremental processing, Delta Live Tables for complex dependencies.

4. Optimize Small File Handling with Auto Compaction

When to apply: Enable auto compaction for ETL processes writing >100 files per hour where small file proliferation is a common ETL bottleneck that degrades query performance and increases storage costs.

How to implement: Auto compaction automatically merges small files during write operations, maintaining optimal file sizes without requiring separate maintenance jobs. Here's what happens next: Delta Lake monitors file sizes during write operations and automatically triggers compaction when files fall below optimal thresholds. Once you've enabled auto compaction, ETL processes maintain consistent performance without manual intervention while preventing the small file problem that plagues many data lake implementations.

1-- Configure auto compaction for ETL tables
2ALTER TABLE transaction_staging 
3SET TBLPROPERTIES (
4    'delta.autoOptimize.optimizeWrite' = 'true',    -- Optimize during writes
5    'delta.autoOptimize.autoCompact' = 'true',      -- Auto compact small files
6    'delta.tuneFileSizesForRewrites' = 'true'       -- Optimize for compaction
7);
8
9-- ETL process with optimized write patterns
10CREATE OR REPLACE TEMPORARY VIEW transaction_batch AS
11SELECT 
12    transaction_id,
13    customer_id,
14    amount,
15    transaction_date,
16    product_category,
17    CURRENT_TIMESTAMP() as processed_timestamp
18FROM raw_transaction_stream
19WHERE processing_date = CURRENT_DATE();
20
21-- Batch insert with automatic optimization
22INSERT INTO transaction_staging
23SELECT * FROM transaction_batch;

5. Implement Streaming Aggregations for Real-Time ETL

When to apply: Implement streaming aggregations for latency requirements <15 minutes and data rates >1000 events/second where real-time metric computation during ETL processing is needed, eliminating the need for separate batch aggregation jobs.

How to implement: When you implement streaming aggregations with proper windowing and watermarking, you achieve near real-time analytics while maintaining exactly-once processing guarantees. You'll find that streaming aggregations work exceptionally well for time-based metrics like hourly sales totals or running customer lifetime value calculations. The key insight here is that proper watermarking configuration ensures accurate results while managing memory usage for long-running streaming jobs.

1-- Streaming aggregation for real-time metrics
2CREATE OR REPLACE STREAMING LIVE TABLE hourly_sales_metrics
3AS SELECT 
4    window.start as hour_start,
5    window.end as hour_end,
6    product_category,
7    COUNT(*) as transaction_count,
8    SUM(amount) as hourly_revenue,
9    AVG(amount) as avg_transaction_size,
10    COUNT(DISTINCT customer_id) as unique_customers,
11    CURRENT_TIMESTAMP() as computed_timestamp
12FROM stream(LIVE.transaction_stream)
13GROUP BY 
14    window(transaction_timestamp, '1 hour'),
15    product_category;
16
17-- Advanced streaming aggregation with watermarking
18CREATE OR REPLACE STREAMING LIVE TABLE customer_lifetime_value_stream
19COMMENT "Real-time customer lifetime value computation"
20AS SELECT 
21    customer_id,
22    SUM(amount) as lifetime_value,
23    COUNT(*) as total_transactions,
24    MAX(transaction_timestamp) as last_transaction_time,
25    CURRENT_TIMESTAMP() as last_updated
26FROM stream(LIVE.transaction_stream)
27-- Note: Proper watermarking requires DataFrame API
28-- Python example: df.withWatermark("transaction_timestamp", "1 hour")
29-- The WHERE clause below is a filter, not a watermark and will not bound state
30GROUP BY customer_id;

Alternatives: Micro-batch processing with Delta Live Tables, Kafka Streams for complex event processing.

When Databricks optimization reaches its limits: The e6data alternative

e6data is a decentralized, Kubernetes-native lakehouse compute engine delivering 10x faster query performance with 60% lower compute costs through per-vCPU billing and zero data movement. It runs directly on existing data formats (Delta/Iceberg/Hudi, Parquet, CSV, JSON), requiring no migration or rewrites. Teams often keep their existing Databricks platform for development workflows while offloading performance-critical queries to e6data for sub-second latency and 1000+ QPS concurrency.

Key benefits of the e6data approach:

  • Superior performance architecture: Decentralized vs. legacy centralized systems eliminates coordinator bottlenecks, delivers sub-second latency, and handles 1000+ concurrent users without SLA degradation through Kubernetes-native stateless services
  • Zero vendor lock-in: Point directly at current lakehouse data with no movement, migrations, or architectural changes required. Full compatibility with existing governance, catalogs, and BI tools
  • Predictable scaling & costs: Granular 1-vCPU increment scaling with per-vCPU billing eliminates cluster waste and surprise cost spikes. Instant performance with no cluster spin-up time or manual tuning overhead

Start a free trial of e6data and see performance comparison on your own workloads. Use our cost calculator to estimate potential gains.

Table of contents:
Share this article