Databricks cost optimization and databricks performance tuning are critical for enterprise data teams managing large-scale analytics workloads. According to Databricks' 2024 State of Data & AI Report, organizations implementing comprehensive databricks performance optimization strategies can achieve significant cost reductions and query performance improvements.
Modern Databricks clusters face performance challenges as data volumes grow exponentially. Enterprise data engineers working with databricks delta lake architectures report that poorly optimized databricks spark queries can consume substantially more DBUs than necessary, directly impacting both operational costs and user experience.
Based on Databricks Runtime Performance Benchmarks, these databricks cluster performance indicators help identify optimization opportunities across workloads:
When to apply: Deploy Z-ordering when tables exceed 1GB and databricks sql dashboards filter on multiple dimensions like customer_id, date_range, and product_category. According to Databricks Delta Lake Performance Guide, traditional partitioning fails when users query across various column combinations. Z-ordering excels on high-cardinality columns appearing frequently in WHERE clauses, particularly for sql dashboard queries filtering on 2+ columns regularly.
How to implement: Z-ordering co-locates related data using space-filling curves, dramatically improving data skipping efficiency for multi-dimensional queries. Enterprise implementations can achieve significant performance improvements on point lookups and range scans when Z-ordering is implemented correctly.
1-- Optimize sales table for common BI query patterns
2OPTIMIZE sales_fact
3ZORDER BY (customer_id, transaction_date, product_category);
4
5-- Enable auto-optimize for ongoing maintenance
6ALTER TABLE sales_fact
7SET TBLPROPERTIES (
8 'delta.autoOptimize.optimizeWrite' = 'true',
9 'delta.autoOptimize.autoCompact' = 'true'
10);
11
12-- Example dashboard query that benefits from Z-ordering
13SELECT
14 product_category,
15 COUNT(*) as transaction_count,
16 SUM(amount) as total_revenue
17FROM sales_fact
18WHERE customer_id IN (12345, 67890, 11111)
19 AND transaction_date >= '2024-01-01'
20 AND product_category = 'Electronics'
21GROUP BY product_category;
Alternatives: Delta Lake Bloom filters for high-cardinality string columns (note: effectiveness is limited and feature availability may vary), databricks liquid clustering for evolving access patterns (note: Liquid Clustering replaces Z-Order and requires Unity Catalog managed tables), or migrating performance-critical queries to e6data for guaranteed sub-second latency without maintenance overhead.
When to apply: Implement caching for tables <10GB that are accessed >5 times per hour when dashboard users repeatedly access the same data throughout the day, making caching strategies crucial for maintaining sub-second response times.
How to implement: Databricks IO cache (formerly Delta Cache) stores frequently accessed data on local SSD storage, reducing network I/O and significantly improving query performance for repeated access patterns. IO cache works transparently at the file level and automatically manages cache eviction based on access patterns.
-- Enable Databricks IO cache on cluster
-- Cluster configuration: Advanced Options > Spark Config
spark.databricks.io.cache.enabled true
spark.databricks.io.cache.maxDiskUsage 50g
-- Optional: Use Spark in-memory caching for specific tables (separate from IO cache)
-- CACHE TABLE customer_dim;
-- CACHE TABLE product_dim;
-- Dashboard query leveraging cached dimensions
SELECT
p.product_name,
c.customer_segment,
COUNT(*) as order_count,
SUM(o.order_value) as total_revenue
FROM orders_fact o
JOIN customer_dim c ON o.customer_id = c.customer_id -- Benefits from IO cache
JOIN product_dim p ON o.product_id = p.product_id -- Benefits from IO cache
WHERE o.order_date >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY p.product_name, c.customer_segment
HAVING COUNT(*) >= 10;
Alternatives: Result caching for identical queries, or adaptive query execution for dynamic optimization.
When to apply: Partition tables >1GB with predictable access patterns where effective partition pruning can dramatically reduce data scanning for time-based BI reports. Target file sizes of 128MB-1GB within partitions, and ensure partition cardinality is balanced (avoid too many small partitions or too few large ones).
How to implement: The key insight here is avoiding the common mistake where users partition by date but then query across customer segments or regions, resulting in full partition scans. Once you've set up partition elimination correctly, query execution times drop dramatically because Spark only reads relevant partitions rather than scanning the entire table.
1-- Create properly partitioned table for BI workloads
2CREATE TABLE sales_monthly_partitioned (
3 transaction_id BIGINT,
4 customer_id BIGINT,
5 product_id BIGINT,
6 sales_amount DECIMAL(10,2),
7 region STRING,
8 transaction_timestamp TIMESTAMP
9) USING DELTA
10PARTITIONED BY (
11 transaction_month STRING -- YYYY-MM format for monthly reports
12);
13
14-- Insert with proper partition values
15INSERT INTO sales_monthly_partitioned
16SELECT
17 transaction_id,
18 customer_id,
19 product_id,
20 sales_amount,
21 region,
22 transaction_timestamp,
23 DATE_FORMAT(transaction_timestamp, 'yyyy-MM') as transaction_month
24FROM raw_sales_data;
25
26-- BI query with effective partition pruning
27SELECT
28 region,
29 COUNT(*) as transaction_count,
30 AVG(sales_amount) as avg_sales
31FROM sales_monthly_partitioned
32WHERE transaction_month = '2024-12' -- Partition filter eliminates majority of data
33 AND region IN ('North', 'South')
34GROUP BY region;
Alternatives: Dynamic partition pruning for complex joins, or liquid clustering for evolving patterns.
When to apply: Use serverless for BI workloads with unpredictable usage patterns where response time consistency matters more than absolute performance. Serverless scales well beyond 50 concurrent users and particularly shines for user-facing dashboards that need to eliminate the cold start problem.
How to implement: Serverless SQL eliminates the cold start problem that plagues traditional cluster-based BI deployments by providing instant query execution with automatic scaling based on demand instead of waiting several minutes for cluster startup during peak usage. The beauty of this approach is that it removes the operational overhead of cluster sizing and management while providing predictable query latency.
1-- Serverless SQL endpoint configuration (via UI or API)
2-- No cluster management required - automatic scaling
3
4-- BI dashboard query on serverless endpoint
5WITH monthly_metrics AS (
6 SELECT
7 DATE_TRUNC('month', order_date) as month,
8 customer_segment,
9 SUM(order_value) as revenue,
10 COUNT(DISTINCT customer_id) as active_customers,
11 AVG(order_value) as avg_order_size
12 FROM orders_fact o
13 JOIN customer_dim c ON o.customer_id = c.customer_id
14 WHERE order_date >= '2024-01-01'
15 GROUP BY DATE_TRUNC('month', order_date), customer_segment
16)
17SELECT
18 month,
19 customer_segment,
20 revenue,
21 active_customers,
22 avg_order_size,
23 LAG(revenue) OVER (PARTITION BY customer_segment ORDER BY month) as prev_month_revenue,
24 (revenue - LAG(revenue) OVER (PARTITION BY customer_segment ORDER BY month)) /
25 LAG(revenue) OVER (PARTITION BY customer_segment ORDER BY month) * 100 as growth_rate
26FROM monthly_metrics
27ORDER BY month DESC, customer_segment;
Alternatives: Right-sized clusters with auto-scaling, Databricks SQL Pro for guaranteed performance, or e6data for sub-second latency with 1000+ concurrent users without migrating from Databricks.
When to apply: Use broadcast joins for tables <200MB joining with fact tables >1GB where large fact table joins with smaller dimension tables create massive shuffle operations that can consume significantly more resources than necessary.
How to implement: Broadcast joins copy small tables to all executors, eliminating network shuffle and dramatically reducing query time for typical star schema queries. Spark's adaptive query execution automatically identifies broadcast opportunities, but you can manually control this behavior for predictable performance. What makes this particularly effective is that broadcast joins work exceptionally well with cached dimension tables, creating a powerful combination for analytical workloads.
1-- Configure broadcast join thresholds
2SET spark.sql.adaptive.enabled = true;
3SET spark.sql.autoBroadcastJoinThreshold = 100MB;
4
5-- Manual broadcast hint for guaranteed behavior (hint table aliases individually)
6SELECT /*+ BROADCAST(c), BROADCAST(p) */
7 c.customer_segment,
8 p.product_category,
9 COUNT(*) as order_count,
10 SUM(o.order_value) as total_revenue,
11 AVG(o.order_value) as avg_order_value
12FROM orders_fact o
13JOIN customer_dim c ON o.customer_id = c.customer_id
14JOIN product_dim p ON o.product_id = p.product_id
15WHERE o.order_date >= '2024-01-01'
16GROUP BY c.customer_segment, p.product_category
17HAVING SUM(o.order_value) > 100000;
Alternatives: Bucketed joins for predictable data distribution, or sort-merge joins for large-to-large table joins
When to apply: Partition window functions when processing >10M rows where window functions in analytical queries trigger expensive global sorts across the entire dataset, creating memory pressure and long execution times.
How to implement: When you implement proper window partitioning strategies, you transform these operations from cluster-wide sorts to manageable partition-level operations. Partitioning window functions by logical business dimensions like customer_id or region substantially reduces memory usage while maintaining result accuracy. Here's where it gets interesting: combining window partitioning with Z-ordering creates significant performance synergies for ranking and aggregation operations.
1-- Inefficient: Global sorting across entire dataset
2SELECT
3 customer_id,
4 order_date,
5 order_value,
6 ROW_NUMBER() OVER (ORDER BY order_value DESC) as global_rank -- Expensive!
7FROM orders_fact;
8
9-- Optimized: Partition-aware window functions
10SELECT
11 customer_id,
12 order_date,
13 order_value,
14 SUM(order_value) OVER (
15 PARTITION BY customer_id
16 ORDER BY order_date
17 ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
18 ) as rolling_7day_total,
19 ROW_NUMBER() OVER (
20 PARTITION BY customer_id
21 ORDER BY order_value DESC
22 ) as customer_order_rank,
23 DENSE_RANK() OVER (
24 PARTITION BY DATE_TRUNC('month', order_date)
25 ORDER BY order_value DESC
26 ) as monthly_rank
27FROM orders_fact
28WHERE order_date >= '2024-01-01';
29
30-- Advanced: Optimize table layout for window operations
31OPTIMIZE orders_fact
32ZORDER BY (customer_id, order_date);
Alternatives: Pre-aggregated materialized views or repeated calculations, approximate functions like percentile_approx.
When to apply: Enable AQE for all analytical workloads where complex analytical queries with multiple joins and aggregations need automatic optimization without manual intervention.
How to implement: Adaptive Query Execution (AQE) accelerates analytical query performance by making runtime decisions based on actual data statistics rather than pre-execution estimates. What makes this particularly effective is that AQE automatically handles skewed joins, optimizes shuffle partitions, and converts sort-merge joins to broadcast joins when beneficial.
1-- Enable comprehensive AQE features
2SET spark.sql.adaptive.enabled = true;
3SET spark.sql.adaptive.coalescePartitions.enabled = true;
4SET spark.sql.adaptive.skewJoin.enabled = true;
5SET spark.sql.adaptive.localShuffleReader.enabled = true;
6
7-- Complex analytical query that benefits from AQE
8WITH customer_metrics AS (
9 SELECT
10 c.customer_id,
11 c.customer_segment,
12 COUNT(*) as order_count,
13 SUM(o.order_value) as total_spent,
14 MAX(o.order_date) as last_order_date
15 FROM customers c
16 LEFT JOIN orders o ON c.customer_id = o.customer_id
17 WHERE c.registration_date >= '2023-01-01'
18 GROUP BY c.customer_id, c.customer_segment
19),
20segment_analysis AS (
21 SELECT
22 customer_segment,
23 AVG(total_spent) as avg_spent,
24 PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY total_spent) as p90_spent,
25 COUNT(*) as segment_size
26 FROM customer_metrics
27 GROUP BY customer_segment
28)
29SELECT
30 cm.customer_segment,
31 COUNT(*) as high_value_customers,
32 AVG(cm.total_spent) as avg_high_value_spent,
33 sa.avg_spent as segment_avg
34FROM customer_metrics cm
35JOIN segment_analysis sa ON cm.customer_segment = sa.customer_segment
36WHERE cm.total_spent > sa.p90_spent
37GROUP BY cm.customer_segment, sa.avg_spent
38ORDER BY high_value_customers DESC;
Alternatives: Manual join hints and partition tuning, cost-based optimizer statistics collection, or e6data's lakehouse query engine's query optimization that eliminates manual tuning entirely.
When to apply: Collect statistics for tables >1GB with selective filter patterns where Delta Lake's column-level statistics can enable sophisticated data skipping that goes beyond basic partition pruning.
How to implement: When you collect statistics on frequently filtered columns, the query optimizer can skip entire files without reading them, substantially reducing I/O for selective analytical queries. Here's what happens next: the Delta Log maintains min/max statistics for each data file, allowing the query engine to eliminate files that don't contain relevant data before any actual data reading occurs. What makes this particularly effective is combining statistics with Z-ordering for maximum data skipping efficiency.
1-- Delta Lake maintains file-level min/max statistics automatically
2-- Focus on collecting table-level statistics for cost-based optimization
3
4-- Analyze table to compute statistics
5ANALYZE TABLE sales_fact COMPUTE STATISTICS FOR ALL COLUMNS;
6
7-- Query that benefits from data skipping
8SELECT
9 customer_segment,
10 product_category,
11 SUM(sales_amount) as total_sales,
12 COUNT(DISTINCT customer_id) as unique_customers
13FROM sales_fact s
14JOIN customer_dim c ON s.customer_id = c.customer_id
15WHERE s.sales_amount > 1000 -- Statistics enable file skipping
16 AND s.transaction_date BETWEEN '2024-11-01' AND '2024-11-30'
17 AND c.customer_segment = 'Enterprise'
18GROUP BY customer_segment, product_category;
Alternatives: Bloom filters for high-cardinality columns, manual file organization strategies, or e6data's query engine's automatic data skipping optimization.
When to apply: Enable predictive I/O for sequential scan patterns >10GB where many analytical queries follow predictable access patterns, allowing the storage layer to anticipate data needs and reduce wait times.
How to implement: Predictive I/O pre-fetches data that's likely to be accessed based on query patterns, reducing latency for large analytical scans. You'll find that predictive I/O works exceptionally well with time-series analysis and sequential data processing where queries typically access adjacent data ranges. Once you've enabled predictive optimization, scan-heavy analytical queries see meaningful latency reduction due to reduced I/O wait times.
1-- Enable IO cache for improved scan performance
2SET spark.databricks.io.cache.enabled = true;
3-- Note: Read-ahead and prefetch optimizations are handled automatically
4
5-- Time-series analysis that benefits from predictive I/O
6WITH daily_metrics AS (
7 SELECT
8 transaction_date,
9 COUNT(*) as transaction_count,
10 SUM(amount) as daily_revenue,
11 AVG(amount) as avg_transaction_size,
12 COUNT(DISTINCT customer_id) as active_customers
13 FROM transactions
14 WHERE transaction_date >= '2024-01-01'
15 GROUP BY transaction_date
16),
17moving_averages AS (
18 SELECT
19 transaction_date,
20 daily_revenue,
21 AVG(daily_revenue) OVER (
22 ORDER BY transaction_date
23 ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
24 ) as ma_7day,
25 AVG(daily_revenue) OVER (
26 ORDER BY transaction_date
27 ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
28 ) as ma_30day
29 FROM daily_metrics
30)
31SELECT
32 transaction_date,
33 daily_revenue,
34 ma_7day,
35 ma_30day,
36 (daily_revenue - ma_7day) / ma_7day * 100 as deviation_from_7day,
37 CASE
38 WHEN daily_revenue > ma_30day * 1.2 THEN 'High Performance'
39 WHEN daily_revenue < ma_30day * 0.8 THEN 'Low Performance'
40 ELSE 'Normal'
41 END as performance_category
42FROM moving_averages
43ORDER BY transaction_date DESC;
Alternatives: Manual data pre-loading strategies, or result caching for repeated queries.
When to apply: Optimize Auto Loader for ingestion rates >1GB/hour where default configurations often underperform for high-volume ETL workloads and you need efficient incremental data ingestion.
How to implement: When you optimize Auto Loader settings for your specific data patterns, you can achieve significant throughput improvements while maintaining exactly-once processing semantics. The key insight here is that Auto Loader performance depends heavily on file size, arrival patterns, and parallelism configuration. What makes this particularly effective is combining Auto Loader with Delta Lake's merge operations for upsert-heavy scenarios common in enterprise ETL pipelines.
1-- Optimized Auto Loader configuration for high-throughput ingestion
2CREATE OR REFRESH STREAMING LIVE TABLE raw_events_optimized
3AS SELECT *
4FROM cloud_files(
5 "s3://your-bucket/events/",
6 "json",
7 map(
8 "cloudFiles.format", "json",
9 "cloudFiles.schemaLocation", "s3://your-bucket/schemas/events",
10 "cloudFiles.inferColumnTypes", "true",
11 "cloudFiles.schemaEvolutionMode", "addNewColumns",
12 "cloudFiles.maxFilesPerTrigger", "1000", -- Optimize for throughput
13 "cloudFiles.maxBytesPerTrigger", "1GB", -- Control batch size
14 "cloudFiles.useNotifications", "true", -- Enable event notifications
15 "cloudFiles.validateOptions", "false" -- Skip validation for speed
16 )
17);
18
19-- DLT CDC pattern for upsert operations
20APPLY CHANGES INTO LIVE.events_processed
21FROM STREAM(LIVE.raw_events_optimized)
22KEYS (event_id)
23SEQUENCE BY event_timestamp
24COLUMNS * EXCEPT (event_timestamp)
25STORED AS SCD TYPE 1;
Alternatives: Delta Live Tables for complex pipelines, Kafka integration for real-time streams, or e6data's real-time streaming ingest with guaranteed throughput SLAs.
When to apply: Apply Z-ordering to ETL output tables >1GB accessed by multiple downstream consumers where ETL processes create tables with multiple access patterns, making traditional partitioning insufficient for downstream analytical queries.
How to implement: Z-ordering optimizes data layout for multiple columns simultaneously, improving query performance for various analytical access patterns without sacrificing ETL throughput. You'll find that implementing Z-ordering during ETL write operations eliminates the need for separate optimization jobs while ensuring optimal performance for downstream consumers. Here's where it gets interesting: combining Z-ordering with Delta Lake's write optimization features creates a powerful foundation for both ETL efficiency and query performance.
1-- ETL process with integrated Z-ordering
2CREATE OR REPLACE TABLE customer_transactions_optimized
3USING DELTA
4LOCATION 's3://your-bucket/optimized/customer_transactions'
5TBLPROPERTIES (
6 'delta.autoOptimize.optimizeWrite' = 'true',
7 'delta.autoOptimize.autoCompact' = 'true'
8);
9
10-- ETL transformation with Z-ordering
11INSERT INTO customer_transactions_optimized
12SELECT
13 t.transaction_id,
14 t.customer_id,
15 c.customer_segment,
16 t.transaction_date,
17 t.amount,
18 t.product_category,
19 CURRENT_TIMESTAMP() as etl_processed_time
20FROM raw_transactions t
21JOIN customer_master c ON t.customer_id = c.customer_id
22WHERE t.transaction_date >= '2024-12-01';
23
24-- Apply Z-ordering after ETL batch completion
25-- Note: OPTIMIZE consumes DBUs and should be scheduled, not run on streaming tables
26-- Important: Z-Order and Liquid Clustering cannot be used together
27OPTIMIZE customer_transactions_optimized
28ZORDER BY (customer_id, transaction_date, product_category);
29
30-- Vacuum old files to reclaim storage
31-- Warning: Setting retention < 7 days requires disabling retention check and has data loss risk
32VACUUM customer_transactions_optimized RETAIN 168 HOURS;
Alternatives: Liquid clustering for evolving access patterns, or partition-based organization.
When to apply: Implement CDF for tables with <50% daily change rates and downstream dependency chains where efficient incremental ETL is needed by tracking only changed records rather than reprocessing entire datasets.
How to implement: Change Data Feed (CDF) enables you to dramatically reduce ETL runtime while maintaining data consistency across downstream systems. The beauty of this approach is that CDF automatically captures insert, update, and delete operations with versioning information, allowing downstream processes to apply changes incrementally. What makes this particularly effective is combining CDF with streaming processing for near real-time data pipeline updates.
1-- Enable Change Data Feed on source table
2ALTER TABLE customer_master
3SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
4
5-- For streaming CDF consumption, use DataFrame API with readChangeData
6-- Note: table_changes() is for batch processing only
7-- Example in Python/PySpark:
8-- df = spark.readStream.format("delta") \
9-- .option("readChangeData", "true") \
10-- .table("customer_master") \
11-- .where("_change_type in ('insert','update_postimage')")
12
13-- Alternative: Use DLT APPLY CHANGES for CDC patterns
14APPLY CHANGES INTO LIVE.customer_analytics_summary
15FROM STREAM(LIVE.customer_master)
16KEYS (customer_id)
17SEQUENCE BY _commit_timestamp
18COLUMNS * EXCEPT (_change_type, _commit_version, _commit_timestamp)
19STORED AS SCD TYPE 1;
20
21-- Apply incremental changes to downstream analytics table
22MERGE INTO customer_analytics_summary t
23USING customer_changes_stream s
24ON t.customer_id = s.customer_id
25WHEN MATCHED AND s._change_type = 'update_postimage' THEN
26 UPDATE SET
27 t.customer_segment = s.customer_segment,
28 t.last_transaction_date = s.last_transaction_date,
29 t.lifetime_value = s.lifetime_value,
30 t.updated_timestamp = CURRENT_TIMESTAMP()
31WHEN NOT MATCHED AND s._change_type = 'insert' THEN
32 INSERT (customer_id, customer_segment, last_transaction_date, lifetime_value, created_timestamp)
33 VALUES (s.customer_id, s.customer_segment, s.last_transaction_date, s.lifetime_value, CURRENT_TIMESTAMP());
Alternatives: Timestamp-based incremental processing, Delta Live Tables for complex dependencies.
When to apply: Enable auto compaction for ETL processes writing >100 files per hour where small file proliferation is a common ETL bottleneck that degrades query performance and increases storage costs.
How to implement: Auto compaction automatically merges small files during write operations, maintaining optimal file sizes without requiring separate maintenance jobs. Here's what happens next: Delta Lake monitors file sizes during write operations and automatically triggers compaction when files fall below optimal thresholds. Once you've enabled auto compaction, ETL processes maintain consistent performance without manual intervention while preventing the small file problem that plagues many data lake implementations.
1-- Configure auto compaction for ETL tables
2ALTER TABLE transaction_staging
3SET TBLPROPERTIES (
4 'delta.autoOptimize.optimizeWrite' = 'true', -- Optimize during writes
5 'delta.autoOptimize.autoCompact' = 'true', -- Auto compact small files
6 'delta.tuneFileSizesForRewrites' = 'true' -- Optimize for compaction
7);
8
9-- ETL process with optimized write patterns
10CREATE OR REPLACE TEMPORARY VIEW transaction_batch AS
11SELECT
12 transaction_id,
13 customer_id,
14 amount,
15 transaction_date,
16 product_category,
17 CURRENT_TIMESTAMP() as processed_timestamp
18FROM raw_transaction_stream
19WHERE processing_date = CURRENT_DATE();
20
21-- Batch insert with automatic optimization
22INSERT INTO transaction_staging
23SELECT * FROM transaction_batch;
When to apply: Implement streaming aggregations for latency requirements <15 minutes and data rates >1000 events/second where real-time metric computation during ETL processing is needed, eliminating the need for separate batch aggregation jobs.
How to implement: When you implement streaming aggregations with proper windowing and watermarking, you achieve near real-time analytics while maintaining exactly-once processing guarantees. You'll find that streaming aggregations work exceptionally well for time-based metrics like hourly sales totals or running customer lifetime value calculations. The key insight here is that proper watermarking configuration ensures accurate results while managing memory usage for long-running streaming jobs.
1-- Streaming aggregation for real-time metrics
2CREATE OR REPLACE STREAMING LIVE TABLE hourly_sales_metrics
3AS SELECT
4 window.start as hour_start,
5 window.end as hour_end,
6 product_category,
7 COUNT(*) as transaction_count,
8 SUM(amount) as hourly_revenue,
9 AVG(amount) as avg_transaction_size,
10 COUNT(DISTINCT customer_id) as unique_customers,
11 CURRENT_TIMESTAMP() as computed_timestamp
12FROM stream(LIVE.transaction_stream)
13GROUP BY
14 window(transaction_timestamp, '1 hour'),
15 product_category;
16
17-- Advanced streaming aggregation with watermarking
18CREATE OR REPLACE STREAMING LIVE TABLE customer_lifetime_value_stream
19COMMENT "Real-time customer lifetime value computation"
20AS SELECT
21 customer_id,
22 SUM(amount) as lifetime_value,
23 COUNT(*) as total_transactions,
24 MAX(transaction_timestamp) as last_transaction_time,
25 CURRENT_TIMESTAMP() as last_updated
26FROM stream(LIVE.transaction_stream)
27-- Note: Proper watermarking requires DataFrame API
28-- Python example: df.withWatermark("transaction_timestamp", "1 hour")
29-- The WHERE clause below is a filter, not a watermark and will not bound state
30GROUP BY customer_id;
Alternatives: Micro-batch processing with Delta Live Tables, Kafka Streams for complex event processing.
e6data is a decentralized, Kubernetes-native lakehouse compute engine delivering 10x faster query performance with 60% lower compute costs through per-vCPU billing and zero data movement. It runs directly on existing data formats (Delta/Iceberg/Hudi, Parquet, CSV, JSON), requiring no migration or rewrites. Teams often keep their existing Databricks platform for development workflows while offloading performance-critical queries to e6data for sub-second latency and 1000+ QPS concurrency.
Key benefits of the e6data approach:
Start a free trial of e6data and see performance comparison on your own workloads. Use our cost calculator to estimate potential gains.