Table of contents:

Subscribe to our newsletter - Data Engineering ACID

Get 3 weekly stories around data engineering at scale that the e6data team is reading.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Share this article

How to Optimize Microsoft Fabric Costs? {Updated 2025 Guide}

August 14, 2025
/
e6data Team
Microsoft Fabric
Cost Optimization
Advanced

Microsoft Fabric’s unified analytics platform can simplify your data estate, but without strong cost governance, that flexibility can quickly turn into budget overruns. Enterprises adopting Fabric’s integrated approach to data engineering, analytics, and AI are discovering that its consumption-based pricing introduces complex cost dynamics that require specialized optimization strategies.

Common pain points include:

Unlike traditional analytics platforms with fixed licensing, Fabric’s pay-as-you-go model means every query, every data movement, and every storage decision directly hits your bottom line.

This playbook offers data engineering teams tactical, measurable approaches to optimize Fabric costs across three critical workload categories. Each tactic includes code examples, real-world scenarios (with metrics), and proven alternatives that have delivered substantial cost reductions in enterprise environments. Use it as a practical cost-optimization handbook for your Fabric workloads.

High-Cost Yardsticks for Microsoft Fabric

Before diving into optimizations, it’s important to monitor key cost indicators that signal inefficiencies. The table below outlines cost “yardsticks” and warning thresholds and trigger points for investigation, along with examples of what to monitor:

Cost Indicator Warning Threshold Investigation Trigger Monitoring Query
OneLake Storage >$50 per TB/month >$100 per TB/month Monitor delta table file count and size distribution
Power BI Premium Capacity >80% CPU utilization >90% CPU for 24+ hours Check Get-PowerBICapacityWorkload metrics
Data Factory Pipeline CUs >10,000 CUs/day baseline 50% month-over-month increase Review pipeline execution logs and duration
Spark Compute Sessions >50 concurrent sessions >72 hours average session time Track session lifecycle and resource allocation
SQL Endpoint Queries >$200/million queries >5 second average query time Analyze query performance and caching efficiency
KQL Database Ingestion >1GB ingestion per second >90% cache miss ratio Monitor ingestion patterns and retention policies
Semantic Model Refresh >4 refreshes per hour >2GB memory per refresh Track refresh frequency and dataset optimization

Workload Taxonomy

Optimizations must be tailored to your Fabric workload patterns. Each workload type has distinct cost drivers and opportunities:

BI Dashboards

Characteristics: High-frequency, predictable analytical queries (e.g. executive dashboards).
Primary Cost Drivers: Power BI Premium capacity usage, semantic model size, query processing times.
Optimization Focus: Caching strategies, efficient model design, incremental refresh patterns.

Ad-hoc Analytics

Characteristics: Exploratory queries with unpredictable resource needs and varying complexity (e.g. data science notebooks).
Primary Cost Drivers: Spark compute session runtime, OneLake data scan volume, SQL endpoint query costs.
Optimization Focus: Query tuning (pruning data scans), data partitioning/clustering, session management (auto-pause/terminate).

ETL/Streaming

Characteristics: High-volume data movement and real-time processing (e.g. pipelines and event streams).
Primary Cost Drivers: Data Factory pipeline execution CUs, Spark cluster utilization (idle vs peak), streaming ingestion and transformation costs.
Optimization Focus: Pipeline efficiency (batching, parallelism), cluster autoscaling, data compression and format optimization.

BI Dashboard Optimization Tactics

For BI dashboards (typically Power BI-driven), the biggest savings come from optimizing datasets and refreshes to reduce load on Premium capacities. The tactics below focus on semantic model efficiency and smarter use of compute for frequent dashboard queries.

1. Implement semantic model aggregations to reduce Premium capacity utilization

A major healthcare network’s patient analytics dashboard was overloading their Power BI Premium capacity (CPU ~95% at peak). The dashboard queried a 500 GB fact table (2.8 billion rows) on every load, causing executive reports to time out and slicers to take 15–30 seconds.

The team addressed this by implementing user-defined aggregations in the Power BI semantic model to pre-summarize common query results. The DAX snippet below creates a monthly aggregation table (PatientMetrics_Monthly) with total visits, average length of stay, and total revenue by Year, Month, and Department. This pre-aggregated table is added to the model and will answer high-level queries without scanning the entire fact table:

1-- Define an aggregation table at Month level
2DEFINE TABLE PatientMetrics_Monthly = 
3SUMMARIZECOLUMNS(
4    'Calendar'[Year],
5    'Calendar'[Month],
6    'Department'[DepartmentName],
7    "TotalVisits",      SUM('FactPatientVisits'[VisitCount]),
8    "AvgLengthOfStay",  AVERAGE('FactPatientVisits'[LengthOfStay]),
9    "TotalRevenue",     SUM('FactPatientVisits'[Revenue])
10)
11
12-- Evaluate and add the aggregation table to the model
13EVALUATE PatientMetrics_Monthly

Alternatives:

  • Composite models – Use DirectQuery for granular detail and Import mode for aggregated summaries. This hybrid approach offloads detailed data and can cut in-memory dataset size by a good margin.
  • Automatic aggregations – Enable Power BI’s dual storage mode so the service automatically creates and manages aggregation tables based on usage patterns (reducing manual effort).
  • Incremental refresh with aggregations – For very large fact tables, combine incremental refresh (to handle new data) with pre-calculated aggregation partitions for historical data.

2. Optimize semantic model refresh patterns to reduce refresh costs

A retail analytics team was performing full dataset refreshes every hour for their sales dashboards, refreshing 12 large Power BI datasets, each consuming ~850 capacity units (CUs) per refresh. During Black Friday weekend, this aggressive schedule drove refresh costs to about $2,400 per day, and many refresh attempts timed out under the load.

They implemented an incremental refresh policy to only process new or changed data instead of fully reloading everything. The JSON snippet below illustrates an example incremental refresh configuration: it refreshes data at an hourly grain for the last 10 hours, maintains a rolling 90-day window of historical data, and uses detectDataChanges to only load partitions where source data has changed:

1{
2  "refreshPolicy": {
3    "type": "incremental",
4    "incrementalGranularity": "Hour",
5    "incrementalPeriods": 10,
6    "rollingWindowGranularity": "Day",
7    "rollingWindowPeriods": 90,
8    "detectDataChanges": true,
9    "sourceExpressionFilter": "DateTime[Date] >= DateTime.From(RangeStart) and DateTime[Date] < DateTime.From(RangeEnd)"
10  }
11}

Alternatives:

  • Partition-based refresh: Manually partition large tables by date ranges and refresh only modified partitions
  • Real-time streaming: Use Power BI streaming datasets for high-frequency metrics, reducing the need for frequent full refreshes
  • Hybrid refresh strategy: Combine incremental refresh for transactional data with full refresh for slowly changing dimensions

3. Implement row-level security caching to reduce query processing costs

A financial services firm had a compliance dashboard serving ~2,500 users across different regions and clearance levels. Power BI Row-Level Security (RLS) ensured each user only saw permitted data (filtered by geography and role). However, without optimization, every query had to re-evaluate the RLS filters for each user, leading to 8–12 second dashboard load times and high Premium CPU utilization.

They optimized RLS by caching the user-to-region mapping and leveraging it in the filter logic. Specifically, they created a table-valued function in the data source that returns the allowed regions for a given user, and then used that in the Power BI model’s RLS rule. This way, the expensive security lookup is done once and reused.

1-- SQL: Table-valued function to get allowed regions for a user
2CREATE FUNCTION Security.GetUserRegions(@UserPrincipalName NVARCHAR(256))
3RETURNS TABLE
4AS
5RETURN
6(
7    SELECT DISTINCT ur.RegionCode
8    FROM dbo.UserRegions ur
9    JOIN dbo.Users u ON ur.UserId = u.UserId
10    WHERE u.UserPrincipalName = @UserPrincipalName
11      AND ur.IsActive = 1
12      AND ur.ExpiryDate > GETDATE()
13);
14
15-- Power BI RLS DAX rule using the function
16[RegionCode] IN VALUES( Security.GetUserRegions( USERNAME() ) )

Alternatives:

  • Dynamic RLS optimization – Refine DAX RLS filters using more efficient patterns (e.g. using TREATAS or pre-aggregated lookup tables) to improve the query plan and reduce overhead.
  • Static RLS partitioning – In extreme cases, segregate data by user groups (e.g. create separate datasets or partitions per region/role) to eliminate per-query RLS filtering. This increases maintenance but can remove the RLS runtime cost entirely.
  • External permission cache – Cache user permissions in a fast store (e.g. Azure Cache for Redis) and have the BI solution query that cache instead of hitting the underlying database for each user lookup.

4. Configure automatic pausing for development workspaces to reduce idle capacity costs

One data engineering team maintained ~15 development and testing workspaces in Fabric on a dedicated Premium capacity, running 24/7. Off-hours (nights & weekends), these environments sat idle but still consumed capacity in wasted cost with zero user activity.

The solution was to implement automatic workspace pausing for non-production workspaces during idle periods. They scheduled a PowerShell script (using Azure Automation) that checks each dev/test workspace’s last activity, and if a workspace has been idle for over 2 hours, it suspends the workspace.

1# PowerShell: Auto-pause idle workspaces
2$workspaces = @("Dev-Analytics", "Test-Reports", "Staging-Dashboards")
3foreach ($ws in $workspaces) {
4    $usage = Get-PowerBIWorkspaceUsage -WorkspaceId $ws -Days 1
5    if ($usage.LastActivity -lt (Get-Date).AddHours(-2)) {
6        Suspend-PowerBIWorkspace -WorkspaceId $ws
7        Write-Host "Paused workspace: $ws due to inactivity"
8    }
9}

Alternatives:

  • Scheduled capacity scaling – During known off-hours, automatically scale down Premium capacity (or switch to a smaller SKU) using an Azure Automation runbook or the Fabric REST API. Scale it back up in the morning.
  • Workspace-specific capacity allocation – Instead of using expensive Premium capacity for dev/test, assign those workspaces to a lower-cost shared capacity (or Fabric’s Pro licenses) except when higher performance is needed.
  • Usage-based hibernation – Implement a smarter hibernation service that learns usage patterns (e.g. no one uses dev on weekends) and proactively hibernates or downsizes environments during expected idle windows.

5. Optimize Power BI dataflow refresh scheduling to reduce compute overlap costs

A team had 8 interdependent Power BI dataflows feeding their supply chain dashboards. Originally, all 8 dataflows were set to refresh simultaneously at 6:00 AM daily. This caused a huge spike in load (Premium capacity hit ~98% CPU) and frequent refresh failures (3–4 failures per week) due to contention when all dataflows ran at once.

The team redesigned the refresh schedule to stagger these dataflows and chain them by dependencies. Critical dataflows now refresh first, and downstream dataflows refresh sequentially after their prerequisites complete.

1{
2  "dataflowRefreshSchedule": {
3    "sourceDataflows": [
4      {
5        "name": "Suppliers_Master",
6        "schedule": "0 5 * * *",
7        "priority": 1,
8        "dependencies": []
9      },
10      {
11        "name": "Inventory_Transactions", 
12        "schedule": "0 6 * * *",
13        "priority": 2,
14        "dependencies": ["Suppliers_Master"]
15      },
16      {
17        "name": "Production_Metrics",
18        "schedule": "0 7 * * *", 
19        "priority": 3,
20        "dependencies": ["Inventory_Transactions"]
21      }
22    ]
23  }
24}

Alternatives:

  • Controlled parallelism with limits – If some parallelism is desired, configure the Premium capacity or dataflow settings to limit the number of concurrent refreshes (queuing the rest). This prevents resource exhaustion while still overlapping some processing.
  • Incremental dataflow refresh – Similar to datasets, use change detection in dataflows so that each refresh only processes new or changed data. This shortens refresh times and reduces the compute hit, making scheduling less of a concern.
  • Shared reference dataflows – If multiple dataflows were pulling the same large reference data, consider consolidating that into one dataflow and letting others reference it. This avoids redundant refresh of the same data across many dataflows.

6. Implement composite model optimization

An analytics team was ingesting a high volume of sales data (≈18 GB/day) into Power BI models in Import mode. This led to large dataset sizes (and Premium storage costs around $3,200/month) and long refresh times (~45 minutes). The model served dual purposes: real-time inventory lookups and historical sales trend analysis, all in one dataset, causing both performance and cost issues due to the mixed workload.

They re-architected the Power BI dataset as a composite model, splitting data by freshness: current data stays in DirectQuery (no import), while historical data remains imported for fast analysis. In practice, they set up the fact table with partitions or logic by date.

1-- Pseudo-code: determine storage mode by year (current year = DirectQuery, others = Import)
2EVALUATE
3ADDCOLUMNS(
4    SUMMARIZE('Sales', 'Date'[Year], 'Date'[Month]),
5    "StorageMode",
6    IF(
7        'Date'[Year] = YEAR(TODAY()),
8        "DirectQuery",   -- Current year data via DirectQuery for real-time
9        "Import"         -- Historical data cached in memory
10    )
11)
12
13-- Define relationships appropriate for DirectQuery (one cross-filter, one single to control join behavior)
14DEFINE
15RELATIONSHIP('Sales'[ProductID], 'Products'[ProductID], MANY_TO_ONE, CROSSFILTER_BOTH)
16RELATIONSHIP('Sales'[DateKey], 'Calendar'[DateKey], MANY_TO_ONE, CROSSFILTER_SINGLE)

Alternatives:

  • Aggregation tables in composite models – Create imported aggregation tables for commonly used rollups (monthly totals, etc.) and leave the detailed fact data in DirectQuery. This way, summary queries hit fast cache, and only drill-downs hit the source.
  • Incremental refresh + DirectQuery mash-up – Use incremental refresh for recent partitions and switch very old partitions to DirectQuery on archival data. This reduces imports further while keeping moderately recent data cached.
  • Dual storage mode – Power BI’s dual mode allows a table to act as both Import and DirectQuery depending on the query. Leverage this so that frequently accessed data is automatically cached by Power BI, and less frequent detail queries go to the source.

Ad-hoc Analytics Optimization Tactics

Ad-hoc analytics workloads (like interactive SQL queries, notebooks, or experiments on large datasets) benefit from optimizing data layout and managing compute resources on-the-fly. The following tactics aim to minimize data scanned and eliminate idle compute time, making exploratory analysis in Fabric faster and cheaper.

1. Implement Delta Lake Z-ordering to reduce scan costs

A financial analytics team was querying a 2.5 TB OneLake Delta table of transactions for exploratory analysis. Because the data wasn’t organized for their query patterns, each query ended up scanning almost the entire dataset, consuming ~450–600 CUs and taking 15–25 minutes. The heavy cost and slow turnaround made analysts hesitant to run complex queries frequently.

They tackled this by optimizing the physical layout of the Delta Lake data using Z-order clustering and partitioning. In practice, this meant clustering data files on the most commonly filtered fields and partitioning the data by date to prune unnecessary data.

1-- Physically cluster the table files by frequently filtered columns
2OPTIMIZE financial.transactions 
3ZORDER BY (transaction_date, account_type, amount_range, region_code);
4
5-- Compute fresh statistics to aid the query optimizer
6ANALYZE TABLE financial.transactions COMPUTE STATISTICS;
7
8-- (Optional) Repartition data by date for pruning
9CREATE TABLE financial.transactions_optimized
10USING DELTA
11PARTITIONED BY (year, month)
12LOCATION 'abfss://[email protected]/transactions'
13TBLPROPERTIES (
14  'delta.autoOptimize.optimizeWrite' = 'true',
15  'delta.autoOptimize.autoCompact' = 'true'
16)
17AS SELECT * 
18FROM financial.transactions;

Alternatives:

  • Liquid clustering (auto-clustering) – Use Fabric’s or Delta Lake’s automatic clustering/ordering features if available. This lets the system continuously organize data based on query patterns without manual Z-order commands, which is useful if filter patterns shift over time.
  • Materialized views for analytics – For extremely frequent aggregations or joins in ad-hoc queries, consider creating materialized views that pre-aggregate or pre-join data. Queries hitting these materialized results can be sub-second, avoiding large scans altogether.
  • Refined partitioning strategy – Revisit the partition scheme of the dataset. For instance, a multi-level partition (by region and date) or bucketing on certain keys might further reduce the amount of data each query needs to scan, depending on query patterns.

2. Optimize OneLake file layout to reduce storage scan costs

An analytics team stored 18 months of customer behavior data as 847,000 small JSON files in OneLake, causing analytical queries to scan excessive file metadata and consume 320-480 CUs per query. Query planning alone took 2-5 minutes before execution started.

The team implemented a file consolidation and format optimization strategy. They converted the scattered JSON files into larger columnar files (Parquet/Delta) with appropriate partitioning to improve query efficiency.

1# Consolidate small files into optimized Parquet format
2from delta.tables import DeltaTable
3import pyspark.sql.functions as F
4
5def optimize_file_layout(source_path, target_path, partition_columns):
6    # Read fragmented data and optimize layout
7    df = spark.read.option("multiline", "true").json(source_path)
8    
9    # Add partitioning columns for efficient filtering
10    df_with_partitions = df.withColumn("year", F.year(F.col("timestamp"))) \
11                          .withColumn("month", F.month(F.col("timestamp"))) \
12                          .withColumn("customer_segment", 
13                                    F.when(F.col("total_purchases") > 1000, "high_value")
14                                     .when(F.col("total_purchases") > 100, "medium_value")
15                                     .otherwise("low_value"))
16    
17    # Write as optimized Delta table with proper file sizing
18    df_with_partitions.write \
19        .format("delta") \
20        .mode("overwrite") \
21        .partitionBy("year", "month", "customer_segment") \
22        .option("delta.autoOptimize.optimizeWrite", "true") \
23        .option("delta.targetFileSize", "134217728")  # 128MB target file size \
24        .save(target_path)
25
26# Execute optimization
27optimize_file_layout(
28    "abfss://[email protected]/customer_behavior/json/",
29    "abfss://[email protected]/customer_behavior_optimized/",
30    ["year", "month", "customer_segment"]
31)

Alternatives:

  • Auto-compaction scheduling: Enable Delta Lake auto-compaction to continuously optimize file layout based on write patterns
  • Columnar storage optimization: Use column-oriented storage formats with compression for analytical workloads requiring specific column access
  • Data lake tiering: Implement hot/cold storage tiering based on data access patterns to optimize both performance and storage costs

3. Implement intelligent data caching strategies to reduce repeated scan costs

A business intelligence team was repeatedly analyzing the same customer segments and time periods, with each analyst re-scanning 800GB of source data for overlapping analyses. Daily analytical workloads consumed 2,800 CUs with 65% of queries accessing previously computed intermediate results.

To address this, the team built an intelligent caching layer on top of Spark. Instead of relying solely on manual caching, they created a mechanism to automatically cache and reuse intermediate results for frequent query patterns.

1# Implement intelligent caching with cache lifecycle management
2from pyspark.sql import DataFrame
3from pyspark.storagelevel import StorageLevel
4import hashlib
5
6class AnalyticsCacheManager:
7    def __init__(self):
8        self.cache_registry = {}
9        self.cache_hits = 0
10        self.cache_misses = 0
11    
12    def get_cache_key(self, query: str, parameters: dict) -> str:
13        # Create deterministic cache key
14        cache_input = f"{query}_{sorted(parameters.items())}"
15        return hashlib.md5(cache_input.encode()).hexdigest()
16    
17    def get_or_compute(self, cache_key: str, compute_func, cache_level=StorageLevel.MEMORY_AND_DISK_SER):
18        if cache_key in self.cache_registry:
19            self.cache_hits += 1
20            print(f"Cache HIT for {cache_key[:8]}...")
21            return self.cache_registry[cache_key]
22        
23        self.cache_misses += 1
24        print(f"Cache MISS for {cache_key[:8]}... Computing...")
25        
26        # Compute and cache result
27        result = compute_func()
28        result.cache()
29        result.persist(cache_level)
30        
31        self.cache_registry[cache_key] = result
32        return result
33    
34    def get_cache_stats(self):
35        hit_rate = self.cache_hits / (self.cache_hits + self.cache_misses) * 100
36        return f"Cache hit rate: {hit_rate:.1f}% ({self.cache_hits} hits, {self.cache_misses} misses)"
37
38# Usage example for customer segmentation analysis
39cache_manager = AnalyticsCacheManager()
40
41def compute_customer_segments(date_range, region):
42    return spark.sql(f"""
43        SELECT customer_id, 
44               SUM(purchase_amount) as total_spent,
45               COUNT(*) as transaction_count,
46               CASE 
47                 WHEN SUM(purchase_amount) > 5000 THEN 'Premium'
48                 WHEN SUM(purchase_amount) > 1000 THEN 'Standard'
49                 ELSE 'Basic'
50               END as segment
51        FROM transactions 
52        WHERE transaction_date BETWEEN '{date_range[0]}' AND '{date_range[1]}'
53          AND region = '{region}'
54        GROUP BY customer_id
55    """)

In this approach:

  • We generate a cache key from the query and its parameters (so identical analyses produce the same key).
  • The first time a given query runs (cache miss), the result DataFrame is persisted (cached in memory and possibly disk).
  • Subsequent calls with the same key (cache hit) return the cached DataFrame instantly instead of recomputing.

Alternatives:

  • Delta Lake caching: Utilize Delta Lake's built-in caching mechanisms for automatic optimization of frequently accessed data
  • Distributed cache coordination: Implement cluster-wide cache coordination to share cached results across multiple analytical sessions

4. Configure dynamic resource allocation for variable workloads

A financial modeling team had unpredictable analytical workloads ranging from simple aggregations requiring 2 cores to complex ML training needing 200+ cores. Static cluster provisioning for peak capacity resulted in $6,200 monthly costs with 25% average utilization and frequent resource waste.

The team enabled dynamic cluster scaling (Spark Dynamic Allocation) so that compute resources would automatically scale up or down based on current workload. They set a broad range for executors, from a minimum of 2 up to a maximum of 200, and tuned Spark’s thresholds to make scaling decisions quickly. The configuration snippet below highlights this:

1# Configure adaptive Spark cluster scaling
2spark.conf.set("spark.dynamicAllocation.enabled", "true")
3spark.conf.set("spark.dynamicAllocation.minExecutors", "2")
4spark.conf.set("spark.dynamicAllocation.maxExecutors", "200")
5spark.conf.set("spark.dynamicAllocation.initialExecutors", "10")
6
7# Advanced scaling configuration
8spark.conf.set("spark.dynamicAllocation.executorIdleTimeout", "60s")
9spark.conf.set("spark.dynamicAllocation.cachedExecutorIdleTimeout", "300s")
10spark.conf.set("spark.dynamicAllocation.schedulerBacklogTimeout", "1s")
11spark.conf.set("spark.dynamicAllocation.sustainedSchedulerBacklogTimeout", "5s")
12
13# Custom resource allocation based on workload patterns
14class WorkloadBasedScaling:
15    def __init__(self):
16        self.workload_profiles = {
17            "simple_aggregation": {"min_executors": 2, "max_executors": 10, "target_executor_cores": 4},
18            "ml_training": {"min_executors": 20, "max_executors": 200, "target_executor_cores": 8},
19            "data_exploration": {"min_executors": 5, "max_executors": 50, "target_executor_cores": 4},
20            "report_generation": {"min_executors": 8, "max_executors": 40, "target_executor_cores": 4}
21        }
22    
23    def configure_for_workload(self, workload_type: str):
24        if workload_type not in self.workload_profiles:
25            print(f"Unknown workload type: {workload_type}")
26            return
27        
28        profile = self.workload_profiles[workload_type]
29        
30        spark.conf.set("spark.dynamicAllocation.minExecutors", str(profile["min_executors"]))
31        spark.conf.set("spark.dynamicAllocation.maxExecutors", str(profile["max_executors"]))
32        spark.conf.set("spark.executor.cores", str(profile["target_executor_cores"]))
33        
34        print(f"Configured cluster for {workload_type}: {profile}")
35
36# Usage with automatic workload detection
37scaler = WorkloadBasedScaling()
38
39def execute_with_optimal_scaling(query: str, estimated_data_size_gb: float):
40    # Simple heuristic for workload classification
41    if "train" in query.lower() or "ml" in query.lower():
42        workload = "ml_training"
43    elif estimated_data_size_gb < 10:
44        workload = "simple_aggregation"
45    elif "GROUP BY" in query and "JOIN" in query:
46        workload = "data_exploration"
47    else:
48        workload = "report_generation"
49    
50    scaler.configure_for_workload(workload)
51    return spark.sql(query)

Alternatives:

  • Serverless analytics: Migrate appropriate workloads to serverless SQL or Spark for automatic resource management and pay-per-query billing
  • Multi-cluster workload isolation: Use separate optimized clusters for different workload types to maximize resource efficiency
  • Predictive scaling: Implement ML-based resource prediction to pre-scale clusters based on historical usage patterns and scheduled workloads

5. Implement query result materialization patterns for redundant computation costs

An analytics team was repeatedly computing the same customer lifetime value calculations and segmentation logic across multiple reports and dashboards. Each calculation required scanning 1.2TB of historical data, with 85% of queries containing overlapping analytical logic consuming redundant compute resources.

To eliminate this duplication, they introduced query result materialization for common expensive computations. In practice, this means creating materialized views or cached tables that store the results of these computations, so that downstream queries can simply read the precomputed results. First, they created materialized views in their Lakehouse (on Delta tables) for the heavy queries.

1-- Create materialized views for common analytical patterns
2CREATE MATERIALIZED VIEW customer_lifetime_analytics
3USING DELTA
4LOCATION 'abfss://[email protected]/materialized/clv_analytics'
5AS
6SELECT 
7    customer_id,
8    registration_date,
9    DATEDIFF(CURRENT_DATE, registration_date) as customer_age_days,
10    SUM(purchase_amount) as total_lifetime_value,
11    COUNT(DISTINCT transaction_date) as purchase_frequency,
12    AVG(purchase_amount) as avg_order_value,
13    MAX(transaction_date) as last_purchase_date,
14    DATEDIFF(CURRENT_DATE, MAX(transaction_date)) as days_since_last_purchase,
15    CASE 
16        WHEN SUM(purchase_amount) > 10000 AND COUNT(*) > 50 THEN 'VIP'
17        WHEN SUM(purchase_amount) > 5000 AND COUNT(*) > 20 THEN 'High Value'
18        WHEN SUM(purchase_amount) > 1000 AND COUNT(*) > 5 THEN 'Regular'
19        ELSE 'Occasional'
20    END as customer_segment,
21    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY purchase_amount) as median_order_value
22FROM customer_transactions 
23WHERE transaction_date >= '2022-01-01'
24GROUP BY customer_id, registration_date;
25
26-- Implement incremental refresh for materialized views
27REFRESH MATERIALIZED VIEW customer_lifetime_analytics
28WHERE transaction_date >= CURRENT_DATE - INTERVAL 1 DAY;

Next, they developed a lightweight Query Materialization Manager in Python to keep track of materialized queries and their dependencies. The idea was to programmatically register common query patterns, so they could ensure those are materialized and refreshed when source data changes.

Alternatives:

  • Automatic query pattern detection: Use ML algorithms to identify common query patterns and automatically suggest materialization candidates
  • Federated materialization: Distribute materialized views across multiple storage tiers based on access frequency and performance requirements
  • Incremental materialization with change data capture: Implement real-time incremental updates using CDC patterns for near-real-time analytical results

ETL/Streaming Optimization Tactics

ETL (Extract-Transform-Load) and streaming workloads often involve constant data movement and long-running processes. Cost optimization here focuses on improving batch efficiency, avoiding unnecessary compute during low loads, and leveraging scaling to handle throughput spikes.

1. Optimize Data Factory pipeline batch sizing to reduce execution costs

A data engineering team was running 240 separate Data Factory pipelines for daily shipment processing, with each pipeline handling 50-200 records individually. Pipeline overhead costs reached $4,100 monthly, with 85% of execution time spent on pipeline initialization rather than data processing.

The team refactored their pipelines to batch process data in larger chunks and drastically reduce the number of pipeline invocations. They combined those 240 daily runs into about a dozen runs by increasing batch sizes to ~10,000 records and processing multiple shipments in one go.

1{
2  "pipeline": {
3    "name": "optimized_shipment_processing",
4    "parameters": {
5      "batchSize": "10000",
6      "maxParallelism": "8",
7      "retryPolicy": {
8        "count": 3,
9        "intervalInSeconds": 30
10      }
11    },
12    "activities": [
13      {
14        "name": "BatchShipmentData",
15        "type": "Copy",
16        "inputs": [
17          {
18            "referenceName": "shipment_source",
19            "type": "DatasetReference"
20          }
21        ],
22        "typeProperties": {
23          "source": {
24            "type": "SqlServerSource",
25            "sqlReaderQuery": "SELECT * FROM shipments WHERE processed_date = '@{formatDateTime(utcnow(), 'yyyy-MM-dd')}'"
26          },
27          "sink": {
28            "type": "DeltaLakeSink",
29            "writeBatchSize": 10000,
30            "writeBatchTimeout": "00:30:00"
31          },
32          "enableStaging": true,
33          "stagingSettings": {
34            "linkedServiceName": "AzureBlobStorage",
35            "path": "staging/shipments"
36          }
37        }
38      }
39    ]
40  }
41}

Alternatives:

  • Event-driven pipeline triggers: Implement Azure Event Grid triggers to batch multiple events before pipeline execution
  • Parallel pipeline execution: Configure multiple parallel copy activities with optimized degree of parallelism for large datasets
  • Incremental pipeline patterns: Use watermark-based incremental processing to reduce data movement and processing overhead

2. Implement Delta Lake merge optimization for streaming ingestion

A data engineering team was ingesting 2.8 million transaction records daily through streaming pipelines, with each micro-batch performing individual MERGE operations causing excessive file fragmentation and compute overhead. Streaming costs reached $3,600 monthly with increasing latency issues.

The team optimized the streaming pipeline by batching up merges and introducing periodic compaction. Instead of merging every single micro-batch (which might be a few hundred records), they accumulated a few micro-batches and merged them in one go, and regularly ran OPTIMIZE to combine small files.

1# Optimized streaming Delta merge with batching and compaction
2from delta.tables import DeltaTable
3import pyspark.sql.functions as F
4from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DoubleType
5
6def optimized_streaming_merge(spark, source_stream, target_table_path):
7    # Configure streaming batch parameters for efficiency
8    streaming_query = (source_stream
9        .writeStream
10        .format("delta")
11        .outputMode("append")
12        .trigger(processingTime='2 minutes')  # Batch micro-batches for efficiency
13        .option("checkpointLocation", f"{target_table_path}/_checkpoints/streaming")
14        .option("mergeSchema", "true")
15        .foreachBatch(lambda batch_df, epoch_id: process_streaming_batch(batch_df, epoch_id, target_table_path))
16        .start()
17    )
18    
19    return streaming_query
20
21def process_streaming_batch(batch_df, epoch_id, target_table_path):
22    print(f"Processing batch {epoch_id} with {batch_df.count()} records")
23    
24    # Load target Delta table
25    target_table = DeltaTable.forPath(spark, target_table_path)
26    
27    # Optimize merge conditions for performance
28    merge_condition = """
29        target.transaction_id = source.transaction_id AND 
30        target.account_id = source.account_id
31    """
32    
33    # Execute optimized merge with batch grouping
34    (target_table.alias("target")
35        .merge(batch_df.alias("source"), merge_condition)
36        .whenMatchedUpdate(set={
37            "amount": F.col("source.amount"),
38            "status": F.col("source.status"),
39            "updated_timestamp": F.current_timestamp()
40        })
41        .whenNotMatchedInsert(values={
42            "transaction_id": F.col("source.transaction_id"),
43            "account_id": F.col("source.account_id"),
44            "amount": F.col("source.amount"),
45            "status": F.col("source.status"),
46            "created_timestamp": F.current_timestamp(),
47            "updated_timestamp": F.current_timestamp()
48        })
49        .execute()
50    )
51    
52    # Trigger compaction for every 10th batch to manage file size
53    if epoch_id % 10 == 0:
54        spark.sql(f"OPTIMIZE '{target_table_path}' ZORDER BY (account_id, transaction_date)")
55        print(f"Triggered optimization for batch {epoch_id}")
56
57# Streaming source configuration with optimized batching
58streaming_source = (spark
59    .readStream
60    .format("kafka")
61    .option("kafka.bootstrap.servers", "localhost:9092")
62    .option("subscribe", "financial_transactions")
63    .option("startingOffsets", "latest")
64    .option("maxOffsetsPerTrigger", "50000")  # Control batch size
65    .load()
66)

Alternatives:

  • Change data capture patterns: Implement CDC-based streaming to reduce merge overhead by processing only changed records
  • Multi-hop architecture: Use Bronze-Silver-Gold pattern with incremental processing at each layer to optimize compute distribution
  • Liquid clustering for streaming: Enable auto-clustering on streaming tables to optimize file layout without manual ZORDER operations

3. Configure autoscaling for Spark compute pools to reduce idle cluster costs

A data engineering team was running 6 dedicated Spark clusters 24/7 for varying ETL workloads, resulting in $8,900 monthly costs with 35% average utilization. Peak processing periods required maximum capacity while off-hours left clusters mostly idle.

The team switched to using Spark compute pools with autoscaling in Fabric. Spark pools allow specifying a min and max number of nodes, and Fabric can auto-scale the pool based on load. They configured each pool according to its workload pattern (e.g., a “batch ETL” pool that can scale to many nodes during nightly jobs, a smaller “ad-hoc” pool with lower max, etc.).

Alternatives:

  • Spot instance integration: Use Azure Spot VMs for non-critical batch workloads to achieve additional 60-80% cost savings
  • Workload-based cluster pools: Create specialized pools optimized for specific workload characteristics (CPU-intensive, memory-intensive, IO-intensive)
  • Scheduled autoscaling: Implement time-based scaling policies aligned with known business patterns and peak usage windows

4. Optimize Data Factory copy activity parallelism to reduce data movement costs

A retail data engineering team ran a daily pipeline to copy 850 GB of sales data from various sources into their lakehouse. Originally, they did this in a largely sequential manner- reading from one source table at a time and then moving to the next, due to dependencies in the pipeline. This resulted in an 8-hour copy process each day and about $2,800/month in Data Factory data movement costs. During peak season, the long sequential copy became a bottleneck that delayed downstream processing.

They reworked the pipeline to maximize parallelism in the copy operations. The new approach involved: (a) partitioning the source data so it could be read in parallel chunks, and (b) increasing the Data Integration Units (DIUs) and using the parallelCopies setting of the Copy activity to perform simultaneous transfers.

In addition to configuring a single Copy activity for high throughput, they sometimes needed to orchestrate multiple copies in parallel when data came from multiple source tables or partitions. They leveraged a parameterized pipeline that could spawn copies for each partition.

Data Partitioning Strategy: On the source side, they also ensured the source database was partitioned appropriately to help with parallel extraction. For instance, if the source was SQL Server, they partitioned the sales_data table by date (or some key) so that reading different partitions in parallel wouldn’t contend on the same I/O. For example:

1-- SQL: Partition the source table by month (for example)
2CREATE PARTITION FUNCTION sales_data_partition_func (DATE)
3AS RANGE RIGHT FOR VALUES 
4    ('2024-01-01', '2024-02-01', ..., '2024-12-01');
5CREATE PARTITION SCHEME sales_data_partition_scheme
6AS PARTITION sales_data_partition_func
7TO (sales_data_fg1, sales_data_fg2, ..., sales_data_fg12);
8-- (Where sales_data_fg1...fg12 are filegroups or storage for each partition range)
9
10-- Verify partition row distribution (to ensure even workload spread)
11SELECT partition_number, rows, total_pages, avg_fragmentation_in_percent
12FROM sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('sales_data'), NULL, NULL, 'DETAILED');

Alternatives:

  • PolyBase integration: Use PolyBase for high-performance data loading into SQL-based targets with native parallelism
  • Azure Synapse Pipelines with COPY statement: Leverage optimized COPY commands for bulk data loading with automatic parallelization
  • Event-driven copy patterns: Implement file-based triggers to enable real-time incremental copying instead of large batch operations

5. Implement compression and file format optimization

A team was storing 12TB of clinical data monthly in uncompressed JSON format, resulting in $4,200 storage costs and slow query performance. Data transfer between systems consumed additional $1,800 monthly in egress charges.

The team undertook a comprehensive format and compression optimization project. The goal was to convert the raw data into an optimized format (like Parquet or Delta), apply compression to shrink size, and prune any unnecessary data at the source. They also looked at optimizing data types to reduce footprint (for example, storing integers in smaller types when possible).

They used Snappy compression for Parquet (which is splittable and fast to decompress). They also set a large Parquet block size (256 MB) to ensure each Parquet file is relatively large and sequential, which is good for scanning performance, and a page size of 1 MB to improve compression ratio on highly repetitive data.

Additionally, they implemented logic to optimize data types: for example, if an integer column only contains values that fit in 2 bytes, cast it to SMALLINT (which cuts that column’s storage by 75%). For string columns, they looked at max length to decide if any optimizations (like using dictionary encoding) would naturally happen.

They didn’t stop at one setting, different datasets got different treatments. For example, they defined compression strategies for different data types:

  • Time-series sensor data: use Parquet with Snappy (good balance of speed and size), partition by date, sort by time (to compress time deltas well).
  • Transactional data: maybe use Delta with Gzip for maximum compression since queries might filter on many columns.
  • Analytical fact tables: use Parquet with LZ4 for faster read but still some compression, partition by broader categories.

When Fabric optimization reaches its limits: The e6data alternative

Even after you’ve squeezed out waste in Fabric and right-sized Spark, tuned refreshes, compacted files, some BI/SQL workloads still pay the “cluster tax.” That’s where e6data comes in.

What it is: e6data is a lakehouse query engine with per-vCPU pricing. It atomically scales compute to each query’s needs (by the second), so you don’t size or babysit clusters. It runs directly on your existing Delta, Parquet, CSV, JSON, so no data migration, no model rewrites. Many teams keep Fabric platform for collaborative dev and ML, and offload cost-sensitive BI/SQL to e6data for extra savings.

Key benefits of the e6data approach

  • Atomic scaling & pricing: Instead of paying for an 8-node cluster while using ~2 nodes worth, e6data allocates exactly the CPU your query needs. You pay only for the vCPU-seconds you consume.
  • Works on your existing data: Point it at Delta/Parquet/CSV/JSON in your lakehouse. No copy jobs, no re-architecture.
  • Zero infra overhead: No clusters to manage. No idle nodes. For spiky or mixed BI workloads, teams often see another 40–60% cost reduction on top of their Fabric optimizations.

➡️ Interested in exploring this? Start a free trial of e6data and see how it compares on your own workloads.

Table of contents:
Share this article