Table of contents:

Subscribe to our newsletter - Data Engineering ACID

Get 3 weekly stories around data engineering at scale that the e6data team is reading.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Share this article

How to Optimize BigQuery Costs? {Updated 2025 Guide}

August 14, 2025
/
e6data Team
BigQuery
Cost Optimization
Beginner

BigQuery's serverless architecture and automatic scaling deliver unmatched analytical performance, but that same flexibility can lead to unexpected cost spikes when teams optimize for speed without understanding the underlying slot-based pricing model. We've worked with dozens of data engineering teams running petabyte-scale analytics on BigQuery, and the pattern is consistent: costs often spiral when teams treat BigQuery as "pay-per-query" without considering slot utilization, data scanning patterns, and storage lifecycle management.

The reality? BigQuery's pricing complexity stems from its hybrid model: on-demand queries, reserved slots, storage tiers, and streaming inserts all follow different cost structures. Unlike traditional databases where you pay for fixed infrastructure, BigQuery charges for actual resource consumption, but optimizing that consumption requires understanding query execution patterns, data layout strategies, and workload classification.

This playbook focuses on practical optimizations that data engineering teams can implement immediately, organized by workload patterns that drive the highest costs. Each tactic includes runnable SQL configurations and BigQuery-specific monitoring queries, so you can start reducing costs without sacrificing the sub-second query performance your business users expect.

When does BigQuery spending become "high cost"?

Before diving into optimizations, it's crucial to establish what qualifies as "high cost" in BigQuery. These benchmarks help identify inefficient usage patterns:

Metric Threshold Impact
Query cost per TB scanned >$6.00 per TB for routine analytics Indicates poor partitioning, excessive SELECT *, or missing materialized views
Slot utilization efficiency >85% sustained usage or <20% average usage Over-provisioning wastes money, under-provisioning creates queue delays
Storage growth rate >25% monthly without proportional query benefits Poor lifecycle management or unnecessary data retention
Streaming insert costs >$0.02 per 1000 rows for real-time ingestion Inefficient batch sizes or unnecessary streaming frequency
Data egress charges >$0.10/GB for cross-region transfers Multi-region datasets with poor data locality
Query queue wait times >30 seconds average during peak hours Under-provisioned slots or inefficient concurrent query patterns

Workload Taxonomy

BigQuery costs primarily spike when one of three usage patterns dominates your workload. Understanding which pattern drives your highest expenses is crucial for targeted optimization:

BI Dashboards - High-frequency analytical queries

Cost drivers: Repeated full-table scans, inefficient result caching, poor partitioning that doesn't match filter patterns, excessive slot allocation for peak concurrency, and materialized view maintenance overhead.

Optimization opportunities: Result caching eliminates duplicate computation, materialized views pre-aggregate frequent patterns, intelligent partitioning and clustering reduce scan volumes, and BI Engine acceleration delivers sub-second response times for small result sets.

Ad-hoc Analytics - Exploratory queries with unpredictable needs

Cost drivers: Full-table scans for hypothesis testing, lack of sampling strategies, inefficient JOIN operations, poor query complexity governance, and unnecessary precision in exploratory analysis that wastes compute resources.

Optimization opportunities: Intelligent data sampling reduces scan costs by 90%+ while maintaining statistical validity, approximate algorithms provide faster exploration, query complexity scoring prevents runaway costs, and federated queries minimize data movement expenses.

ETL/Streaming - High-volume data processing

Cost drivers: Inefficient batch loading patterns, streaming insert frequency that exceeds business requirements, poor compression and storage layout choices, unnecessary cross-region data replication, and suboptimal scheduling that creates slot contention.

Optimization opportunities: Optimized batch sizing and compression reduce storage and compute costs, streaming insert consolidation eliminates unnecessary micro-batching, data lifecycle policies automatically manage storage tiers, and intelligent scheduling spreads resource usage across lower-cost time windows.

BI Dashboards

Typical cost pitfalls: Oversized slot allocation for peak concurrency, repeated full-table scans for dashboard refreshes, and poor result caching strategies that miss obvious optimization opportunities.

1. Implement materialized views for dashboard acceleration to reduce query costs

A retail analytics team was refreshing executive dashboards against a large transactions table every 15 minutes. Each refresh scanned substantial amounts of data, consuming significant query costs daily for identical aggregations across product categories, regions, and time periods. Dashboard queries repeatedly aggregate the same dimensions, but BigQuery rescans and re-aggregates unless you persist the results in materialized views that incrementally update.

Fix: Materialized view pipeline for dashboard queries (an example scenario). This turns heavy queries that scanned billions of bytes into lightweight queries scanning just a few megabytes of pre-computed data.

1-- Create base materialized view for daily sales aggregations
2CREATE MATERIALIZED VIEW `analytics.daily_sales_summary`
3PARTITION BY DATE(order_date)
4CLUSTER BY region, category
5AS
6SELECT 
7    DATE(order_date) as order_date,
8    region,
9    category,
10    COUNT(*) as order_count,
11    SUM(amount) as total_revenue,
12    AVG(amount) as avg_order_value,
13    COUNT(DISTINCT customer_id) as unique_customers
14FROM `sales.transactions`
15WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY)
16GROUP BY 1, 2, 3;
17
18-- Dashboard query now scans KB instead of TB
19SELECT 
20    region,
21    SUM(total_revenue) as revenue,
22    SUM(order_count) as orders,
23    AVG(avg_order_value) as avg_order_value
24FROM `analytics.daily_sales_summary`
25WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
26GROUP BY region
27ORDER BY revenue DESC;
28
29-- Monitor materialized view efficiency
30SELECT 
31    table_name,
32    last_refresh_time,
33    refresh_watermark,
34    size_bytes / 1e9 as size_gb
35FROM `analytics.INFORMATION_SCHEMA.MATERIALIZED_VIEWS`
36WHERE table_name = 'daily_sales_summary';

Alternatives

  • Use scheduled queries to manually refresh aggregated tables with more control over timing and costs
  • Implement query result caching for static time periods to eliminate repeated computation
  • Consider BI Engine for sub-second acceleration of small result sets under 100GB
  • Use table snapshots for point-in-time dashboard views that don't require real-time data
  • Leverage e6data for variable dashboard workloads that benefit from per-vCPU autoscaling upto 1000 QPS+ instead of reserved slots

2. Optimize partitioning and clustering for dashboard filter patterns to improve query execution

A financial services dashboard was filtering a large customer events table by date ranges and account types, but the original partitioning strategy didn't match the dashboard filter patterns, forcing full partition scans across many date ranges.

Fix: Redesign partitioning strategy based on actual dashboard query patterns (an example scenario). In this case, partition by date (since every query uses a date range) and cluster by account_type (and perhaps event_type) so that queries which filter those fields will scan a much smaller subset of data.

1-- Analyze existing query patterns to optimize partitioning
2SELECT 
3    REGEXP_EXTRACT(query, r'WHERE.*?(?=GROUP|ORDER|LIMIT|$)') as where_patterns,
4    COUNT(*) as frequency,
5    AVG(total_bytes_billed / 1e12) as avg_tb_scanned
6FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
7WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
8    AND query LIKE '%customer_events%'
9    AND statement_type = 'SELECT'
10GROUP BY where_patterns
11ORDER BY frequency DESC
12LIMIT 10;
13
14-- Create optimally partitioned table
15CREATE TABLE `analytics.customer_events_optimized` (
16    event_date DATE,
17    account_type STRING,
18    customer_id STRING,
19    event_type STRING,
20    event_value NUMERIC
21)
22PARTITION BY event_date
23CLUSTER BY account_type, event_type
24OPTIONS(
25    partition_expiration_days = 730,
26    require_partition_filter = true
27);
28
29-- Migrate data with optimal layout
30INSERT INTO `analytics.customer_events_optimized`
31SELECT * FROM `analytics.customer_events`
32WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 730 DAY);
33
34-- Dashboard queries now skip irrelevant partitions
35SELECT 
36    account_type,
37    COUNT(*) as events,
38    SUM(event_value) as total_value
39FROM `analytics.customer_events_optimized`
40WHERE event_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND CURRENT_DATE()
41    AND account_type IN ('premium', 'business')
42GROUP BY account_type;

Alternatives

  • Use partition decorators for time-based queries to ensure partition pruning effectiveness
  • Implement time-unit column partitioning (hourly vs daily) based on dashboard refresh frequency
  • Consider range partitioning for non-date columns that have natural boundaries
  • Use clustering on multiple columns to optimize complex WHERE clauses
  • Apply query optimization techniques like predicate pushdown for nested and repeated fields

3. Implement intelligent result caching and query optimization to reduce duplicate computation

A marketing analytics platform ran the same complex cohort analysis queries many times daily across different dashboard users, with each query scanning large amounts of user behavior data and consuming significant costs per execution. There was no caching or reuse of results. Even though 90% of the logic was identical across these queries, BigQuery treated them as separate and scanned all the data each time. Small differences in the query SQL (like different literal values or slightly different SQL structure) meant the built-in cache wasn’t being hit.

Fix: Multi-level caching and query optimization strategy (an example scenario)

1-- Enable query result caching and optimize cache hit rates
2-- First, standardize query patterns to improve cache efficiency
3CREATE OR REPLACE VIEW `analytics.cohort_analysis_base` AS
4SELECT 
5    DATE_TRUNC(first_purchase_date, WEEK) as cohort_week,
6    customer_id,
7    DATE_DIFF(purchase_date, first_purchase_date, WEEK) as weeks_since_first_purchase,
8    purchase_amount
9FROM `sales.customer_purchases`
10WHERE first_purchase_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 52 WEEK);
11
12-- Create cacheable cohort queries with consistent structure
13WITH cohort_base AS (
14    SELECT 
15        cohort_week,
16        weeks_since_first_purchase,
17        COUNT(DISTINCT customer_id) as customers,
18        SUM(purchase_amount) as revenue
19    FROM `analytics.cohort_analysis_base`
20    WHERE cohort_week >= DATE_SUB(CURRENT_DATE(), INTERVAL 12 WEEK)
21    GROUP BY 1, 2
22),
23cohort_sizes AS (
24    SELECT 
25        cohort_week,
26        COUNT(DISTINCT customer_id) as cohort_size
27    FROM `analytics.cohort_analysis_base`
28    WHERE weeks_since_first_purchase = 0
29    GROUP BY 1
30)
31SELECT 
32    cb.cohort_week,
33    cb.weeks_since_first_purchase,
34    cb.customers,
35    cb.revenue,
36    cb.customers / cs.cohort_size as retention_rate
37FROM cohort_base cb
38JOIN cohort_sizes cs ON cb.cohort_week = cs.cohort_week
39ORDER BY cb.cohort_week, cb.weeks_since_first_purchase;
40
41-- Monitor cache hit rates and optimize queries for caching
42SELECT 
43    job_id,
44    query,
45    cache_hit,
46    total_bytes_billed / 1e9 as gb_billed,
47    total_slot_ms / 1000 as slot_seconds
48FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
49WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
50    AND statement_type = 'SELECT'
51ORDER BY total_bytes_billed DESC;

Alternatives

  • Use scheduled queries to pre-compute dashboard results and store them in dedicated tables
  • Create parameterized dashboard views that enable cache reuse across different filter combinations
  • Use BigQuery BI Engine for automatic caching of small result sets with sub-second access times

4. Configure slot autoscaling and workload isolation to reduce idle slot costs

A business intelligence team was paying for substantial reserved slots to handle peak dashboard loads during business hours, but utilization dropped significantly outside business hours and on weekends, wasting considerable monthly costs on idle capacity. If you allocate, say, 500 slots to guarantee low latency at peak, those slots cost the same 24/7. Without autoscaling or workload isolation, you either waste money or, if you try to go on-demand to save cost, you risk slowdowns at peak.

Fix: Implement autoscaling reservations with workload isolation (an example scenario)

1-- Create separate reservations for different workload classes
2-- Production dashboards: guaranteed capacity
3CREATE RESERVATION `dashboard_production`
4OPTIONS (
5    slot_capacity = 500,
6    location = 'US',
7    edition = 'STANDARD'
8);
9
10-- Analytical workloads: flex slots for variable demand
11CREATE RESERVATION `analytics_flex`
12OPTIONS (
13    slot_capacity = 100,
14    location = 'US',
15    edition = 'STANDARD',
16    autoscale_max_slots = 1000
17);
18
19-- Assign projects to appropriate reservations
20CREATE ASSIGNMENT `dashboard_assignment`
21OPTIONS (
22    assignee_type = 'PROJECT',
23    assignee_id = 'dashboard-prod-project',
24    job_type = 'QUERY',
25    reservation = 'projects/your-project/locations/US/reservations/dashboard_production'
26);
27
28-- Monitor slot utilization and optimize reservation sizing
29SELECT 
30    reservation_name,
31    slot_capacity,
32    APPROX_QUANTILES(total_slots, 4)[OFFSET(2)] as median_slots_used,
33    AVG(total_slots) as avg_slots_used,
34    MAX(total_slots) as peak_slots_used
35FROM `region-us.INFORMATION_SCHEMA.RESERVATION_TIMELINE_BY_PROJECT`
36WHERE start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
37GROUP BY reservation_name, slot_capacity
38ORDER BY reservation_name;
39
40-- Implement time-based slot scaling
41CREATE OR REPLACE PROCEDURE `analytics.scale_slots`(target_slots INT64)
42BEGIN
43    EXECUTE IMMEDIATE FORMAT("""
44        ALTER RESERVATION `analytics_flex`
45        SET OPTIONS (slot_capacity = %d)
46    """, target_slots);
47END;

Alternatives

  • Use on-demand pricing for truly variable workloads instead of maintaining reservations
  • Implement query queuing and priority systems to better utilize available slot capacity
  • Schedule heavy analytical workloads during off-peak hours to reduce slot contention
  • Use BigQuery's job timeout settings to prevent runaway queries from consuming excessive slots
  • Consider hybrid deployment with e6data for complex workloads as it can autoscale up to 1000 QPS.

5. Leverage BI Engine for sub-second dashboard acceleration to improve user experience metrics

A customer support dashboard displaying real-time metrics was experiencing 5-15 second load times despite optimized queries, causing user frustration and reducing dashboard adoption across the support organization.

Fix: BI Engine configuration for accelerated dashboard performance (an example scenario)

1-- Enable BI Engine for critical dashboard tables
2ALTER TABLE `analytics.daily_sales_summary`
3SET OPTIONS (
4    max_staleness = INTERVAL '1' HOUR
5);
6
7-- Monitor BI Engine efficiency and cache hit rates
8SELECT 
9    table_name,
10    bi_engine_statistics.bi_engine_mode,
11    bi_engine_statistics.bi_engine_reasons,
12    total_logical_bytes / 1e9 as logical_gb,
13    total_physical_bytes / 1e9 as physical_gb,
14    bi_engine_statistics.bi_engine_acceleration.bi_engine_mode as acceleration_mode
15FROM `analytics.INFORMATION_SCHEMA.TABLE_STORAGE_BY_PROJECT`
16WHERE bi_engine_statistics.bi_engine_mode IS NOT NULL
17    AND table_name IN ('daily_sales_summary', 'customer_metrics', 'support_tickets');
18
19-- Optimize queries for BI Engine acceleration
20-- Keep result sets under 100GB for optimal acceleration
21SELECT 
22    support_agent,
23    ticket_status,
24    COUNT(*) as ticket_count,
25    AVG(resolution_time_hours) as avg_resolution_time
26FROM `support.tickets_summary`
27WHERE created_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
28    AND created_date <= CURRENT_DATE()
29GROUP BY support_agent, ticket_status
30HAVING ticket_count >= 5
31ORDER BY ticket_count DESC
32LIMIT 100;
33
34-- Create BI Engine optimized views for common dashboard patterns
35CREATE OR REPLACE VIEW `analytics.realtime_kpis` AS
36SELECT 
37    DATE(event_timestamp) as event_date,
38    EXTRACT(HOUR FROM event_timestamp) as event_hour,
39    event_type,
40    COUNT(*) as event_count,
41    COUNT(DISTINCT user_id) as unique_users
42FROM `events.user_interactions`
43WHERE DATE(event_timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
44GROUP BY 1, 2, 3;

Alternatives

  • Use materialized views with frequent refresh schedules for near real-time performance
  • Implement application-level caching layers with Redis for sub-second response times
  • Create pre-aggregated summary tables that update incrementally throughout the day
  • Use streaming inserts with minute-level aggregations for real-time dashboard updates
  • Consider e6data's lakehouse query engine for consistent sub-second performance across variable workloads

Ad-hoc Analytics

Typical cost pitfalls: Full-table scans for exploratory queries, lack of intelligent sampling strategies, and poor query complexity governance that allows runaway costs.

1. Use TABLESAMPLE and intelligent sampling for exploration to reduce scan costs for hypothesis testing

A data science team was running full-table scans on their large customer behavior dataset to test simple hypotheses like "conversion rates by traffic source" or "seasonal patterns in user engagement." Each exploratory query was consuming substantial memory and generating significant costs per execution.

Full-table exploration wastes compute when representative samples provide equally valid statistical insights for hypothesis testing and pattern discovery.

Fix: Multi-level sampling strategy for cost-effective exploration (an example scenario)

1-- Systematic sampling for representative analysis
2SELECT 
3    traffic_source,
4    COUNT(*) as total_users,
5    COUNT(CASE WHEN converted = true THEN 1 END) as conversions,
6    COUNT(CASE WHEN converted = true THEN 1 END) / COUNT(*) as conversion_rate
7FROM `analytics.customer_journey` TABLESAMPLE SYSTEM (1 PERCENT)
8WHERE visit_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
9GROUP BY traffic_source
10ORDER BY conversion_rate DESC;
11
12-- Reservoir sampling for exact sample sizes regardless of table growth
13SELECT 
14    device_type,
15    session_duration_minutes,
16    pages_viewed,
17    purchase_amount
18FROM `analytics.user_sessions` TABLESAMPLE RESERVOIR (10000 ROWS)
19WHERE session_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);
20
21-- Stratified sampling for balanced representation across segments
22WITH segment_samples AS (
23    SELECT *, ROW_NUMBER() OVER (PARTITION BY user_segment ORDER BY RAND()) as rn
24    FROM `analytics.user_profiles`
25    WHERE last_active_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 180 DAY)
26)
27SELECT 
28    user_segment,
29    avg_monthly_spend,
30    total_sessions,
31    account_tenure_months
32FROM segment_samples
33WHERE rn <= 1000  -- 1000 users per segment
34ORDER BY user_segment, avg_monthly_spend DESC;
35
36-- Monitor sampling accuracy vs full table results
37SELECT 
38    'full_table' as method,
39    COUNT(*) as total_rows,
40    AVG(session_duration) as avg_duration,
41    STDDEV(session_duration) as duration_stddev
42FROM `analytics.user_sessions`
43WHERE session_date = CURRENT_DATE()
44
45UNION ALL
46
47SELECT 
48    'sampled_1pct' as method,
49    COUNT(*) * 100 as extrapolated_total,  -- Scale up sample
50    AVG(session_duration) as avg_duration,
51    STDDEV(session_duration) as duration_stddev
52FROM `analytics.user_sessions` TABLESAMPLE SYSTEM (1 PERCENT)
53WHERE session_date = CURRENT_DATE();

Alternatives

  • Create dedicated sample tables with automated refresh for repeated exploration patterns
  • Use approximate aggregate functions like HLL_COUNT.INIT for rapid cardinality estimation
  • Implement time-based sampling to focus analysis on recent, more relevant data periods
  • Build stratified sample views that maintain representation across key business dimension

2. Leverage approximate functions for faster statistical analysis to improve performance on large datasets

A marketing analytics team needed quick approximations of unique users, funnel conversion rates, and customer lifetime value across large datasets, but exact calculations were taking substantial time per query and consuming significant slot resources.

Exact computations on very large data (especially COUNT(DISTINCT), medians, 95th percentiles, top-k elements, etc.) require scanning and shuffling a lot of data. For example, counting distinct users over a big dataset can’t be easily broken down and usually triggers a large shuffle in BigQuery. Similarly, calculating precise percentiles can be heavy. Doing this frequently (say daily or ad-hoc) incurs big costs and might not be timely.

Fix: Approximate algorithm optimization for exploratory analytics (an example scenario)

1-- Replace exact unique counts with HyperLogLog approximation
2SELECT 
3    campaign_id,
4    event_date,
5    HLL_COUNT.MERGE(user_hll) as approx_unique_users,
6    COUNT(*) as total_events,
7    SUM(event_value) as total_value
8FROM (
9    SELECT 
10        campaign_id,
11        event_date,
12        HLL_COUNT.INIT(user_id) as user_hll,
13        COUNT(*) as events,
14        SUM(purchase_amount) as event_value
15    FROM `marketing.campaign_events`
16    WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
17    GROUP BY 1, 2
18)
19GROUP BY campaign_id, event_date
20ORDER BY approx_unique_users DESC;
21
22-- Use approximate quantiles for percentile analysis
23SELECT 
24    product_category,
25    APPROX_QUANTILES(price, 100)[OFFSET(25)] as p25_price,
26    APPROX_QUANTILES(price, 100)[OFFSET(50)] as median_price,
27    APPROX_QUANTILES(price, 100)[OFFSET(75)] as p75_price,
28    APPROX_QUANTILES(price, 100)[OFFSET(95)] as p95_price
29FROM `sales.product_purchases`
30WHERE purchase_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
31GROUP BY product_category;
32
33-- Approximate top-K analysis for trending insights
34SELECT 
35    APPROX_TOP_COUNT(search_query, 20) as top_searches,
36    APPROX_TOP_SUM(search_query, clicks, 15) as top_searches_by_clicks
37FROM `search.query_logs`
38WHERE search_date = CURRENT_DATE();
39
40-- Compare approximate vs exact for accuracy validation
41WITH exact_results AS (
42    SELECT 
43        COUNT(DISTINCT user_id) as exact_users,
44        PERCENTILE_CONT(session_duration, 0.5) OVER() as exact_median
45    FROM `analytics.sessions`
46    WHERE DATE(session_start) = CURRENT_DATE()
47),
48approx_results AS (
49    SELECT 
50        HLL_COUNT.INIT(user_id) as approx_users_hll,
51        APPROX_QUANTILES(session_duration, 100)[OFFSET(50)] as approx_median
52    FROM `analytics.sessions`
53    WHERE DATE(session_start) = CURRENT_DATE()
54)
55SELECT 
56    HLL_COUNT.EXTRACT(approx_users_hll) as approx_users,
57    exact_users,
58    ABS(HLL_COUNT.EXTRACT(approx_users_hll) - exact_users) / exact_users as user_error_pct,
59    approx_median,
60    exact_median,
61    ABS(approx_median - exact_median) / exact_median as median_error_pct
62FROM exact_results, approx_results;

Alternatives

  • Pre-compute exact statistics in scheduled batch jobs for frequently analyzed dimensions
  • Use BigQuery ML for approximate pattern recognition in large datasets
  • Implement hybrid approaches that use exact calculations for small result sets and approximations for large ones
  • Create approximate summary tables that update incrementally with streaming data

3. Implement query complexity scoring to prevent runaway costs and reduce expensive query incidents

A data science team had multiple expensive surprise bills when analysts accidentally created cartesian products or inefficient window functions across large tables, running for extended periods before manual intervention.

There was no proactive check on query complexity or cost before execution. BigQuery will let you run a 100TB join if you have the quota, and while you can set project-level quotas, sometimes these were broad. The team needed a way to catch obviously complex or potentially costly queries before they ran too long, or at least stop them quickly and inform the user.

Fix: Automated query complexity analysis and cost prevention (an example scenario)

1-- Create query complexity analysis function
2CREATE OR REPLACE FUNCTION `analytics.estimate_query_cost`(query_text STRING)
3RETURNS STRUCT<
4    estimated_tb_scanned FLOAT64,
5    estimated_cost_usd FLOAT64,
6    complexity_score INT64,
7    risk_level STRING
8>
9LANGUAGE js AS """
10    // Simple heuristic-based cost estimation
11    var cost_per_tb = 5.0;
12    var complexity_score = 0;
13    var estimated_tb = 0.1;  // Base estimate
14    
15    // Increase estimates based on query patterns
16    if (query_text.includes('SELECT *')) complexity_score += 3;
17    if (query_text.includes('CROSS JOIN')) complexity_score += 5;
18    if (query_text.includes('WINDOW')) complexity_score += 2;
19    if ((query_text.match(/JOIN/g) || []).length > 3) complexity_score += 2;
20    
21    // Estimate TB scanned based on complexity
22    estimated_tb = Math.pow(1.5, complexity_score) * 0.1;
23    
24    var risk_level = complexity_score < 3 ? 'LOW' : 
25                    complexity_score < 6 ? 'MEDIUM' : 'HIGH';
26    
27    return {
28        estimated_tb_scanned: estimated_tb,
29        estimated_cost_usd: estimated_tb * cost_per_tb,
30        complexity_score: complexity_score,
31        risk_level: risk_level
32    };
33""";
34
35-- Monitor and analyze expensive queries
36CREATE OR REPLACE VIEW `analytics.query_cost_analysis` AS
37WITH query_stats AS (
38    SELECT 
39        job_id,
40        user_email,
41        query,
42        total_bytes_billed / 1e12 as tb_billed,
43        (total_bytes_billed / 1e12) * 5.0 as estimated_cost_usd,
44        total_slot_ms / 1000 as slot_seconds,
45        TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_seconds,
46        creation_time
47    FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
48    WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
49        AND state = 'DONE'
50        AND statement_type = 'SELECT'
51)
52SELECT 
53    *,
54    analytics.estimate_query_cost(query) as complexity_analysis,
55    CASE 
56        WHEN estimated_cost_usd > 50 THEN 'EXPENSIVE'
57        WHEN estimated_cost_usd > 10 THEN 'MODERATE'
58        ELSE 'CHEAP'
59    END as cost_category
60FROM query_stats
61WHERE tb_billed > 0.1  -- Focus on queries that scan >100GB
62ORDER BY estimated_cost_usd DESC;
63
64-- Create cost alerts and query governance
65CREATE OR REPLACE PROCEDURE `analytics.check_query_cost_alerts`()
66BEGIN
67    DECLARE alert_threshold FLOAT64 DEFAULT 25.0;  -- $25 threshold
68    
69    CREATE TEMP TABLE recent_expensive_queries AS
70    SELECT 
71        job_id,
72        user_email,
73        estimated_cost_usd,
74        query,
75        creation_time
76    FROM `analytics.query_cost_analysis`
77    WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
78        AND estimated_cost_usd > alert_threshold;
79    
80    -- Alert logic would integrate with notification systems
81    SELECT 
82        CONCAT('COST ALERT: Query ', job_id, ' by ', user_email, 
83               ' cost $', ROUND(estimated_cost_usd, 2)) as alert_message,
84        query,
85        creation_time
86    FROM recent_expensive_queries;
87END;

Alternatives

  • Implement query approval workflows for complex queries above cost thresholds
  • Use BigQuery's maximum bytes billed setting to hard-cap individual query costs
  • Create query templates and best practice guidelines for common analytical patterns
  • Set up automated query optimization suggestions based on execution patterns
  • Use e6data's built-in cost controls and automatic query optimization for variable workloads

4. Optimize JOIN strategies and query structure for large analytical queries to reduce execution time

A business intelligence team was running multi-table analytical queries across large fact tables and dimension tables, consuming substantial memory per query and taking considerable time to complete.

Joining a huge fact table with multiple dimensions at full granularity can be very expensive if you only ultimately needed an aggregated result. Similarly, if you join two huge tables and then filter, it’s often better to filter first or pre-aggregate to reduce data. The default join order or type might not be optimal for every scenario (though the query planner does a decent job).

Fix: Advanced JOIN optimization and query restructuring (example scenario)

1-- Optimize join order and use appropriate join types
2-- First, analyze table cardinalities to inform join strategy
3SELECT 
4    table_name,
5    row_count,
6    size_bytes / 1e9 as size_gb,
7    partitioning_type,
8    clustering_fields
9FROM `analytics.INFORMATION_SCHEMA.TABLE_STORAGE_BY_PROJECT`
10WHERE table_name IN ('customer_orders', 'products', 'customers', 'stores')
11ORDER BY row_count DESC;
12
13-- Use broadcast joins for small dimension tables
14SELECT 
15    o.order_id,
16    c.customer_segment,
17    p.product_category,
18    s.store_region,
19    o.order_amount
20FROM `sales.customer_orders` o
21LEFT JOIN `reference.customers` c ON o.customer_id = c.customer_id
22LEFT JOIN `reference.products` p ON o.product_id = p.product_id  
23LEFT JOIN `reference.stores` s ON o.store_id = s.store_id
24WHERE o.order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);
25
26-- Pre-aggregate before expensive joins
27WITH daily_sales AS (
28    SELECT 
29        order_date,
30        customer_id,
31        store_id,
32        COUNT(*) as order_count,
33        SUM(order_amount) as daily_amount,
34        AVG(order_amount) as avg_order_value
35    FROM `sales.customer_orders`
36    WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
37    GROUP BY 1, 2, 3
38),
39customer_metrics AS (
40    SELECT 
41        customer_id,
42        COUNT(DISTINCT order_date) as active_days,
43        SUM(daily_amount) as total_spent,
44        AVG(avg_order_value) as avg_order_value
45    FROM daily_sales
46    GROUP BY customer_id
47)
48SELECT 
49    cm.customer_id,
50    c.customer_segment,
51    cm.active_days,
52    cm.total_spent,
53    cm.avg_order_value
54FROM customer_metrics cm
55JOIN `reference.customers` c ON cm.customer_id = c.customer_id
56WHERE cm.total_spent >= 1000
57ORDER BY cm.total_spent DESC;
58
59-- Use window functions efficiently to avoid self-joins
60SELECT 
61    customer_id,
62    order_date,
63    order_amount,
64    -- Calculate running totals and rankings in single pass
65    SUM(order_amount) OVER (
66        PARTITION BY customer_id 
67        ORDER BY order_date 
68        ROWS UNBOUNDED PRECEDING
69    ) as running_total,
70    ROW_NUMBER() OVER (
71        PARTITION BY customer_id 
72        ORDER BY order_amount DESC
73    ) as amount_rank,
74    LAG(order_amount) OVER (
75        PARTITION BY customer_id 
76        ORDER BY order_date
77    ) as previous_order_amount
78FROM `sales.customer_orders`
79WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY)
80QUALIFY amount_rank <= 5;  -- Top 5 orders per customer
81
82-- Monitor join performance and optimization opportunities
83SELECT 
84    job_id,
85    query,
86    total_slot_ms / 1000 as slot_seconds,
87    max_slots_utilized,
88    total_bytes_processed / 1e9 as gb_processed,
89    TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_seconds
90FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
91WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
92    AND query LIKE '%JOIN%'
93    AND total_slot_ms > 60000  -- Queries using >60 slot-seconds
94ORDER BY total_slot_ms DESC;

Alternatives

  • Create pre-joined tables for frequently accessed dimension combinations
  • Use federated queries to join data across different cloud platforms without data movement
  • Implement materialized views that maintain pre-computed join results
  • Use BigQuery's query optimization suggestions to identify inefficient join patterns
  • Consider e6data's query engine that automatically selects optimal join strategies and makes plans

5. Create cost-optimized sandbox environments for data exploration to reduce development costs

A data science team was running experimental queries and model development directly in production BigQuery, leading to unpredictable costs and resource contention with business-critical dashboards.

Solution: Create cost-optimized sandbox environments – for example, a separate BigQuery project (or projects per team) specifically for exploratory analysis. In these sandbox projects, implement stricter cost controls: smaller slot reservations or purely on-demand budgeted usage, lower or custom query quotas, and perhaps limited accessible data (maybe only views that sample the main data). By giving data scientists a “playground” that has guardrails, they can explore freely without the risk of runaway costs impacting production. Additionally, using separate projects allows clear attribution: you can see exactly how much is spent on ad-hoc exploration vs production dashboards vs ETL, etc.

Fix: Isolate and govern ad-hoc exploration with separate project & quotas

  1. Create a separate BigQuery project for sandboxing.
  2. Apply cost controls to the sandbox project. For instance, set a custom daily bytes scanned quota for the project. Also set a per-user quota (QueryUsagePerUserPerDay) if multiple people use it, to avoid one person using the entire quota.
  3. Use on-demand (pay-per-query) model in the sandbox, not a flat-rate reservation. This way, if the sandbox is idle, it costs nothing. If it’s heavily used, it’ll only use what it needs and you get charged accordingly. You might combine this with a budget alert – e.g., if sandbox spend exceeds $X in a month, you get an alert.
  4. Tag and label everything.
  5. Monitor and gamify efficiency.

Alternatives

  • Use BigQuery's maximum bytes billed setting at the user level to prevent runaway costs
  • Implement query templates and guided analytics for common exploration patterns
  • Create synthetic datasets that preserve statistical properties while reducing size and cost
  • Use BigQuery Omni for cross-cloud analytics without data movement costs
  • Consider e6data's instant sandbox environments with automatic cost controls and per-vCPU billing

ETL/Streaming

Typical cost pitfalls: Inefficient batch loading patterns, excessive streaming insert costs, and poor data lifecycle management that accumulates storage costs over time.

1. Optimize batch loading with clustered tables and compression to reduce storage and query costs

A financial data pipeline was loading substantial transaction data daily using basic INSERT statements, resulting in poor compression, inefficient clustering, and high query costs due to frequent full table scans.

Inefficient batch loading creates suboptimal data layout that impacts both storage costs and query performance throughout the table's lifetime. If you append data without partitions, queries pay to scan everything. If you don’t cluster on common query fields, BigQuery can’t skip data internally, and compression might not be as effective if data isn’t sorted. Also, loading data without using recommended methods (like using LOAD DATA for files or streaming API for large batches) can be slower and potentially costlier.

Fix: Optimized batch loading pipeline with intelligent clustering (example scenario)

1-- Create optimally structured table for batch loading
2CREATE OR REPLACE TABLE `finance.transactions_optimized` (
3    transaction_date DATE,
4    account_id STRING,
5    transaction_type STRING,
6    amount NUMERIC(15,2),
7    merchant_category STRING,
8    region STRING,
9    transaction_id STRING
10)
11PARTITION BY transaction_date
12CLUSTER BY account_id, transaction_type, merchant_category
13OPTIONS(
14    partition_expiration_days = 2555,  -- 7 years retention
15    require_partition_filter = true
16);
17
18-- Optimized batch loading with proper ordering and compression
19INSERT INTO `finance.transactions_optimized`
20SELECT 
21    DATE(transaction_timestamp) as transaction_date,
22    account_id,
23    transaction_type,
24    ROUND(amount, 2) as amount,
25    merchant_category,
26    region,
27    transaction_id
28FROM `staging.raw_transactions`
29WHERE processing_date = CURRENT_DATE()
30ORDER BY account_id, transaction_type, merchant_category;  -- Match clustering key
31
32-- Use LOAD DATA for large CSV files with optimal format
33LOAD DATA INTO `finance.transactions_optimized`
34FROM FILES (
35    format = 'CSV',
36    uris = ['gs://your-bucket/transactions/*.csv'],
37    skip_leading_rows = 1,
38    field_delimiter = ',',
39    null_marker = '',
40    allow_quoted_newlines = false
41);
42
43-- Monitor loading efficiency and table optimization
44SELECT 
45    table_name,
46    partition_id,
47    total_rows,
48    total_logical_bytes / 1e9 as logical_gb,
49    total_physical_bytes / 1e9 as physical_gb,
50    total_physical_bytes / total_logical_bytes as compression_ratio
51FROM `finance.INFORMATION_SCHEMA.PARTITIONS_META`
52WHERE table_name = 'transactions_optimized'
53    AND partition_id >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY))
54ORDER BY partition_id DESC;
55
56-- Optimize existing tables with clustering
57CREATE OR REPLACE TABLE `finance.transactions_reclustered` AS
58SELECT * FROM `finance.transactions_original`
59ORDER BY account_id, transaction_type, merchant_category;

Alternatives

  • Use BigQuery Storage Write API for high-throughput streaming with automatic optimization
  • Implement incremental loading with MERGE statements for upsert patterns
  • Use external tables with automatic schema detection for schema evolution
  • Create staging tables with different compression settings for optimal cost vs performance tradeoffs

2. Implement efficient streaming inserts and real-time processing to optimize streaming costs

A real-time analytics platform was streaming substantial individual events per minute to BigQuery, generating significant daily streaming insert costs due to inefficient batching and unnecessary insertion frequency. BigQuery charges for streaming by rows, so sending one row at a time is the priciest way to do it. Plus, not partitioning the target table meant queries scanning it grew more expensive over time. And without clustering or architecture for real-time, query performance on live data was suboptimal.

Solution:

  • Batch the streaming inserts and structure the target table for streaming. Introduce a small delay or buffer (even a few seconds) to accumulate multiple events, then send them in one insert call. This dramatically reduces the number of API calls/transactions.
  • Additionally, create the target table partitioned (say by day or hour of event_timestamp) so that analytical queries can focus on recent data easily. We also added clustering (e.g., by user_id or event_type) to help with common query filters.
  • In some cases, using a tool like Pub/Sub with a BigQuery subscription can automatically batch events.
  • Another improvement: move some real-time aggregation into materialized views so that dashboards querying real-time data hit a summary that’s updated continuously instead of raw streaming data.

Alternatives

  • Use Cloud Pub/Sub with BigQuery for automatic batching and delivery
  • Implement Apache Beam pipelines with windowing for complex streaming transformations
  • Use BigQuery Storage Write API for high-throughput streaming with lower latency
  • Create materialized views that automatically aggregate streaming data
  • Consider e6data's real-time ingest engine for sub-minute data freshness

3. Apply data lifecycle management and automated archival to reduce storage costs

An e-commerce analytics platform was storing substantial granular event data with most queries only accessing recent data, while retaining everything at the same storage tier for regulatory compliance.

BigQuery does automatically drop storage costs by ~50% for data that’s not edited for 90 days (long-term storage pricing), but even that may not justify keeping huge volumes of stale data online. There’s also a performance overhead to extremely large tables (metadata load, partition count limits, etc.). Additionally, if compliance requires keeping data, it might be better stored in aggregated form or an external archive rather than raw detail in BigQuery.

Solution: Implement a tiered data architecture within BigQuery and automate the movement/expiration of data between tiers. Example scenario:

1-- Hot tier table: holds recent 30 days of detailed events for fast access
2CREATE OR REPLACE TABLE `analytics.events_hot` (
3    event_date DATE,
4    user_id STRING,
5    event_type STRING,
6    event_data JSON,
7    revenue NUMERIC
8)
9PARTITION BY event_date
10CLUSTER BY user_id, event_type
11OPTIONS(
12    partition_expiration_days = 30,  -- keep 30 days in this table
13    description = 'Hot tier: Last 30 days of detailed events'
14);
15
16-- Warm tier table: holds the next 60 days of events (months 2-3) for less frequent analysis
17CREATE OR REPLACE TABLE `analytics.events_warm` (
18    event_date DATE,
19    user_id STRING,
20    event_type STRING,
21    event_data JSON,
22    revenue NUMERIC
23)
24PARTITION BY event_date
25CLUSTER BY event_type, event_date
26OPTIONS(
27    partition_expiration_days = 60,  -- data here expires after additional 60 days (total 90 from original event)
28    description = 'Warm tier: Days 31-90 of events for occasional analysis'
29);
30
31-- Cold tier summary: after 90 days, we only keep aggregated info (no user-level detail for example)
32CREATE OR REPLACE TABLE `analytics.events_cold_summary` (
33    event_date DATE,
34    event_type STRING,
35    user_segment STRING,
36    event_count INT64,
37    total_revenue NUMERIC,
38    unique_users INT64
39)
40PARTITION BY event_date
41OPTIONS(
42    partition_expiration_days = 2555,  -- e.g., ~7 years retention for compliance
43    description = 'Cold tier: Aggregated historical data (post-90 days)'
44);
45
46-- Stored Procedure to run daily that moves data from Hot -> Warm -> Cold and cleans up
47CREATE OR REPLACE PROCEDURE `analytics.manage_data_lifecycle`()
48BEGIN
49  -- Move data that just transitioned from hot to warm tier (e.g., data 31 days old)
50  INSERT INTO `analytics.events_warm`
51  SELECT * FROM `analytics.events_hot`
52  WHERE event_date = DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);
53
54  -- Aggregate and archive data that just transitioned out of warm (e.g., data 91 days old)
55  INSERT INTO `analytics.events_cold_summary`
56  SELECT 
57      event_date,
58      event_type,
59      CASE 
60         WHEN revenue > 100 THEN 'high_value'
61         WHEN revenue > 10 THEN 'medium_value'
62         ELSE 'low_value'
63      END AS user_segment,
64      COUNT(*) AS event_count,
65      SUM(revenue) AS total_revenue,
66      COUNT(DISTINCT user_id) AS unique_users
67  FROM `analytics.events_warm`
68  WHERE event_date = DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
69  GROUP BY 1, 2, 3;
70
71  -- (Note: Above, we bucket users into segments by revenue, as an example of summarization)
72
73  -- The events_warm table has partition_expiration_days=60, so it will automatically drop partitions older than 60 days.
74  -- The events_hot table partitions older than 30 days drop off automatically as well.
75  -- The cold_summary will retain aggregated data for 7 years (or whatever is set).
76END;
77
78-- (You would schedule this procedure daily via Cloud Scheduler or an AppEngine cron, etc.)
79
80-- Monitoring storage across tiers to verify distribution
81SELECT 
82    table_name,
83    SUM(size_bytes)/1e9 AS size_gb,
84    COUNT(DISTINCT partition_id) AS partition_count,
85    MIN(DATE(PARSE_DATETIME('%Y%m%d', partition_id))) AS oldest_data,
86    MAX(DATE(PARSE_DATETIME('%Y%m%d', partition_id))) AS newest_data
87FROM `analytics.INFORMATION_SCHEMA.TABLE_STORAGE_BY_PROJECT`
88WHERE table_name LIKE 'events_%'
89GROUP BY table_name
90ORDER BY size_gb DESC;

Alternatives

  • Use BigQuery's automatic table expiration for simple time-based retention
  • Implement external tables with Cloud Storage lifecycle policies for cost-effective archival
  • Create federated tables that span multiple storage tiers seamlessly
  • Use BigQuery's time travel feature for point-in-time recovery instead of full retention
  • Consider e6data's automated data tiering for cross-cloud storage optimization

4. Optimize cross-region replication and data transfer to reduce data egress costs

A global analytics platform was replicating substantial data daily across multiple regions for disaster recovery and local access, generating significant monthly cross-region transfer costs that could be optimized.

Unneeded cross-region data transfer is expensive (BigQuery egress to other continents can be $0.12/GB or more). If you're duplicating petabytes without carefully evaluating need, costs skyrocket.

Use a smarter data localization strategy. Identify what data truly needs to be in each region and replicate only that. Perhaps keep full detail in one primary region and push aggregated or filtered subsets to other regions. Use federated queries (BigQuery can query across region via external query if needed) for occasional access instead of full copies. In cases of compliance (like EU user data must stay in EU), separate the data by region at ingestion (store EU user data in EU region from the start, US in US). Additionally, use scheduled queries or Data Transfer Service to copy only incremental changes or only the slices needed, instead of full snapshots daily. Essentially: minimize egress by localizing storage of new data, and by avoiding copying large datasets unnecessarily.

Alternatives

  • Use BigQuery Omni for querying data in place across multiple clouds without transfer
  • Implement data mesh architecture with domain-specific datasets in appropriate regions
  • Use Cloud Storage as intermediate staging for cost-effective cross-region replication
  • Create snapshot-based replication for disaster recovery with lower frequency updates
  • Consider e6data's hybrid data lakehouse for multi-cloud federation and cross-region analytics without data movement

5. Implement intelligent job scheduling and resource optimization to reduce compute costs during peak hours

A data engineering team was running batch ETL jobs during business hours, competing with interactive dashboards for slot capacity and paying premium rates for peak-time compute resources.

Fix: Optimized scheduling with resource-aware job management

1-- Create resource usage analysis for optimal scheduling
2CREATE OR REPLACE VIEW `scheduling.resource_utilization` AS
3SELECT 
4    EXTRACT(HOUR FROM creation_time) as hour_of_day,
5    EXTRACT(DAYOFWEEK FROM creation_time) as day_of_week,
6    COUNT(*) as job_count,
7    SUM(total_slot_ms) / 1000 / 3600 as total_slot_hours,
8    AVG(total_slot_ms) / 1000 as avg_slot_seconds,
9    SUM(total_bytes_billed) / 1e12 as total_tb_billed
10FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
11WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
12    AND statement_type IN ('INSERT', 'UPDATE', 'DELETE', 'CREATE_TABLE_AS_SELECT')
13GROUP BY 1, 2
14ORDER BY 1, 2;
15
16-- Implement priority-based job scheduling
17CREATE OR REPLACE TABLE `scheduling.job_queue` (
18    job_id STRING,
19    job_type STRING,
20    priority INT64,  -- 1=highest, 5=lowest
21    estimated_slot_hours FLOAT64,
22    scheduled_time TIMESTAMP,
23    dependencies ARRAY<STRING>,
24    status STRING DEFAULT 'QUEUED'
25);
26
27-- Create adaptive scheduling procedure
28CREATE OR REPLACE PROCEDURE `scheduling.schedule_etl_jobs`()
29BEGIN
30    DECLARE current_hour INT64;
31    DECLARE is_business_hours BOOL;
32    DECLARE available_slots INT64;
33    
34    SET current_hour = EXTRACT(HOUR FROM CURRENT_TIMESTAMP());
35    SET is_business_hours = current_hour BETWEEN 8 AND 18;
36    
37    -- Adjust available slots based on time of day
38    SET available_slots = CASE 
39        WHEN is_business_hours THEN 200  -- Reserve capacity for dashboards
40        ELSE 1000  -- Full capacity during off-hours
41    END;
42    
43    -- Schedule high-priority jobs first
44    UPDATE `scheduling.job_queue`
45    SET 
46        scheduled_time = CURRENT_TIMESTAMP(),
47        status = 'SCHEDULED'
48    WHERE status = 'QUEUED'
49        AND priority <= 2  -- High priority jobs
50        AND estimated_slot_hours <= available_slots
51    ORDER BY priority ASC, scheduled_time ASC
52    LIMIT 10;
53    
54    -- Schedule batch jobs during off-peak hours
55    UPDATE `scheduling.job_queue`
56    SET 
57        scheduled_time = CASE 
58            WHEN is_business_hours THEN 
59                TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL (20 - current_hour) HOUR)
60            ELSE CURRENT_TIMESTAMP()
61        END,
62        status = 'SCHEDULED'
63    WHERE status = 'QUEUED'
64        AND priority >= 3  -- Lower priority batch jobs
65        AND NOT is_business_hours
66    ORDER BY priority ASC
67    LIMIT 5;
68END;
69
70-- Create cost-optimized ETL patterns
71-- Use incremental processing to reduce resource consumption
72CREATE OR REPLACE PROCEDURE `etl.process_daily_increments`(processing_date DATE)
73BEGIN
74    -- Process only new/changed data
75    CREATE OR REPLACE TABLE `analytics.customer_metrics_temp` AS
76    SELECT 
77        customer_id,
78        processing_date,
79        COUNT(*) as daily_transactions,
80        SUM(amount) as daily_spend,
81        MAX(transaction_timestamp) as last_transaction
82    FROM `raw.transactions`
83    WHERE DATE(transaction_timestamp) = processing_date
84    GROUP BY 1, 2;
85    
86    -- Merge with existing data
87    MERGE `analytics.customer_metrics` T
88    USING `analytics.customer_metrics_temp` S
89    ON T.customer_id = S.customer_id AND T.processing_date = S.processing_date
90    WHEN MATCHED THEN
91        UPDATE SET 
92            daily_transactions = S.daily_transactions,
93            daily_spend = S.daily_spend,
94            last_transaction = S.last_transaction
95    WHEN NOT MATCHED THEN
96        INSERT (customer_id, processing_date, daily_transactions, daily_spend, last_transaction)
97        VALUES (S.customer_id, S.processing_date, S.daily_transactions, S.daily_spend, S.last_transaction);
98    
99    DROP TABLE `analytics.customer_metrics_temp`;
100END;
101
102-- Monitor scheduling efficiency and cost optimization
103SELECT 
104    DATE(scheduled_time) as schedule_date,
105    EXTRACT(HOUR FROM scheduled_time) as schedule_hour,
106    job_type,
107    COUNT(*) as jobs_scheduled,
108    SUM(estimated_slot_hours) as total_slot_hours,
109    AVG(estimated_slot_hours) as avg_slot_hours_per_job
110FROM `scheduling.job_queue`
111WHERE scheduled_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
112    AND status = 'COMPLETED'
113GROUP BY 1, 2, 3
114ORDER BY 1, 2;

Alternatives

  • Use BigQuery's job timeout and retry settings to optimize resource usage and reliability
  • Implement Apache Airflow or Cloud Composer for sophisticated ETL orchestration and dependency management
  • Use Cloud Functions or Cloud Run for event-driven ETL triggered by data arrival
  • Create separate slot reservations for ETL workloads with autoscaling based on queue depth
  • Consider e6data's automatic job scheduling and resource optimization for variable ETL workloads

When BigQuery Optimization isn't Enough: An e6data Alternative

While BigQuery optimization dramatically reduces costs, some analytical workloads encounter efficiency limits due to slot-based resource allocation and single-engine constraints. e6data's compute engine provides complementary capabilities that address remaining cost inefficiencies:

Granular resource control: e6data's per-vCPU billing eliminates the slot allocation overhead that creates cost inefficiencies for variable analytical workloads, particularly during off-peak periods and exploratory data science work.

Cross-format query optimization: Unlike BigQuery's columnar focus, e6data's vectorized engine delivers comparable performance across Parquet, Delta, Iceberg, and CSV formats without requiring data conversion or format-specific optimization.

Automatic cost optimization: e6data's adaptive query engine automatically implements sampling, caching, and execution plan optimization without manual configuration, reducing the operational overhead of maintaining cost-efficient BigQuery deployments.

Hybrid deployment efficiency: Teams use e6data for cost-sensitive analytical workloads while maintaining BigQuery for real-time dashboards and streaming analytics, optimizing each workload with the most suitable engine rather than forcing all analytics through a single platform.

Estimated compute costs on e6data vs BigQuery: 50%+ savings
Estimated Costs on e6data

Implement BigQuery optimization tactics to capture immediate substantial cost savings, then evaluate e6data for variable analytical workloads where per-vCPU billing and automated optimization provide additional cost reductions. Leading data teams use this hybrid approach to maximize both performance and cost efficiency across their entire analytical stack.

➡️ Interested in exploring this? Start a free trial of e6data and see how it compares on your own workloads.

Table of contents:
Share this article