AgentSkillsCN

fabric-lakehouse-perf-remediate

诊断并解决 Microsoft Fabric Lakehouse 的性能问题,包括 Spark 查询缓慢、小文件问题、Delta 表碎片化、V-Order 配置、表维护(OPTIMIZE、VACUUM、Z-Order)、SQL 分析端点调优、Direct Lake 性能、资源配置选择、自动调优配置、容量限流,以及流式摄取优化。适用于在被要求排查 Fabric Lakehouse 运行缓慢、优化 Delta 表、修复小文件问题、配置 Spark 设置、执行表维护,或在笔记本或管道中提升查询性能时使用。

SKILL.md
--- frontmatter
name: fabric-lakehouse-perf-remediate
description: Diagnose and resolve Microsoft Fabric Lakehouse performance issues including slow Spark queries, small file problems, Delta table fragmentation, V-Order configuration, table maintenance (OPTIMIZE, VACUUM, Z-Order), SQL analytics endpoint tuning, Direct Lake performance, resource profile selection, autotune configuration, capacity throttling, and streaming ingestion optimization. Use when asked to troubleshoot Fabric Lakehouse slowness, optimize Delta tables, fix small file problems, configure Spark settings, run table maintenance, or improve query performance in notebooks or pipelines.
license: Complete terms in LICENSE.txt

Fabric Lakehouse Performance remediate

Systematic toolkit for diagnosing and resolving performance issues in Microsoft Fabric Lakehouse environments. Covers Delta table health, Spark compute tuning, query optimization, and automated maintenance workflows.

When to Use This Skill

  • Lakehouse queries are running slowly or timing out
  • Delta tables have accumulated many small files (small file problem)
  • Spark notebooks or jobs are underperforming
  • Direct Lake semantic models have cold-start or transcoding delays
  • SQL analytics endpoint queries are slow
  • Table maintenance (OPTIMIZE, VACUUM) needs to be scheduled or automated
  • V-Order, Z-Order, or resource profile configuration is needed
  • Capacity throttling or concurrency issues are suspected
  • Streaming ingestion is creating fragmented Delta tables

Prerequisites

  • Microsoft Fabric workspace with Lakehouse items
  • Contributor or higher workspace role
  • Fabric capacity (F2 or above) or Trial capacity
  • For REST API automation: Microsoft Entra token for Fabric service
  • For Spark commands: Access to Fabric notebooks or Spark Job Definitions

Quick Diagnosis Checklist

When a user reports Lakehouse performance issues, work through these areas in order:

  1. Identify the symptom — Slow reads, slow writes, capacity throttling, or query timeouts
  2. Check Delta table health — File count, file sizes, V-Order status, partition layout
  3. Review Spark configuration — Resource profile, autotune, shuffle partitions
  4. Inspect capacity utilization — Concurrency limits, burst capacity, throttling
  5. Evaluate maintenance history — When was OPTIMIZE/VACUUM last run?
  6. Assess data patterns — Streaming vs batch, read-heavy vs write-heavy

Symptom-to-Action Map

SymptomRoot CauseAction
Slow reads across all enginesSmall files, no V-OrderRun OPTIMIZE VORDER, switch to readHeavy profile
Slow Spark queries onlyWrong shuffle partitionsEnable autotune or tune manually
Slow Power BI Direct LakeToo many Parquet files/row groupsRun OPTIMIZE, check guardrail limits
Slow SQL analytics endpointFiles under 400 MB, too many small filesOPTIMIZE with maxRecordsPerFile=2M
Write performance degradedV-Order enabled on write-heavy workloadSwitch to writeHeavy resource profile
Capacity throttledToo many concurrent Spark jobsReview concurrency limits, enable optimistic admission
Storage growing unexpectedlyVACUUM not runningSchedule VACUUM with 7-day retention
Streaming creates tiny filesNo batching or trigger intervalAdd processingTime trigger, run periodic OPTIMIZE

Core Optimization Operations

1. Table Maintenance Commands

Run in a Fabric notebook (Spark SQL):

sql
-- Basic OPTIMIZE (bin-compaction)
OPTIMIZE lakehouse_name.schema_name.table_name;

-- OPTIMIZE with V-Order
OPTIMIZE lakehouse_name.schema_name.table_name VORDER;

-- OPTIMIZE with Z-Order on frequently filtered columns
OPTIMIZE lakehouse_name.schema_name.table_name ZORDER BY (column_name);

-- OPTIMIZE with both Z-Order and V-Order
OPTIMIZE lakehouse_name.schema_name.table_name ZORDER BY (column_name) VORDER;

-- OPTIMIZE specific partitions only
OPTIMIZE lakehouse_name.schema_name.table_name WHERE date_key >= '2025-01-01' VORDER;

-- VACUUM with default 7-day retention
VACUUM lakehouse_name.schema_name.table_name;

-- VACUUM with custom retention (requires safety check disabled)
VACUUM lakehouse_name.schema_name.table_name RETAIN 168 HOURS;

2. Resource Profile Configuration

Set at environment level or runtime. See resource-profiles.md for details.

python
# Check current profile
spark.conf.get('spark.fabric.resourceProfile')

# Switch to read-heavy for Spark queries
spark.conf.set("spark.fabric.resourceProfile", "readHeavyForSpark")

# Switch to read-heavy for Power BI Direct Lake
spark.conf.set("spark.fabric.resourceProfile", "readHeavyForPBI")

# Switch to write-heavy for ETL/ingestion
spark.conf.set("spark.fabric.resourceProfile", "writeHeavy")

3. V-Order Control

python
# Check current V-Order setting
spark.conf.get('spark.sql.parquet.vorder.default')

# Enable V-Order for read-heavy workloads
spark.conf.set('spark.sql.parquet.vorder.default', 'true')

# Disable V-Order for write-heavy ingestion
spark.conf.set('spark.sql.parquet.vorder.default', 'false')

4. Autotune Configuration

sql
-- Enable autotune for automatic Spark SQL tuning
SET spark.ms.autotune.enabled=TRUE;

-- Disable autotune
SET spark.ms.autotune.enabled=FALSE;

Autotune adjusts three key settings per query: spark.sql.shuffle.partitions, spark.sql.autoBroadcastJoinThreshold, and spark.sql.files.maxPartitionBytes. Requires 20-25 iterations to learn optimal settings. Only works on Runtime 1.1 and 1.2. Not compatible with high concurrency mode or private endpoints.

5. SQL Analytics Endpoint Optimization

For best SQL analytics endpoint performance, target ~2 million rows and ~400 MB per Parquet file:

python
# Before data changes
spark.conf.set("spark.sql.files.maxRecordsPerFile", 2000000)

# Perform data operations (inserts, updates, deletes)
# ...

# After data changes, set max file size and optimize
spark.conf.set("spark.databricks.delta.optimize.maxFileSize", 4294967296)

Then run OPTIMIZE on the affected tables.

REST API Automation

Automate table maintenance via the Fabric REST API. See rest-api-maintenance.md for full details.

Endpoint:

code
POST https://api.fabric.microsoft.com/v1/workspaces/{workspaceId}/items/{lakehouseId}/jobs/instances?jobType=TableMaintenance

Request body:

json
{
  "executionData": {
    "tableName": "my_table",
    "schemaName": "dbo",
    "optimizeSettings": {
      "vOrder": "true",
      "zOrderBy": ["frequently_filtered_column"]
    },
    "vacuumSettings": {
      "retentionPeriod": "7.01:00:00"
    }
  }
}

Monitor status:

code
GET https://api.fabric.microsoft.com/v1/workspaces/{workspaceId}/items/{lakehouseId}/jobs/instances/{operationId}

Streaming Ingestion Best Practices

When streaming data into a Lakehouse, prevent small file proliferation:

  1. Set processing time triggers to batch events into larger writes
  2. Use Optimized Write (spark.microsoft.delta.optimizeWrite.enabled = true)
  3. Partition wisely — low-cardinality columns only (< 100-200 distinct values)
  4. Schedule periodic OPTIMIZE — daily or more often for high-frequency streams
  5. Combine repartition() with partitionBy() for optimal in-memory and on-disk layout
python
# Example: Streaming with batching and partitioning
rawData = df \
  .writeStream \
  .format("delta") \
  .option("checkpointLocation", "Files/checkpoint") \
  .outputMode("append") \
  .partitionBy("date_key") \
  .trigger(processingTime="1 minute") \
  .toTable("my_streaming_table")

Available Scripts

Available Templates

Detailed References

  • Resource Profiles — Complete guide to Fabric Spark resource profile selection and configuration
  • REST API Maintenance — Automating table maintenance with Fabric REST API and PowerShell
  • Delta Table Health — Assessing and monitoring Delta table file layout, V-Order status, and partition health
  • Concurrency and Capacity — Understanding Spark job admission, throttling, burst capacity, and autoscale billing

remediate

IssueCauseFix
OPTIMIZE command not recognizedRunning in SQL analytics endpointUse Fabric notebook with Spark runtime
VACUUM fails with retention errorRetention < 7 days without safety overrideSet spark.databricks.delta.retentionDurationCheck.enabled=false
Autotune not activatingQuery too short (< 15 seconds) or wrong runtimeUse Runtime 1.1/1.2, ensure queries exceed 15s
Table maintenance API returns 409Another maintenance job running on same tableWait for completion, jobs on different tables run in parallel
V-Order not applied after OPTIMIZESession/table property mismatchUse OPTIMIZE table VORDER explicitly
Capacity throttled during maintenanceMaintenance consuming too many coresSchedule during off-peak, reduce concurrent jobs

Key Decision Framework

code
Is the workload primarily reads or writes?
├── Read-heavy (dashboards, queries, analytics)
│   ├── Power BI Direct Lake → readHeavyForPBI profile + OPTIMIZE VORDER
│   └── Spark analytics → readHeavyForSpark profile + enable autotune
├── Write-heavy (ETL, ingestion, streaming)
│   └── writeHeavy profile + periodic OPTIMIZE + VACUUM on schedule
└── Mixed workload
    ├── Separate environments for read vs write paths
    └── Use runtime spark.conf.set() to switch profiles per notebook