Transition from Data Analyst to Data Engineer: Upgrading Skills from DataFlows Gen2 to Notebooks

Description

The session shows an analyst how to use notebooks to perform the same actions as Dataflow Gen2. Looking at 4 areas: source, transformation, cleansing, and destination, examples include append/merge query to dataframe joins, text replace functionality, split and concatenate columns, as well as pivot/unpivot, data type changes, and removing duplicates. We will finish off with engine optimizations.

Key Takeaways

My Notes

Action Items

Slides

Transition from Data
Analyst to Data Engineer
Thomas Leblanc, Fabric Architect
Nikola Ilic, Data Mozart, Microsoft Data Platform MVP
Your Transition Assistants
• About Me…
oData Platform MVP (8 years)
oDatabase Normalization “Nut”
▪ Retired Developer
▪ Crazy about Dimensions
▪ Data Vault 2.0 Certified
oBluesky - @PowerBIDude
ovolunteer – Local User Group(s)
▪ Data, Dev and Analytics UGs
Thomas LeBlanc
Nikola Ilic
Consultant & Trainer
➢ I'm making music from the data!
➢ Power BI and Fabric addict, blogger, speaker...
➢ Father of 2, Barca & Leo Messi fan...
data-mozart.com
Agenda
• “The bridge.
• Why notebooks? Mental model shift
• Connections
• Azure SQL DB and Mirroring
• Spark engine
• Common transformations
• Data cleansing operations
• Incremental loading
• Destinations
• Optimization strategies
• Takeaways and Q&A
The Bridge: When to Reach for Notebooks
Dataflow Gen2 — What it does & where it stops
✓ Low-code ingestion & Power Query transforms
✓ Incremental refresh (built-in, UI-configured)
✓ Reuse existing M / Power Query skills
✗ IR = bucket replace, not row-level upsert
✗ Source must fully support query folding
✗ OPTIMIZE / VACUUM blocked on IR tables
✗ Max 50 buckets/query, 150/dataflow
✗ No custom error handling or retry logic
✗ No visibility into the execution plan
Fabric Notebooks — What you unlock
✓ True row-level MERGE / upsert via Delta Lake
✓ No folding requirement - any data source
✓ Full OPTIMIZE + V-Order after every load
✓ Unlimited scale: partitions, cache, broadcast
✓ Custom retry logic, logging, error branches
✓ Full Spark plan visibility (.explain())
✓ Version-controlled, testable helper functions
✓ Orchestrate with pipelines or job scheduler
Connections Available
• Mirroring – Azure SQL DB
o Creates a mirrored table into a Delta Table in Lakehouse
o Dataflow - 'Enable Staging'
• Copy Job
o GUI interface
o More flexible
o Dataflow - 'Enable Staging'
• Other options
o Pipeline with Copy activity
o Real-time streaming
o Notebook code to connect to data source
• Why? More flexible than Dataflows
o Monitoring
Demo - Connecting
Spark Engine
• Spark Pool –default
o Usually, seconds to startup
o A Default pool
• Environment changes
o Minutes to start
o Workspace Managed Identity (even if not used)
o New environment (libraries, change size)
• Fabric Capacity Metrics App
o Monitor usage
o Less CUs than Dataflow
▪ Capacity Units
Demo - Stage Lakehouse
Demo - Start engine
Medallion Architecture
• Bronze
o Like Staging
• Silver
o Transformations
o ODS – Operational Data Store
o Lakehouse
• Gold
o Warehouse
o Dimensions and Facts
▪ Bridge, plus other tables
Common Data Transformations
➢Split columns
➢Extract before/after delimiter
➢Replace values
➢Merge/Append queries
➢Pivot and unpivot
Splitting Columns – M vs. PySpark
M (Power Query)
// Power Query - Split Column
= Table.SplitColumn(
Source,
"product_info",
Splitter.SplitTextByDelimiter(
"|", QuoteStyle.Csv),
{"category",
"sub_category",
"product_name"}
)
// One column at a time
// No reuse across queries
// GUI step - not version-controlled
PySpark (Fabric Notebook)

PySpark - split() + getItem()

from pyspark.sql.functions import split, col
parts = split(col("product_info"), r"|")
df = (df
.withColumn("category",
parts.getItem(0))
.withColumn("sub_category", parts.getItem(1))
.withColumn("product_name", parts.getItem(2))
.drop("product_info")
)

Reusable helper function:

def split_col(df, src, delim, names):
p = split(col(src), delim)
for k, name in enumerate(names):
df = df.withColumn(name, p.getItem(k))
return df.drop(src)
Splitting Columns – M vs. PySpark
M (Power Query)
// Power Query - Replace Value
= Table.ReplaceValue(
Source,
"Done", "Completed",
Replacer.ReplaceText,
{"status"}
)
// One value per step
// Case-sensitive by default
// Multiple steps for multiple values
// No regex support in basic Replace
PySpark (Fabric Notebook)

Approach A: regexp_replace (case-insensitive)

from pyspark.sql.functions import regexp_replace
df = df.withColumn("status",
regexp_replace("status",
r"(?i)done|complete", "Completed")
)

Approach B: when/otherwise (analyst-friendly)

from pyspark.sql.functions import when
df = df.withColumn("status",
when(col("status").isin(
"Done","complete","COMPLETED"),
"Completed")
.otherwise(col("status"))
)

Approach C: mapping dictionary

mapping = {"Done":"Completed","complete":"Completed"}
df = df.replace(mapping, subset=["status"])
Engine Optimizations
➢V-Order
➢OPTIMIZE
➢Caching
➢Broadcast joins
Understanding V-Order
Write-time optimisation for Parquet files, built into Fabric's Spark runtime (disabled by default). It applies VertiPaq-compatible sorting,
encoding, and compression so that Power BI, SQL endpoint and Spark all read your Delta tables faster.
Power BI Direct Lake
40–60%
faster cold-cache
queries
SQL Analytics Endpoint
~10%
read speed
improvement
Spark
No direct read gain
but OPTIMIZE still
consolidates small
files
V-Order in Fabric Notebooks

Option 1: Enable V-Order for the whole session

spark.conf.set("spark.sql.parquet.vorder.enabled", "true")

Option 2: Set permanently on a table

spark.sql("""
ALTER TABLE sales_orders
SET TBLPROPERTIES ('delta.parquet.vorder.enabled' = 'true')
""")

Option 3: OPTIMIZE + V-Order after every significant load

sql OPTIMIZE sales_orders VORDER

Enable at session level:

spark.conf.set(
"spark.sql.parquet.vorder.enabled",
"true")
• Merges small Parquet files into larger ones (~128
MB–1 GB target)
• Reduces file-count overhead on reads
sql OPTIMIZE sales_orders VORDER
Incremental Loading
Dataflows Gen2 Incremental Refresh
// DataFlows Gen2 - Incremental Refresh
// Configured in UI (no code):
// • DateTime column: updated_at
// • Bucket size: 1 day
// • Window: last 30 days
// • Update method: REPLACE bucket
Fabric Notebook - Full control

Notebook - Watermark + Append (new rows only)

last_ts = spark.sql(
"SELECT COALESCE(MAX(updated_at),'1900-01-01')"
" FROM sales_orders"
).collect()[0][0]
// What happens each run:
// 1. Check max(updated_at) per bucket
// 2. If changed: DELETE entire bucket
// 3. Re-INSERT all rows from source
new_rows = (
spark.read.csv("Files/new.csv", header=True)
.filter(col("updated_at") > last_ts)
)
new_rows.write.format("delta")
.mode("append")
.saveAsTable("sales_orders")
//
//
//

Compact + V-Order - always safe here:

%%sql
OPTIMIZE sales_orders VORDER
Source MUST fully fold
Cannot run OPTIMIZE on this table
No row-level upsert - bucket replace only

Full audit trail, zero extra work:

spark.sql("DESCRIBE HISTORY sales_orders")
Key Takeaways

  1. Your M skills translate directly
    split() ≈ Splitter · when/otherwise ≈ Conditional Column · Same logic, different syntax
  2. Reusable functions beat repeated steps
    One Python def replaces a dozen Dataflow steps - version-controlled, testable, shareable across any notebook
  3. Dataflows Gen2 IR is real but limited
    Bucket replace requires folding, blocks OPTIMIZE, and can't do row-level upsert — Delta MERGE gives you full control
  4. V-Order is your first optimisation, always
    Enable at session level or per table. Run OPTIMIZE … VORDER after every significant load. Power BI Direct Lake especially benefits.
  5. The transition is gradual - not all-or-nothing
    Keep Dataflows for simple low-code ingestion. Reach for notebooks for incrementals, MERGE, performance control, or code reuse
    Rate the Session
    Thank You!
    Thomas LeBlanc
    Nikola Ilic
    @PowerBIDude/BlueSky
    Thomaslleblanc/LinkedIn
    www.data-mozart.com
    TheSmilingDBA@gmail.com
    learn.data-mozart.com
    nikola@data-mozart.com
    Sound off.
    The mic is all yours.
    Influence the product roadmap.
    Join the Fabric User Panel
    Join the SQL User Panel
    Share your feedback directly with our
    Fabric product group and researchers.
    Influence our SQL roadmap and ensure
    it meets your real-life needs
    https://aka.ms/JoinFabricUserPanel
    https://aka.ms/JoinSQLUserPanel
    Sound off.
    The mic is all yours.
    Influence the product roadmap.
    Join the Fabric User Panel
    Join the SQL User Panel
    Share your feedback directly with our
    Fabric product group and researchers.
    Influence our SQL roadmap and ensure
    it meets your real-life needs
    https://aka.ms/JoinFabricUserPanel
    https://aka.ms/JoinSQLUserPanel