Transition from Data Analyst to Data Engineer: Upgrading Skills from DataFlows Gen2 to Notebooks
Description
The session shows an analyst how to use notebooks to perform the same actions as Dataflow Gen2. Looking at 4 areas: source, transformation, cleansing, and destination, examples include append/merge query to dataframe joins, text replace functionality, split and concatenate columns, as well as pivot/unpivot, data type changes, and removing duplicates. We will finish off with engine optimizations.
Key Takeaways
- Why notebooks? Mental model shift
- Azure SQL DB and Mirroring
- Common transformations
- Data cleansing operations
- Optimization strategies
- ✓ Reuse existing M / Power Query skills
- ✗ Source must fully support query folding
My Notes
Action Items
- [ ]
Resources & Links
Slides
Transition from Data
Analyst to Data Engineer
Thomas Leblanc, Fabric Architect
Nikola Ilic, Data Mozart, Microsoft Data Platform MVP
Your Transition Assistants
• About Me…
oData Platform MVP (8 years)
oDatabase Normalization “Nut”
▪ Retired Developer
▪ Crazy about Dimensions
▪ Data Vault 2.0 Certified
oBluesky - @PowerBIDude
ovolunteer – Local User Group(s)
▪ Data, Dev and Analytics UGs
Thomas LeBlanc
Nikola Ilic
Consultant & Trainer
➢ I'm making music from the data!
➢ Power BI and Fabric addict, blogger, speaker...
➢ Father of 2, Barca & Leo Messi fan...
data-mozart.com
Agenda
• “The bridge.
• Why notebooks? Mental model shift
• Connections
• Azure SQL DB and Mirroring
• Spark engine
• Common transformations
• Data cleansing operations
• Incremental loading
• Destinations
• Optimization strategies
• Takeaways and Q&A
The Bridge: When to Reach for Notebooks
Dataflow Gen2 — What it does & where it stops
✓ Low-code ingestion & Power Query transforms
✓ Incremental refresh (built-in, UI-configured)
✓ Reuse existing M / Power Query skills
✗ IR = bucket replace, not row-level upsert
✗ Source must fully support query folding
✗ OPTIMIZE / VACUUM blocked on IR tables
✗ Max 50 buckets/query, 150/dataflow
✗ No custom error handling or retry logic
✗ No visibility into the execution plan
Fabric Notebooks — What you unlock
✓ True row-level MERGE / upsert via Delta Lake
✓ No folding requirement - any data source
✓ Full OPTIMIZE + V-Order after every load
✓ Unlimited scale: partitions, cache, broadcast
✓ Custom retry logic, logging, error branches
✓ Full Spark plan visibility (.explain())
✓ Version-controlled, testable helper functions
✓ Orchestrate with pipelines or job scheduler
Connections Available
• Mirroring – Azure SQL DB
o Creates a mirrored table into a Delta Table in Lakehouse
o Dataflow - 'Enable Staging'
• Copy Job
o GUI interface
o More flexible
o Dataflow - 'Enable Staging'
• Other options
o Pipeline with Copy activity
o Real-time streaming
o Notebook code to connect to data source
• Why? More flexible than Dataflows
o Monitoring
Demo - Connecting
Spark Engine
• Spark Pool –default
o Usually, seconds to startup
o A Default pool
• Environment changes
o Minutes to start
o Workspace Managed Identity (even if not used)
o New environment (libraries, change size)
• Fabric Capacity Metrics App
o Monitor usage
o Less CUs than Dataflow
▪ Capacity Units
Demo - Stage Lakehouse
Demo - Start engine
Medallion Architecture
• Bronze
o Like Staging
• Silver
o Transformations
o ODS – Operational Data Store
o Lakehouse
• Gold
o Warehouse
o Dimensions and Facts
▪ Bridge, plus other tables
Common Data Transformations
➢Split columns
➢Extract before/after delimiter
➢Replace values
➢Merge/Append queries
➢Pivot and unpivot
Splitting Columns – M vs. PySpark
M (Power Query)
// Power Query - Split Column
= Table.SplitColumn(
Source,
"product_info",
Splitter.SplitTextByDelimiter(
"|", QuoteStyle.Csv),
{"category",
"sub_category",
"product_name"}
)
// One column at a time
// No reuse across queries
// GUI step - not version-controlled
PySpark (Fabric Notebook)
PySpark - split() + getItem()
from pyspark.sql.functions import split, col
parts = split(col("product_info"), r"|")
df = (df
.withColumn("category",
parts.getItem(0))
.withColumn("sub_category", parts.getItem(1))
.withColumn("product_name", parts.getItem(2))
.drop("product_info")
)
Reusable helper function:
def split_col(df, src, delim, names):
p = split(col(src), delim)
for k, name in enumerate(names):
df = df.withColumn(name, p.getItem(k))
return df.drop(src)
Splitting Columns – M vs. PySpark
M (Power Query)
// Power Query - Replace Value
= Table.ReplaceValue(
Source,
"Done", "Completed",
Replacer.ReplaceText,
{"status"}
)
// One value per step
// Case-sensitive by default
// Multiple steps for multiple values
// No regex support in basic Replace
PySpark (Fabric Notebook)
Approach A: regexp_replace (case-insensitive)
from pyspark.sql.functions import regexp_replace
df = df.withColumn("status",
regexp_replace("status",
r"(?i)done|complete", "Completed")
)
Approach B: when/otherwise (analyst-friendly)
from pyspark.sql.functions import when
df = df.withColumn("status",
when(col("status").isin(
"Done","complete","COMPLETED"),
"Completed")
.otherwise(col("status"))
)
Approach C: mapping dictionary
mapping = {"Done":"Completed","complete":"Completed"}
df = df.replace(mapping, subset=["status"])
Engine Optimizations
➢V-Order
➢OPTIMIZE
➢Caching
➢Broadcast joins
Understanding V-Order
Write-time optimisation for Parquet files, built into Fabric's Spark runtime (disabled by default). It applies VertiPaq-compatible sorting,
encoding, and compression so that Power BI, SQL endpoint and Spark all read your Delta tables faster.
Power BI Direct Lake
40–60%
faster cold-cache
queries
SQL Analytics Endpoint
~10%
read speed
improvement
Spark
No direct read gain
but OPTIMIZE still
consolidates small
files
V-Order in Fabric Notebooks
Option 1: Enable V-Order for the whole session
spark.conf.set("spark.sql.parquet.vorder.enabled", "true")
Option 2: Set permanently on a table
spark.sql("""
ALTER TABLE sales_orders
SET TBLPROPERTIES ('delta.parquet.vorder.enabled' = 'true')
""")
Option 3: OPTIMIZE + V-Order after every significant load
sql OPTIMIZE sales_orders VORDER
Enable at session level:
spark.conf.set(
"spark.sql.parquet.vorder.enabled",
"true")
• Merges small Parquet files into larger ones (~128
MB–1 GB target)
• Reduces file-count overhead on reads
sql OPTIMIZE sales_orders VORDER
Incremental Loading
Dataflows Gen2 Incremental Refresh
// DataFlows Gen2 - Incremental Refresh
// Configured in UI (no code):
// • DateTime column: updated_at
// • Bucket size: 1 day
// • Window: last 30 days
// • Update method: REPLACE bucket
Fabric Notebook - Full control
Notebook - Watermark + Append (new rows only)
last_ts = spark.sql(
"SELECT COALESCE(MAX(updated_at),'1900-01-01')"
" FROM sales_orders"
).collect()[0][0]
// What happens each run:
// 1. Check max(updated_at) per bucket
// 2. If changed: DELETE entire bucket
// 3. Re-INSERT all rows from source
new_rows = (
spark.read.csv("Files/new.csv", header=True)
.filter(col("updated_at") > last_ts)
)
new_rows.write.format("delta")
.mode("append")
.saveAsTable("sales_orders")
//
//
//
Compact + V-Order - always safe here:
%%sql
OPTIMIZE sales_orders VORDER
Source MUST fully fold
Cannot run OPTIMIZE on this table
No row-level upsert - bucket replace only
Full audit trail, zero extra work:
spark.sql("DESCRIBE HISTORY sales_orders")
Key Takeaways
- Your M skills translate directly
split() ≈ Splitter · when/otherwise ≈ Conditional Column · Same logic, different syntax - Reusable functions beat repeated steps
One Python def replaces a dozen Dataflow steps - version-controlled, testable, shareable across any notebook - Dataflows Gen2 IR is real but limited
Bucket replace requires folding, blocks OPTIMIZE, and can't do row-level upsert — Delta MERGE gives you full control - V-Order is your first optimisation, always
Enable at session level or per table. Run OPTIMIZE … VORDER after every significant load. Power BI Direct Lake especially benefits. - The transition is gradual - not all-or-nothing
Keep Dataflows for simple low-code ingestion. Reach for notebooks for incrementals, MERGE, performance control, or code reuse
Rate the Session
Thank You!
Thomas LeBlanc
Nikola Ilic
@PowerBIDude/BlueSky
Thomaslleblanc/LinkedIn
• www.data-mozart.com
TheSmilingDBA@gmail.com
• learn.data-mozart.com
• nikola@data-mozart.com
Sound off.
The mic is all yours.
Influence the product roadmap.
Join the Fabric User Panel
Join the SQL User Panel
Share your feedback directly with our
Fabric product group and researchers.
Influence our SQL roadmap and ensure
it meets your real-life needs
https://aka.ms/JoinFabricUserPanel
https://aka.ms/JoinSQLUserPanel
Sound off.
The mic is all yours.
Influence the product roadmap.
Join the Fabric User Panel
Join the SQL User Panel
Share your feedback directly with our
Fabric product group and researchers.
Influence our SQL roadmap and ensure
it meets your real-life needs
https://aka.ms/JoinFabricUserPanel
https://aka.ms/JoinSQLUserPanel