Config-Driven Data Engineering in Microsoft Fabric
Description
Data-engineering often results in hundreds of one-offs with duplicate code and inconsistency, making change slow, brittle, and costly. We will introduce and demo new metadata and config-driven patterns for PySpark integration in Fabric. You will learn to implement reusable patterns that simplify development, scale reliably, cut duplication, and improve governance across Fabric environments.
Key Takeaways
- A developer can always just write a notebook cell
- Key Generation &
- Dependent threads wait. Always in the
- Surrogate Key Generation
- Hash -based surrogate keys from business key columns.
- Keys are always non
- -null. Always consistent.
My Notes
Action Items
- [ ]
Resources & Links
Slides
ATLANTA
MARCH 16
- 20, 2026
Config
-driven Data Engineering
in Microsoft Fabric
Pierre LaFromboise
Chief Data Officer
Covenant Technology Partners
25+ years implementing Microsoft data solutions
Organizations of all sizes · Every industry
A Familiar Journey
Notebooks + Lakehouse
Full PySpark
flexibility
Natural next step from Dataflows
Dataflows + Warehouse
Steeper learning curve
Power BI -familiar, low -code
Great starting point
Gaps in transformation capabilities
Medallion -native
pivot
Delta Lake out of box
Discipline required to scale
SQL -familiar, approachable
Rigid for bronze / silver layers
Ecosystem still maturing
It Starts Well
The first notebook works.
The second notebook works.
The thirtieth notebook is…
complicated.
The Abstraction Treadmill
Copy -Paste
Templates
Shared
Snippets
Helper
Modules
Five Problems, One Root Cause
No Repeatable Structure
Copy -Paste Proliferation
Brittle Orchestration
Opaque Execution
Environment Drift
Notebooks have no contract.
There is no enforceable agreement between what
was intended and what was built.
The Structural Problem
PySpark
is pro -code by design.
Any governance layer is optional.
A developer can always just write a notebook cell
.
You cannot enforce a standard you cannot intercept.
This isn’t a people problem. It’s an architecture problem.
What if the pipeline
IS the config?
Declare what you want.
Let the engine handle how.
weevr
A configuration -driven execution framework for
PySpark in Microsoft Fabric.
Open source ·
ardent -data.github.io/
Apache 2.0
Licensed · Production ready
weevr
The Core Promise
Same config.
Same inputs.
Same outputs.
Every time .
The Configuration IS the Contract
« The config is the specification
Not a comment in a notebook.
« The config is the standard.
Not a wiki page.
« The config is the audit trail.
Not a convention.
« The config is the truth.
The config
itself .
The Object Model
Loom
Weave
Thread
Deployable unit
Dependency graph
over threads
Smallest unit of work
One or more weaves.
Shared defaults.
Typed parameters.
One or more source
→ transforms
→ target
Parallel execution.
Automatic ordering.
Configuration flows down. Most specific wins.
Architectural Foundation
Spark -native
No new execution engine.
runs on the Spark session you
already have.
No Code Generation
weevr
Configuration is interpreted at
runtime. Nothing is generated,
compiled, or stored between
runs.
Fabric & Delta Aligned
Deterministic Execution
Built for OneLake , lakehouses,
and Delta tables. No abstraction
leaks. No workarounds. No
surprises.
Same config plus same inputs
equals same outputs. The
engine is stateless. The behavior
is guaranteed.
What
weevr
Is Not
X Not a replacement for Spark
X Not opinionated about data modeling
Spark is the execution engine.
weevr is the
configuration layer that sits on top of it.
Kimball, Data Vault, wide tables
— your modeling
patterns are supported, not mandated.
X Not a scheduler
X Not trying to solve every Fabric problem
Orchestration remains external
— Fabric Pipelines,
Airflow, whatever your team already uses.
Deliberately scoped;
PySpark , Fabric, Delta Lake.
That’s the lane. It’s a wide lane, but it’s one lane.
X Not a code generator
Configuration is interpreted at runtime. No
PySpark is generated, stored, or compiled from
YAML.
Features
at a Glance
Declarative YAML
Pipelines
DAG Orchestration
Key Generation &
Hashing
Structured
Telemetry &
Observability
19 Transform Types
Incremental
Loading
Idempotent
Outputs
Validations &
Assertions
Each feature is a direct answer to a named problem.
Declarative YAML Pipelines
Problems addressed:
No Repeatable Structure
Copy -Paste Proliferatio n
Define sources, transforms, and targets
in YAML.
One canonical definition per pipeline.
Reviewable, diffable, version
-controlled
in Git.
The config is the PR.
19 Transform Types
Problems addressed:
No Repeatable Structure
Copy -Paste Proliferatio n
The full range of data engineering
needs, declared in config
— not coded
from scratch.
Every transform follows the
same declarative pattern.
No custom
No drift .
code.
No variation.
DAG Orchestration
Problems addressed:
Brittle Orchestration
Dependencies inferred automatically
from source and target relationships.
Independent threads execute in
parallel.
Dependent threads wait. Always in the
right order .
Execution order is declared,
not improvised.
Incremental Loading
Read only what’s changed.
Not everything.
Every
time .
Watermark tracking, CDC support, and incremental state
management
— declared in config, handled by the engine.
You define the strategy.
w eevr manages the state.
Key Generation & Hashing
Surrogate Key Generation
Hash -based surrogate keys from business key columns.
Eight hash algorithms: xxhash64, sha256, md5, and more.
Null inputs replaced with deterministic
sentinels.
Keys are always non
-null. Always consistent.
Change Detection Hashing
Row -level change detection without CDC.
Hash a set of columns — if the hash changes, the row changed.
Pairs naturally with merge write mode for SCD patterns .
Idempotent Outputs
Run it once. Run it ten times.
The result is the same .
Overwrite
Append
Merge
Naturally idempotent.
Target is fully
replaced. Same data
in, same data out.
Pair with incremental
load to prevent
duplicate rows. Same
window in, same rows
out.
Match keys ensure the
same upsert result
every time. Same
source in, same target
out.
Determinism isn’t just a property of the config.
It’s a guarantee
of the output .
Structured Telemetry & Observability
Problems addressed:
Opaque Execution
OTel -compatible execution spans.
Structured JSON logging at every milestone.
Row counts, timing, and status per thread.
Full trace hierarchy
— loom → weave → thread.
Route to Azure Monitor, Elasticsearch, Splunk
— any JSON -capable log sink.
No custom parsing required.
When something goes wrong, you
know exactly what happened,
where, and what the data looked
like.
Validations & Assertions
Pre -Write Validations
Rules evaluated against
DataFrame before a single row is written.
Failing rows are quarantined
— not dropped, not written, not silently lost.
Severity levels: info · warn · error · fatal
Post -Write Assertions
Checks evaluated against the target after every successful write.
Row count minimums, null checks, uniqueness constraints, custom expression.
The contract is verified, not assumed.
Same Pipeline. Two Ways.
The difference isn't the length.
It's what you
must know to write it correctly.
The YAML declares intent
. The Python implements infrastructure
None of the extra code has anything to do with your data.
.
Seeing is believing
A basic pipeline from config
Validations and
3a
Merge with soft delete and surrogate keys
3b
Incremental watermark
A full pipeline weave
A production
quarantine
loading
— 5 threads, one DAG
loom — inheritance, params, conditions
Demo 1: The Basic Thread
Problems addressed:
No Repeatable Structure
Copy -Paste Proliferatio n
ctx.run ('stg_customers.thread
')
The YAML file is the pipeline.
Nothing else required .
Demo 2: Validations & Quarantine
Problems addressed:
Opaque Execution
ctx.run ('stg_customers_validated.thread
')
Pre -write validation rules.
Failing rows quarantined
— not
dropped, not written.
Post -write assertions on the target .
Quality is part of the contract.
Demo 3a: Merge with Soft Delete
Problems addressed:
No Repeatable Structure
ctx.run (dim_product.thread
')
Merge write mode
— update, insert,
soft delete.xxhash64 surrogate key
generation.Null
-safe by default.
Same config. Different inputs.
Correct output.
Demo 3b: Incremental Watermark
Problems addressed:
Brittle Orchestration
ctx.run (fact_transactions.thread
')
Watermark -based incremental
loading.
State persisted automatically
configuration.
Merge on transaction_id .
— zero
Run 1: full initial load
Run 2: only new transactions processed
Demo 4: Full Pipeline Weave
Problems addressed:
Brittle Orchestration
Opaque Executio n
ctx.run ('customer_pipeline.weave
')
5 threads · Automatic DAG · Parallel
execution
Narrow lookups · Hooks · Quality
gates
The weave is the dependency
contract.
Demo 5: A Production Loom
Problems addressed:
Environment Drift
ctx.run ('daily.loom ')
Two weaves · Config inheritance ·
Typed parameters
Conditional execution · Environment
agnostic
One config. Any environment.
The contract doesn't change.
Only the parameters do.
Declarative Pattern is the Answer
SQL has
PySpark
dbt .
on Fabric deserves the same.
dbt didn’t invent SQL transformation. It gave the
community a standard way to do it.
weevr is that bet for
config -driven PySpark .
Whether it wins depends on whether practitioners like you
decide it's worth building together.
Join the Community
①
②
③
Try it
Star it
Shape it
pip install weevr
ardent-data/weevr
20 -minute
quickstart
the ecosystem is ready
Full docs at
ardent -data.github.io/
Every star signals
weevr
Watch for releases
Apache 2.0 · Production ready
Open issues · Read the roadmap
GitHub Discussions
Share your thread patterns
Request transform types
Contribute to docs
Scan for docs.
Open a PR
Fabric Runtime 1.3
Pierre LaFromboise
CDO · Covenant Technology Partners
linkedin.com/in/pierrelafromboise
Sound off.
The mic is all yours.
Influence the product roadmap.
Join the Fabric User Panel
Join the SQL User Panel
Share your feedback directly with our
Fabric product group and researchers.
Influence our SQL roadmap and ensure
it meets your real-life needs
https://aka.ms/JoinFabricUserPanel
https://aka.ms/JoinSQLUserPanel
How was
the session?
Complete Session Surveys in
for your chance to WIN
PRIZES!