Building an AutoML Pipeline for Vector Data in Azure SQL
Description
This session demonstrates how to build an AutoML-driven pipeline inside Azure SQL to process and optimize vector data. Learn how to automatically profile datasets, select the best embedding model and parameters, and improve retrieval accuracy using adaptive evaluation and feedback loops.
Key Takeaways
- RAG evaluation results & model recommendations
- We can use DiskANN and search increadable fast.
- One of the most important
- type, because of the 8KB
- So, what should we do for
- SQL integration is always custom per project
- SQL Server connection / PDF import
My Notes
Action Items
- [ ]
Resources & Links
Slides
Building an AutoML
Pipeline for Vector
Data in Azure SQL
Ömer Çolakoğlu,
Microsoft Data Platform MVP
Founder, FEYOTECH
omer@feyotech.com
https://www.linkedin.com/in/omercolakoglu/
What We Will Cover Today
The Problem
Why is choosing the right embedding model so hard?
Architecture
How AutoEmbedding works end-to-end
SQL Server 2025
VECTOR type & sp_invoke_external_rest_endpoint
Data Profiling
10 domain types: review, ecommerce, book, knowledge_base...
Model Evaulation
Accuracy x Latency x Cost x RAG Quality weighted composite score
New Features
Smart Chunking, RAG Pipeline & PDF Import
Rag & Results
RAG evaluation results & model recommendations
We are happy!
SQL Server 2025 has released.
We are happy!
Native vector type is built in.
We are happy!
We have an AI Ready Enterprise Database Platform
We are happy!
We can embed data with any LLM Embedding Model with
sp_invoke_external_rest_endpoint
We are happy!
We can use DiskANN and search increadable fast.
We are happy!
We can search data as semantic and any language without translation.
We are happy!
Vector stores data as semantic. It means, Vector is a format that stores
data as some float values. These are not meanful for human but very
meanful for computers.
Questions in our brains???
Questions in our brains
There are a lot of data types and a lot of llm embedding models. So which
model is is the best for our scenario?
Questions in our brains
One of the most important
features of the vector data
type is its dimension.
Large dimensions are
more expensive.
But does a larger
dimension mean more
accurate results?
Questions in our brains
Vector is an universal
format of data like body
language.
So we can search data as
vector in any language and
get the same result.
Does it really work?
Questions in our brains
We have also big text
data. And we have to
chunk it.
Also we have a little bit
bigger data than our max
chunk size.
Chunking or truncating?
Questions in our brains
SQL Server 2025 has
1998 limit for the vector
type, because of the 8KB
page size.
Questions in our brains
DiskANN vector index is
really fast but it is
readonly.
So, what should we do for
the changed data.
Questions in our brains
DiskANN vector index is
really fast but it is
readonly.
So, what should we do for
the changed data.
Why Is Embedding Model Selection So Hard?
Why Is Embedding Model Selection So Hard?
Real-World Challenges
How It's Done Today
30+ available models with different dimensions (768-3072)
Manual benchmarking takes days
Performance varies dramatically per domain
Repeat for every new model release
Cost: $0.00002 to $0.001 per 1K tokens
Excel spreadsheets for comparison
Latency: 50ms to 2000ms per request
Which metric matters most? How to weight?
Azure vs OpenAI vs Voyage vs Jina vs Ollama...
SQL integration is always custom per project
SQL Server integration requires custom code each time
Customer reviews != Product catalog
So, we need an intelligent pipeline.
How Auto Embedding Works?
End-to-End Pipeline
- Datasource Management
5.Analysis
• SQL Server connection / PDF import
• Datasource CRUD (add, edit, test)
• Model metrics (overall score, accuracy, latency, cost, memory)
• Consistency test results (similar/different pair analysis)
• Multilingual test (language pair similarity)
• Retrieval test (precision, top-K ranking)
• RAG quality details (context relevance, faithfulness, completeness)
• Latency performance indicator
• Full model comparison table
• Strategy simulation (change weights → see score impact)
• Live Query Test (ad-hoc search, LLM evaluation) - Data Profile
• Language detection and distribution
• Type classification (10 domains, score-based)
• Text statistics (length, duplicate rate)
• Category extraction
• Column profiling (Text, Filter, Date, ID)
• Noise detection and cleaning (HTML, noise)
• Recommended models and strategy
• Text structure optimization (field detection, LLM suggestion)
• VectorStr formula definition - Question Management
• Multilingual question pool (TR, EN, DE, FR)
• Automatic question generation via LLM
• Expected keywords / ground truth
4.Experiment
• Model selection (multi-select)
• Strategy selection (recommended by data type)
• Weight settings (accuracy, latency, cost, RAG quality)
• Sampling count
• Chunk settings (size, overlap, semantic/fixed)
• RAG settings (enable/disable, top-K, max queries)
• Live progress tracking + stop control
6.Results and Evaulation
• Winning model and score summary
• AI recommendation (LLM explanation)
• Key decision factors
• Model ranking table
• Vectorizing all data with the selected model
7.Tools
• A/B Experiment comparison
• Experiment history
• Activity log (audit trail)
• SQL Server status panel
• Multilingual query test (cross-language analysis)
Data Source Management
Data Source Management
Data Profiling-Automatic Domain Detection
RAG QUAL
ACCURACY
RAG QUAL
ACCURACY
RAG QUAL
ACCURACY
RAG QUAL
ACCURACY
RAG QUAL
ACCURACY
RAG QUAL
ACCURACY
RAG QUAL
COST
LATENCY
COST
LATENCY
COST
LATENCY
COST
LATENCY
COST
LATENCY
COST
LATENCY
COST
LATENCY
COST
LATENCY
Reviews
E Commerce
Book
Knowledge Base
Code
Medical
Scientific
• Amazon Camera
Reviews (Used
this Project)
• Amazon Book
Reviews
• Yelp Restaurant
Reviews
• Walmart
Products (Used
in this Project)
• Shopify Store
Inventory
• eBay Listings
• Big Data in
Practice (Used in
this Project)
• SQL Server 2025
Unveiled (Used
in this Project)
• StackOverflow
Q&A (Used in this
Project)
• Wikipedia Articles
• GitHub
Repository Docs
• API
Documentation
• Jupyter Notebook
Cells
• PubMed
Abstracts
• Clinical Trial
Reports
• Drug Interaction
Database
• Nature
Publications
• Research
Datasets (UCI)
• Physics Textbook
ACCURACY
RAG QUAL
ACCURACY
Multilingual
• Multilingual
Wikipedia
• Subtitle
Databases
TR
Data Profiling-Automatic Domain Detection
Create Data Profile
Data Profiling-Automatic Domain Detection
| Create Data Profile
Type Classification,
Category Distribution
Data Profiling-Automatic Domain Detection
| Create Data Profile
Recomended Models, Query Pool
Data Profiling-Automatic Domain Detection
| Create Data Profile
Data Profile & Search Strategy
Data Profiling-Automatic Domain Detection
| Create Data Profile
Recomended Search Strategy
Data Profiling-Automatic Domain Detection
| Create Data Profile
Column to Embed Analysis
Data Profiling-Automatic Domain Detection
| Create Data Profile
Column to Embed Analysis
Here we go. Experiment time.
The pipeline will find the best options for us.
Experiment
1.Model Selection
Experiment
Raw Data
Experiment
2.Sampling, Strategy, Rerank Parameters,
Estimated Budget
Experiment
2.Sampling, Strategy, Rerank Parameters,
Estimated Budget
Experiment
3.Live Results
Experiment
3.Live Results
Experiment
4.Detail Analysis
Experiment | Detail Analysis
1.Consistency Test – Similar Pairs
Experiment | Detail Analysis - Multilingual Test - EN/TR Translation Pairs
Experiment | Detail Analysis - Retrieval Test - Query and Top-K Results
Experiment | Detail Analysis
Query Detail (Top-K Results)
Experiment | Detail Analysis - All Models - Comparison
Experiment | Detail Analysis - Strategy Simulation
Experiment | Detail Analysis - Live Query Test
Experiment | Detail Analysis - Live Query Test
Experiment
Summary & Result
Experiment
Summary & Result
Vectorize All Data
Vectorize All Data
1.Source Selection
Vectorize All Data
2.Model Selection
Vectorize All Data - Embedding Progress
Vectorize All Data - Embeded Data On SQL Server
Vectorize All Data - Vector Serch On SQL Server
What about other datasets?
What about other datasets?
RAG QUAL
COST
LATENCY
ACCURACY
COST
LATENCY
ACCURACY
RAG QUAL
COST
LATENCY
RAG QUAL
ACCURACY
LATENCY
COST
RAG QUAL
ACCURACY
Strategy
Reviews
E Commerce
Book
Knowledge Base
Amazon
Camera
Reviews
Walmart
Products
SQL Server
2025 Unveiled
Book PDF
StackOverflow
Q&A
Honestly, I have more than you wonder.
Ladies and gentelmen! Here is….
On Prem Notebook LM
Based on SQL Server 2025
and AutoML Embedding Pipeline
On Prem Notebook LM Based on SQL Server 2025
On Prem Notebook LM Based on SQL Server 2025
Chat With Your Embedded PDF
On Prem Notebook LM Based on SQL Server 2025
Chat With 40 Million Rows StackOverFlow Data
On Prem Notebook LM Based on SQL Server 2025
Add Source
On Prem Notebook LM Based on SQL Server 2025
Summary, Mindmap,Info Cards
Thank you.
It’s honor to be here and talk to you.
Sound off.
The mic is all yours.
Influence the product roadmap.
Join the Fabric User Panel
Join the SQL User Panel
Share your feedback directly with our
Fabric product group and researchers.
Influence our SQL roadmap and ensure
it meets your real-life needs
https://aka.ms/JoinFabricUserPanel
https://aka.ms/JoinSQLUserPanel