This content originally appeared on DEV Community and was authored by Avthar Sewrathan
Learn how to automate AI embedding creation using the PostgreSQL you know and love.
Managing embedding workflows for AI systems like RAG, search and AI agents can be a hassle: juggling multiple tools, setting up complex pipelines, and spending hours syncing data, especially if you aren't an ML or AI expert. But it doesn’t have to be that way.
With pgai Vectorizer, now in Early Access, you can automate vector embedding creation, keep them automatically synced as your data changes, and experiment with different AI models -- all with a simple SQL command. No extra tools, no complex setups -- just PostgreSQL doing the heavy lifting.
-- Create a vectorizer to embed data in the blogs table
-- Use Open AI text-embedding-3-small model
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
embedding => ai.embedding_openai('text-embedding-3-small', 1536),
chunking => ai.chunking_recursive_character_text_splitter('content')
);
What pgai Vectorizer does:
- Embedding creation with SQL: generate vector embeddings from multiple text columns with just one command, streamlining a key part of your AI workflow.
- Automatic sync: embeddings update as your data changes—no manual intervention needed.
- Quick model switching: test different AI models instantly using SQL—no data reprocessing required.
- Test and roll out: compare models and chunking techniques, A/B test, and roll out updates with confidence and without downtime.
Here's an example of testing the RAG output of two different embedding models using pgai Vectorizer:
-- Vectorizer using OpenAI text-embedding-3-small
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
destination => 'blogs_embedding_small',
embedding => ai.embedding_openai('text-embedding-3-small', 1536),
chunking => ai.chunking_recursive_character_text_splitter('content'),
formatting => ai.formatting_python_template('Title: $title\nURL: $url\nContent: $chunk')
);
-- Vectorizer using OpenAI text-embedding-3-large
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
destination => 'blogs_embedding_large',
embedding => ai.embedding_openai('text-embedding-3-large', 1536), -- Note different dimensions
chunking => ai.chunking_recursive_character_text_splitter('content'),
formatting => ai.formatting_python_template('Title: $title\nURL: $url\nContent: $chunk')
);
-- Compare results from the two vectorizers on the same RAG query
SELECT
'text-embedding-3-small' as model,
generate_rag_response(
'What is AI?',
'public.blogs_embedding_small'
) as response
UNION ALL
SELECT
'text-embedding-3-large' as model,
generate_rag_response(
'What is AI?',
'public.blogs_embedding_large'
) as response;
Built to Scale
As your datasets grow, pgai Vectorizer scales with you. It automatically optimizes search performance with vector indexes (like HNSW and StreamingDiskANN) once you exceed 100,000 vectors. You’re in control—define chunking and formatting rules to tailor your embeddings to your needs.
Here's an example of an advanced vectorizer configuration, with an ANN index created after 100k rows added, and custom chunking for HTML files:
-- Advanced vectorizer configuration
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
destination => 'blogs_embedding_recursive',
embedding => ai.embedding_openai('text-embedding-3-small', 1536),
-- automatically create a StreamingDiskANN index when table has 100k rows
indexing => ai.indexing_diskann(min_rows => 100000, storage_layout => 'memory_optimized'),
-- apply recursive chunking with specified settings for HTML content
chunking => ai.chunking_recursive_character_text_splitter(
'content',
chunk_size => 800,
chunk_overlap => 400,
-- HTML-aware separators, ordered from highest to lowest precedence
separator => array[
E'</article>', -- Split on major document sections
E'</div>', -- Split on div boundaries
E'</section>',
E'</p>', -- Split on paragraphs
E'<br>', -- Split on line breaks
E'</li>', -- Split on list items
E'. ', -- Fall back to sentence boundaries
' ' -- Last resort: split on spaces
]
),
formatting => ai.formatting_python_template('title: $title url: $url $chunk')
);
Try pgai Vectorizer Today (Early Access)
For companies like MarketReader, pgai Vectorizer has already made AI development faster and more efficient:
“pgai Vectorizer streamlines our AI workflow, from embedding creation to real-time syncing, making AI development faster and simpler -- all in PostgreSQL.” — Web Begole, CTO at MarketReader, an AI Financial Insights Company
If you're ready to start building, we are hosting a Dev Challenge with our partners at Ollama all about building AI apps with Open Source Software. We're really excited to see what the community builds with PostgreSQL and pgai Vectorizer!
Save time and effort. Focus less on embeddings. Spend more time building your next killer AI app. Try pgai Vectorizer free today: get it on GitHub or fully managed on Timescale Cloud (free for a limited time during Early Access).
This content originally appeared on DEV Community and was authored by Avthar Sewrathan
Avthar Sewrathan | Sciencx (2024-10-29T13:31:18+00:00) 🚀 pgai Vectorizer: Automate AI Embeddings With One SQL Command in PostgreSQL. Retrieved from https://www.scien.cx/2024/10/29/%f0%9f%9a%80-pgai-vectorizer-automate-ai-embeddings-with-one-sql-command-in-postgresql/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.