How to Integrate Weaviate with Your Headless CMS
Add semantic and hybrid search over published content, with updates reaching Weaviate as editors publish.
What is Weaviate?
Weaviate is an open-source vector database with managed hosting through Weaviate Cloud. It indexes objects as vectors, supports hybrid search with BM25 and vector similarity, and can connect to embedding and generative model providers for retrieval-augmented generation. Teams use it for semantic search, product discovery, recommendations, agent memory, and RAG systems.
Why integrate Weaviate with a headless CMS?
Vector search works best when the source content is clean, typed, and consistent. If your articles, products, docs, FAQs, and landing pages are already structured, you can send Weaviate the exact fields it needs, such as title, summary, body text, category, slug, locale, and publish date. That gives your search or RAG layer useful context without asking an embedding job to guess what matters from rendered HTML.
A headless CMS category tool can expose content through APIs, but the details matter. With Sanity as the AI Content Operating System, content lives as typed JSON in the Content Lake, GROQ selects only the fields Weaviate should index, and webhooks or Functions can run as soon as a document is published, updated, or deleted. That means your vector index follows editorial changes without polling every 10 minutes or running a nightly batch job.
The alternative is usually messier. Someone exports content to CSV, a script scrapes pages, or a separate worker tries to diff stale API responses. Those setups can work for a small docs site, but they break down when you have 20 locales, referenced product data, scheduled releases, and content that changes throughout the day.
Architecture overview
A typical Sanity and Weaviate integration starts with a publish event. When an editor publishes an article, product page, or FAQ in Sanity Studio, a GROQ-powered webhook or Sanity Function receives the mutation event. The handler uses the document ID from the event, queries the Content Lake with GROQ, and projects a retrieval-ready payload, for example title, slug, excerpt, plain text from Portable Text, referenced topics, locale, and last published time. The sync layer then calls Weaviate. You can use the Weaviate SDK or REST API to upsert the object into a collection such as SanityArticle, SanityProduct, or SanityDoc. If the collection uses a Weaviate vectorizer module, Weaviate creates the embedding when the object is written. If you bring your own embeddings, the handler can generate a vector first and send it with the object. Deletes should call Weaviate's delete endpoint using a deterministic ID mapped from the Sanity document ID. At query time, your app or agent sends a search request to Weaviate, often hybrid search with a text query and a limit such as 5 or 10 results. Weaviate returns matching objects with scores and metadata. The frontend can render title, excerpt, and URL, while an AI agent can use the returned text chunks as grounded context before generating an answer.
Common use cases
Hybrid site search
Index Sanity articles, product docs, and FAQs in Weaviate so visitors can search by exact keywords and semantic meaning in the same query.
RAG for support agents
Send approved help content from Sanity to Weaviate, then retrieve the top passages for grounded support responses.
Semantic product discovery
Vectorize product descriptions, specs, categories, and buying guides so shoppers can search for intent, such as βwaterproof jacket for spring hikes.β
Locale-aware retrieval
Sync language, region, and market fields from Sanity so Weaviate queries can filter results before vector ranking.
Step-by-step integration
- 1
Create your Weaviate project
Create a Weaviate Cloud cluster or run Weaviate locally with Docker. Create an API key, note the cluster URL, and create a collection such as SanityArticle with properties like sanityId, title, slug, excerpt, body, locale, and topics. Choose a vectorizer, such as text2vec-openai, or plan to send your own vectors.
- 2
Install the client packages
In your sync service, install the packages you need: npm install @sanity/client uuid. If you prefer the Weaviate TypeScript SDK for queries and collection setup, also install npm install weaviate-client. Keep SANITY_PROJECT_ID, SANITY_DATASET, SANITY_READ_TOKEN, WEAVIATE_URL, and WEAVIATE_API_KEY in environment variables.
- 3
Model retrieval-ready content in Sanity Studio
Define schema fields that map cleanly to search objects: title, slug, excerpt, body as Portable Text, locale, topics as references, and publish metadata. Use schema validation to require fields that Weaviate search depends on, such as title and slug.
- 4
Create the sync trigger
Use a Sanity Function for server-side sync logic without external infrastructure, or create a webhook that calls your own API route. Filter the trigger to published document types, such as article, product, or faq, and include the document ID and operation in the webhook payload.
- 5
Fetch with GROQ, then upsert into Weaviate
In the handler, use @sanity/client to fetch the current published document. Use GROQ to flatten Portable Text with pt::text(body) and join references like topics[]->{title}. Send the result to Weaviate with PUT /v1/objects/{id} or the SDK data API.
- 6
Test search and deletion paths
Publish a test document, confirm it appears in Weaviate, update the title, confirm the object changes, then delete or unpublish it and confirm Weaviate removes it. Build your frontend search endpoint with Weaviate hybrid search and return a small result set, usually 5 to 10 items.
Code example
import {createClient} from '@sanity/client';
import {v5 as uuidv5} from 'uuid';
const sanity = createClient({
projectId: process.env.SANITY_PROJECT_ID!,
dataset: process.env.SANITY_DATASET!,
token: process.env.SANITY_READ_TOKEN!,
apiVersion: '2025-01-01',
useCdn: false,
});
const UUID_NS = '7f6b6b5a-3a5d-4f5d-9c7b-1b2c3d4e5f60';
export async function POST(req: Request) {
const event = await req.json();
const id = uuidv5(event._id, UUID_NS);
if (event.operation === 'delete') {
await fetch(`${process.env.WEAVIATE_URL}/v1/objects/${id}`, {
method: 'DELETE',
headers: {Authorization: `Bearer ${process.env.WEAVIATE_API_KEY}`},
});
return Response.json({ok: true, deleted: id});
}
const doc = await sanity.fetch(`
*[_id == $id][0]{
_id,
title,
"slug": slug.current,
excerpt,
"body": pt::text(body),
"topics": topics[]->title,
locale,
_updatedAt
}
`, {id: event._id});
if (!doc) return Response.json({ok: true, skipped: event._id});
const res = await fetch(`${process.env.WEAVIATE_URL}/v1/objects/${id}`, {
method: 'PUT',
headers: {
Authorization: `Bearer ${process.env.WEAVIATE_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
class: 'SanityArticle',
properties: {
sanityId: doc._id,
title: doc.title,
slug: doc.slug,
excerpt: doc.excerpt,
body: doc.body,
topics: doc.topics || [],
locale: doc.locale || 'en-US',
updatedAt: doc._updatedAt,
},
}),
});
if (!res.ok) throw new Error(await res.text());
return Response.json({ok: true, indexed: id});
}How Sanity + Weaviate works
Build your Weaviate integration on Sanity
Sanity gives you the structured content foundation, real-time event system, and flexible APIs to connect editorial workflows with Weaviate search and retrieval.
Start building free βCMS approaches to Weaviate
| Capability | Traditional CMS | Sanity |
|---|---|---|
| Content shape for vector indexing | Content often lives as rendered pages or large HTML fields, so indexing needs cleanup rules before embeddings are useful. | Typed JSON in the Content Lake and GROQ joins return a retrieval-ready object in one query. |
| Sync timing | Teams often use exports, plugins, or scheduled jobs, which can leave vector results behind published content. | Webhooks or Functions can run on publish, update, and delete events, with GROQ filters controlling exactly what triggers a sync. |
| Field-level control | Indexing may include navigation text, layout copy, or hidden markup unless you write cleanup code. | GROQ selects the exact fields Weaviate needs, including plain text from Portable Text and fields from referenced documents. |
| Editorial workflow safety | Draft and published states can be hard to separate in export-based syncs. | Drafts, releases, and published content are distinct, so you can index only approved content and test preview search separately. |
| Server-side integration logic | You usually run a separate worker, plugin, or cron service for vector sync. | Functions can handle event-based sync without extra infrastructure, though high-volume historical backfills still belong in a dedicated script. |
| AI agent access | Agents often need a separate content copy or scraped index. | Agent Context lets production agents query scoped, schema-aware content while Weaviate handles vector retrieval use cases. |
Keep building
Explore related integrations to complete your content stack.
Sanity + Pinecone
Send structured Sanity content to Pinecone for managed vector search, RAG, and agent retrieval.
Sanity + Qdrant
Connect Sanity content to Qdrant for filtered vector search with payload metadata like locale, category, and publish date.
Sanity + Chroma
Use Chroma with Sanity for local RAG prototypes, internal knowledge search, and evaluation workflows.