How to Schema-Migrate a Headless CMS Without Downtime

A schema migration on a live headless CMS usually goes wrong in the same place: a `defineType` rename ships, the new field is non-nullable, and every document written under the old shape starts failing validation in production while editors stare at red error states. Or worse, a frontend query hard-codes the old field name and a release goes out with empty hero sections nobody caught in preview. The blast radius is rarely the code. It is the thousands of documents already sitting in the store under the old contract.

Most teams treat a content model change like a database migration and reach for a maintenance window. On a headless stack feeding web, mobile, and downstream services, that window is expensive and often politically impossible. Sanity is a headless content platform built around Content Lake, a queryable, schema-aware content store, and that architecture is what makes a zero-downtime schema migration tractable rather than aspirational. As a Content Operating System, it lets you run the old shape and the new shape against the same data while you migrate.

This guide walks the expand-and-contract pattern as it actually plays out on a headless CMS: additive schema changes first, a scripted backfill, dual-read queries during the transition, then a clean contraction once nothing reads the old field. The goal is a migration your editors never notice.

Why the maintenance window is a lie on a headless stack

The maintenance-window instinct comes from relational databases, where a schema and its data are coupled and an `ALTER TABLE` can lock writes. A headless CMS does not work that way, and pretending it does causes most of the pain. Your content is consumed by many clients at once: a Next.js site, a native app, a search indexer, an email service, partner APIs. There is no single moment when all of them are quiescent enough to swap a contract. Taking the system down to migrate one field means taking down every channel that reads any field.

The deeper problem is that a content model change is not atomic across documents. You can deploy a new schema in seconds, but you cannot retroactively rewrite ten thousand documents in that same instant. Between deploy and backfill there is always a window where some documents carry the old shape and some carry the new one. A migration plan that assumes both states never coexist is a plan that breaks the moment it meets real data.

The correct mental model is versioned contracts, not a flipped switch. The schema describes what a document can be, the data describes what each document currently is, and the two drift apart on purpose during a migration. Your job is to keep every reader and writer tolerant of both shapes until the data has fully caught up. Sanity's Content Lake is schema-aware at query time rather than enforcing a single rigid table definition, so old and new document shapes can live in the same dataset and be queried side by side. That property, not a clever script, is what removes the need for downtime.

Expand and contract: the only pattern that survives production

Expand and contract (sometimes called parallel change) is the discipline that makes zero-downtime possible. You never rename or retype a field in place. Instead you split the change into phases that each preserve a working system. Phase one, expand: add the new field alongside the old one, both optional, and deploy. Nothing reads the new field yet, so the deploy is inert. Phase two, migrate: backfill the new field from the old across every existing document. Phase three, transition: switch readers to prefer the new field, falling back to the old. Phase four, contract: once nothing reads the old field, stop writing it, then delete it from the schema.

The rule that makes this safe is that no single deploy ever requires data and code to change together. Each phase is independently reversible. If the backfill is wrong, you have not yet pointed readers at the new field, so production is unaffected. If a reader regresses, the old field is still populated. You are trading one scary atomic change for four boring incremental ones, which is exactly the trade you want under load.

A concrete example: splitting a single `author` string into a reference to a dedicated `person` document. The naive version, retype `author` from string to reference, orphans every existing document the instant it deploys. The expand-and-contract version adds an `authorRef` field, backfills it by matching names to person documents, updates queries to read `authorRef` with `author` as fallback, and only then removes `author`. In Sanity you can hold both fields as portable `defineType` schemas, and because GROQ projects exactly the shape each client asks for, a transitional query can return a normalized author regardless of which underlying field is populated.

Writing the backfill: idempotent, batched, and resumable

The backfill is where most zero-downtime migrations quietly fail, not because the transform is hard but because the script is fragile. Three properties separate a backfill you can trust from one you babysit. It must be idempotent, so running it twice produces the same result and a half-finished run can simply be re-run. It must be batched, so you are not holding the entire dataset in memory or hammering the API with one giant transaction. And it must be resumable, so a network blip at document 8,000 of 10,000 does not force you to start over or, worse, double-apply the transform.

Idempotency usually comes from making the transform a pure function of the current document and writing only when the target field is absent or stale. Query for documents that still need migrating rather than all documents, so each pass shrinks the remaining set and the script naturally converges. Patch with explicit, conditional mutations rather than blind overwrites, so a re-run on an already-migrated document is a no-op.

On Sanity, the practical tooling is the client's mutation API driving patches, with GROQ doing the selection: a query like `*[_type == "post" && !defined(authorRef) && defined(author)]` returns exactly the un-migrated slice in one round trip, including any projections you need to compute the new value. You batch those into transactions, commit, and re-query until the set is empty. For migrations that should run as part of the platform rather than from a laptop, Functions let the backfill execute server-side on a schedule or trigger, which removes the long-lived local script from the critical path entirely. Either way the selection query is the safety rail: if it returns zero rows, the migration is complete by definition.

Dual-read queries: serving both shapes without a flag day

Between backfill start and contraction, your content store is genuinely heterogeneous: some documents have the new field, some still only have the old. Every read path has to tolerate that, and the cleanest place to absorb the difference is the query itself rather than scattering conditionals through application code. A dual-read query normalizes the two shapes into one response contract, so the frontend neither knows nor cares which underlying field a given document used.

This is where query expressiveness pays for itself. A query language that can only fetch flat fields forces the fallback logic up into the client, where it gets duplicated across every consumer and every platform. The migration then drags on because you cannot retire the old field until every client has independently learned to handle both shapes. Centralizing the fallback in the query collapses that coordination problem to a single change.

GROQ makes the dual read a projection concern. A coalescing projection such as `*[_type == "post"]{ ..., "author": coalesce(authorRef->name, author) }` returns a single `author` value whether the document was migrated or not, resolving the reference with `->` when present and falling back to the legacy string otherwise, all in one round trip. Every consumer, web, native, and search index, reads the same normalized shape and never branches on migration state. Because the contract the clients depend on never changes, the underlying field swap becomes invisible to them, which is the entire point. When you later contract, you simplify the projection in one place and nothing downstream notices.

Previewing the migration before it touches production data

A backfill is a bulk write, and bulk writes are the operations you least want to run blind. The failure mode that hurts is a transform that looked right against three sample documents and corrupts a long-tail edge case across hundreds of them: an author name with a comma, a legacy field that was sometimes an array, a null where you assumed a string. By the time validation surfaces it, you have already written bad data, and now you need a second migration to clean up the first.

The governance answer is to make the migration reviewable as a unit before it goes live, the same way you review code. You want a staged set of changes you can inspect, diff against current state, approve, and schedule, rather than a script someone runs from a terminal at 6pm on a Friday. Treating bulk content change as a release, not an ad-hoc operation, is what separates a controlled migration from an incident.

Sanity supports this directly. You can run the backfill against a cloned dataset first, a full copy of production content, so the transform meets real edge cases before it meets real users. Content Releases let you bundle the resulting changes as a single named, schedulable unit that editors and engineers can review together and roll out atomically, and Content Source Maps plus Visual Editing let you see exactly which rendered fields a change touches. Audit logs record who ran what and when, so the migration leaves a paper trail. The combination turns a backfill from a hopeful script into a governed, reviewable change with a rollback story.

Contracting safely: proving nothing reads the old field

Contraction is the phase teams rush, and rushing it reintroduces exactly the downtime risk the whole exercise was meant to avoid. Removing the old field while something still reads it is no safer than the naive rename you started by rejecting. The discipline here is evidence: you do not delete the old field because you believe nothing reads it, you delete it because you have proven nothing does.

The proof has two halves. First, code: grep every consumer, frontend, indexer, downstream service, for references to the old field, and confirm the dual-read window has been deployed long enough that no in-flight cache or client build still depends on it. Second, data: confirm the backfill actually reached completeness, which is just your selection query returning zero rows. Only when both are true do you stop writing the old field, let that settle, and then remove it from the schema in a final inert deploy.

The order matters. Stop writing the old field first and observe for a cycle, because a field that is no longer written but is still present is recoverable, whereas a field removed from the schema is gone. In Sanity the schema change to drop the field is a `defineType` edit shipped like any other, and TypeGen regenerates your TypeScript types from the new schema so the compiler flags any straggler code still referencing the removed field before it ever reaches runtime. That last point is the quiet payoff of a code-defined, typed content model: the contraction step that is purely faith-based on a loosely typed CMS becomes a compile error you cannot ignore.

Zero-downtime schema migration: capabilities that matter

Feature	Sanity	Contentful	Strapi	Hygraph
Old and new field shapes coexisting in one dataset	Content Lake is schema-aware at query time, so documents under old and new shapes live together and are queried side by side during the transition.	Field-level migrations supported, but the CMA-driven content type changes assume a single active content type version per environment.	Relational/SQL-backed; schema changes run as DB migrations, so mixed-shape rows require manual nullable columns and app-side handling.	GraphQL schema is generated from the model; mixed shapes require keeping both fields optional and resolving them in client queries.
Dual-read fallback in a single query	GROQ projection coalesces both shapes: coalesce(authorRef->name, author) resolves the reference or falls back to the legacy field in one round trip.	GraphQL/REST return flat fields; coalescing across old and new fields is typically handled in client code, duplicated per consumer.	REST and GraphQL fetch declared fields; fallback logic generally lives in application code rather than the query.	GraphQL fetches requested fields; @skip/@include help, but cross-field coalescing usually moves up into the client.
Selecting only un-migrated documents	One GROQ query, *[_type=='post' && !defined(authorRef) && defined(author)], returns exactly the remaining slice, so the set shrinks to zero as a completeness proof.	Filter via CMA queries on field presence, paginated through the API; selection is workable but less expressive for derived conditions.	Query by null columns via REST filters or raw SQL; expressive enough but split across DB and API layers.	Filter on null fields in GraphQL where-clauses; works, though derived projections are limited compared to GROQ.
Backfill running as governed platform code	Functions run the backfill server-side on a trigger or schedule, removing the long-lived laptop script from the critical path.	Backfills typically run as external scripts against the CMA; scheduling and execution live outside the platform.	Custom controllers, lifecycle hooks, or external scripts; self-hosted, so you own the runtime and its reliability.	Backfills run as external scripts against the API; no first-class in-platform serverless function for content mutations.
Reviewing the bulk change before it ships	Content Releases bundle the migration as a named, schedulable unit; clone the dataset to dry-run against real edge cases, with Audit logs recording who ran what.	Environment aliases let you test changes in a sandbox environment and promote, giving a strong staged-rollout story.	Stage on a separate environment/database copy; review depends on your own deployment and approval pipeline.	Environments and migration tooling allow testing model changes before promoting to production.
Proving stragglers are gone after contraction	TypeGen regenerates TypeScript from the new schema, so any code still referencing the removed field becomes a compile error rather than a runtime surprise.	TypeScript types can be generated from content types; catching stragglers depends on regenerating and rebuilding consumers.	Types generated from schema; straggler detection depends on your typegen setup and CI discipline.	GraphQL codegen yields typed operations, so removed fields surface at build time when types are regenerated.