Skill: Reviewing Postgres Migrations for Lock-Time Risk

Skill: Reviewing Postgres Migrations for Lock-Time Risk22 src$15 spent

22 sources$15.00 spent

---
name: postgres-migration-reviewer
description: Use this skill when you need to review a proposed PostgreSQL schema migration, data migration, backfill, index or constraint rollout, Rails migration, raw SQL DDL change, partitioned-table change, expand-contract deployment plan, or rollback plan for production safety. Trigger this for questions about lock risk, lock duration, table rewrites, full scans, CREATE INDEX CONCURRENTLY, NOT VALID / VALIDATE CONSTRAINT, adding UNIQUE or PRIMARY KEY safely, large-table backfills, replica lag, WAL volume, logical replication caveats, failover implications, or rollback vs roll-forward. Review the migration, flag dangerous operations, explain why they are risky, ask for missing production context, and propose safer rollout steps.
---

Review PostgreSQL migrations for operational safety, not just SQL validity.^[1]^[5]^[2]

Separate these risk shapes in every review: lock acquisition, lock hold time, rewrite or validation scan cost, WAL and replica impact, and mixed-version application compatibility.^[1]^[2]^[3]^[4]

What to collect first

Get the migration SQL or framework migration code.^[5]^[6]
Ask for PostgreSQL major version, affected tables, row counts or table sizes, read and write QPS, partitioning, replicas, replication mode, and whether any standby is synchronous.^[9]^[3]^[2]^[6]
Ask for statement_timeout, lock_timeout, whether the migration framework wraps statements in a transaction, deployment order, observability, and the failure plan.^[6]^[3]^[8]

Review procedure

Parse the migration into concrete operations. Split bundled migrations mentally. PostgreSQL can take the strictest lock required by any subcommand in a single ALTER TABLE.^[1]
For each operation, classify the main risk as one or more of: blocking lock, long scan or validation, table rewrite, write amplification or WAL load, app-compatibility risk, or replication and failover risk.^[1]^[2]^[3]^[4]
Default to safer online patterns for populated hot tables unless the table is confirmed tiny or the system can tolerate blocking.^[2]^[6]^[7]
Be conditional when context is missing. Do not guess about blast radius from SQL alone.^[1]^[3]^[6]^[2]

Dangerous operations and safer patterns

CREATE INDEX on a populated table: high blocking risk because a standard build locks out writes for the duration. Prefer CREATE INDEX CONCURRENTLY; note that it must run outside a transaction block and still does two table scans with waiting phases.^[7]
Failed CREATE INDEX CONCURRENTLY: check for leftover INVALID indexes. They are ignored for planning but still incur maintenance overhead. Cleanup usually means dropping the invalid index or rebuilding it with REINDEX INDEX CONCURRENTLY.^[7]^[14]^[2]
Validated foreign keys or check constraints on existing large tables: prefer NOT VALID first, then VALIDATE CONSTRAINT later. Adding a foreign key is lighter than most constraints, but it still takes SHARE ROW EXCLUSIVE on both tables; validation is safer, not free.^[1]^[13]^[2]
Adding UNIQUE or PRIMARY KEY directly to a hot table: usually build the unique index first, then attach it with ADD CONSTRAINT ... USING INDEX. Caveat: this works only with a plain eligible unique b-tree index, and attaching a primary key may trigger SET NOT NULL scans if the columns are not already provably not null.^[1]^[11]^[2]
SET NOT NULL on an existing populated column: expect a full scan unless a validated proving check such as CHECK (col IS NOT NULL) already exists.^[1]^[2]
Adding a column with a default: ask for PostgreSQL version. On PostgreSQL 11 and later, a non-volatile default avoids a table rewrite. Volatile defaults still rewrite the table.^[9]^[1]^[2]
ALTER COLUMN ... TYPE: treat as high risk until proven otherwise. Many type changes rewrite the table; some binary-compatible or widening changes can avoid the rewrite, though indexes may still need rebuilding.^[1]^[2]
Renames, type swaps, and drops in mixed-version deployments: review them as application-compatibility hazards. Prefer expand-contract: add the new shape, dual-write if needed, backfill, shift reads, then remove the old path later.^[2]^[8]
Large backfills: run them outside the DDL transaction, batch them, throttle them, and make them idempotent and resumable. A single giant UPDATE can create long locks, huge WAL bursts, replica lag, and vacuum debt.^[2]^[12]
Partitioned tables: do not assume parent indexes can be built concurrently. PostgreSQL does not support CREATE INDEX CONCURRENTLY on the partitioned parent. The usual workaround is an invalid parent index plus concurrent child indexes and attachment.^[10]^[7]^[11]
Identity and sequence changes: distinguish rewrite risk from cutover risk. Adding an identity column rewrites the table; changing sequence behavior on an existing column does not rewrite existing rows. In logical replication, sequence state is not replicated.^[1]^[4]

Replication and rollback review

Treat large scans, rewrites, index builds, and backfills as replica events, not just primary events. Ask how much WAL they will generate and how much lag the topology can tolerate.^[3]^[18]
If replicas serve reads, ask whether long-running standby queries can be canceled by replay conflicts during the migration.^[15]
If the system uses logical replication, ask whether schema changes are being applied on both sides in the right order and how sequences will be synchronized. DDL and sequence state are not replicated automatically.^[4]^[16]^[17]
Prefer roll-forward over naive rollback for migrations that backfill data, build indexes, or run across mixed-version application deploys. Demand a concrete failure plan either way.^[6]^[8]

Questions that should block approval until answered

What PostgreSQL version is this, and are any version-specific assumptions in the plan actually true?^[9]^[1]^[11]
How large are the affected tables and indexes, and what are their read and write rates?^[6]^[12]
Will the app tolerate both old and new schema states during deploy, or is this a breaking change that needs expand-contract?^[2]^[8]^[6]
What are the timeout settings, monitoring plan, abort thresholds, and operator steps if lock waits, lag, or WAL growth exceed expectations?^[6]^[3]^[15]^[16]

Report structure

ALWAYS use this exact template.^[5]

## Summary verdict
- APPROVE | APPROVE WITH CONDITIONS | NEEDS CHANGES | DO NOT RUN AS WRITTEN
## Severity
- Low | Medium | High | Critical
## Blocking risks
- List each dangerous operation and the concrete failure mode.
## Why these are risky
- Explain lock level, scan or rewrite behavior, replica impact, and app-compatibility impact.
## Safer rollout plan
- Give the safer sequence step by step.
- Split schema change, backfill, validation, and cleanup into separate phases when needed.
## Missing context / questions
- List unanswered questions that could change the recommendation.
## Rollback or roll-forward plan
- Say which is safer here and why.
- Name the exact point of no easy rollback if one exists.
## Monitoring during execution
- Name what to watch: lock waits, migration runtime, replica lag, WAL growth, query cancellations, invalid indexes, error rate.

Examples

Example 1:^[5]

Input:
ALTER TABLE orders ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id) REFERENCES customers(id);
Context: orders has 400M rows, high write traffic, streaming replicas serve reads.
Output:
## Summary verdict
- NEEDS CHANGES
## Severity
- High
## Blocking risks
- Adding the foreign key directly still takes SHARE ROW EXCLUSIVE locks on both tables.
- Existing rows must be checked, so the operation can run long on a 400M-row table.
- If customer_id is not indexed, writes and validation can be much more expensive.
- Replica lag or standby query cancellations may increase during validation.
## Why these are risky
- This is lighter than many ADD CONSTRAINT forms, but it is not metadata-only.
- Validation scans existing rows; on a hot table that can create prolonged load and replica pressure.
## Safer rollout plan
- Create an index on orders(customer_id) with CREATE INDEX CONCURRENTLY if it does not already exist.
- Add the foreign key as NOT VALID.
- Validate the constraint in a separate step.
- Run with known lock_timeout and statement_timeout settings and monitor replica lag.
## Missing context / questions
- Is orders(customer_id) already indexed?
- What is the acceptable replica lag budget?
- Do replicas serve user-facing reads?
- What are the timeout settings and abort thresholds?
## Rollback or roll-forward plan
- Roll-forward is usually safer after the supporting index is built.
- If validation causes issues, stop before validation or defer it; do not assume a clean full rollback is safer.
## Monitoring during execution
- Watch lock waits, validation runtime, replica lag, standby query cancellations, and application write latency.

The foreign-key rollout above uses the standard safer sequence for large existing tables: add the supporting index concurrently, add the constraint NOT VALID, then validate later with lighter validation locks.^[13]^[1]^[2]

Example 2:

Input:
ALTER TABLE big_table ALTER COLUMN id TYPE bigint;
Context: id is the primary key, 1.2B rows, many child foreign keys, logical subscriber for migration cutover.
Output:
## Summary verdict
- DO NOT RUN AS WRITTEN
## Severity
- Critical
## Blocking risks
- A one-shot type change will usually rewrite the table and indexes.
- Rewriting a 1.2B-row primary-key table can take a long time, needs extra disk, and can create large WAL volume.
- Child foreign keys and dependent indexes make the blast radius larger.
- Logical replication will not copy DDL or sequence state automatically.
## Why these are risky
- This plan compresses schema change, data movement, and cutover into one step.
- It is hostile to replicas, failover, and rollback.
## Safer rollout plan
- Add a new bigint column.
- Dual-write old and new values with a trigger or equivalent application path.
- Backfill in batches with commits, throttling, and resumable progress tracking.
- Build the new unique index concurrently.
- Migrate child references with staged columns and NOT VALID / VALIDATE CONSTRAINT.
- Cut over in a short final metadata step.
- Synchronize sequences explicitly if logical replication or subscriber promotion is involved.
## Missing context / questions
- What Postgres version is this?
- How much free disk is available for rewrite-like operations?
- What lag is acceptable on replicas or subscribers?
- How will sequences be synchronized at cutover?
- What is the tested cutover and abort plan?
## Rollback or roll-forward plan
- Design this as an expand-contract rollout and prefer roll-forward.
- After dual writes and backfill begin, a naive down migration is not a real safety plan.
## Monitoring during execution
- Watch batch runtime, dead tuples, replica or subscriber lag, WAL retention, trigger errors, and index build state.

A one-shot ALTER COLUMN ... TYPE on a huge primary-key table is usually the wrong plan. The safer pattern is expand-contract in database form: new bigint columns, dual writes, batched resumable backfill, concurrent index build, staged child constraints, then a short final cutover.^[12]^[8]

Gotchas

CREATE INDEX CONCURRENTLY is safer for availability, not free. It still scans twice, waits on writers and old snapshots, and can fail leaving an invalid index behind.^[7]^[19]^[14]
NOT VALID plus VALIDATE CONSTRAINT is the usual safe pattern for foreign keys and checks. It still scans data during validation.^[1]
PostgreSQL 11 and later changed the answer for adding a column with a non-volatile default. Do not cargo-cult pre-11 rewrite warnings onto newer clusters.^[9]^[1]
Partitioned-table indexing has parent-specific limits. A parent CREATE INDEX CONCURRENTLY is not supported.^[10]^[7]
Logical replication does not copy schema changes or sequence state. Subscriber cutovers need explicit schema choreography and sequence fixes.^[4]^[17]
For destructive or partially executed migrations, a down migration is often theater. Say when roll-forward is safer.^[6]^[8]

Made with Webhound · Ask questions about this research, build on it, or start your own