Skip to main content

Command Palette

Search for a command to run...

When Your Fix Creates a New Problem: Debugging the Production Partition Migration

Part 2 of 2 — Two Swap Cycles, One Rollback, and Why the Safety Net Made All the Difference

Updated
9 min read
When Your Fix Creates a New Problem: Debugging the Production Partition Migration

The migration went live just after midnight. The tables were renamed. The application restarted. Everything looked fine.

By 1 AM, the sync job — the background process that marks rows as processed and hands them to downstream pipelines — was taking 25 seconds per batch where it had taken milliseconds before. Not fast enough to cause immediate application failures, but slow enough to cascade: batches backed up, queues grew, processing latencies climbed.

We had just eliminated the bloat crisis. Now we had a different one.

What happened next is the part of this story that matters most, not because we found the answer quickly, but because of how we moved. We rolled back to the old system. Twice. We took planned downtime windows to re-run experiments on the new system. We changed application code, cut experimental release tags, and reverted them. We went back and forth between the old and new architecture in a controlled, deliberate way until we were certain.

No data was lost across any of it.


Hypothesis 1: Lock Contention

The first instinct was obvious. Partitioned tables involve more complex locking. Perhaps a maintenance operation or a concurrent write was holding a lock that the sync job was waiting on.

We queried for active lock waits:

SELECT pid, wait_event_type, wait_event, state, now() - query_start AS duration
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY duration DESC;

No results. Ran it again during an active sync job batch. Still nothing.

Wrong hypothesis. And importantly — this one we could test without changing anything. No rollback needed. We were still on the partitioned system.

The next hypothesis was different. It required a code change.


The First Rollback

The sync job UPDATE looked like this:

UPDATE telemetry_table
SET processed = true
WHERE id = ANY('{...50 ids...}')

On a partitioned table, when the query has no filter on the partition key (last_modified), PostgreSQL cannot prune — it has to inspect every partition. With 8 active partitions each holding over a million rows, that is a lot of work to find 50 rows.

The theory: if we added a last_modified filter, PostgreSQL could skip partitions outside the time range and narrow the scan dramatically.

But testing this theory required changing application code. We could not just run a query in a console, the sync job was managed code with a release cycle. We needed to cut an experimental tag, deploy it, and measure the result properly.

And that meant making a decision: do we stay on the partitioned system while we prepare the code change, or do we rollback?

We rolled back.

The reasoning was straightforward. The partitioned system was running 25-second sync batches. Every minute we stayed on it, queues were backing up. The old unpartitioned system was proven, stable, and sitting right there — four table renames away. The right thing to do was to restore production to a known-good state, work on the fix without time pressure, and come back to the partitioned system when we were ready to test properly.

BEGIN;
  ALTER TABLE live_table          RENAME TO live_table_new;
  ALTER TABLE archive_table       RENAME TO archive_table_new;
  ALTER TABLE live_table_old      RENAME TO live_table;
  ALTER TABLE archive_table_old   RENAME TO archive_table;
COMMIT;

The same four-table atomic rename from the swap script, in reverse. Production was back on the unpartitioned tables within seconds.


Swap Cycle 2: Testing the last_modified Filter

With production stable and no time pressure, we prepared the code change. The sync job's UPDATE query was modified to include a last_modified filter:

UPDATE telemetry_table
SET processed = true
WHERE id = ANY('{...50 ids...}')
  AND last_modified >= now() - interval '8 days'

An experimental release tag was cut. We were ready to test.

We took another downtime, ran the delta sync to pull all rows that had arrived on the old system since the previous swap, and brought the partitioned system live with the new application code. Matter of seconds because the swap script was idempotent.

8-day window result: still 20+ seconds.

The filter covered the full retention window — all 8 partitions — so PostgreSQL still had to check everything. No pruning benefit. We narrowed the window further without taking another swap cycle:

UPDATE telemetry_table
SET processed = true
WHERE id = ANY('{...50 ids...}')
  AND last_modified >= now() - interval '1 day'

1-day window result: ~5 seconds.

Better. The planner was now pruning to 1-2 partitions instead of all 8. But 5 seconds is still 10× slower than the 450ms baseline. And the approach felt wrong in a different way — a time filter on an UPDATE that is fundamentally id-based is fragile. If a row arrived slightly outside the window, it would be silently skipped. That is a data correctness risk, not just a performance concern.

We kept the partitioned system live and went looking for the real answer.


The Real Answer: EXPLAIN ANALYZE

We pulled EXPLAIN ANALYZE on the original query — no last_modified filter — and read the output carefully:

Update on telemetry_table
  ->  Seq Scan on telemetry_table_partition_jun06
       Filter: (id = ANY ('{422666688,422666687,...}'::bigint[]))
       Rows Removed by Filter: 969,950
       Actual Rows: 12
       Actual Time: 3847.221 ms

The planner was doing a full sequential scan on the June 6 partition — scanning 969,950 rows to find 12 matching ones. And doing this on every partition.

The reason: there was no index on id.

The old unpartitioned table had a plain (id) index — implicitly, via its primary key. When we built the new partitioned table, we created six indexes covering device lookups, time-range queries, and two partial indexes for the sync job's SELECT side. But we had NOT created a plain (id) index, because on the old table the primary key was the id index. On the new partitioned table, with no primary key, that index was simply missing.

The fix did not need another swap cycle. We added the index directly to the partitioned parent table under a short downtime to raw data ingestion:

CREATE INDEX ON telemetry_table (id);

PostgreSQL propagated it implicitly to all child partitions. Then we rolled the application back to the stable release tag — removing the last_modified filter entirely, returning the sync job to its original form:

UPDATE telemetry_table
SET processed = true
WHERE id = ANY('{...50 ids...}')

Execution time: 43 milliseconds. Back to baseline.

The last_modified filter experiment was discarded completely. The query is simpler, safer, and faster than the experimental version ever was.


What the Back-and-Forth Made Possible

Two swap cycles after the initial migration. One deliberate rollback to production. Three downtime windows in total. Multiple application experimental release tags cut and reverted.

That sounds like a lot of churn. It was not. Each step was controlled and intentional.

The rollback after Hypothesis 1 was not a failure — it was the right call. Keeping a degraded system running while you prepare a code change is the wrong way to operate. The old system was there. Using it was the correct decision.

The second swap cycle with the last_modified filter let us test the hypothesis properly — with real production data, real query patterns, real load — rather than guessing in a staging environment. The result told us the filter was a partial improvement but not a real solution. That was valuable information.

Staying on the partitioned system for the index addition meant we did not need a third full swap cycle. The index creation was a targeted, low-risk change. The rollback to the stable application tag was clean.

At every point, the decision of whether to rollback or continue was made based on one question: does the current state of the system allow us to work safely, or do we need to restore production first?

When the answer was "restore production first," we did it without hesitation. When the answer was "we can continue from here," we continued.


The Numbers

The exec time data from that night makes the progression clear.

Before migration: ~450ms, stable.

Post-migration, no changes: 25–27 seconds.

With last_modified 8-day filter: still 20+ seconds.

With last_modified 1-day filter: ~5 seconds.

After CREATE INDEX (id), stable release tag: 43ms.

Each data point was a real measurement from a real production system under real load, not a synthetic benchmark.


The Broader Philosophy

We started this migration because table bloat was a structural problem that no amount of tuning could outrun. The only real fix was a structural one and a structural fix requires confidence that you can move back if something goes wrong.

The blue/green architecture with the delta sync pattern gave us that confidence. Not in a theoretical sense, in a practical one. We used the rollback. We used the delta sync. We ran the swap script multiple times. Every one of those operations worked exactly as designed because we had built them to be repeatable, idempotent, and fast.

The engineering lesson is not "always rollback when things go wrong." It is more specific than that

Build your migration so that rollback is as easy as forward. If rollback is hard, you will avoid it even when it is the right call. If rollback is a 30-second atomic rename, you will use it exactly when you should — which means you can afford to be thorough, methodical, and unhurried even at 1 AM when queues are backing up.

The most important infrastructure you can build is the kind that lets you move fast without fear. That is not a principle for migrations alone. It is how reliable systems get built.


This was a real problem, real debugging, and real numbers from a production system. If problems like this interest you and you want to work on systems at industrial IoT scale, we would like to hear from you.

9 views
R

Excellent breakdown. The missing (id) index lesson is a reminder that fundamentals often matter more than complex optimizations. Also, having a safe rollback strategy made the migration far less risky.

At https://www.remotestate.com/ we see the same principle in scalable system design—move fast, but always make rollback easy.