Skip to main content

Command Palette

Search for a command to run...

When Your Database Becomes the Bottleneck: How We Rebuilt Our IoT Telemetry Pipeline Without Taking It Down

Part 1 of 2 — The Problem, The Design, The Migration

Updated
11 min read
When Your Database Becomes the Bottleneck: How We Rebuilt Our IoT Telemetry Pipeline Without Taking It Down

There is a particular kind of engineering problem that does not announce itself. It sneaks up on you. One quarter your system handles the load fine. The next quarter you are running manual interventions every week. The quarter after that, your on-call rotation is getting paged at 2 AM because a status check timed out.

This is the story of one of those problems and how we designed our way out of it.


The System

We run an IoT platform that collects real-time telemetry from industrial air compressors deployed across customer sites. Every compressor in the field sends data continuously, ranging from operational parameters, performance metrics, fault signals.

That data first lands in a PostgreSQL table that we call our raw telemetry layer. Think of it as the intake queue for everything downstream: rule engines, analytics pipelines, alerting systems, data downloads, AI models, customer dashboards.

The table had a simple job. Accept writes fast. Hold data for a few days. Hand it off to an archive layer. Keep moving.

For a while, it did that job well.


The Slow Deterioration

The first signs were subtle. WAL replication to our read replicas started showing backpressure. We run a read-from-replica, write-to-primary pattern — standard practice — but we started seeing reads fail because the replica was lagging behind the primary. Queries that expected fresh data were hitting stale rows. In some cases the application development gave up and fell back to reading from the primary, which compounded the write load.

Then the data download requests started timing out. Operations teams, field teams requesting exports of raw historical telemetry, a routine operation were getting failures or waiting far longer than they should.

Then the analytics queues started throwing statement cancelled errors. These are the jobs that pull raw telemetry into our data platform. When they fail, data gaps appear in dashboards and model training pipelines.

The system status checks which query this table to compute processing latencies started timing out periodically. Which meant our alerting was unreliable precisely when we needed it most.

And underneath all of this, the root cause was something we understood but had not yet solved: table bloat.


Why The Table Was Bloating

PostgreSQL's MVCC model means that when you update or delete a row, the old version is not immediately removed. It becomes a dead tuple — invisible to new queries but still occupying space on disk. A background process called autovacuum is supposed to reclaim that space. But autovacuum has limits, and we were pushing past them in two ways.

First: the write pattern. This table is not just an append log. Every row gets updated multiple times as it moves through the processing pipeline — once when it is acknowledged, once when it is synced to the IoT platform, once when it is handed to the analytics layer. With hundreds of thousands of rows per day from a growing device fleet, the update churn is enormous. Every update creates a dead tuple.

Second: the daily archival job. Each night, we move rows older than a week from the main table to an archive table. The mechanism was a CTE — DELETE ... RETURNING ... INSERT — a single atomic operation that moves rows cleanly. But from PostgreSQL's perspective, deleting a million rows in one transaction creates a million dead tuples in one shot. Autovacuum then has to clean all of them up before the next night's job runs.

The default autovacuum configuration triggers based on a percentage of table size. On a table with 8 million rows, it only kicks in after roughly 1.6 million dead tuples accumulate. We were hitting that threshold every few days, autovacuum was running reactively rather than proactively, and between runs the dead tuples were piling up and bloating the table.

We had tried the obvious mitigations. We tuned autovacuum thresholds. We added indexes. We reduced the retention window from 30 days to 15 days to 7 days. Each change bought us time but not a solution. The problem gets worse every time we add a customer or expand a customer's device fleet — and we are doing both, constantly.

The real issue was structural: we were using row-level operations to solve a problem that needed partition-level operations.


The Design

The solution we landed on was declarative range partitioning on last_modified, with daily partitions and a two-tier retention pipeline.

Instead of running a DELETE job every night, the maintenance script does this:

No rows are moved. No rows are deleted. No dead tuples are created. No VACUUM is needed. The entire operation is essentially free.

Some decisions we made deliberately:

No primary keys on either table. PostgreSQL requires a primary key on a partitioned table to include the partition key — which would mean PRIMARY KEY (id, last_modified). That does not enforce uniqueness on id alone. Since id comes from an identity sequence and the tables are append-only, the constraint adds no real value and we omitted it.

Archive starts empty after migration. We had roughly 60 days of historical data in the old archive table. Rather than spend time backfilling a complex partitioned structure, we preserved the old table as a rollback safety net and let the new archive fill naturally from day one of the maintenance cron. A trade-off we made consciously.

Calendar-day partition boundaries, not rolling hours. The T-8 move happens on calendar day alignment — the partition covering May 29 moves on June 6, regardless of whether some rows in it are 7 days and 23 hours old vs 8 days and 1 hour old. The precision loss is acceptable for this use case.

Indexes on the main table, lighter set on archive. The main table carries six indexes per partition — covering device lookups, time-range queries, and two partial indexes that serve the sync jobs scanning for unprocessed rows. When a partition moves to archive, the partial indexes are dropped (they are never queried on archive) and the partition carries four indexes forward.


The Migration

We chose a blue/green approach with two separate scripts: a setup script that runs well before any downtime window, and a swap script that runs during a short planned downtime. The goal was to make the actual downtime window as small as possible — seconds, not minutes.

This separation matters. The most time-consuming parts of any migration are the backfill and the index builds. On a table with millions of rows, those operations take time. If you do them during the downtime window, your application is down for all of it. If you do them beforehand on a shadow table while the old system is still running, your downtime window shrinks to just the final synchronisation and the rename.

The Setup Script — Runs Before Downtime

The setup script creates the new partitioned tables alongside the existing ones. The old tables are untouched and continue serving production traffic throughout.

The sequence:

  1. Create both partitioned tables (live table and archive table) as shadow tables

  2. Create the initial set of partitions on the live table covering the retention window

  3. Backfill the last 8 days of data from old to new. This is the bulk of the data movement, takes time but it happens while production runs normally

  4. Build all 6 indexes after the backfill, not during — building an index on a pre-populated table via a single scan is significantly faster than maintaining it incrementally row by row

  5. Run ANALYZE to generate fresh planner statistics on the new table — skipping this step would leave the query planner flying blind post-swap

  6. Verify row counts against the source - a gate check before declaring setup complete

At the end of the setup script, you have a fully populated, fully indexed, fully analysed shadow table sitting alongside the live one. The live system has not noticed anything.

The Swap Script — The Actual Downtime Window

The downtime window starts when writes are halted and the swap script begins. By this point most of the work is already done. What remains is minimal:

  1. Delta sync — pull any rows written to the old table since the backfill completed. This is typically a small number — minutes or at max hours of writes, not days.

  2. Row count gate — the execution confirms the counts look right before committing.

  3. Atomic four-table rename — the heart of the swap:

BEGIN;
  ALTER TABLE live_table          RENAME TO live_table_old;
  ALTER TABLE archive_table       RENAME TO archive_table_old;
  ALTER TABLE live_table_new      RENAME TO live_table;
  ALTER TABLE archive_table_new   RENAME TO archive_table;
COMMIT;

Four tables renamed in one transaction. Either all four succeed or none do. From the application's perspective, the table names never change.

  1. Identity sequence correction — the new table's id sequence is advanced past the highest id in the delta sync. This is done after the rename, not before, because the delta sync used explicit id values and did not consume the sequence.

The application restarts pointing at the same table names it always used. It has no knowledge that the underlying structure changed.

The total downtime window is the delta sync time plus the rename transaction. The rename itself completes in milliseconds. The delta sync depends on how much time elapsed since the backfill — but with a busy table, scheduling the swap within a few hours of the backfill keeps it small.

The Confidence Period

After the swap, the old tables are preserved for 24-48 hours. Rollback is a single transaction — the same four-table rename in reverse. We drop the old tables only after we are certain the new system is behaving correctly.

This confidence period turned out to be essential. We will cover exactly why in Part 2.


What We Were Trading

Honesty about trade-offs matters. Here is what we gave up:

The sync job UPDATE pattern became more expensive. The sync jobs that mark rows as processed run UPDATE WHERE id = ANY(<list of 50 ids>). On a partitioned table with no last_modified filter, PostgreSQL has to check every partition — it cannot prune because id is not the partition key. We discovered this post-migration and it led to a significant debugging episode. We will cover that in Part 2.

The system is operationally more complex. A missed cron run means an extra partition sits in the main table. Two missed runs risks an insert failure if the pre-create also failed. The failure modes are more varied than a simple DELETE job. We mitigated this with idempotent partition creation and guard logic in the maintenance script, but the complexity is real.

Replication lag from write volume is not solved. The migration eliminates the WAL pressure caused by bulk DELETEs and VACUUM cycles. But the underlying write throughput from a growing device fleet is unchanged. We addressed the structural bloat problem. The scale problem is a separate challenge.


The Result

The daily maintenance cron now runs in seconds. No dead tuples. No VACUUM. No bloat accumulation.

WAL replication backpressure from the maintenance cycle is gone. The replica lag that was causing read failures has stabilised. Data download timeouts have not recurred. The analytics queue failures dropped to zero on the day after migration.

The status check queries that were timing out periodically now: no issues.

The retention window is holding at 8 days in the main table and 31 days in archive. When we need to change those numbers, it is a one-line argument change in the cron invocation — no schema changes, no script changes.

And the fleet keeps growing. Which is exactly the point.


In Part 2, we will cover what happened after the migration went live — a sync job that went from milliseconds to 25 seconds, three wrong hypotheses, and what EXPLAIN ANALYZE finally told us. More importantly, we will talk about why keeping the old system running throughout was not just a safety net — it was the engineering strategy that let us move fast and fix things without ever losing a row of data.


We build software for industrial IoT at scale. If problems like this interest you, we are always looking for engineers who think carefully about systems under pressure.