Part 1: When Your Time-Series Data Turns Against You...

Migrating 3.9 Billion Rows to TWCS Without Downtime on a Constrained IoT Cluster

The number that started everything was 7.99 ms. That was the average read latency on our Cassandra keyspace — a cluster seeing 181 writes for every single read.

Most teams only notice a number like that when dashboards start timing out. We noticed it because we were looking for it. What followed was a month of compaction runs, tombstone archaeology, device forensics, and one very honest conversation about what it means for a cluster to be fixed versus merely managed.

This is the first part of that story.

Note: Replace iot_platform, ts_data, and ts_partitions with your own keyspace and table names throughout this post. All commands are copy-paste ready.

The Cluster We Were Running

Our infrastructure was quite modest for the volume it handled:

Hardware: Three Cassandra 5.0.4 nodes, RF=2. Each node had 2 CPU cores and 7.7 GB RAM.
JVM Heap: Allocated at 2 GB.
The Workload: An IoT telemetry platform storing sensor readings from industrial machines.

By the time we started this investigation, the cluster had absorbed approximately 3.9 billion writes and 21.5 million reads — a precise write-to-read ratio of 181:1.

That ratio matters more than it might seem. Cassandra is designed for write-heavy workloads, but the way data ages and expires is a read-time problem. You write fast, but you pay at query time.

Our compaction strategy was SizeTieredCompactionStrategy (STCS) — the default, the one you get if you never thought about it. We had never thought about it

What `nodetool tablestats` Was Telling Us

Running nodetool tablestats on the primary telemetry table gave us this revealing output:

A 29.97% droppable tombstone ratio means nearly a third of everything Cassandra reads on this table is already dead. It is expired TTL data sitting around waiting to be cleaned up, slowing every query that touches it.

The speculative retries told the same story from the read side: 218,555 reads timed out and had to be retried on another replica. The cluster was working incredibly hard to serve data through a graveyard of expired rows.

What a Tombstone Actually Is

Every row written with a TTL leaves a tombstone when it expires. Cassandra does not delete data in place; it marks it deleted and cleans it up during compaction, but only after gc_grace_seconds has passed. Until then, reads have to scan through the dead rows to find the live ones.

At a 30% droppable ratio, a read for 1,000 live rows might scan through 300 or more dead ones first. Multiply that across every dashboard query, every API call, and every device history lookup — and you get 7.99 ms where you should be seeing 1–2 ms.

Why STCS is the Wrong Strategy for Time-Series Data

SizeTieredCompactionStrategy works by merging SSTables of similar size. This is fine for workloads with uniform data distribution — user profiles, product catalogues, or anything where data is written roughly evenly and queried by key.

Time-series IoT data is not that. It is written in strict time order: every sensor reading is newer than the one before it. Monthly partition boundaries create natural expiry cliffs — all of January expires at roughly the same time, all of February a month later.

STCS does not know or care about time. When it merges SSTables, it mixes data from different months into the same output file.

When January's rows expire and become tombstones, those tombstones are now spread across multiple SSTables that also contain live data from February, March, and April. To clean them up, Cassandra has to completely rewrite those files — scanning every row to separate the dead from the living.

The TWCS Alternative

TimeWindowCompactionStrategy (TWCS) keeps each time window in its own dedicated SSTable. When an entire window expires, the whole file can be dropped without scanning a single row.

No rewrite. No tombstone scan. Just straight disk deletion. That is the difference between managed tombstone cleanup and structural tombstone cleanup.

Choosing the Migration Parameters

1. The MONTHS Unit Failure

Our telemetry platform uses monthly partition buckets. All sensor readings from January 2026 land in a partition with a bucket timestamp, all of February in the next, and so on. The natural TWCS window size would theoretically be one calendar month.

Cassandra documentation describes compaction_window_unit as accepting MINUTES, HOURS, and DAYS. Some versions also imply MONTHS. However, we found out the hard way on Cassandra 5.0.4 that MONTHS throws an explicit error:

ConfigurationException: MONTHS is not valid for compaction_window_unit

Our Fallback: 30 DAYS. This is a close approximation but not a perfect match. TWCS windows are calculated from epoch 0 in fixed 30-day increments, while calendar months vary from 28 to 31 days.

The Consequence: A ThingsBoard-style monthly partition may occasionally straddle two TWCS windows. This is minor; tombstone cleanup simply happens in two passes instead of one for affected partitions. We accepted this trade-off.

2. Window Size: 30 Days vs. Documentation Advice

The Cassandra documentation recommends selecting a window size that produces approximately 20–30 active windows.

90 days TTL / 3 day window  = 30 windows  ← Docs recommendation
90 days TTL / 30 day window = 3 windows   ← Our deliberate choice

We chose 30 days instead of 3 as a deliberate, hardware-constrained decision. At 3-day windows, our steady-state SSTable count would sit around 30 files. Each compaction job would be proportionally smaller — roughly 3 GB rather than 30 GB. On a well-resourced cluster, that is absolutely the right call.

But on a 2-core node with 7.7 GB RAM and a JVM heap of 2 GB, we had a major concern: compaction jobs that exceed available memory abort silently. We had already seen evidence of this via 49 aborted compactions on one of our nodes. While smaller jobs are safer on constrained hardware, the overhead of managing 30 SSTables instead of 3 adds steep CPU cost at every compaction cycle. We chose fewer, larger SSTables.

If your hardware is less constrained, the 3-day window is worth reconsidering. The documentation recommendation exists for good reason.

3. `unsafe_aggressive_sstable_expiration`

This flag allows TWCS to drop entire SSTables when all their rows have expired TTLs — without checking whether any of those rows contain tombstones that might shadow data on other nodes.

The documentation calls this "potentially dangerous" because in certain multi-node repair scenarios, dropping a tombstone before all replicas see it can cause deleted data to reappear.

For our specific workload, it is entirely safe because our IoT telemetry is append-only. Rows are written once with a TTL and never updated or explicitly deleted. There are no cross-SSTable shadowing relationships. Furthermore, our daily incremental repair runs complete in under 5 minutes for the entire cluster, well within our 2-day gc_grace_seconds window. All nodes are consistently in sync before any tombstone becomes eligible for aggressive expiration.

The flag must be set as a JVM argument at startup; it cannot be enabled inline at query time. In our Docker Compose setup, we added it like this:

environment:
  - JVM_OPTS=-Dcassandra.allow_unsafe_aggressive_sstable_expiration=true

Warning: The ALTER TABLE statement succeeds silently even if this JVM flag is missing. You will only discover it's missing by checking ps aux in the process list, or when you notice tombstones aren't being cleaned up aggressively.

The Migration Strategy

No Snapshot: A Deliberate Choice

Standard migration advice dictates taking a snapshot before any major schema change. We explicitly chose not to.

Our cluster has an RF=2 layout, meaning every row exists on exactly two nodes. With the cluster completely healthy and all three nodes live, each node's data was already replicated elsewhere. A snapshot would have consumed approximately 52 GB per node. With a 51.8 GB ts_data table across three nodes at RF=2, that required 155 GB of disk space we simply didn't have available.

Additionally, the revert path for a compaction strategy change is instantaneous: ALTER TABLE back to STCS, and you're done. The schema change is immediately reversible. While the compaction work itself is not reversible (SSTables rewritten under TWCS keep their new layout), the data remains perfectly intact. There is nothing to recover.

Executing the ALTER TABLE

The actual schema change is a single CQL statement. It is instantaneous, updating schema metadata and gossiping the change to all nodes. No SSTables are rewritten immediately; existing STCS SSTables migrate to TWCS windows gradually in the background.

ALTER TABLE iot_platform.ts_data WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'DAYS',
  'compaction_window_size': '30',
  'unsafe_aggressive_sstable_expiration': 'true'
};

If you are running this via Docker Compose natively in an interactive terminal, you might run:

docker exec -it node-1 cqlsh -e "ALTER TABLE iot_platform.ts_data WITH compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_unit': 'DAYS', 'compaction_window_size': '30', 'unsafe_aggressive_sstable_expiration': 'true'};"

Throttling Compaction I/O

The migration compaction was initially throttled to 50 MB/s per node to avoid saturating disk I/O during core business hours. Note that setcompactionthroughput is a per-node command and does not propagate across the cluster automatically. You must set it on each node individually:

docker exec node-1 nodetool setcompactionthroughput 50
docker exec node-2 nodetool setcompactionthroughput 50
docker exec node-3 nodetool setcompactionthroughput 50

Once the migration successfully completes, you can restore throughput to unlimited (0):

docker exec node-1 nodetool setcompactionthroughput 0

Stopping Writes: An Unplanned Accelerant

Mid-migration, we made the call to stop all incoming writes to the table. This wasn't in our original playbook.

With writes paused, compaction transformed into pure consolidation. No new memtable flushes competed for disk I/O, and no temporary SSTables were created while old ones were being merged. The 31-minute estimate we had under the 50 MB/s throttle dropped significantly once the throttle was lifted and writes were paused. More importantly, it completely saved our struggling node.

Troubleshooting the `node-1` Abort Problem

One of our three nodes had a history of 49 aborted compactions prior to starting this migration. During the TWCS migration, it kicked off a massive 93 GB compaction job, reached 77% completion, and then vanished entirely from compactionstats and compactionhistory.

It was a clean abort with zero errors in the system logs.

The Diagnosis: The node was running at the absolute edge of its memory envelope (2 CPU cores, a mere 125 MB of free system RAM during the job, and 66% CPU usage).

The Culprit: TWCS compaction requires merge buffers proportional to the number of SSTables being merged. On a 93 GB job touching all 16 SSTables simultaneously, those buffers exceeded what our tight 2 GB JVM heap could cleanly allocate.

What finally worked: Running the compaction overnight with writes completely paused. This resulted in much lower memory pressure from zero concurrent write activity, lower CPU contention, and a slightly smaller 82 GB job file (as some data had naturally expired and been cleaned up between our attempts).

The job successfully ran to completion overnight, clearing our droppable tombstones down to 0%. Compaction abort thresholds on constrained hardware are real structural constraints, not minor edge cases. If your nodes run close to their memory limits, a compaction job that succeeds seamlessly at 2 AM may abort every single afternoon.

The Results

Here is how our primary metrics looked before and after the TWCS migration:

Metric	Before	After	Change
Read Latency (Keyspace Avg)	7.99 ms	1.90 ms	-76%
Droppable Tombstone Ratio	29.97%	0.00000%	-100%
SSTable Count	16	3–6	Consolidated
Speculative Retries / Day	~1,000+	~150	-85%
Compaction Strategy	STCS	TWCS	Migrated
Tombstone Cleanup Method	Row-by-row	Whole SSTable drop	Structural

The read latency improvement from 7.99 ms to 1.90 ms is the win that mattered most to our users. Every dashboard query, every device history lookup, and every API call touching this table instantly became 76% faster.

The best part? This optimization came entirely from eliminating tombstone scanning overhead — no hardware upgrades, no complex schema redesigns, and zero application code changes.

Coming Up Next...

In Part 2, we will cover what happened next: why hit hitting 0.00000% tombstones wasn't actually the finish line, how unexpected gc_grace_seconds timing made all of this progress appear to completely reverse four days later, and what it actually looks like to build an IoT cluster that cleanly manages its own tombstone lifecycle automatically. Stay tuned!

Part 1: When Your Time-Series Data Turns Against You...

Migrating 3.9 Billion Rows to TWCS Without Downtime on a Constrained IoT Cluster

The Cluster We Were Running

What `nodetool tablestats` Was Telling Us

What a Tombstone Actually Is

Why STCS is the Wrong Strategy for Time-Series Data

The TWCS Alternative

Choosing the Migration Parameters

1. The MONTHS Unit Failure

2. Window Size: 30 Days vs. Documentation Advice

3. `unsafe_aggressive_sstable_expiration`

The Migration Strategy

No Snapshot: A Deliberate Choice

Executing the ALTER TABLE

Throttling Compaction I/O

Stopping Writes: An Unplanned Accelerant

Troubleshooting the `node-1` Abort Problem

The Results

Coming Up Next...

Comments

More from this blog

When Your Fix Creates a New Problem: Debugging the Production Partition Migration

When Your Database Becomes the Bottleneck: How We Rebuilt Our IoT Telemetry Pipeline Without Taking It Down

Beyond the Model: How Interaction Design Makes AI Useful

What 365 days of telemetry data taught us about building in the real world

Command Palette

Migrating 3.9 Billion Rows to TWCS Without Downtime on a Constrained IoT Cluster

The Cluster We Were Running

What nodetool tablestats Was Telling Us

What a Tombstone Actually Is

Why STCS is the Wrong Strategy for Time-Series Data

The TWCS Alternative

Choosing the Migration Parameters

1. The MONTHS Unit Failure

2. Window Size: 30 Days vs. Documentation Advice

3. unsafe_aggressive_sstable_expiration

The Migration Strategy

No Snapshot: A Deliberate Choice

Executing the ALTER TABLE

Throttling Compaction I/O

Stopping Writes: An Unplanned Accelerant

Troubleshooting the node-1 Abort Problem

The Results

Coming Up Next...

Comments

More from this blog

What `nodetool tablestats` Was Telling Us

3. `unsafe_aggressive_sstable_expiration`

Troubleshooting the `node-1` Abort Problem