Part 1: When Your Time-Series Data Turns Against You...

Migrating 3.9 Billion Rows to TWCS Without Downtime on a Constrained IoT Cluster
The number that started everything was 7.99 ms. That was the average read latency on our Cassandra keyspace — a cluster seeing 181 writes for every single read.
Most teams only notice a number like that when dashboards start timing out. We noticed it because we were looking for it. What followed was a month of compaction runs, tombstone archaeology, device forensics, and one very honest conversation about what it means for a cluster to be fixed versus merely managed.
This is the first part of that story.
Note: Replace
iot_platform,ts_data, andts_partitionswith your own keyspace and table names throughout this post. All commands are copy-paste ready.
The Cluster We Were Running
Our infrastructure was quite modest for the volume it handled:
Hardware: Three Cassandra 5.0.4 nodes, RF=2. Each node had 2 CPU cores and 7.7 GB RAM.
JVM Heap: Allocated at 2 GB.
The Workload: An IoT telemetry platform storing sensor readings from industrial machines.
By the time we started this investigation, the cluster had absorbed approximately 3.9 billion writes and 21.5 million reads — a precise write-to-read ratio of 181:1.
That ratio matters more than it might seem. Cassandra is designed for write-heavy workloads, but the way data ages and expires is a read-time problem. You write fast, but you pay at query time.
Our compaction strategy was SizeTieredCompactionStrategy (STCS) — the default, the one you get if you never thought about it. We had never thought about it
What nodetool tablestats Was Telling Us
Running nodetool tablestats on the primary telemetry table gave us this revealing output:
A 29.97% droppable tombstone ratio means nearly a third of everything Cassandra reads on this table is already dead. It is expired TTL data sitting around waiting to be cleaned up, slowing every query that touches it.
The speculative retries told the same story from the read side: 218,555 reads timed out and had to be retried on another replica. The cluster was working incredibly hard to serve data through a graveyard of expired rows.
What a Tombstone Actually Is
Every row written with a TTL leaves a tombstone when it expires. Cassandra does not delete data in place; it marks it deleted and cleans it up during compaction, but only after gc_grace_seconds has passed. Until then, reads have to scan through the dead rows to find the live ones.
At a 30% droppable ratio, a read for 1,000 live rows might scan through 300 or more dead ones first. Multiply that across every dashboard query, every API call, and every device history lookup — and you get 7.99 ms where you should be seeing 1–2 ms.
Why STCS is the Wrong Strategy for Time-Series Data
SizeTieredCompactionStrategy works by merging SSTables of similar size. This is fine for workloads with uniform data distribution — user profiles, product catalogues, or anything where data is written roughly evenly and queried by key.
Time-series IoT data is not that. It is written in strict time order: every sensor reading is newer than the one before it. Monthly partition boundaries create natural expiry cliffs — all of January expires at roughly the same time, all of February a month later.
STCS does not know or care about time. When it merges SSTables, it mixes data from different months into the same output file.
When January's rows expire and become tombstones, those tombstones are now spread across multiple SSTables that also contain live data from February, March, and April. To clean them up, Cassandra has to completely rewrite those files — scanning every row to separate the dead from the living.
The TWCS Alternative
TimeWindowCompactionStrategy (TWCS) keeps each time window in its own dedicated SSTable. When an entire window expires, the whole file can be dropped without scanning a single row.
No rewrite. No tombstone scan. Just straight disk deletion. That is the difference between managed tombstone cleanup and structural tombstone cleanup.
Choosing the Migration Parameters
1. The MONTHS Unit Failure
Our telemetry platform uses monthly partition buckets. All sensor readings from January 2026 land in a partition with a bucket timestamp, all of February in the next, and so on. The natural TWCS window size would theoretically be one calendar month.
Cassandra documentation describes compaction_window_unit as accepting MINUTES, HOURS, and DAYS. Some versions also imply MONTHS. However, we found out the hard way on Cassandra 5.0.4 that MONTHS throws an explicit error:
ConfigurationException: MONTHS is not valid for compaction_window_unit
Our Fallback: 30 DAYS. This is a close approximation but not a perfect match. TWCS windows are calculated from epoch 0 in fixed 30-day increments, while calendar months vary from 28 to 31 days.
The Consequence: A ThingsBoard-style monthly partition may occasionally straddle two TWCS windows. This is minor; tombstone cleanup simply happens in two passes instead of one for affected partitions. We accepted this trade-off.
2. Window Size: 30 Days vs. Documentation Advice
The Cassandra documentation recommends selecting a window size that produces approximately 20–30 active windows.
90 days TTL / 3 day window = 30 windows ← Docs recommendation
90 days TTL / 30 day window = 3 windows ← Our deliberate choice
We chose 30 days instead of 3 as a deliberate, hardware-constrained decision. At 3-day windows, our steady-state SSTable count would sit around 30 files. Each compaction job would be proportionally smaller — roughly 3 GB rather than 30 GB. On a well-resourced cluster, that is absolutely the right call.
But on a 2-core node with 7.7 GB RAM and a JVM heap of 2 GB, we had a major concern: compaction jobs that exceed available memory abort silently. We had already seen evidence of this via 49 aborted compactions on one of our nodes. While smaller jobs are safer on constrained hardware, the overhead of managing 30 SSTables instead of 3 adds steep CPU cost at every compaction cycle. We chose fewer, larger SSTables.
If your hardware is less constrained, the 3-day window is worth reconsidering. The documentation recommendation exists for good reason.
3. unsafe_aggressive_sstable_expiration
This flag allows TWCS to drop entire SSTables when all their rows have expired TTLs — without checking whether any of those rows contain tombstones that might shadow data on other nodes.
The documentation calls this "potentially dangerous" because in certain multi-node repair scenarios, dropping a tombstone before all replicas see it can cause deleted data to reappear.
For our specific workload, it is entirely safe because our IoT telemetry is append-only. Rows are written once with a TTL and never updated or explicitly deleted. There are no cross-SSTable shadowing relationships. Furthermore, our daily incremental repair runs complete in under 5 minutes for the entire cluster, well within our 2-day gc_grace_seconds window. All nodes are consistently in sync before any tombstone becomes eligible for aggressive expiration.
The flag must be set as a JVM argument at startup; it cannot be enabled inline at query time. In our Docker Compose setup, we added it like this:
environment:
- JVM_OPTS=-Dcassandra.allow_unsafe_aggressive_sstable_expiration=true
Warning: The
ALTER TABLEstatement succeeds silently even if this JVM flag is missing. You will only discover it's missing by checkingps auxin the process list, or when you notice tombstones aren't being cleaned up aggressively.
The Migration Strategy
No Snapshot: A Deliberate Choice
Standard migration advice dictates taking a snapshot before any major schema change. We explicitly chose not to.
Our cluster has an RF=2 layout, meaning every row exists on exactly two nodes. With the cluster completely healthy and all three nodes live, each node's data was already replicated elsewhere. A snapshot would have consumed approximately 52 GB per node. With a 51.8 GB ts_data table across three nodes at RF=2, that required 155 GB of disk space we simply didn't have available.
Additionally, the revert path for a compaction strategy change is instantaneous: ALTER TABLE back to STCS, and you're done. The schema change is immediately reversible. While the compaction work itself is not reversible (SSTables rewritten under TWCS keep their new layout), the data remains perfectly intact. There is nothing to recover.
Executing the ALTER TABLE
The actual schema change is a single CQL statement. It is instantaneous, updating schema metadata and gossiping the change to all nodes. No SSTables are rewritten immediately; existing STCS SSTables migrate to TWCS windows gradually in the background.
ALTER TABLE iot_platform.ts_data WITH compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': '30',
'unsafe_aggressive_sstable_expiration': 'true'
};
If you are running this via Docker Compose natively in an interactive terminal, you might run:
docker exec -it node-1 cqlsh -e "ALTER TABLE iot_platform.ts_data WITH compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_unit': 'DAYS', 'compaction_window_size': '30', 'unsafe_aggressive_sstable_expiration': 'true'};"
Throttling Compaction I/O
The migration compaction was initially throttled to 50 MB/s per node to avoid saturating disk I/O during core business hours. Note that setcompactionthroughput is a per-node command and does not propagate across the cluster automatically. You must set it on each node individually:
docker exec node-1 nodetool setcompactionthroughput 50
docker exec node-2 nodetool setcompactionthroughput 50
docker exec node-3 nodetool setcompactionthroughput 50
Once the migration successfully completes, you can restore throughput to unlimited (0):
docker exec node-1 nodetool setcompactionthroughput 0
Stopping Writes: An Unplanned Accelerant
Mid-migration, we made the call to stop all incoming writes to the table. This wasn't in our original playbook.
With writes paused, compaction transformed into pure consolidation. No new memtable flushes competed for disk I/O, and no temporary SSTables were created while old ones were being merged. The 31-minute estimate we had under the 50 MB/s throttle dropped significantly once the throttle was lifted and writes were paused. More importantly, it completely saved our struggling node.
Troubleshooting the node-1 Abort Problem
One of our three nodes had a history of 49 aborted compactions prior to starting this migration. During the TWCS migration, it kicked off a massive 93 GB compaction job, reached 77% completion, and then vanished entirely from compactionstats and compactionhistory.
It was a clean abort with zero errors in the system logs.
The Diagnosis: The node was running at the absolute edge of its memory envelope (2 CPU cores, a mere 125 MB of free system RAM during the job, and 66% CPU usage).
The Culprit: TWCS compaction requires merge buffers proportional to the number of SSTables being merged. On a 93 GB job touching all 16 SSTables simultaneously, those buffers exceeded what our tight 2 GB JVM heap could cleanly allocate.
What finally worked: Running the compaction overnight with writes completely paused. This resulted in much lower memory pressure from zero concurrent write activity, lower CPU contention, and a slightly smaller 82 GB job file (as some data had naturally expired and been cleaned up between our attempts).
The job successfully ran to completion overnight, clearing our droppable tombstones down to 0%. Compaction abort thresholds on constrained hardware are real structural constraints, not minor edge cases. If your nodes run close to their memory limits, a compaction job that succeeds seamlessly at 2 AM may abort every single afternoon.
The Results
Here is how our primary metrics looked before and after the TWCS migration:
| Metric | Before | After | Change |
|---|---|---|---|
| Read Latency (Keyspace Avg) | 7.99 ms | 1.90 ms | -76% |
| Droppable Tombstone Ratio | 29.97% | 0.00000% | -100% |
| SSTable Count | 16 | 3–6 | Consolidated |
| Speculative Retries / Day | ~1,000+ | ~150 | -85% |
| Compaction Strategy | STCS | TWCS | Migrated |
| Tombstone Cleanup Method | Row-by-row | Whole SSTable drop | Structural |
The read latency improvement from 7.99 ms to 1.90 ms is the win that mattered most to our users. Every dashboard query, every device history lookup, and every API call touching this table instantly became 76% faster.
The best part? This optimization came entirely from eliminating tombstone scanning overhead — no hardware upgrades, no complex schema redesigns, and zero application code changes.
Coming Up Next...
In Part 2, we will cover what happened next: why hit hitting 0.00000% tombstones wasn't actually the finish line, how unexpected gc_grace_seconds timing made all of this progress appear to completely reverse four days later, and what it actually looks like to build an IoT cluster that cleanly manages its own tombstone lifecycle automatically. Stay tuned!
