Skip to main content

Command Palette

Search for a command to run...

Part 2: When 0% Tombstones Is Not the Finish Line"

Four days after hitting 0.00000% droppable tombstones on all three nodes simultaneously — the cleanest state the cluster had ever been in — the Slack report showed 30% again

Updated
8 min read
Part 2: When 0% Tombstones Is Not the Finish Line"

The compaction had worked. The timing had not.

This is the part of the story that rarely gets written about, because it happens after the blog post that covers the migration. The migration was the right call. What we learned in the weeks after it is what separates a cluster that's been cleaned from one that's actually healthy.


The gc_grace_seconds Timing Effect

After the TWCS migration compaction passes completed on all three nodes, tombstone ratios hit zero. Five days later they were climbing back toward 30%. THE gc_grace_seconds TIMING GAP compaction runs on day 0 · rows still in grace window become droppable days later COMPACTION RUNS Day 0 Day 1 Day 2 Day 3 Day 4 compaction cleans droppable already droppable — cleaned ✓ still inside gc_grace_seconds window — invisible to compaction gc_grace = 2 days DAYS AFTER Day 0 Day 1 Day 2 Day 3 Day 4 cross gc_grace window — now droppable tombstone ratio climbs back toward 30% the compaction was correct — the timing meant it could not see what would become droppable two days later fix: a second compaction pass once those tombstones cleared the window

Here's what happened.

When the compaction ran, it cleaned up tombstones that were already droppable — rows that had expired their TTL and aged past gc_grace_seconds (set to 2 days on our cluster). But device-A, a high-frequency sensor writing at 1 row/second, had rows expiring continuously. Rows still within the 2-day grace window at compaction time could not be cleaned. They finished aging through the grace window in the days after compaction and appeared as droppable tombstones.

The compaction did not fail. The timing meant it could not see everything that would eventually become droppable.

The fix was a second compaction pass a few days later, once those tombstones had fully cleared the grace window. It hit zero again — and this time durably, because the root cause had also been addressed.

Operational note: Before concluding that a compaction aborted, always check nodetool compactionhistory. Aborted jobs do not appear in the history. Completed jobs do. This distinction is how we separated a clean completion from a silent abort on the node with 49 historical aborts.


Finding the Offending Device

nodetool toppartitions: A Syntax Trap

The tool to sample active write traffic and surface hot partitions is nodetool toppartitions. The syntax changed between Cassandra versions and the error message offers no hint about what went wrong:

# Wrong — flag syntax, fails silently in 5.x
nodetool toppartitions iot_platform ts_data -d 60000 -k 10

# Correct — duration is a positional argument
nodetool toppartitions iot_platform ts_data 60000

Duration in milliseconds is positional, not a flag. 60000 samples for 60 seconds.

What We Found

Device-A dominated the write frequency list across all three nodes with the same 10+ key names: sensor_freq, sensor_pressure, sensor_temp, and several others — all writing at approximately 1 write/second across every key simultaneously.

The math is blunt

The original tablestats top-partitions-by-tombstone-count list had already shown device-A with 456,528 tombstones per key in a single monthly partition. That's what 1-per-second writes look like after just 5 days of TTL expiry.

Checking the TTL

The first instinct was misconfiguration — maybe the TTL wasn't being set at write time. The verification query:

SELECT ts, ttl(dbl_v), ttl(long_v)
FROM iot_platform.ts_data
WHERE entity_type = 'DEVICE'
AND entity_id = <device-A-uuid>
AND key = 'sensor_freq'
AND partition = <current-partition>
LIMIT 5;

Result: ttl(long_v) ≈ 7,600,000 seconds — about 88 days remaining, consistent with a 90-day retention policy applied correctly at write time.

The device was not misconfigured. It was generating tombstones at the rate that follows naturally from 1 write/second with a 90-day TTL. The infrastructure was working exactly as designed and producing exactly the tombstone volume that math predicts.


The Business Requirement

Device-A writing every 10 seconds and device-B writing every 15 seconds are non-negotiable. These are industrial monitoring devices where sampling rate reflects real physical measurement requirements. Slower writes mean lost resolution on the sensors that matter most to the operations team.

This is the moment where an infrastructure problem becomes an architecture problem. The cluster has to absorb this write frequency. The question is how to manage the tombstone consequences sustainably.

The Numbers at Fixed Write Rates

Device Interval Rows/day/key Keys Daily rows total /90d cycle
device-A 10 sec 8,640 11 95,040 ~8.5M
device-B 15 sec 5,760 11 63,360 ~5.7M

Combined: approximately 14 million tombstones per 90-day cycle from two devices.

TWCS with unsafe_aggressive_sstable_expiration manages this sustainably — but it requires understanding that the tombstone ratio will cycle rather than stay flat. Accumulate, auto-drop, accumulate, auto-drop. This is not a problem to eliminate. It is the expected steady state of a time-series cluster handling high-frequency writes.

A cluster requiring a scheduled compaction every 6–8 weeks is not broken. It is managed. Knowing the difference is what determines whether you're responding to incidents or running a maintenance schedule.

The One Lever That Was Negotiable: TTL

Device-B's write frequency was fixed. Its retention period was not.

After a conversation with the operations team, it turned out that 30 days of 15-second-resolution data covered every real use case — dashboards, incident investigation, trend analysis. Nobody had ever queried 90-day-old data at 15-second granularity. The default 90-day TTL was inherited, not intentional.

Reducing device-B's TTL to 30 days in the platform cut its tombstone generation by 3× with zero visible product impact. Rows written after the change carried 30-day TTLs. Existing rows continued aging under 90-day TTLs and expired naturally.

Verifying the change took effect

Query rows written before and after the TTL change and compare


The Historic Data Dump Pattern

Device-C appeared in toppartitions writing 68 times per 60-second sample — initially alarming, appearing to be another high-frequency offender in the same class as device-A.

TTL check on its rows:

ttl(long_v) = 5,214,294 seconds ≈ 60.3 days remaining

60 days remaining on a 90-day TTL means these rows were written approximately 30 days ago. At the time of the sample, device-C was writing normally at 1 message/minute. The 68-writes-per-60-seconds burst was a one-time event from a month earlier: the device had been offline, buffered its telemetry locally, and on reconnect pushed weeks of stored data in a rapid burst.

This is a distinct failure mode from a sustained high-frequency writer

Pattern toppartitions signature Cause Duration
High-frequency writer High count, every sample Write rate by design Permanent
Buffered reconnect dump High count, single sample Offline → reconnect flush One-time

The tombstone impact of a reconnect dump is real but time-bounded. All those buffered rows expire around the same time — 90 days after the dump date. The implication: note the dump date and schedule a targeted compaction for 90 days later. In our case, a burst from May 29 meant a concentrated tombstone expiry in late August. We added this to the operations calendar.

The pattern to watch for: a device appearing in toppartitions with a high write count during a single 60-second sample, absent from subsequent samples, with TTL queries showing rows significantly older than the current date. That profile is a reconnect dump, not a firmware problem.


Where This Leaves Us

By the end of this investigation

  • The gc_grace_seconds timing effect is understood — a second compaction pass resolved the 30% rebound

  • Device-A's write frequency is fixed by business requirement; its tombstone volume is a known, manageable quantity

  • Device-B's TTL has been reduced from 90 to 30 days — a 3× reduction in future tombstone generation

  • Device-C's reconnect dump is a known one-time event with a tracked expiry date

The cluster is not at zero tombstones. It is on a known cycle, with the variables understood and the levers documented.

Part 3 covers what we built to run that cycle: the ts_partitions subplot that surfaced a second tombstone problem we hadn't been watching, the Slack monitoring system with its delta column and intervention thresholds, and what the first successful TWCS auto-drop looked like in practice the moment the cluster started managing itself.

3 views