Part 3: Building a Cluster That Manages Itself

By the end of Part 2, the tombstone math was understood. Device-A's write frequency was a fixed cost. Device-B's TTL had been cut 3×. Device-C's reconnect dump had a calendared expiry date.

None of that made the cluster static. It made the cluster cyclical — and the next few weeks were about building the tooling to read that cycle, and discovering a second tombstone problem we hadn't even been watching.

The ts_partitions Subplot

The partition index table which our platform uses to track which monthly time buckets exist for each device and metric key — is tiny compared to the main telemetry table. About 4 MB per node. We hadn't been monitoring it at all.

It surfaced when node-2 silently accumulated 33% droppable tombstones in ts_partitions, caught only during a manual tablestats inspection. Not by any alert, not by any dashboard. We simply happened to look.

The tombstones came from device keys being added and removed over each device's lifetime. Every key removal leaves a tombstone in the partition index. Device-B, active since October 2023, had accumulated 1,889 tombstones in this table from 2.5 years of key lifecycle changes sensors added, sensors retired, keys renamed.

ts_partitions uses LeveledCompactionStrategy, appropriate for a small, read-heavy lookup table. LCS cleans up efficiently once tombstones become droppable but only after they age past gc_grace_seconds, same as everywhere else.

We added ts_partitions to the monitoring script immediately after this. It now reports alongside ts_data in every Slack message, six-hourly, no exceptions.

The lesson here isn't about this specific table. It's that "the main table is healthy" and "the cluster is healthy" are different claims. Every supporting table — indexes, lookup tables, materialized views — has its own tombstone lifecycle and deserves the same monitoring discipline as the table everyone's actually querying.

The Nightly Repair Interaction

After we started compacting ts_partitions regularly, we noticed something stranger: a synchronized spike across all three nodes — the ratio jumping from 0% to roughly 19% within 6 hours, identically on every node.

The synchronization was the tell. Random accumulation from independent device activity does not produce identical numbers across three separate machines at the same moment. Something coordinated was happening.

The cause: our nightly incremental repair runs at 23:30. Repair triggers LCS to recompact the repaired token ranges in ts_partitions. When it does, it consolidates tombstone metadata that had been spread across multiple SSTables — making tombstones that were previously invisible to tablestats suddenly visible, all at once, on all three nodes, because repair touches all three nodes together.

This is compaction surfacing tombstones, not creating them. The spike resolves naturally within 24–48 hours as LCS finishes its cleanup pass. It isn't a problem. But it looked alarming the first few times, before the pattern was understood — which is exactly why it's worth writing down.

A Single-Node Experiment

We tested lowering gc_grace_seconds on ts_partitions specifically to 86400 (1 day), reasoning that LCS keeps this table almost continuously repaired, making the default 2-day window unnecessarily conservative for this table — even though we'd decided against the same change for the main telemetry table.

One operational detail matters here: ALTER TABLE propagates cluster-wide via gossip. There's no per-node schema. To run a genuine single-node experiment, you apply the change and then immediately revert the other nodes to hold them as a control group:

With node-2 and node-3 holding the old setting, the Slack reports made the comparison visible directly

node-1 showed lower, more stable ts_partitions ratios following each nightly repair spike, while the other two nodes continued the original pattern. A controlled experiment, run entirely through schema changes and a monitoring dashboard that already existed.

Building the Monitor

A Slack webhook script running every 6 hours turned out to be more useful than we expected going in. The piece that made it actionable rather than merely informational was the delta column — the change in tombstone ratio since the previous report.

NODE SSTABLES TS_DATA TS_PARTITIONS CHANGE
node-1 5 0.00000 0.00000 —
node-2 4 0.00000 0.00000 ▼-11.030%
node-3 4 0.00000 0.00000 ▼-10.775%

The ▼ symbols are TWCS auto-drop events — entire expired SSTables discarded wholesale. A large ▼ means the cluster cleaned itself up without anyone touching a terminal. A sustained ▲ above 1.5% across multiple consecutive reports is the signal to go looking for a new high-frequency device, the same way we found device-A and device-C in Part 2.

The mechanism behind the delta column is simple: a state file storing the previous run's tombstone ratios per node, diffed against the current run. Without it, the six-hourly numbers are just numbers. With it, they have direction and direction is what turns a report into a decision.

Intervention Thresholds That Emerged From Operation

These weren't designed up front. They came from watching the reports for several weeks and noticing which patterns actually warranted action versus which were noise:

Signal	Action
`ts_data` > 20% on any node	Schedule manual compact within 48 hours
CHANGE > 1.5% for 3+ consecutive reports	Investigate new high-frequency device
`ts_partitions` > 5% on any node	`nodetool compact iot_platform ts_partitions`
Any node shows ▼ > 5%	TWCS auto-drop fired — healthy, no action
Tombstone warnings in logs > 1,000/day for one device	Check write frequency and TTL

The last row is the one that's easy to skip building, and the most valuable once it exists. Log-based alerting catches things the ratio-based monitor structurally can't see until hours later.

The Self-Managing Cluster

The first time TWCS dropped an entire expired SSTable automatically no manual compact, no intervention, nobody watching, the Slack report showed this:

node-1 9 0.00000 0.00000 ▼-9.517%

Four SSTables gone. 9.5 percentage points of droppable tombstones eliminated in a single compaction cycle. The unsafe_aggressive_sstable_expiration flag doing precisely what it was configured to do back in Part 1: identifying an entire 30-day window where every row had passed both its TTL and gc_grace_seconds, and deleting the file wholesale rather than scanning it row by row.

Node-2 and node-3 followed within 24 hours, each on its own schedule, each without anyone running a command.

The self-managing cycle now reads like this: tombstones accumulate from device writes, TWCS recognizes when a time window has fully expired, the whole SSTable drops, the ratio falls sharply, accumulation begins again. In the Slack reports this shows up as a sawtooth — a slow climb, a sharp drop, repeat, on a timescale measured in weeks.

This is the part worth sitting with: the sawtooth is not a problem to eliminate. It's the expected behavior of a correctly configured time-series compaction strategy carrying a write load that includes fixed-frequency industrial sensors. A permanently flat 0% line would require either no writes (not a useful database) or infinite compaction resources (not a real cluster). The goal from the start was never a flat line. It was making the cleanup automatic, predictable, and bounded and watching that first -9.517% drop happen with nobody at the keyboard was the moment that goal was confirmed met.

The Repair Log as Evidence

At one point during this process we seriously considered reducing gc_grace_seconds cluster-wide from 2 days to 1, reasoning that it would shrink the post-compaction rebound window described in Part 2. The repair log turned out to be the decisive piece of evidence against it:

Repair command #25 finished in 2 minutes 34 seconds

node-2: Repair command #24 finished in 1 minute 18 seconds

node-3: Repair command #24 finished in 9 seconds

Overall Duration: 00:04:12

Four minutes twelve seconds for a full incremental repair across the entire cluster. Node-3 finished in 9 seconds because it was already 99.84% repaired going in. This told us our cluster consistently repairs within a tiny fraction of the existing 2-day grace window — call it a 47-hour safety buffer between "repair completes" and "grace window closes."

Reducing gc_grace_seconds to 1 day would cut that buffer to roughly 22 hours. On a cluster where one node has a documented history of 49 aborted compactions tied to memory pressure, any extended maintenance event, disk pressure incident, or OOM kill that keeps a node offline for more than 22 hours becomes a genuine data-consistency risk — deleted rows reappearing because a node missed the repair that would have propagated their tombstones. The rebound pattern from Part 2 is visually annoying but costs nothing operationally. A narrower safety margin would cost something real if it ever got tested.

Measure your repair duration before touching gc_grace_seconds in either direction. The number it gives you is the floor your grace period needs to clear comfortably — not just on average, but on your worst realistic day.

What Remains Open

Two changes stayed on the list, deliberately deferred rather than forgotten:

TWCS window size. Cassandra's own documentation recommends roughly 20–30 active compaction windows for a given TTL. At our 90-day TTL, that points to a 7-day window — about 13 windows, smaller compaction jobs, less memory pressure on nodes that are already tight on RAM, and more frequent auto-drops instead of fewer larger ones. We kept 30-day windows because the cluster is stable today and the change means another full round of compaction work to re-bucket existing data. The documentation's idealized 3-day window is very likely too fine-grained for 2-core hardware regardless — 7 days is the more realistic target for this cluster, not the textbook one.

Heap increase. Running Cassandra on a 2 GB JVM heap with 7.7 GB of total RAM, on a node managing 36 GB of SSTable data, is tight by any standard. The 49 compaction aborts on node-1 — the ones that drove a meaningful share of the diagnostic work across both this post and the last — trace directly to memory pressure during large merge operations. Going from 2G to 4G would very likely eliminate that failure mode outright. We deferred it because it requires another rolling restart and we'd already done several in close succession. It remains, honestly, the single highest-leverage infrastructure change still on the table.

Where the Series Lands

Three nodes. 3.9 billion writes. 2 cores and 7.7 GB of RAM each. A read latency that went from 7.99 ms to 1.90 ms and stayed there. A tombstone ratio that no longer sits at a fixed number, because the right number for a cluster like this was never zero — it's a bounded, self-correcting cycle with a monitoring system reading it every six hours and a documented set of thresholds for when a human needs to step in.

None of the individual pieces here were exotic. TWCS over STCS for time-series data is well-documented. gc_grace_seconds math is in the Cassandra docs. A Slack webhook is a cron job and a curl command. What made this work was less any single decision and more the discipline of checking compactionhistory before assuming an abort, querying TTLs before assuming misconfiguration, and measuring repair duration before touching a safety setting.

The principle that carried the whole series: make the change you understand before the change you're still learning.

The cluster runs itself now. Understanding why it runs itself is what made that possible.

Part 3: Building a Cluster That Manages Itself