Engineering

A firehose in a document store: moving OCPP command storage to Cassandra

Every charging station on the network talks to us constantly. Heartbeats, meter values, transaction events, a steady drumbeat of OCPP commands flowing in around the clock. It is append-only, it is rarely read back, and it never stops. For a long time all of it landed in MongoDB, and for a long time that was fine. Then it was not.

This is the story of moving that firehose into Cassandra, and of the mindset shift that the move demanded.

The problem: a time-series firehose in a document store

The access pattern for command data is narrow. In practice we want “the recent commands for this station”, plus the occasional lookup by message id. That is it. But the volume is anything but narrow:

  • circa 2.7 billion records accumulating every year
  • 4.4 TB of data, plus 2.9 TB of indexes
  • 57 million commands a day, roughly 1,400 writes a second (one per command, one per result)

We had already scaled the OCPP side of this. The servers that terminate station connections scale out horizontally, commands are distributed over a message stream, and a pool of multi-writer consumers handles them in parallel. The ingestion path was not the problem. The problem was the destination: every one of those writers ultimately funnelled into the same MongoDB, and that single store became the central bottleneck. You can add all the consumers you like, but if they all queue behind one write target, you have just moved the contention downstream.

MongoDB is a fine operational database, and it still runs the platform. But this workload is the wrong shape for it: an unbounded, write-heavy, time-ordered stream. We were paying for it in index bloat and write contention, and we were bumping into the ceiling of vertical scaling. Worst of all, this firehose was competing with the operational database for the very same resources. To go further we needed a store that could parallelise the writes themselves, spreading them across many nodes instead of one.

Why Cassandra

Cassandra is built for exactly this shape of data. Its LSM-tree storage turns writes into cheap, uniform appends to immutable files (SSTables), so high write throughput is the happy path rather than the exception. It scales horizontally and roughly linearly: add nodes, get capacity. A partition-per-station layout gives natural data locality and read/write parallelism. And TTL combined with time-window compaction makes retention and expiry almost free.

It is durable and reliable enough to be the single source of truth for this dataset, not just a cache in front of something else.

The mindset shift: model the query, not the entity

This is the part that trips up everyone arriving from a document database, and it is worth saying plainly.

In MongoDB you model the entity, store it, and add indexes later when the queries show up. In Cassandra you model the query first. You build one table per access pattern, you denormalise heavily, and you accept that there are no joins and no ad-hoc queries. If a new question needs asking, you often build a new table to answer it.

So we have a commands table holding the main data, with the filterable fields (connector id, evse id, trigger reason, and so on) extracted out of the JSON into real columns at write time. For message-id lookups we built a second, TTL-bound command_ids table rather than an index. That is deliberate: an index would fan a lookup out across every node, whereas a purpose-built table routes it to a single one. We only need those lookups for 24 hours, so the table simply expires them.

The primary key is the design

In Cassandra the primary key does two jobs at once, distribution and on-disk order, and getting it right is most of the work:

PRIMARY KEY ((station_id, day_bucket), method, created_at, command_id)

The partition key, (station_id, day_bucket), is the unit of distribution. It hashes to a token that selects the replica nodes, and it co-locates one station’s day together. The day_bucket is what keeps a partition bounded in size. Every read must pin both parts, so a time-window search walks one partition per day.

The clustering columns, method ASC, created_at DESC, command_id DESC, define the sort order inside the partition. Putting method first lets the server do a method IN (...) slice for us. created_at DESC means rows come back newest-first with no sort step, and command_id (a timeuuid) is the unique tiebreaker.

One station, one day

That partitioning choice pays off in how the data ages. A partition is exactly one station for one day, a few thousand rows even on the busiest stations. Because the rows are already sorted by method then time, reads return newest-first with no sort step. Different stations and days hash to different nodes, so load spreads naturally across the cluster. And a time-window search just walks the day buckets backwards, one bounded partition at a time.

Compaction is where the ageing story lands. Cassandra’s SSTables are immutable, unlike Mongo’s in-place B-tree updates. Time-window compaction groups data by time window, so when a TTL expires, whole files drop away rather than rows being picked out one by one. The 24-hour command_ids lookup table, for instance, ages out for free.

A small, deliberate query surface

You do not get to query Cassandra however you like, and that is the point. We query by partition key and clustering order only. Server-side pushdowns are limited (a method IN and a bounded created_at), and anything finer is filtered in the application. Sort order is fixed by the clustering columns. Two consequences of that constraint were interesting enough to build carefully.

Cursor pagination, because offsets do not exist

Cassandra has no skip/offset. Paging by offset would re-scan everything up to the current page every time, which gets worse the deeper you go. So paging is forward-only and watermark-based. The cursor is an opaque base64 of a tiny JSON watermark of the last row we sent:

{ "b": "<time_bucket>", "t": "<created_at>", "id": "<command_id>" }

To resume, we decode it, skip the buckets newer than the watermark (already sent), keep only the rows strictly after the watermark within its bucket, then carry on into older buckets. When the walk is exhausted, the next cursor comes back empty. The nice property is that it is insert-stable: a positional watermark does not shift the pages when new rows arrive, the way an offset would. And there is no slowdown as you page deeper, even across millions of rows.

Counting honestly

There is no cheap COUNT(*) here either: an exact count is a full range scan. So the count rides along on the same traversal as the data fetch rather than being a separate query. Once a page is full we simply keep tallying matches, bounded two ways: a cap (10,000 by default, 100,000 maximum) and a hard scan-safety ceiling (200,000 physical rows). That yields three honest outcomes for the UI:

  • Walk finishes before the cap: an exact total (“N of 4,213 results”).
  • Cap hit first: a lower bound (“N of more than 10,000 results”).
  • Scan budget tripped: unavailable (“N of many results”).

No pretending we counted more than we did.

Living without a primary

Operationally, Cassandra asks more of you. It is masterless and peer-to-peer, so there is no single point of failure, and consistency is tunable per query. Our production shape uses a replication factor of three with one replica in each of three availability zones, so the cluster survives losing a whole zone. Reading and writing at QUORUM means two of three replicas must agree, which gives read-your-writes and survives a node going down. Every write hits the commit log and memtable before it is acknowledged, and a token-aware client routes each request straight to a node that owns the partition, with no extra hop.

In exchange you take on new duties: repair, compaction tuning, per-node capacity planning, and a new failure model built on eventual consistency, hinted handoff, and read repair. The team has to actually learn Cassandra, not just use it.

Is it worth it?

Being honest about the trade matters more than selling the win.

What we give up: ad-hoc queries, joins, arbitrary sorting, offset paging, exact counts, and multi-document transactions. Denormalisation means duplicated data and more logic at write time. And there is real operational overhead.

What we get back: huge, uniform write throughput and linear scale. Bounded partitions, cheap retention, and no index bloat. No single-primary bottleneck, with high availability and multi-datacentre support if we ever need it. Lower storage cost at scale, predictable latency for the patterns we designed for, and a firehose that no longer competes with the operational database.

The lesson is the same one the deck closed on: this is the right tool for an append-only, time-ordered, station-scoped stream. The cost is rigidity and operational effort. The return is scale, cost, availability, and predictability. Pick the tool for the shape of the data, and go in with your eyes open on what it asks of you.