Cassandra lessons from desert deployment

[April 18, 2021] – Was working on a story on industrial analytics, and took a short sidebar to attend an interesting webcast created by DataStax. Time Series Data Management at Scale with Apache Cassandra concerned itself with an Edge IoT experiment in the dessert. During the relatively brief session, a lot of ground was covered. Some aspect struck me, and I thought it blog worthy. The morale of the story, for me, anyway, was that the momentum that drives individuals to pursue innovative projects, and the extent to which they have to moderate the momentum, and first set the project up for success.

Here goes…

There’s a natural drive to get projects moving – that’s true in IoT work as much as in any other kind of software development. The abundance of new tools for data development owes its existence to willingness to code first and ask questions later.

Especially, this showed as innovative developers built new Hadoop and NoSQL applications that took down the sign on the glass door that read: “Only SQL spoken here.”

But running systems that were hard to query were often the result, whether the locus was Hadoop, Cassandra, MongoDB or what have you. An email blast system, for example, could scream like a banshee, and blow the top off relational alternatives. But when the marketing department wanted to figure out how the email contacts reacted, it required developer time.

“Code first and model later” may always be with us, but it may recede for a while as distributed data methods for IoT mature. That is seen with time-series-oriented IoT computing now happening on the edge of the Internet.

This came to mind as I recently sat in on a webcast sponsored by DataStax, featuring as it did new team member Frank Sepulveda, data architect, who discussed the PhD work undertook prior to joining DataStax, the prominent Cassandra distributed database maker.

In 2019 in Nevada not far from the usual site of the Burning Man festival Sepulveda deployed an array of 144 Raspberry Pi modules. The purpose was to examine the grounds for seismic activity. Over the course of seven days he and a support team gathered 622 sample readings (or 1.73 Gbytes per day) per day. The work was weaved together with DataStax Enterprise (DSE) Cassandra-based data management system.

What the modules found about the movement of the earth as not disclosed in the Webcast. That is okay, because the focus was meant to be on the mechanics of building an IOT sensor array.

One can think of this as an edge system – it points to what IoT and data processing on the edge may look like in coming years. That brings us to lessons learned – a main one being on upfront data modeling.

It’s critical to start with a data model, particularly when working with time series data, Sepulveda said.

Here’s more:

“How Cassandra writes data to disk is very special, using what essentially is a distributed hash table, which is formed with a partitioning key and a clustering column.
Essentially what Cassandra is able to do is to read and write time series data sequentially from disk, with the ability to write up to 2 billion timestamp value pairs to disk sequentially.”

That timestamp value pair estimate is in theory. In practice, you wouldn’t want a partition that would get so big. “Taking time to learn how to do proper data modeling with Apache Cassandra is extremely important, particularly when you’re working with time series data,” Sepulveda emphasized.

Otherwise, he continued, “it’s like having a sports car and driving it around in first gear.”
He shared some more advice. Which is to say, ‘measure
twice and cut once.’ The rush to start coding should be tempered by some planning.

“You know, oftentimes there’s this kind of [rush to] just get something into production,” Sepulveda said. A better course is to “take the time and understand the resources that are available from a diagnostics perspective.”

Among the tools he recommended in this regard is NoSQL Bench, which more or less instruments and monitors a system to show you tradeoffs that effect a systems ability to query. Kinds of a CAP Theorem in a box?

Speaking of the Vaunted Theorem … Sepulveda discussed it as he outlined database decisions that led him (before joining DataStax, and as a student in the dessert) to build his sensor station on DSE Cassandra. What was missing from the discussion, in my humble opinion, was some mention of Cassandra in comparison to some of the growing ranks of distributed databases explicitly dedicated to time series data processing. – Jack Vaughan