Building Applications With Cassandra: Experience And Gotchas

Recently I’ve summarized some experience on quickly getting started with Cassadra. And for this post I’d like to keep writing about some of our experience using and operating Cassandra. Hopefully it could be useful to you, and help you avoid future unwanted surprises.

Election and Paxos

Cassandra is always considered to be favoring the “AP” in “CAP” theorem, where it guarantees eventual consistency for availability and performance. But when really necessary, you can still leverage Cassandra’s built-in “Light-weight Transaction” for elections to determine a leader node in the cluster.

Basically, it works by writing to a table with your own lease:

1
INSERT INTO leases (name, owner) VALUES ('lease_master', 'server_1') IF NOT EXISTS;

The IF NOT EXISTS triggers the Cassandra built-in Light-weight Transaction and can be used to declare a consensus among a cluster. With a default TTL in the table, this can be used for leases control, or master election. For example:

1
2
3
4
CREATE table leases (
name text,
owner text,
) WITH default_time_to_live = 16;

So that the lease owner needs to keep writing to the lease row for heartbeats.

I’m not sure about the performance characteristics of Cassandra’s election behavior with other applications (etcd, Zookeeper, …) and it’ll be interesting to see a study. But since those are already more full-featured and well-understood in keeping consensus, I’d recommend delegating this behavior to them unless you’re stuck with Cassandra for your application.

Read More

Building Applications With Cassandra: A Very Quick Guide

Cassandra Overview

Cassandra as an open-source NoSQL database has gained popularity in cloud and big data applications. Inspired by DynamoDB, it also has good latency, tunable consistency, easy to achieve scalability, and high-availability with cluster setup.

Our team’s been using Cassandra as the backend for an application we’ve been shipping to customers. We chose it for its high-availability setup, and good performance. We used to store time-series data and some simple configuration data as Key-value pairs. So it felt like a natural choice. And in our experience over time, it has proven to be highly capabable at serving our purposes.

With impressive availability, scalability, and read/write performance, Casandra also comes with its limitations. We cannot design data models the same way we did with traditional relational databases with SQL interface. And it doesn’t come with many of the guarantees from traditional databases, like consistency level, transactions, cascading deletion, etc. Like other NoSQL databases, Cassandra was designed to optimize batch write operations with good read and write latency. It fits applications without too much update/delete operations, especially ones with no high amounts of transactions.

So the best use cases for Cassandra can be:

  • You have a high volume of data with availability concerns.
  • Most data is sequential read/write or append, e.g.: logs, time-series, IoT applications, track records, messages, etc.
  • You don’t have complex data relations between data entities that requires high amount of transactions.

Read More

Paper Reading: Ray: A Distributed Framework for Emerging AI Applications

Paper link: https://www.usenix.org/system/files/osdi18-moritz.pdf

Overview

Ray is a new and grossing distributed programming framework, with an ambitious plan to be the foundation of emerging AI/ML applications. In its own words, it aims to “provide a universal API for distributed computing”. Which means it needs to provide a programming interface that’s flexible enough for new applications, and a backend system designed to scale for elastic computing needs with some good performance. This paper (OSDI 18’) explains its API and architecture design to fulfill this goal. And I’ve found some very interesting points.

Read More

Golang Channel Idioms

While learning Golang, I was fascinated with the power Golang’s goroutines and channels. Channel is a powerful tool to tackle synchronization problems in asynchronous programs. It acts as a bridge between async goroutines and can describe some complicated logic expressively. Together they can be powerful weapons in building async applications. On the other hand, when misused, they can be a nightmare to debug.

Here I’ve summarized a few of the valuable idioms of using Golang routines from multiple references as well as my own experience. They can serve as a helpful toolbox that comes in handy for similar problems. So that you don’t have to design them from scratch, which might help you avoid synchronization errors.

Read More

Paper Reading: 150 Successful Machine Learning Models Deployed: 6 Lessons Learned At Booking.com

Paper link: https://www.kdd.org/kdd2019/accepted-papers/view/150-successful-machine-learning-models-6-lessons-learned-at-booking.com

Or download.

First published in KDD from booking.com, the paper described its lessons from deploying Machine Learning models in their production service. It provided some intriguing insights. I believe many are very valuable to understanding applying Machine Learning in real-world scenarios.

Here are some of my takeaways.

Read More

Book Review: The Death of Expertise

The Death of Expertise: The Campaign Against Established Knowledge and Why It Matters

https://en.wikipedia.org/wiki/The_Death_of_Expertise

“The Death of Expertise” by Tom Nichols is a timely piece to the ongoing information endemic, especially in America. Quoting Issac Asimov:

There is a cult of ignorance in the United States, and there always has been. The strain of anti-intellectualism has been a constant thread winding its way through our political and cultural life, nurtured by the false notion that democracy means that “my ignorance is just a good as your knowledge.”

The book describes the author’s view of why experts are so important in a democracy and the relationship between expertise the public. And it also goes on to decry the ongoing decay in this relationship, where citizens are increasingly losing trust in experts, and experts are increasingly finding it difficult to communicate with their audience.

Read More

Reading Summary: 11/06/2020

Technology

Edsger Dijkstra: The Man Who Carried Computer Science on His Shoulders

The often untold story behind a mastermind of Computer Science: Dijkstra, whose name has been an important algorithm widely used in GPS navigation.

The blog described a wise, hard-thinker, a great mind who made unparallel contributions to both Computer Science as a mathematical and logical view, as well as Software Engineering which focuses on building software and hardware components.

He’s most famous for his private reports, named “EWD”, and continued for more than forty years, describing his views on Computer Science and Software Engineering in general, and sometimes worked as reviews for others’ work. One of the most influencing “EWD” report was “Notes on Structured Programming,” which argued programming as a serious form of skill that demands intellectual rigor.

In 1972, Dijkstra received the ACM Turing Award, he was recognized for:

contributions to programming as a high, intellectual challenge; for eloquent insistence and practical demonstration that programs should be composed correctly, not just debugged into correctness; for illuminating perception of problems at the foundations of program design.

He has great passion for his art, and his strong personality sometimes sparked controversies. One of the most famous was the discussion on critiquing “GOTO” statements as harmful. It brought widespread, heated debate, yet Dijkstra’s view finally prevailed, and his insistence made a monumental change to programming paradigm.

There are much more interesting details around his personal and academic life in the original post, too long to be summarized here. For example, his had a mini-van in Austin, which he often drove to national parks with his wife, and it was named the “Touring Machine.” If you are passionate with computers and software, have a long weekend afternoon, it’s worth a good read.

Read More

Paper Reading: Julia: Dynamism and Performance Reconciled by Design

Link: https://dl.acm.org/doi/pdf/10.1145/3276490

The paper outlines the Julia programming language’s some most important design choices, and explains how they build a bridge between user-friendliness and performance.

The paper provided with a few benchmarks, to compare its performance with a C baseline, along with other dynamic languages like Python, MATLAB, JavaScript, and so on. While other dynamic programming languages suffer great performance loss, due to its dynamism, Julia can compete relatively close with the C/C++ baseline, with up to native performance in a few cases, most of the benchmarks are within 2x of C or C++, while Python can suffer more than 70x slower performance than C++.

This is significant, as it may eliminate the “prototype in dynamic language, then reimplement in static language for faster performance” cycle, eliminating extra time on coding to achieve efficiency without sacrificing much performance.

Some key takeouts from this paper:

Read More

Reading Summary: 07/20/2020

Social

A Sino-American bond, forged by Chinese students, is in peril $

How Chinese-American relationship is impacting the lives of many “stuck in between.”

How social media took us from Tahrir Square to Donald Trump

The author had the foresight about the dangerous impact social media has on a society, and he was right.

He also proposes: the cure cannot be a pure technological one, it requires fixing the vulnerabilities inside economics, political, and social systems.

Technology

Testifying at the Senate about A.I.‑Selected Content on the Internet, from Stephen Wolfram

Stephen Wolfram’s testimony at the Senate, on A.I. selected content, his ideas on why algorithmic bias is dangerous, and how we can address it with proper regulations, transparency, and user choice.

He basically proposed that users should have an idea of what algorithm is feeding them data, and the capability to choose. This requires some open benchmarks on recommendation algorithms, and frameworks for users to choose.

Programming

The Rise of Embarrassingly Parallel Serverless Compute

What is serverless computing, why it is on the rise, and why is it useful for parallel data processing (data processing, CI/CD, compilation, ML, visualization, …, you name it).

NoSQL Data Modeling Techniques

A detailed guide for modeling your NoSQL data schemes.

Paper Reading: Aurora: Distributed Relational Database

The following is my overly simplified summary of paper reading.

Aurora is a geo-distributed SQL database that supports replication, high-availability, and transactions, with its distributed design around replicating the database WAL log.

References

Read More