Reading Summary: 07/20/2020

Social

A Sino-American bond, forged by Chinese students, is in peril $

How Chinese-American relationship is impacting the lives of many “stuck in between.”

How social media took us from Tahrir Square to Donald Trump

The author had the foresight about the dangerous impact social media has on a society, and he was right.

He also proposes: the cure cannot be a pure technological one, it requires fixing the vulnerabilities inside economics, political, and social systems.

Technology

Testifying at the Senate about A.I.‑Selected Content on the Internet, from Stephen Wolfram

Stephen Wolfram’s testimony at the Senate, on A.I. selected content, his ideas on why algorithmic bias is dangerous, and how we can address it with proper regulations, transparency, and user choice.

He basically proposed that users should have an idea of what algorithm is feeding them data, and the capability to choose. This requires some open benchmarks on recommendation algorithms, and frameworks for users to choose.

Programming

The Rise of Embarrassingly Parallel Serverless Compute

What is serverless computing, why it is on the rise, and why is it useful for parallel data processing (data processing, CI/CD, compilation, ML, visualization, …, you name it).

NoSQL Data Modeling Techniques

A detailed guide for modeling your NoSQL data schemes.

Paper Reading: Aurora: Distributed Relational Database

The following is my overly simplified summary of paper reading.

Aurora is a geo-distributed SQL database that supports replication, high-availability, and transactions, with its distributed design around replicating the database WAL log.

References

Read More

Reading: Cassandra Data Modeling

Reading from Cassandra official website: https://www.datastax.com/sites/default/files/content/whitepaper/files/2019-10/CM2019236 - Data Modeling in Apache Cassandra ™ White Paper-4.pdf

Cassandra is a exemplary implmentation of NoSQL database, and gained popularity in various web, big data, and ML applications. Recently I’ve stumbled upon a good summary of Cassandra handbook, which includes a decent introduction to its data modeling techniques, which can in term be used in other NoSQL databases.

Here are my notes and summaries:

Data Modeling Concepts

There are great many ways Cassandra and traditional RDBMS are different: Cassandra is a wide-column database, with BASE eventual consistency guarantees, has looser relationships between tables. Therefore one needs to model their data very differently than traditional RDBMS for the application to run efficiently.

Namely NoSQL has following differences:

  • No Joins: tables have loose relationships with each other without database level joining.
  • No Referential Integrity: RDBMS requires foreign keys to refer to primary key in another table. NoSQL doesn’t enforce this.
  • Denormalization: contrary to what RDBMS normalization techniques, denormalization is first-class citizen in NoSQL. Many NoSQL databases supports aggregating fields in the same table to achieve row level atomicity.
  • Query First: SQL data modeling starts with entities and relations, while NoSQL data modeling starts with application queries.
  • Sorting: Sorting is an important design decision, for Cassandra and many NoSQL databases.

Read More

Reading Summary: Ultralearning

Ultralearning is a quite interesting book from one of my favorite bloggers: Scott Young. Famous for his “MIT Challenge” – which he completed four years of MIT coursework in one single year by completely self-studying – he now blogs regularly on studying methods, student cognitions, and everything related.

This book is his summary of his researches and experiences of studying. The book’s author argued that: there’s one possible way to learn and improve yourself, with intensive training and exercises. Like training muscles, you can adopt an extraordinary, unorthodox training plan for your brains, and pick up a new skill in a short amount of time, be it a foreign language, programming, sketch, or even public speaking. He called it "ultralearning." In the book, he researched many references and interviewed like-minded friends, who had similar experiences of acquiring or improving a skill intensively. And he summarizes all the essential principles, as the guide to a successful “ultralearning” project.

Read More

Book Review: Data and Goliath

https://play.google.com/store/books/details/Bruce_Schneier_Data_and_Goliath_The_Hidden_Battles?id=MwF-BAAAQBAJ https://www.amazon.com/dp/039335217X/

“Data and Goliath” is an excellent book a friend recommended. It’s a summary of all the dangerous and negative ways data, and the “Big Data” technology can shape our societies. The author Bruce Schneier is a prominent expert in cryptography who published impactful works on cryptography and issues on privacy. He’s also on the board of directors of Electronic Frontier Foundation.

Read More

Reading Summary 2019-08

Cassandra Time Series Bucketing

How to model timeseries data with Cassandra.

Simple GoRPC

The best way to understand something, is to build one yourself. This tutorial covers basic network programming in Go, struct design and the usage of reflect package.

Optimizing M3: How Uber Halved Our Metrics Ingestion Latency by Forking the Go Compiler

A great experience sharing blog on how to debug a performance issue in their services. And with profiling and analysis tools, the Uber team was able to pinpoint this issue in worker pool and goroutine stack allocation, and then they forked the Go compiler to prove it’s a regression in the Go compiler. A very nice read and analysis process.

Book: Programming Models for Distributed Computation

A programming book on topics in distributed computation, from teaching experience in distributed system course, from Northeastern University.

Spotify Engineering Culture

A very nice engineering blog from 2014. A excellent overview of Spotify culture, and an introduction on how to build the “agile” team.

How We Helped Our Reporters Learn to Love Spreadsheets

NYTimes has released its in-house course to teach journalists data science. Journalism can also benefit from a little coding/data analytics skills.

Reading Summary 2019-04

An Overview of Go’s Tooling

If go is one of your favorite languages as well, this is a must read: it introduces all the basic tooling that comes with Go’s ecosystem, which might greatly save your time.

HackerNews thread on TLA+:

A thread from HackerNews, discussing the importance of formal verification for distributed systems.

TLA+ and formal verification is notoriously known for its complexity and steep learning curve. This might be one of my very future goals.

Are You a Software Architect?

What it takes to be a software architect, a great blog post from InfoQ.

InfluxData is Building a Fast Implementation of Apache Arrow in Go Using c2goasm and SIMD

TIL that it is possible to convert your C/C++ assembly into Go’s assembly, and call from Go’s code. InfluxData leverages the tooling to embed AVX/SSE instructions into Golang’s assembly, thus boosts Go code’s performance, sometimes by orders of magnitude.

More information on this tool, c2goasm, work from Minio.

Org-Mode Is One of the Most Reasonable Markup Languages to Use for Text

I think so, too. But it’ll require a community and proper tooling to see it really prosper. Hope to see that some day.

Why and How Capitalism Needs to Be Reformed

A great piece from Ray Dalio, the founder of investment firm Bridgewaters, a seasoned investor, discusses in his recent long post why American capitalism is sick in distributing resources, especially educational resources, and needs to be reformed to stay healthy.