08 02 2021

Paper Reading: Ray: A Distributed Framework for Emerging AI Applications

Paper link: https://www.usenix.org/system/files/osdi18-moritz.pdf

Overview

Ray is a new and grossing distributed programming framework, with an ambitious plan to be the foundation of emerging AI/ML applications. In its own words, it aims to “provide a universal API for distributed computing”. Which means it needs to provide a programming interface that’s flexible enough for new applications, and a backend system designed to scale for elastic computing needs with some good performance. This paper (OSDI 18’) explains its API and architecture design to fulfill this goal. And I’ve found some very interesting points.

11 05 2020

PaperReading

Paper Reading: Julia: Dynamism and Performance Reconciled by Design

Link: https://dl.acm.org/doi/pdf/10.1145/3276490

The paper outlines the Julia programming language’s some most important design choices, and explains how they build a bridge between user-friendliness and performance.

The paper provided with a few benchmarks, to compare its performance with a C baseline, along with other dynamic languages like Python, MATLAB, JavaScript, and so on. While other dynamic programming languages suffer great performance loss, due to its dynamism, Julia can compete relatively close with the C/C++ baseline, with up to native performance in a few cases, most of the benchmarks are within 2x of C or C++, while Python can suffer more than 70x slower performance than C++.

This is significant, as it may eliminate the “prototype in dynamic language, then reimplement in static language for faster performance” cycle, eliminating extra time on coding to achieve efficiency without sacrificing much performance.

Some key takeouts from this paper:

07 04 2020

PaperReading

Paper Reading: Aurora: Distributed Relational Database

The following is my overly simplified summary of paper reading.

Aurora is a geo-distributed SQL database that supports replication, high-availability, and transactions, with its distributed design around replicating the database WAL log.

References

Course Syllabus: https://pdos.csail.mit.edu/6.824/schedule.html
Video Lectures: https://www.youtube.com/channel/UC_7WrbZTCODu1o_kfUMq88g/videos
Lecture: https://www.youtube.com/watch?v=jJSh54J1s5o

07 03 2020

PaperReading

Reading: Cassandra Data Modeling

Reading from Cassandra official website: https://www.datastax.com/sites/default/files/content/whitepaper/files/2019-10/CM2019236%20-%20Data%20Modeling%20in%20Apache%20Cassandra%20%E2%84%A2%20White%20Paper-4.pdf

Cassandra is a exemplary implmentation of NoSQL database, and gained popularity in various web, big data, and ML applications. Recently I’ve stumbled upon a good summary of Cassandra handbook, which includes a decent introduction to its data modeling techniques, which can in term be used in other NoSQL databases.

Here are my notes and summaries:

Data Modeling Concepts

There are great many ways Cassandra and traditional RDBMS are different: Cassandra is a wide-column database, with BASE eventual consistency guarantees, has looser relationships between tables. Therefore one needs to model their data very differently than traditional RDBMS for the application to run efficiently.

Namely NoSQL has following differences:

No Joins: tables have loose relationships with each other without database level joining.
No Referential Integrity: RDBMS requires foreign keys to refer to primary key in another table. NoSQL doesn’t enforce this.
Denormalization: contrary to what RDBMS normalization techniques, denormalization is first-class citizen in NoSQL. Many NoSQL databases supports aggregating fields in the same table to achieve row level atomicity.
Query First: SQL data modeling starts with entities and relations, while NoSQL data modeling starts with application queries.
Sorting: Sorting is an important design decision, for Cassandra and many NoSQL databases.

01 20 2020

PaperReading

Paper Reading: Zookeeper

Paper: https://www.usenix.org/legacy/events/atc10/tech/full_papers/Hunt.pdf

Presentation: https://www.usenix.org/conference/usenix-atc-10/zookeeper-wait-free-coordination-internet-scale-systems

03 10 2019

PaperReading

Paper Reading: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Link to paper: https://people.eecs.berkeley.edu/~alig/papers/mesos.pdf

Presentation: https://www.usenix.org/conference/nsdi11/mesos-platform-fine-grained-resource-sharing-data-center

Mesos is a cluster resource management software from UC Berkeley. Unlike many other frameworks already existed, Mesos is designed to support heterogeneous frameworks (Hadoop, MPI, etc) in the same cluster and share resources between them, by providing a thin layer that making resource offers to the framework schedulers, and delegate the scheduling decision to the frameworks themselves.

With this design, Mesos can achieve pretty good elasticity between frameworks, and letting frameworks choose their own resources results in better data locality.

03 04 2019

Paper

Paper Reading: Understanding Real-World Concurrency Bugs in Go

Link: https://golangweekly.com/link/59972/b208593eda

A team from Penn State University and Purdue published their latest study on concurrency bugs found in Golang projects, namely large projects from Github: Docker and Kubernetes, two datacenter container systems, etcd, a distributedkey-value store system, gRPC, an RPC library, and CockroachDB and BoltDB. The authors searched commit histories of each repository to understand concurrency bug fixes for categorization and study.

TL;DR:

Go’s message-passing concurrency mechanism, something Go is proud of, isn’t as easy to use as it’s generally perceived. It creates just as many bugs, if not more, than shared-memory concurrency model.
Shared memory synchronization is still used more in Go projects.
Go’s built-in race and deadlock bug detection library still cannot catch all the bugs. There’s room for more improvements.

02 27 2019

Paper

Paper Reading: Large-scale cluster management at Google with Borg

Link: https://ai.google/research/pubs/pub43438

About: Borg is Google’s large cluster workload scheduling and management system, which handles Google’s most service and batch job workloads on a cluster on scale of thousands of machines. It hides users from burdens of management of cluster, and provides high-availability features that handles failures.

The now very famous and popular open-source docker orchestration tool Kubernetes, is an open source successor to Borg, and keeps borrowing ideas from Borg (see kubernetes).

01 03 2015

ProgrammingLanguage

Paper Reading - Fundamental Concepts In Programming Languages

This is a holiday reading summary. I recently came across two interesting blogs on fundamental concepts in computer science, both with the title “10 Papers Every Programmer Should Read (At Least Twice)”. One could be found in here, and another one in Fogus’ blog. Topics of these papers range from Programming Language theories, functional programming, to Lamport’s distributed system theories. I will read and summarize some of them in my blog. It’ll be 20 papers, and 40 paper-readings to do if I do read each one twice. So, it might be a long time before all is finished.

Kevin Hu's Blog

A Hungry Fool

Paper Reading: Ray: A Distributed Framework for Emerging AI Applications

Overview

Paper Reading: Julia: Dynamism and Performance Reconciled by Design

Paper Reading: Aurora: Distributed Relational Database

References

Reading: Cassandra Data Modeling

Data Modeling Concepts

Paper Reading: Zookeeper

Paper Reading: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Paper Reading: Understanding Real-World Concurrency Bugs in Go

Paper Reading: Large-scale cluster management at Google with Borg

Paper Reading - Fundamental Concepts In Programming Languages