Paper Reading: A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

https://www.microsoft.com/en-us/research/publication/a-reconfigurable-fabric-for-accelerating-large-scale-datacenter-services/

This is one of the series of papers from Microsoft’s Project Catapult, which studies leveraging reconfigurable devices (FPGA, etc.) to accelerate data center, from very specific accelerating algorithms like page ranking for Bing search engine, to more sophisticated machine learning frameworks like DNN.

This is one of their early publications, which introduces the basic design and implementation of the FPGA accelerated datacenter. It covers the very fundamental details of all aspects of server design, from hardware, network topology, FPGA core design, fault-tolerant cluster management software design, workload scheduling algorithm, and etc..

Read More

Reading-Summary 2018-10-14

Posts I find interesting around the web:

Miscalleneous Posts

Augumenting Long-term Memory

A very interesting posts on augumenting long-term memory, based on Ebbinghaus’ forgetting curve theory: use flashcards to memorize everything you’ve learned, and even trivias like your friends’ birthday, etc.. It uses Anki flashcard software to go through the list of stuff.

Author also reasoned about the benefits of memorizing all the details, concepts, and “everything”: the details are the building blocks of a field of knowledge, and memorizing them dramatically helps the understanding this field.

It’s a long read but a deep discussion, and I find it a joyful read.

How To Get Rich

An interesting talk from Jared Diamond, the author of Guns, Germs, and Steel. Despite the kind of misleading title, it’s an interesting take on history and the progress of human civilizations, and how competitions between civilizations influence their prosperity.

Systems Design and Distributed Systems

SoftwareArch: You are going to need it — Using Interfaces and Dependency Injection to future proof your designs

An introduction to interfaces in Golang, and how dependency injection can help you design large projects.

System Design Primer

The basic concepts of system design, web design, basic principals and distributed systems design. A collaborated effort on Github.

Distributed Periodic Scheduling with Cron

A chapter from Google’s new Site Reliability Engineering book, on how to design a distributed cron job daemon, and handle problems including fault-tolerance, repeatedly scheduled jobs, overloading the cluster, etc.. The whole book is a very valuable summary of experience of automation and distributed systems design at Google, and at Google scale. Definitely will read through other chapters.

Go hits the concurrency nail right on the head

Eli Bendersky’s blog post on why Golang gracefully handles the problems of concurrency at language level, that other major languages handles rather awkwardly.

  • Use goroutine to unify the interface to coroutines and thread.
  • Use channels to enforce the ‘share memory by communicating’ pattern.

Which greatly reduces the programmer’s mental burden of design highly concurrent systems.

Getting started with Python in HPC

An introduction to learning Python in HPC, from introduction to Python language, to distributed HPC frameworks for Python.

A Whirlwind Tour of Distributed Systems

A list of concepts, papers, and interesting blog posts on distributed systems design.

Weekly paper reading: C++ and the Perils of Double-Checked Locking

An interesting paper on the perils of C++, design pattern and multi-threading when they’re mixed together:

C++ and the Perils of Double-Checked Locking

The DCLP(Double-Checked Locking Pattern) is often-used in singleton design pattern: you’d like to initialize a shared object for singleton pattern, you follow the steps:

  • check lock if the resource is already initialized
  • if no, lock the mutex
  • check again if the resource is locked inside the mutex-protected area.
  • and again if no, initialize the object

See C++ example:

1
2
3
4
5
6
7
8
9
10
11
Singleton* Singleton::instance() {
if (pInstance == 0) { // 1st check, to avoid locking every time
Lock lock;

if (pInstance == 0) { // 2nd check, a safe check to guarantee correctness
pInstance = new Singleton;
}
}

return pInstance
}

This pattern however, introduces subtle bugs when described in C++ with multi-threading.

The issue is with this statement:

1
pInstance = new Singleton;

The following steps happen:

  1. Allocate memory for the object
  2. Construct an object in the allocated memory.
  3. Assign pInstance to the allocated memory.

But C++ specification don’t enforce the steps happen in order, and compilers are therefore not constrained to reorder them for sake of optimization. As long as the observable outcome of the instructions are correct, compilers are free to place instructions in an order so that CPUs are most utilized. Consider the following case with DCLP:

  • Thread A execute the DCLP piece of code for the first time, performs the 1st check, lock the mutex, performs the 2nd check, allocates memory for Singleton object, points the pInstance to the allocated memory. But before the Singleton object is constructed, thread A is suspended or another thread is scheduled at the same time.
  • Thread B enters DCLP area, determines that pInstance is non-null, and start using the object even before it’s fully constructed, and start accessing the Singleton object.

Oops. This is a very subtle bug, and hard to detect issue when we’re trying to initialize a shared resource once.

The paper digs into details on how compiler can leverage all sorts of different optimizations to spoil you effort to correct the DCLP code, and how to actually implement it correctly with volatile keyword.

It’s a very interesting paper on algorithm, C++, and programming, It makes you stand in awe of the difficulty and intricacies of C++ and multi-threaded programming.

Reading-Summary 2018-06

Posts I found interesting around the web:

man7 Linux cgroups

Linux manual page to cgroups feature in the kernel, which restricts Linux processes CPU, max process numbers, memory usage, network setup and etc..

man7 Linux namespaces

Linux manual page to namespaces feature in the kernel. Namespaces can be specified by the clone syscall, and isolates the child process’ cgroup, IPC, network, mount, domain names, and etc..

GOTO 2018 Containers From Scratch

When all the ingredients come together, it’s the foundation where Docker is built upon. This very interesting talk from GOTO2018 demonstrates how you can use the following technologies already built-in the Linux kernel to create your own very small proof-of-concept docker:

  • chroot
  • namespace
  • cgroups

It also includes very interesting details including (but not limited to):

  • You’ll need to mount the /proc virtual file systems for your ‘containerized’ child process.
  • You’ll need to provide ‘UnshareFlag’ CLONE_NEWNS to the clone system call, to ‘unshare’ the mount point from the child process from the parent process, so that parent doesn’t see child’s mount points (which could be many and messy).

A Classical Math Problem Gets Pulled Into the Modern World

An optimization problem is being used in AI, and therefore all AI applications, including self-driving, etc. Math is magical.

Wikipedia is fixing one of Internet’s biggest flaws

As it actually encourages collaborations, discussions, and exposure to opposing views.

Golang Patterns - Part 2

Technical Writing: Learning from Kernighan

Learning technical writing from the author of your favorite C programming book, ‘The C Programming Language’.

A Note on Linux Hugepages

Page table is where Linux stores virtual to physical page address translation, and its size can get huge when memory usage is high. One way to reduce the size of page tables, and reduce the number of page faults, is to use huge pages. I’ve been digging some information on hugepages for my own curiosity, and it looks like Linux has pretty good support for huge pages. And this blog serves as a quick note on my readings.

Read More

Reading-Summary 2018-05

Posts I found interesting during my reading:

Writing a Time Series Database from Scratch

The author’s experience in writing a time-series database from groundup, for Prometheus.

Introducing Thanos: Prometheus at Scale

The effort to scale Prometheus with a new project Thanos, with Kubernetes sidecar pattern, to read data from individual nodes, pre-process (e.g. sampling), and submit to a centralized data storage and display.

A Beginner’s Guide To Scaling To 11 Million+ Users On Amazon’s AWS

What kind of machine/cluster you’ll need for different size of user base (from 1 to billions).

Nexflix FlameScope

A display of CPU trace as a Github-style texture tiles.

A Usable C++ Dialect that is Safe Against Memory Corruption

IT-‘No Bug’-Hare is an interesting blog I found recently, focused on system, C++ language and game design. A good read for C++ fanatics and system designers.


I’m feeling guilty for not updating for so long. But on the bright side: I’m back.

As a part of work requirements I’m taking on Golang and some small distributed system design jobs. It’s an interesting language for this task: network, systems, infrastructures, etc. I’m having mixed but mostly positive feelings about this language, and maybe will share my experience when I got a chance.

Reading Summary 2017-06

It’s been a while since I ever post a reading summary never mention a new blog post. Writing is a time demanding job.

Society and Technology

Why do we manage academia so badly?

“Managers want metrics that are easy to calculate, easy to understand, and quick to yield a value … metrics with these desirable properties are almost always worse than useless.”

Easy metrics are also easily “hacked” - people “hack” the metrics to make statistics look good, while deviate from the original purpose of academia: to achieve good quality research.

See also:

Every attempt to manage academia makes it worse

Did Reddit’s April Fool’s gag solve the issue of online hate speech?

An interesting, anarchic style experiment on Reddit: let thousands of Redditers draw a picture all at the same time, what would possibly happen? It turned out to be surprisingly good.

Tim Berners-Lee: I invented the web. Here are three things we need to change to save it

Tim Berners-Lee: The Father of the World Wide Web and Turing Award winner believes the web nowadays has serious flaws, namely the loss of control of personal privacy, rampant spreading of misinformation on the web, and manipulations from the political campaigns online. It took everyone to build the web we have today, and it takes everyone to fix it now.

More reports and readings on Tim Berners-Lee:

Read More

Reading Summary 2016-12

C/C++

How to find size of an array in C without sizeof

The difference between arr and &arr - basically, arr is of type int , and &arr is of type (int )[size].

Very excellent article on the fundamentals of C/C++!

What Every C Programmer Should Know About Undefined Behavior

Some “gotchas” and pitfalls in the C programming language and how sometimes compiler optimizations can make it worse. Long story short is, steer away from undefined behaviors.

This post is from Chris Lattner himeself. Really nice article.

Python

Python Has Big Impact At Red Hat

Why Python is such a cool language and how Python is used in Redhat. Most of redhat’s important infrastructure is written in Python, including but not limited to firewalld, yum, and its successor dnf, and many cloud PaaS tools for OpenShift.

Read More

Reading Summary 2016-11

C/C++

“Effective C++” and “C++ In A Nutshell”

Finished most part of “C++ In A Nutshell”, and Scott Meyer’s “Effective C++”, and started to learn the basics of C++ language. Really great books to start to learn the basics of C++, and some of the fundamental problems in the language.

Read More