Kevin Hu's Blog

Making a Toy Gradient Descent Implementation

2024-02-24T22:34:23.000Z

1 Introduction

I’ve recently came across a few of Andrej Karpathy’svideo tutorial serieson Machine Learning and I found them immensely fun and educational.I highly appreciate his hands-on approach to teaching basic concepts of Machine Learning.

Richard Feynman once famously stated that “What I cannot create, I do not understand.”So here’s my attempt to create a toy implementation of gradient descent,to better understand the core algorithm that powers Deep Learningafter learning from by Karpathy’s video tutorial of micrograd:

https://github.com/hxy9243/toygrad

Even though there’s a plethora of books, blogs, and references that explains the gradient descent algorithm,it’s a totally different experience when you get to build it yourself from the scratch.During this course I found there are quite a few knowledge gaps for myself, things that I’ve taken for grantedand didn’t really fully understand.

And this blog post is my notes during this experience. Even writing this post helped my understanding in many ways.

2 Key Concepts

2.1 Chain Rule

In Calculus, the chain rule of derivative shows the basic rules for finding the derivativesof the composite functions.

As we learned from Calculus class, the chain rule states:

\frac{df(g(x))}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

For example, the derivative of the function $f(x) = sin x^2$ is:

\frac{dsin(x^2)}{dx} = \frac{dsin(x^2)}{dx} \cdot \frac{dx^2}{x} = cos(x^2) \cdot 2x

With chain-rule, we could derive the auto-grad algorithm to implement backpropagationon complex compute graphs.

2.2 Backpropagation

Backpropagation, or backward propagation of errors, is the algorithm to find the derivativeof the loss function (which is a function that computes the difference between prediction and the actual output data).

It calcuates the gradient backwards through the feed-foward network from the last layer to the first.

There are 4 steps to the Backpropagation algorithm:

Forward Pass: where the input data is fed through the model (e.g. a Deep Neural Network) and get the prediction output.
Loss Computation: where the prediction output is compared with the actual fed data with a function(e.g. Mean Squred Error, or Cross Entropy Loss). This is the loss function that we’ll aim to minimize ( $loss = J(w)$ ).
Backward Pass: this is where gradient descent comes in.In this step, we need to find the derivative of the loss function at each parameter ( $\frac{\partial J}{\partial wn}$ ).Instead of deriving a formula for the derivative of the loss function, we can then apply the chain rule to derive the gradient descent algorithm so as to find the gradients of all parameters backward the computational graph.
Update Parameters: After getting the gradients for each parameter, we update all parameters by substracting the gradient timed bya small value that we call the learning rate.

And we repeat this 4-step process until the loss is close to the minimal value.

Conceptually, the gradients represent the slope rate at the level of the current parameters. So we could substract the parameters atthe direction of the gradient by a small amount (decided by the learning rate). It’s a process of moving the loss function closerto the minimal value, at a speed defined by learning rate.

There are a lot of tricks and optimizations to adjusting the learning rate, but that’s outside the scope of this discussion.

2.3 AutoGrad

AutoGrad, or Automatic Differentiation, is the core of the backpropagation algorithm in the backward step.It computes the gradients of all parameters in a reverse manner, calculating the derivative of allparameters of the function by applying the gradient backwards in the compute graph.

And here’s the explanation why the AutoGrad algorithm makes sense.

https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#differentiation-in-autograd

The AutoGrad algorithm could be derived by the chain rule. The chain rule in Calculus states that:

\frac{\partial y}{\partial x} = \frac{\partial y}{\partial w1} \cdot \frac{\partial w1}{\partial x}= (\frac{\partial y}{\partial w2} \cdot \frac{\partial w2}{\partial w1}) \cdot \frac{\partial w1}{\partial x}= ((\frac{\partial y}{\partial w3} \cdot \frac{\partial w3}{\partial w2}) \cdot \frac{\partial w2}{\partial w1}) \cdot \frac{\partial w1}{\partial x}

Assuming in this simple case, w3 is the result of a computation from w2, i.e. the successor of w2 in the compute graph,we can derive that:

\frac{\partial y}{\partial w2} = \frac{\partial y}{\partial w3} \cdot \frac{\partial w3}{\partial w2}

When we need to find the gradients of all the parameters, we just need to apply this chain rule backward the computational graph. The gradient of each intermediate variables, denoted as:

\bar{w}_{i} = \frac{\partial{y}}{\partial w_{i}}

would be the sum of all gradients of its successors in the operation.

With this chain rule in mind, we could start with the final result of the equation and work upward the compute graph.It has a seed gradient value of 1 (as $\frac{\partial y}{\partial y} = 1$ ), and back-propagate the gradients back to all intermediate parameters.

See more at: https://en.wikipedia.org/wiki/Automatic_differentiation#Reverse_accumulation

3 ToyGrad Architecture

3.1 Value Engine Design

The core of this autograd engine design is the Value class that could both express the feed-forward computationas well as the backward propagation of gradients. These are the steps when considering its design:

Basic arithmatics: overriding the basic arithmatics of a Value.
Feed-forward computation: compute the basic elements, overriding the operators of the Value class.
The gradient computation: the basic feedfoward computation and its corresponding backward computation. Each backwardcomputation is defined by the derivative of each operation.

For example, in the case of multiplication:

class Value:
    def __init__(self, val):
        self.val = val
        self.grad = 0
        self.operands = set()
        self._backward = lambda: None

    def __mul__(self, other):
        other = Value(other) if not isinstance(other, Value) else other

        # new returned value is the computated value of the operator
        v = Value(self.val * other.val)
        v.operands = set((self, other))

        # here we update the gradient of each parameter from the gradient of the result value
        def _backward():
            self.grad += other.val * v.grad
            other.grad += self.val * v.grad

        v._backward = _backward
        return v

    ...

By calling backward() on the final result of the compute graph, we first assign a gradient of 1 to the final value,and compute the gradients of all operands of the final value. We can do it in two steps:

Create a compute order by reversely traversing the compute graph (a reverse topological sort).
Apply the backward() function on each one of the values in the compute order, thus updating the gradients of all parameters.

3.2 Neural Network Design

With the Value engine designed, we could now chain these value computation to form a feed-forward neural networkwith dense layers, where all Values are connected to the next layer’s values.

We’ll also need code for:

Linear Dense Layer: a dense layer connects to all weights from previous layer, added by a bias: $y = W \cdot X + b$ .
Neural network: create the layers and the whole neural network. Each layer could be defined as the matrix multiplication of the weights and previous layer added by bias. We can then call the forward() and backward() step on the neural network after feeding it the training input and output data.
Parameter update: for each training iteration step, update all the parameters. This is usually done by the optimizer in a real implementation.
Training process: we glue everything together in the training step implementation.

Here’s an outline of the code that describes the steps to a model definition and training process:

class Model(Module):
    def __init__(self, input_features, output_features):
        super().__init__()

        self.layer1 = Linear(input_features, 8)
        self.layer2 = Linear(8, 4)
        self.output = Linear(4, output_features)

    def forward(self, X):
        X = self.layer1(X)
        X = [xi.relu() for xi in X]
        X = self.layer2(X)
        X = [xi.relu() for xi in X]
        output = self.output(X)
        return output

    def train(self, X, y, epochs=1, learning_rate=1e-5):
        for epoch in range(epochs):
            # loss = 0.0
            final_cost = Value(0.0)

            # get cost function
            for i, xval in enumerate(X):
                out = self.forward(xval)
                cost = ((y[i] - out) ** 2)[0]
                final_cost += cost

            final_cost = final_cost / len(X)

            # backward
            final_cost.backward()

            # apply the grad with learning rate
            for p in self.parameters():
                p.val -= p.grad * learning_rate

            # zero out the grads
            for p in self.parameters():
                p.grad = 0.0

            loss = final_cost.val

    ...

If you reached this far, I highly recommend the video tutorial of microgradfrom Andrej Karpathy himself, along with his implementation.He explains it way better than I could.

Also let me know if you found any problems in this blog post or my version of implementation:

https://github.com/hxy9243/toygrad

3.3 Things to Notice

Parameters include weights of the neural network and bias.
Init all the parameters with random values instead of zero.
You need to clean up all the gradients for each iteration of backpropagation. It’s easy to forget this step.

4 References

Paper Readings on LLM Task Performing - II

2023-09-03T06:14:08.000Z

1 Overview

Previously: Paper Readings on LLM Task Performing

I’m still combing through the papers on LLM and I’ll summarize my readings here on this blog.There’s an increasing amount of interest and attention in this field and hence there will bemany more papers in the forseeable future.Hopefully I can squeeze more time to read and share my experience with all of you.

2 Papers and Ideas

2.1 Rethinking with Retrieval Faithful Large Language Model Inference

https://arxiv.org/abs/2301.00303

This early paper (early as in the LLM universe, Jan 2023) laid the foundation of the RAG (Retrieval-Augmented-Generation) design for LLM architecture.The idea is that: since LLM is weak in factual reasoning, we can insert related information as context in the prompt input to the LLM, and the generation would infer from the given context instead of its own generation, which is prone to errors.

The paper uses CoT (Chain-of-Thought) to break down the input question, and retrieve supporting informationfrom the data source (Wikipedia nad Wikidata in this case) for each step of the reasoning.

The paper did not mention chunking and embedding-based document preprocessing and retrieval, but amore traditional and well-tested information retrieval algorithm BM25.

2.2 Self-Consistency Improves Chain of Thought Reasoning in Language Models

https://arxiv.org/abs/2203.11171

Self-Consistency is like the ensemble method: it uses CoT (Chain-of-Thougt) reasoning togenerate a diverse set of reasoning paths, and votes to find out the most consistent answer.

This requires the problem to have a clear to define and clear to compare answer,as with most mathematic and arithmetic questions.

2.3 Tree of Thoughts: Deliberate Problem Solving with Large Language Models

https://arxiv.org/abs/2305.10601

Tree-of-Thought is an extension to the idea of Chain-of-Thought and Self-Consistency.Instead of a single, linear chain of reasoning,each step of the ToT will branch out to multiple probable answers.

This is helpful for the problems that requires some exploration, e.g. problems like 24 sum, creative writing,mini-crosswords.

This approach requires the problem is more explorative, and has clear definitions of success to filter outbad output results.

It reminds me of dynamic programming in traditional programming, and the problem sets are even similiar.

2.4 Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

https://arxiv.org/abs/2205.10625

As I understand, this is a prompt engineering way of saying:

Least to most decomposing can be described as guiding LLM to think: “to solve this problem, we’ll need to know XXX first.”

2.5 Decomposed Prompting A Modular Approach for Solving Complex Tasks

https://arxiv.org/abs/2210.02406

Similarly, Decomposed Prompting is the idea to devide a problem into morestructured subproblems,and may even use external tools (like retrieval) for completing the answer.

What’s interesting about this paper is how it divides into sub-problem more methodically(e.g. with iterative, recursive, splitting, or external information retrieval).Sub-problems can be more clear-defined, and it’s clearer to merge them.

The paper uses few-shot example prompting to guide the LLM to decompose theproblem.

2.6 Automatic Chain of Thought Prompting in Large Language Models

https://arxiv.org/abs/2210.03493

The observation this paper makes is that: with more diverse examples for few-shot learning,the LLM is more likely to be correct. So this paper optimizes the example creation period toprovide more diverse examples.

The Auto-CoT creates examples from dataset with existing questions, but no existing correct QA pairs. It’s designed to reduce impact from wrong answers from examples or Zeroshot results.

Auto CoT creates examples for in-context learning. It selects from a pool of diverse questions to create examples to maximize correctness. It achieves it by:

Clustering the questions
Select an example from each cluster, create a simple Question-Answer pair with Zeroshot CoT.
Create the example for In-Context Learning.

2.7 Making Large Language Models Better Reasoners with Step-Aware Verifier

https://arxiv.org/abs/2206.02336

This paper has very similiar idea from the “Self-Consistency” paper: it breaks the problem intosteps, each step has diverse outputs, and it performs voting to decide the most correct outputfor the next step.

This paper uses a voting verifier to vote the most plausible next step.The voting verifier is trained with a dataset of multiple reasoning paths.

Hacking LangChain For Fun and Profit - I

2023-07-10T05:29:30.000Z

1 Overview

Recently I’ve looked into the LangChain project and I was surprised by how it could be such a powerful and mature a project built in such short span of time. It covers many essential tools for creating your own LLM-driven projects, abstracting cumbersome steps with only a few lines of code.

I like where the project direction is going, and the development team has been proactively including and introducing new ideas of the latest LLM features in the project.

The path to understanding this new project weren’t really smooth. It has its own opinions for code organization and it could be unintuitive to guess how to hack your own projects for more than the tutorials. Many of the tutorials out there explains how to create a small application with LangChain but doesn’t cover how to intuitively comprehend the abstraction and design choices.

Hence I have taken the initiative to document my personal cognitive process throughout this journey. By doing so, I aim to clarify my own understanding while also providing assistance to y’all who are interested in hacking LangChain for fun and profit.

This blog post will dedicate to the overall understanding of all the concepts. I found it really helpful to start by understanding the concepts that directly interacts with the LLM, especially the core API interfaces. Once you have the mindmap of all the LangChain abstractions, it’s much more intuitive to hack and extend your own implementation.

I’ll be covering the very basic concepts around Chain and Agents:

Chain
Tool
Template
Agent
AgentExecutor

What I’m not writing in this blog, and they’ll be for another blog/discussion:

Embedding
Memory
Document Loader
Vector Store

Pick the right ones, and programming will flow naturally from design;
Pick the wrong ones, and programming will be a series of nasty surprises.
– MIT Professor Daniel Jackson on Abstraction in Software, in his book “Software Abstractions”

2 Concepts

2.1 Chains

Chains are the basic way of organizing actions, extending LLM capabilities, and integrate different Chain actions together. You can think of it as a “chain” of actions grouped together.

The interface of the base Chain class:

class Chain:
    @property
    def input_keys(self) -> List[str]:
        ...

    @property
    def output_keys(self) -> List[str]:
        ...

    @abstractmethod
    def _call(self, input: Dict[str, Any], ...) -> Dict[str, str]:
        ...

    def __call__(self, input: Union[Dict[str, Any], Any]) -> Dict[str, Any]:
    # which is a wrapper around _call() and preprocesses input args
        ...

    def run(self, *kargs, **kwargs) -> str:
    # which is a wrapper around __call__()
        ...

Once you understand this it’s pretty clear how to extend the Chain. You’ll need to define:

The input parameters
The output parameters
The action to take when calling the Chain, by defining the abstract _call() method (or _acall() for asynchronous calling, but I’d like to skip those for now).

The __call__(), and run() methods are really just wrappers around this core method that processes input parameters.

Sometimes it could be confusing that there are so many different ways of calling the same Chain. But think of:

_call() as the basic functionality you’ll need to define as the developer. It has nice, preprocessed input parameters.
__call__() or run() as the interface for users of your project that takes in more flexible input.

With this interface, you can extend the functionality by “chaining” them together in a link. The output from the previous chain will be the input keys to the next.

See examples in: https://colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/02-langchain-chains.ipynb

2.1.1 LLMChain

LLMChain is a special type of chain that wraps around the underlying LLM generative engine. And it’s the most commonly used Chains for direct use and extensibility. You can extend it to any special features you want, or even “chain” them up to perform a pipeline of actions with LLM.

The interface for LLMChain is simple. See more at source code.

class LLMChain:
    prompt: BasePromptTemplate
    llm: BaseLanguageModel
    output_key: str

LLMChain extends the original Chain by defining:

inputs: the same inputs that are required by the text template.
outputs: the “text” field, which is the output return of the LLM generation.

You can either directly call it, or use it to build more specialized Chains.

1 2	chain = LLMChain(llm=llm, template=tempate) chain.run('LLM prompt')

See more at the “Template” section to understand prompt Template.

2.1.2 Extending and Joining Chains

You can extend the Chain to accomplish anything that requires inputs and produces an output. Think of it as a task and you can use it for: e.g. text preprocessing, or even parsing.

A Chain doesn’t necessarily have to involve interacting with LLMs. It can be any task you find useful when implementing the whole task pipeline.

See examples in section “Generic Chains”:

https://colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/02-langchain-chains.ipynb#scrollTo=f66a25a2

In the example, the TransformChain that does just regex transformations to remove white spaces. You can use it together with other Chains to create a pipeline of transformation -> rewrite using SequentialChain to link them together.

sequential_chain = SequentialChain(
   chains=[clean_extra_spaces_chain, style_paraphrase_chain],
   input_variables=['text', 'style'],
   output_variables=['final_output'])

Once you understand Chains, you can build powerful pipeline of chains in LangChain (hence the name). There are chains that:

Calculates and run math operations.
Summarizes text.
Translates into a different language.
Come up with product names and slogans.
…

See more Chain examples on Github: https://github.com/hwchase17/langchain/tree/master/langchain/chains

2.2 Prompt Template

I was quite baffled with the idea of prompt and templates when I’m first exposed to LangChain. But the idea is actually quite simple. It’s the same idea with any templates: you define a template text, and interpolate it with text variables.

The most common use case of prompt template is that it creates the outline of the input to the LLM, and you can customize the input by variables. That’s it. That’s how simple it is.

One common use case for Template is, as mentioned above, to format the final LLM prompt. It could be very useful in Agents, where you have multiple queries to the LLM, and you want to define the prompt with different intermediate steps at each iteration.

And let’s take a look at the concept of Agents.

2.3 Agent

One of the most powerful applications for LLM is tool-use. Agent provides an abstraction to choose from a toolbox to solve more open and complicated questions.

According to LangChain’s official documentation:

Some applications require a flexible chain of calls to LLMs and other tools based on user input. The Agent interface provides the flexibility for such applications. An agent has access to a suite of tools, and determines which ones to use depending on the user input. Agents can use multiple tools, and use the output of one tool as the input to the next.

See more at:

2.3.1 Agent Interface

class Agent:
    def plan(self):
        ...

    def aplan(self):
        ...

That’s it. That’s the interface for Agent.

First step of understanding the Agent is to strip away the complicated tool-use features etc and look at the interface.

Agent is an automatic actor that can make “plans” based on each step of the LLM output. You can add more features to create a full-featured, complete Agent that can run actions for you, e.g. Tools to use tools, PromptTemplates to build prompts, Parsers to parse output.

To create your own LangChain Agents, you’ll just need to worry about making the plans (e.g. handling input, creating prompts, parsing outputs, and returning outputs).

To illustrate the interface for Agents, I’ve created a very simple implementation of a dummy agent that executes whatever tool you define for exactly 3 times.

In this example, the plan is: return the AgentAction for 3 times with whatever tool that’s given, and then return the AgentFinish.

class DummyAgent(BaseSingleActionAgent):
 # initiate other part of the code like input, output, etc.
 # ...

 tool: str = ''
 count: int = 3

 def plan(self,
    intermediate_steps: List[Tuple[AgentAction, str]],
    **kwargs: Any):
        if self.count <= 0:
            return AgentFinish({'output': 'Finished execution'},
                log='Action Finished: ')
    self.count -= 1

    return AgentAction(tool=self.tool,
        ool_input=kwargs['tool_input'],
        log='Agent Action: ')

(See the snippet on Gist.Also, I’ve just started a small side project that hacks Agents. See more on Github.)

2.3.2 Tools

Tool is an interface that interacts with other environments. The interface is real simple too, with run or asynchronous arun.

Tools can be any external actions to the LLM, e.g. calculators, search engines, SQL execution, document or data loaders, or anything with an API. It can also be any other Chains!

Its interface is also simple. Similarly, you’ll just need to define the inputs, outputs, and what to run.

from langchain.tools.base import BaseTool

class ExampleTool(BaseTool):
    name = 'example'
    description = 'An example tool'

    def _run(self, query):
        return 'some run results'

    def _arun(self, query):
        return 'async run'

Sometimes it could be confusing as it could be used with different ways of initializing:

# initializing by setting the name, description, and a callable function
math_tool = Tool(
    name='Calculator',
    func=llm_math.run,
    description='Useful for when you need to answer questions about math.',
)

# or, initializing with a function call
Tool.from_function(
    name='Calculator',
    func=llm_math.run,
    description='Useful for when you need to answer questions about math',
)

But the idea is the same. Remember, it’s but syntactic sugar to create a Tool with name, description, and the _run step is to call the func.

Tool’s function could be an API call (e.g. calculator, search, load text, …), or it could invoke other Chains. It’s flexible like that, and you can reuse Chains or even Agents as Tool function. So in this way, one Agent can invoke other Agents.

2.3.3 AgentExecutor

To get an idea of how Agents come about and some of the fundamental ideas on LLM task performing, see my other blog on a list of papers I found useful in understanding LLM reasoning.

AgentExecutor is also also a Chain: it has the exact simple interface of a Chain: input, output, and the action - which is to wrap everything about Agents together.

There are quite a few syntactic sugars provided by the LangChain library to “initialize_agent”. But remember, it’s not returning an Agent, but an AgentExecutor, which has the interface of a “Chain”.

from langchain.agents import initialize_agent

zero_shot_agent = initialize_agent(
    agent="zero-shot-react-description",
    tools=tools,
    llm=llm,
    verbose=True,
    max_iterations=3
)

(Example from: https://archive.pinecone.io/learn/langchain-agents/)

The Agent class abstract the most essential part of the agent behavior: how it “plans” each step based on input and intermediate results, and how it decides what actions to take, or whether to finalize the Agent execution.

The _call() implementation of an AgentExecutor Chain wraps it all up:

Initialize the Agent by passing arguments to it.
Reads Agent output.
Runs the actions from the toolbox.
Finalize the execution by providing the output.
Other infrastructure code like timeout, iteration limits, output streaming, etc.

These grunt works all implemented in AgentExecutor so that we can focus on the interesting part, which is the actual planning.

And typically we ignore these grunt work and only focuses on the interesting part, like creating a ReAct Agent that performs tasks based on the tools given.

And yes, AgentExecutor is a Chain and so it can be used with other Chains or as Tools to other agents.

See another example in the same Pinecone tutorial mentioned above:

# initializing by setting the name, description, and a callable function
math_tool = Tool(
    name='Calculator',
    func=llm_math.run,
    description='Useful for when you need to answer questions about math.',
)

llm_math is an AgentExecutor class that wraps the “llm_math” Agent, and it’s a Chain whose run() interface is a function to invoke the Agent.

Clear enough?

LangChain already provides a rich library of Agents that can perform interesting work, like reading CSV data, managing files, calling APIs, etc.

See: https://github.com/hwchase17/langchain/tree/master/langchain/agents/agent_toolkits

3 Putting It All Together: ZeroshotAgent

Once you understand all these pieces, you can assemble everything together to make your own Agent.

There are two papers behind the implementations. I’ve also mentioned them in my previous blog:

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning that describes the idea of combining LLM with external tools.
ReAct: Synergizing Reasoning and Acting in Language Models that describes the ideas of formatting the prompt to make LLMs reason with external tools and Chain-of-Thought reasoning.

LangChain had its implementation of ZeroShotAgent

Create a prompt to guide the workflow, create few shot examples to follow the pattern of:

Question:
Thought:
Action:
Action Input:
Observation: (use tools, Thought + Action + Observation loop can happen N times)
Final answer:

Create Tools to the Agent
Parse the LLM output from the Thought, e.g. what tools to use, is it the final answer.
Invoke the tools and create “Observation”.
Create yet another prompt based on the output, and again feed it to LLM.
Repeat until getting final answer. Output.

If you think this is helpful, I’ll keep exploring and write what I found about LangChain and NLP + LLM in general. Hope this helps your understanding!

4 References

Harrison Chase’s presentation: https://docs.google.com/presentation/d/1EDmM1R0AcstfjadCUpvoa8h7NGwNqE2dwtJXqRK1618/edit?pli=1#slide=id.g238c17053ef_0_73
Pinecone’s LangChain Handbook: https://archive.pinecone.io/learn/langchain/
LangChain official documentation: https://python.langchain.com/docs/modules/agents/
Streamli LangChain tutorial: https://blog.streamlit.io/langchain-tutorial-1-build-an-llm-powered-app-in-18-lines-of-code/
Arize tutorial on LangChain: https://arize.com/resource/langchain-tutorial/
Andrew Ng’s LangChain tutorial: https://www.deeplearning.ai/short-courses/langchain-for-llm-application-development/

Paper Readings on LLM Task Performing

2023-06-12T06:14:08.000Z

1 Overview

I’ve spent the last couple of months reading about the development of AI and NLP development in general ever since the release of ChatGPT. And here’s some of my personal findings specifically on task performing capabilities.

The field of AI has been advancing rapidly, and the results have exceeded expectations for many users and researchers.One particularly impressive development is the Language Model (LLM), which has demonstrated a remarkable ability to generate natural, human-like text.Another exciting example is ChatGPT from OpenAI, which has shown impressive task performing and logical reasoning capabilities.

Looking ahead, I am optimistic that LLM will continue to be incredibly effective at performing more complex tasks with the help of plugins, prompt engineering, and some human input/interactions. The potential applications for LLM are vast and promising.

I’ve compiled a list of papers of extending the task performing capabilities in this field. I’m quite enthusiastic and excited about the potential of longer term of this capability that brings LLMs like ChatGPT to more powerful applications.

Here’s my first list of paper, also what I consider to be more fundamental papers, along with my very quick summaries.

2 Papers and Ideas

2.1 InstructGPT: Aligning language models to follow instructions

From OpenAI: one of the papers that gave ChatGPT its capabilities to be super useful by following human orders. So instead of simply generating copycat-like text, LLMs now have the capability to be more use.

GPT-3 generation:

Explain the theory of gravity to a 6 year old.

Explain the theory of relativity to a 6 year old in a few sentences.

Explain the big bang theory to a 6 year old.

Explain evolution to a 6 year old.

InstructGPT generation:

1	People went to the moon, and they took pictures of what they saw, and sent them back to the earth so we could all see them.

Later OpenAI leveraged RLHF to teach LLM to perform various tasks (summary, QA, rewrite, generation, ideation, brainstorm, etc.) and opened a door for opportunities.

2.2 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

The paper introduced the power of the phrase “Let’s think about it step by step” to LLMs. As LLMs are generally good at solving small, straight-forward math problems, it still struggles to reason with longer or more convoluted logic. By breaking the original problem into small pieces, LLMs can achieve much better performance at each step and the final answer.

This creates the opportunities of breaking user problems into smaller steps and solve more complicated problems.

2.3 Toolformer: Language Models Can Teach Themselves to Use Tools

https://arxiv.org/abs/2302.04761

A ground-breaking paper from Meta. Instead of follow limited text-based tasks, Toolformer paper created a dataset of using external tool and fine-tuned LLMs on it.

Now LLM has APIs and input parameters, and can invoke APIs, perform more complicated tasks, and interact with the rest of the world, like search, calculation, posting twitter, or potentially anything with an API.

2.4 MRKL: Modular Reasoning, Knowledge and Language

https://arxiv.org/abs/2205.00445

MRKL: pronounced ‘Miracle’, is an interesting paper on leveraging external tools by generating tool name, and tool inputs as operands.

MRKL has similar ideas with Toolformer and focuses more on generating the correct operands for the LLMs.

Users can automate a plethora of tasks by combining LLM and calling external tools.

2.5 Show Your Work: Scratchpads for Intermediate Computation with Language Models

https://arxiv.org/abs/2112.00114

This paper extends the idea of Chain-of-Thought, by writing down each step of reasoning in a more clear, verbose fashion, as if providing the LLM a piece of paper to scratch intermediate results on. Unsurprisingly, this creates math solution performance extensively.

Combined with ReAct paper: create iterative QA pairs to solve complex problems with LLM by breaking down with small problems.

2.6 ReAct: Synergizing Reasoning and Acting in Language Models

https://arxiv.org/abs/2210.03629

The ReAct paper combines the Chain-of-Thought reasoning and task-performing or text generation action into a single loop (REasoning + ACTions). It guides the LLM to iteratively solve a given problem by breaking them into a loop of:

1
2
3

Thought: the thought process with information from previous loops.
Action: the action to take (e.g. external tools) to search and gather information.
Observation: the output of the above action that provides more insights into the problem

In this case, the paper synthesizes much higher accuracy than previous simple Chain-of-Thought efforts.

2.7 Measuring and Narrowing the Compositionality Gap in Language Models

https://arxiv.org/abs/2210.03350

The paper introduces what can be best summarized as “Self-Ask” method. At each loop of problem solving, prompt the LLM to ask itself: “any follow-up questions?”The LLM examines its own logic and decides if the solution is satisfactory. If not, it’ll keep the iteration going until an good answer is reached.

The paper also examines how best to compose all the sub-problems into a single problem and what’s the gap in achieving that.

2.8 Langchain: One Chain to Bring Them All

And you can put them all together with the magical Langchain project. Langchain integrated all the above-mentioned tool-performing ideas into a single project, as can be seen from its source code and documentation:

The “ReAct” idea behind its Zeroshot agents: https://python.langchain.com/en/latest/modules/agents/getting_started.html
Source code for Agents with “ReAct”: https://github.com/hwchase17/langchain/tree/master/langchain/agents/react
The MRKL agent of using tools: https://python.langchain.com/en/latest/modules/agents/agents/custom_mrkl_agent.html
“Scratchpads” for intermediate answers to sub-problems: https://github.com/hwchase17/langchain/blob/master/langchain/agents/agent.py#L415

Langchain provides abstractions within many essential LLM concepts like Embedding, Vector Search, Chain, Agents. Once combined together, these individual tools can make powerful applications that performs actions.

There are a number of good resources on Langchin out there:

3 More to Follow

As we look to the future, the potential of LLMs is limitless. I’ll keep the watch on the most interesting and exciting research advances that brings more intelligence into AI systems. Stay tuned.

OpenAPI Generator For Go Web Development

2022-09-04T03:00:00.000Z

The Openapi Generator for Go API and Go web app development works surprisingly well,but somehow I found that it’s not so often mentioned.Recently I’ve tried it in one of my projects, and in my (limited) experience with it, I was pleasantly surprised by how good it was.With some setup, it could generate Go code with decent quality, and it’s fairly easy to use once you get a hang of it.Whether you’re building a standalone web-app from scratch or creating a service with REST API endpoints, openapi-generator might come up handy for you.

Using a generator might save much time to kick-start your web app project. And most importantly, I found that a good, well-defined, consistentAPI definition is so crucial to your development, testing, and most importantly, communication among teams and customers. I highly recommendthat for any sizable project, you spend some quality time on writing a good API spec. It’ll become essential to your development workflow.I used to highly doubt this, and now I don’t think I can live without it.

And if you manually keep documentation, or API specifictions in sync with your Go code, you’ll have a hard time reviewing, checking,and testing between code and specs. The best way IMHO is to automate the process, by either generating the API code from spec, or the other around.Many toolings support either one of these, and openapi-generator is one of the really nice tools that I’m going to introduce in this blog post.

Openapi Generator supports many languages on the server as well as on the client side.And it has generator for different frameworks of Go. Right here I’m going to use go-server generatoras an example.It uses the Gorilla framework for the server-side code.

For this blog post I’ve also made an example of code generation in my Github repo. I’ve generated the code,and implemented only one endpoint /books with example data:

https://github.com/hxy9243/go-examples/tree/main/src/openapi

OpenAPI Definition

OpenAPI format (previously named Swagger) is a way of documenting your REST API endpoints, yet it goes way before simply documentation. You can use it to generate pretty HTML documentation, use for interactive debugging or automatic testing, and in this case, code generation.

See openAPI official documentations and tutorials: https://swagger.io/specification/.

To create a new openAPI definition, you can start with a YAML or JSON file for the data, starting with metadata information for your APIs, including title, version, etc.

An example openAPI 3.0 API definition may look like this:

openapi: 3.0.0
info:
  title: OpenAPI Library Demo App
  version: 0.0.1
servers: [
  {
    "url": "http://localhost:8080",
    "description": "local development server",
  },
]
paths:
  /books:
...

The most important field is paths, where you define the endpoints of different paths for your API, along with request parameters, and response data models.

For example, the following path defines the /users endpoint, with “status 200 response” format being an array of “User” data model, which is defined in a separate file user.yaml under components/schemas/users. OpenAPI allows referencing other files, which makes dividing and organizing specifications a lot easier.

/users/:
    get:
      operationId: GET-users
      responses:
        "200":
          description: get list of users
          content:
            application/json:
              schema:
                type: array
                items:
                  $ref: "./user.yaml#/components/schemas/user"
    post:
...

And the example user.yaml file contains the specification about the format of the user:

components:
  schemas:
    user:
      type: object
      title: User
      properties:
        id:
          type: string
          description: user id
        birthdate:
          type: string
          description: date of birth for user
          format: date-time
        name:
          type: string
          description: name of user

And you can fill in all the other endpoints, methods, and data models to define the complete API for your application.

Visual Studio Code has extensions to help you better write, lint, and format your openAPI spec, e.g.: https://marketplace.visualstudio.com/items?itemName=42Crunch.vscode-openapi.

Also see official openAPI webpage https://spec.openapis.org/oas/latest.html# for the full specification to learn the very details, including but not limited data types, security, model formats, etc.

Server Code Generation

How to create a project bootstrap with openapi generator

With a correctly defined openAPI specification, you can now generate server code with openapi-generator.

Example command to generate Go code:

openapi-generator-cli generate \
    -g go-server -i api.yaml \
-o server/ \
--additional-properties=outputAsLibrary=true,sourceFolder=openapi

The code will be in output directory server, and the source code folder would be server/openapi.

Generated Go code will contain the following files:

api_default_service.go: the default service implementation, where all endpoints will return an Unimplemented error. The service defined here might be the most important data struct. It is the one you’ll need to extend to fill in your own implementation.
api_default.go: the default API routing and controller that wraps around the service implementation mentioned above.
api.go: defines the router and servicer interface.
error.go, helpers.go, impl.go, logger.go: the helper files for errors definition, response, logging, etc.
model_*.go: data model struct definition, from your model definitions. e.g. the model_user.go file generated from the above example, with User Go struct.

Voila! Now that you’ve generated the service code as a library, all you need to do is to start filling in your own implementation, and then start your own HTTP server in Go.

Implementation

After the code is generated, you can create your own library to implement the service by extending the DefaultApiService.

Your endpoints will turn into function endpoints like this, with parameters (if exists) in function parameter list, and returns response and error and return.

// GETUsers -
func (s *DefaultApiService) GETUsers(ctx context.Context) (ImplResponse, error) {
// TODO - update GETUsers with the required logic for this service method.
// Add api_default_service.go to the .openapi-generator-ignore to avoid overwriting this service implementation when updating open api generation.

//TODO: Uncomment the next line to return response Response(200, []User{}) or use other options such as http.Ok ...
//return Response(200, []User{}), nil

return Response(http.StatusNotImplemented, nil), errors.New("GETUsers method not implemented")
}

Although you can start editing away in the generated file (like suggested in the generated code), my personal favorite way to do inherit and override the functions with your own.

You can import the generated openapi library to your own implementation, and inherit it with your own service with the exact function signature:

import "github.com/xxx/xxx/server/gen/openapi"

type MyWebAppService struct {
    openapi.DefaultApiService
}

func(s *MyWebAppService) GETUsers(ctx context.Context) (ImplResponse, error) {
...
}

The data model defined will be generated as structs in Go. For example, the user example we had will be generated as following, with correct data type mapping:

type User struct {
// user id
Id string `json:"id,omitempty"`
// date of birth for user
Birthdate time.Time `json:"birthdate,omitempty"`
// name of user
Name string `json:"name,omitempty"`

  ...
}

The type “string” will be formatted to Go string, and Birthdate with type “string” and format “date-time” will be turned into Go’s time.Time.

You can see more details at: https://openapi-generator.tech/docs/generators/go-server/ about supported types for this generator.

After that, you can start running your own HTTP server by passing the Router as generated HTTP handler, with the default controller and router from openapi library generated code:

service := NewMyWebAppService()

handler := openapi.NewRouter(
    openapi.NewDefaultApiController(service))

log.Panic(http.ListenAndServe(":8080", handler))

Overriding the default controller can also gives you more flexibility in your code if necessary, e.g. adding your own logging, tracing, or other middlewares.

Documentation

The generator tool can also generate HTML pages, for your users, customers, or other teammates. e.g.:

1
2
3

doc: api/*.yaml
openapi-generator-cli generate -g dynamic-html \
-i api/api.yaml -o server/gen/doc

It’ll give you a well-rendered HTML project that documents all your API endpoints:

One other documentation tool I found really handy for openAPIs is the redoc tool. They can generate a very pretty HTML page for your API specifications.

Redoc-cli: https://redocly.com/docs/redoc/deployment/cli/

Gotchas

There are also some gotchas I bumped into while researching and using the openapi-generator:

You can reuse definition with openAPI reference symbol $ref and divide your definitions into different files, which saves much repetitive work, and makes it easy to organize.
Customize the variable name of the output with title: you can define your own Go struct name if you add the title method in the data model definition.
The generated code is not immediately usable, you have to call goimports -w to automatically fix the formatting of the code. Fortunately, it’s not hard to do by adding a few lines in Makefile:
1
for f in gen/openapi/*.go; do goimports -w $f; done
You can define different types of integers in go by specifying the format of the field of an integer, e.g. a field can have type of “integer”, and format of “uint64”, and will be generated as uint64 in Go.

The following work also supports openAPI in web API development in other languages, with some generating code just like this one.

Or if you want to keep your current development workflow, some projects allow generating openAPI specs from your own code, e.g.: FastAPI, encore project,etc.

FastAPI: https://fastapi.tiangolo.com/
encore for Go web development: https://encore.dev/
Django openAPI project: https://pypi.org/project/django-openapi/
OAPI-codegen for Go: https://github.com/deepmap/oapi-codegen
Swagger-codegen for Go: https://github.com/swagger-api/swagger-codegen
Swaggo for Go: https://github.com/swaggo/swag

References:

Reading BBoltDB

2022-06-05T03:00:00.000Z

By trying to understand databases in general, I’ve recently dived into the source codeof BBoltDB for a deeper understanding of its implementation.It’s a fairly small, nifty local KV store for source-code reading and studying, especially you’re a Golangprogrammer. At the same time it’s widely used and tested, embeded as the powerful storage engine forsystems like Etcd and Consul.Though it’s only backed by one file, it’s capable of serving data size up toTB level.

Below is my notes for the very quick reading from its source code, hopefully providing anyoneinterested in data systems a quick glimpse of what’s under the hood. It’s not meant as an exhaustiveand deep analysis.

Overview

BBoltDB is a local, embedded database backed by a single file mmapped into the memory,meaning it’s backed by only one file, and it’s meant to be built into other data systemsas the storage engine.The data file is locked by the application, with only one process access at a time.

There’s no write-ahead-log. Storage engine uses B+ tree for managing storage.And all operations are written to the database file.

User Interface

Reading source code from v1.3.5 release:

https://github.com/etcd-io/bbolt/tree/v1.3.5

BBoltDB is a key-value store, so data format is as simple as key-value pairs, organized into “Buckets”.Buckets provides namespace for a range of key-value pairs, and they can be nested.

Using BBoltDB in your application is as simple as importing its library in yoru Golang project,and opening a database: See more at BBoltDB.

import (
"log"

bolt "go.etcd.io/bbolt"
)

func main() {
// Open the my.db data file in your current directory.
// It will be created if it doesn't exist.
db, err := bolt.Open("my.db", 0600, nil)
if err != nil {
log.Fatal(err)
}
defer db.Close()

...
}

DB operations are carried out in transactions. For example, a Read-write transaction:

err := db.Update(func(tx *bolt.Tx) error {
...
return nil
})

Read-only transactions:

err := db.View(func(tx *bolt.Tx) error {
...
return nil
})

See more at:

https://github.com/etcd-io/bbolt#read-write-transactions

And for more features like range query, auto-incrementing:

https://github.com/etcd-io/bbolt#autoincrementing-integer-for-the-bucket

Opening The Database

BBoltDB organize mmapped memory as pages. The first two pages are saving metadata for the database,like version info, configurations, etc. The third page saves the freelist of pages, which isthe addresses of memory that’s not allocated yet. The rest of the pages are

See source code from https://github.com/etcd-io/bbolt/blob/v1.3.5/db.go#L178.

When opening the database process, BBoltDB carries out the following major steps:

Opens the file, and locks file (flock) access to single process.
mmap the file into memory.
Read in the meta pages from the memory, construct meta info if not exists. Meta page includes the info:
- magic number.
- version number.
- pagesize.
- freelist: which keeps a list of free pages for lookup/allocate/recycle.

Managing Memory

BBoltDB manages the mapped in memory by pages of 4KB size. And it keeps all the pages besides the metapages in a freelist for allocation. Freelist can be of default array type, or a map type by option.

It initializes meta pages in total at the time of the first start. It contains metadata information about the database, including the freelist info.
See at https://github.com/etcd-io/bbolt/blob/v1.3.5/db.go#L426.
The freelist is initialized either by loading the metadata from the page,or constructed by scanning through all the free pages in the DB and get their ids.
Allocation: uses sync.Pool for one page, or allocates memory for longer contiguous pages.All allocation is backed by a page (could be a multiple of page size) with a page id.
By default, freelist of pages uses a list implementation, and multiple pages of allocationis guaranteed to be contiguous.
Implementation is in arrayAllocate(): https://github.com/etcd-io/bbolt/blob/v1.3.5/freelist.go#L107
Once a page is allocated, it’s serialized into the memory as a node, which represents the node in a B+tree.

B+Tree

B+Tree is the higher level abstraction from the memory pages in BBoltDB. The database manageskey spaces by B+Tree and its nodes. The nodes are initialized from memory pages from the freelist,and organized as a B+Tree. Each node has its own index key, with a listof inodes (for internal nodes) that saves the actual keys and values as bytes.

See https://github.com/etcd-io/bbolt/blob/v1.3.5/node.go

// node represents an in-memory, deserialized page.
type node struct {
bucket     *Bucket
isLeaf     bool
unbalanced bool
spilled    bool
key        []byte
pgid       pgid
parent     *node
children   nodes
inodes     inodes
}

// inode represents an internal node inside of a node.
// It can be used to point to elements in a page or point
// to an element which hasn't been added to a page yet.
type inode struct {
flags uint32
pgid  pgid
key   []byte
value []byte
}

type inodes []inode

The node.go file contains the whole B+Tree implementation, from indexing, reading/writing,spilling, splitting node, rebalancing, to deletion.

Transaction

Each transaction will write the node data to the actual memory that’s backed by the page. And ifthere are more data than the configured size of the node, the node will spill data to a newnode, and recursively split the parents if necessary.

All R/W operations in BBoltDB are managed by transactions:

Create transaction with tx = db.Begin().
Perform actual View/Update.
Possible spill of nodes, which splits tree nodes.
Possible rebalance of tree nodes.
Write the dirty pages to the mmapped memory region and syncs to disk in tx.write(). See at https://github.com/etcd-io/bbolt/blob/v1.3.5/tx.go#L514.
In case of error, Tx.Rollback().
Commit transaction with tx.Commit().

Example from user’s perspective, from the README for BBoltDB:

// Start a writable transaction.
tx, err := db.Begin(true)
if err != nil {
    return err
}
defer tx.Rollback()

// Use the transaction...
_, err := tx.CreateBucket([]byte("MyBucket"))
if err != nil {
    return err
}

// Commit the transaction and check for error.
if err := tx.Commit(); err != nil {
    return err
}

Conclusion

BBoltDB has a simple (compared to other database systems) yet powerful capabilitysupporting the operations ofhuge volume of data in many well-tested online systems in many companies. Reading the sourcecode provides a hint of insights on how a B+tree based simple Key-Value store works,yet I’ve only scratched the surface. If you’re willing to maintain/extendBBoltDB, this could maybe be a first step to understanding its detailed design.

Building Applications With Cassandra: Experience And Gotchas

2021-09-03T06:30:00.000Z

Recently I’ve summarized some experience on quickly getting started with Cassadra. And for this post I’d like to keep writing about some of our experience using and operating Cassandra. Hopefully it could be useful to you, and help you avoid future unwanted surprises.

Election and Paxos

Cassandra is always considered to be favoring the “AP” in “CAP” theorem, where it guarantees eventual consistency for availability and performance. But when really necessary, you can still leverage Cassandra’s built-in “Light-weight Transaction” for elections to determine a leader node in the cluster.

Basically, it works by writing to a table with your own lease:

1	INSERT INTO leases (name, owner) VALUES ('lease_master', 'server_1') IF NOT EXISTS;

The IF NOT EXISTS triggers the Cassandra built-in Light-weight Transaction and can be used to declare a consensus among a cluster. With a default TTL in the table, this can be used for leases control, or master election. For example:

CREATE table leases (
    name text,
    owner text,
) WITH default_time_to_live = 16;

So that the lease owner needs to keep writing to the lease row for heartbeats.

I’m not sure about the performance characteristics of Cassandra’s election behavior with other applications (etcd, Zookeeper, …) and it’ll be interesting to see a study. But since those are already more full-featured and well-understood in keeping consensus, I’d recommend delegating this behavior to them unless you’re stuck with Cassandra for your application.

Optimizing Time-series Data Retention

One great use case of Cassandra is logs and timeseries data saving. But what if you’d want to automatically drop stale data and don’t want to populate the tombstones in Cassandra? Removing and updating data frequently may actually cause problems in Cassandra.

Cassandra team developed a very useful strategy to just handle this situation. It’s called TWCS (Time Window Compaction Strategy). And it works by grouping your timeseries data into chunks (in the same SSTable) and directly dropping them when their TTL is reached, instead of generating new tombstones. Check out this blog for use cases and details.

So that you can create a table with these flags enabled:

-- creating table compacting data every day, with 7 days TTL and TWCS
CREATE TABLE timeseries (
    ...
) WITH CLUSTERING ORDER BY (value ASC)
    AND gc_grace_seconds = 60
    AND default_time_to_live = 604800  
    AND compaction = {
        'compaction_window_size': '24', 
        'compaction_window_unit': 'HOURS', 
        'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy'}

It’s some neat optimizations you can do while saving time-series data with a deadline in mind.

Membership Change When One Node Is Down

Interestingly enough, Cassandra can get grumpy when you try to man-handle its membership. For example, during our development and testing, we encountered this issue where the cassandra cluster is just reluctant to accept a new node when there’s already a node down. The logs from the node shows:

1
2
3

CassandraDaemon.java:465 - Exception encountered during startup
java.lang.RuntimeException: A node required to move the data consistently is down (/x.x.x.x).  
If you wish to move the data from a potentially inconsistent replica, restart the node with -Dcassandra.consistent.rangemovement=false

It turns out that Cassandra needs to move the data consistently to the new node. And when one node is down and Cassandra cannot form a quorum for the data with one node missing, it’ll be reluctant to hand the potentially broken data to the newcomer.

Here’s also an interesting blog about replacing Cassandra dead node and all the surprises along the way. The lesson is: managing Cassandra membership could be harder than you actually thought. So it might be a good idea to read the manual.

In short, if you don’t understand Cassandra, it’ll give you surprises.

Dynamically Manipulating Tables Is Bad

When we started building our application, we used a way to automatically create new tables. It worked well for a while, and then we kept hitting this weird error:

1 2	Caused by: org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found ; expected )

It turns out Cassandra long had this problem with running into race conditions with creating Column Families (a.k.a Cassandra’s tables).

After searching through the Internet, our conclusion is simply: do not attempt to dynamically create tables in a distributed system in the first place. We redesigned our application and schema and this problem went away since.

Deletion In Cassandra Is Hard

It’s not from our own experience, but I still feel like it’s worth sharing. When not careful, Cassandra’s Quorum read/write can still result in dirty data in very special cases. Due to its design, Cassandra can have some pretty complex steps to delete data!

https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

Rule of thumb from this experience: repair time <= gc_grace_seconds. So that repair would propagate tombstones before GC cleans it up. Still, it’s recommended that Cassandra cluster be constantly repaired.

https://docs.datastax.com/en/archived/cassandra/2.1/cassandra/operations/opsRepairNodesWhen.html

Here’s another interesting case for deletion in Cassandra causing headaches and surprises, due to tombstone hurting performance. It’s from Discord’s Experience:

We noticed Cassandra was running 10 second “stop-the-world” GC constantly but we had no idea why. We started digging and found a Discord channel that was taking 20 seconds to load.…To our surprise, the channel had only 1 message in it. It was at that moment that it became obvious they deleted millions of messages using our API, leaving only 1 message in the channel.If you have been paying attention you might remember how Cassandra handles deletes using tombstones (mentioned in Eventual Consistency). When a user loaded this channel, even though > there was only 1 message, Cassandra had to effectively scan millions of message tombstones (generating garbage faster than the JVM could collect it).

Basically if you are not careful, deletion in Cassandra could actually be a part of the query burden due to the tombstones. Understanding Cassandra’s behavior is essential to operation at its best performance.

References

Consensus in Cassandra: https://www.datastax.com/blog/consensus-cassandra
Lightweight Transactions in Cassandra: https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/dml/dmlLtwtTransactions.html
Leader election with Cassandra https://www.dotconferences.com/2015/06/matthieu-nantern-leader-election-with-cassandra
Time Window CompactionStrategy https://cassandra.apache.org/doc/latest/cassandra/operating/compaction/twcs.html
TWCS part 1 - how does it work and when should you use it? https://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
Replacing a dead node in Cassandra and surprises https://medium.com/analytics-vidhya/replacing-a-dead-node-in-cassandra-and-surprises-4681287eeddf
StackOverflow https://stackoverflow.com/questions/28376437/how-to-recover-cassandra-node-from-failed-bootstrap/28379751
Manual For Adding/Removing Node: https://docs.datastax.com/en/cassandra-oss/2.1/cassandra/operations/opsAddingRemovingNodeTOC.html
How Discord Stores Billions of Messages https://blog.discord.com/how-discord-stores-billions-of-messages-7fa6ec7ee4c7

Building Applications With Cassandra: A Very Quick Guide

2021-08-21T23:02:00.000Z

Cassandra Overview

Cassandra as an open-source NoSQL database has gained popularity in cloud and big data applications. Inspired by DynamoDB, it also has good latency, tunable consistency, easy to achieve scalability, and high-availability with cluster setup.

Our team’s been using Cassandra as the backend for an application we’ve been shipping to customers. We chose it for its high-availability setup, and good performance. We used to store time-series data and some simple configuration data as Key-value pairs. So it felt like a natural choice. And in our experience over time, it has proven to be highly capabable at serving our purposes.

With impressive availability, scalability, and read/write performance, Casandra also comes with its limitations. We cannot design data models the same way we did with traditional relational databases with SQL interface. And it doesn’t come with many of the guarantees from traditional databases, like consistency level, transactions, cascading deletion, etc. Like other NoSQL databases, Cassandra was designed to optimize batch write operations with good read and write latency. It fits applications without too much update/delete operations, especially ones with no high amounts of transactions.

So the best use cases for Cassandra can be:

You have a high volume of data with availability concerns.
Most data is sequential read/write or append, e.g.: logs, time-series, IoT applications, track records, messages, etc.
You don’t have complex data relations between data entities that requires high amount of transactions.

I hope this blog could be useful if you’re starting off a new application or a module, and evaluting databases of choice. It starts with an overview of Cassandra, its architecture, then how to evaluate Cassandra for your project, and how to design your data models with examples. It will give you a better picture of whether and how Cassandra can fit in your project. So that you can start thinking about your application and data modeling from a high-level. And then, you can go on to learn more about this database’s details from other references provided in this blog or other resources.

With the prevalence of Machine Learning and Big Data applications, I strongly believe Cassandra can play an important role and it’s definitely worth learning about its ideas.

As an alternative, ScyllaDB could be a very neat open-source replacement for Cassandra, with compatible CQL and driver interface. See more at: https://www.scylladb.com/ Its blog also provides with some use case studies.

References

Cassandra Overview: https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/cassandraAbout.html
Cassandra Use Cases from DataStax: https://www.datastax.com/blog/what-earth-are-people-using-cassandra-anyway
Cassandra: Principle and Applications http://disi.unitn.it/~montreso/ds/papers/Cassandra.pdf
ScyllaDB https://www.scylladb.com/
ScyllaDB User Stories https://www.scylladb.com/category/user-stories/
How Discord Stores Billions of Messages https://blog.discord.com/how-discord-stores-billions-of-messages-7fa6ec7ee4c7

Cassandra Architecture For the Impatient

Cassandra has some interesting architectural design ideas to achieve its availability as well as performance. The tradeoff is its own limitations.

A Cassandra cluster consists of one or many decentralized nodes with that shares the client query load for scalability as well as replication for high-availability. Cassandra cluster has no master node, it maintains its membership information with the Gossip Protocol.

Cassandra cluster partitions its data among nodes as a token ring. All data in Cassandra is partitioned to its nodes in the ring based on the key hash and replication configurations.

Node membership and sharing is decided by the Consistent Hashing algorithm for load-balancing and minimal data movement during membership changes. For each table, the partition key decides which replicas it writes to. Therefore it’s important to keep in mind to include the partition key in the design of your schema (as discussed below).

Client queries can go to multiple nodes in the cluster based on your replication and query configuration. This is Cassandra’s “Tunable Consistency”: higher number of nodes for each query would sacrifice response time and availability, but maintains higher consistency. And vice versa. (See more at CAP theorem.) Tunable Consistency allows users to decide what consistent level is for data read/write. For Eventual Consistency, Cassandra responds with confirmation after writing to any one of the replicated nodes for low latency and high availability. While for Quorum Consistency, Cassandra reads/writes from a quorum of the replicated nodes, and takes the latest write as the final result (LWW Last Write Wins strategy). You can choose the consistency level based on your application needs.

Same as BigTable and DynamoDB design, Cassandra uses MemTable as in memory storage, and SSTable (Sorted Strings Table) as storage backend. MemTables are periodically flushed to disk as SSTables, which are immutable, sorted by key, and gives impressive batch read/write performance. But updating/deletion in SSTable uses new records and tombstones. It appends the new records to new SSTables instead of overwriting the existing ones. So huge amounts of update/delete operations will be inefficient in Cassandra.

So Cassandra’s architecture decides that:

You should decide the partition key in your data schema design.
You should decide the replication, consistency level, and availability of your tables and queries.
Cassandra works very well with sequential batch read/writes, but not so much with high amount of modifications and deletions.

For more information, I found the book “Cassandra: The Definitive Guide” a very helpful reference.

Update: though it’s a blog about ScyllaDB, it’s closely modeled after Cassandra and DynamoDB: https://docs.scylladb.com/architecture/. And I’ve found it a very helpful source as well.

References

Cassandra Paper: http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
Cassandra and CAP theorem https://www.ibm.com/cloud/learn/cap-theorem
Consistent Hashing https://en.wikipedia.org/wiki/Consistent_hashing
Cassandra: The Definitive Guide https://www.oreilly.com/library/view/cassandra-the-definitive/9781098115159/
ScyllaDB Architecture: https://docs.scylladb.com/architecture/

Data Modeling

To start off building applications with Cassandra, first thing to come to mind is your data schema, and whether they can fit nicely into the Cassandra paradigm.

Beyond RDBMS

In the book “Cassandra: The Definitive Guide”, I found a paragraph that best summarizes schema design with NoSQL databases:

By contrast, in Cassandra you don’t start with the data model; you start with the query model. Instead of modeling the data first and then writing queries, with Cassandra you model the queries and let the data be organized around them. Think of the most common query paths your application will use, and then create the tables that you need to support them.

Users need to keep in mind not to design its application schema as Entity-Relation model with traditional SQL databases, but start by thinking of all the queries you’ll need from the database.

Data Modeling: Time-Series Example

For example, from ScyllaDB blog, here’s an example of a typical Cassandra time-series table schema design.

Cassandra requires a composite primary key for each table: one as row key (or partition key) used to locate the row data for writing as well as reading; another as sorting key (or clustering key).

-- Example from https://www.scylladb.com/2020/02/20/nauto-achieving-consistency-in-an-eventually-consistent-environment/

CREATE TABLE trips (
    version int,
    id text,
    bucket timestamp,
    end_ms timestamp,
    start_ms timestamp,
    details blob,
    summary blob,
    PRIMARY KEY ((version, id, bucket), end_ms, start_ms))
WITH CLUSTERING ORDER BY (end_ms DESC, start_ms DESC)

The Primary Key is used to identify and locate the rows in the Cassandra cluster, and it’s of two parts: the partition key and the clustering key.

The first part (version, id, bucket) as partition key, is used to locate the right partition in the cluster and the nodes where the data resides. It is required for writing, and efficient reading as well.

The second part end_ms, start_ms as clustering key is for sorting rows. Choosing the right clustering key gives you better performance for batch reading.

Data Modeling: Hotel Service Example

The Cassandra book includes an example of designing data schema with Cassandra.

https://www.oreilly.com/library/view/cassandra-the-definitive/9781449399764/ch04.html

Think of a hotel booking application, where you’ll need to look up what hotels are in the city, what hotels are close to site-seeing locations, book reservations, look up detailed amenities for hotel rooms, and so on.

So A query-first approach to schema design is, start by thinking about all the queries you’ll make to the backend service:

Find the hotels by city name.
Find the hotels by Point of Interest locations (site-seeing, tourism points).
Find the rooms in the hotels.
Find the amenities of given rooms in the hotel.
Find the Point of Interest locations around the hotels.

And for reservations, you’ll probably need to find the right reservation by different types of query info, for example:

Find the reservation of certain dates for a room.
Find the reservation by confirmation ID.
Find the reservation by guest.

If you link all your queries based on the keys and results of each query, you’ll end up with a diagram of all the supported queries from the client workflow. And based on this diagram, we can come up with the tables necessary for our application, along with the best keys for partition and sorting.

All the queries will form a workflow diagram linking all the tables together. This could be represented as “Chebotko Diagram” for schema design. A Chebotko Diagram is a way to visualize the relationships between table and queries, including the table key and field design in the schema. With this schema, it’s much clearer how to finally implement your tables and the application workflow.

A part of the final diagram would look like this, an example from the book.

For the reservation procedure. The denormalized schema design will be based on queries, and might be replicated across different tables, contrary to the E-R model for SQL schemas:

Data Modeling: Pet Service Exercise

Here’s also a fun exercise you can do with the example in Scylla’s User Stories section. Imagine a Pet Care service where each user tags their pets with remote monitor sensors (e.g. for heart rate, Geolocation, and other metrics). And you’ll need to design an application to log and query the metrics for each pet.

https://www.scylladb.com/2020/09/09/carepet-an-example-iot-use-case-for-hands-on-app-developers/

So you’re working for a service for pet health monitoring. You’ll need to design the backend service that saves and queries pet’s health metrics over time for all customers and their pets. Each pet monitor gathers data and saves in the backend database, which is Cassandra in this case. And you’ll need to design the schema to fit in the Cassandra model.

From the database, you’ll need provide following queries:

Query by owner ID, returns owner’s account info.
Query by owner ID, returns owner’s pets information.
Query by pet ID, returns the sensors info.
Query by sensor ID, returns the time-series data from the sensor.

Try design the tables and see if your solution is close to the example’s.

https://github.com/scylladb/care-pet/blob/master/docs/design_and_data_model.md

References

On Chetbotko Diagram: https://www.researchgate.net/publication/308856170_A_Big_Data_Modeling_Methodology_for_Apache_Cassandra
Scylla Example: https://www.scylladb.com/2020/09/09/carepet-an-example-iot-use-case-for-hands-on-app-developers/
My Previous Notes on NoSQL Data Modeling: https://blog.kevinhu.me/2020/07/03/03-Cassandra-Data-Modeling/
Detailed Guide From DataStax: https://www.datastax.com/sites/default/files/content/whitepaper/files/2019-10/CM2019236 - Data Modeling in Apache Cassandra ™ White Paper-4.pdf

Summary

To summarize, Cassandra’s architectural design decides that:

Cassandra has some special use cases. Understand when you should choose Cassandra and when you should not.
Data Modeling: design with query-oriented，denormalized schema. Start with the query diagram in your applications, and not E-R models.
Data Modeling: pick a composite key with partition key and clustering key in mind.
Transactions in Cassandra is possible, but is painful and with lots of limitations. Try to avoid them.
Just like any other databases, there are several maintenance burdens to keep in mind for best correctness and performance. Better understanding of the inner workings grants you better stability, performance, and less unpleasant surprises.

It’s a powerful tool with its limitations. When used in the right scenarios, it can be a formidable weapon in your arsenal.

If you feel like Cassandra can be a good fit for your application, you can then go on to learn more about:

Its architecture design:
- SSTable.
- Snitch.
- Merkel Tree Anti-entropy Process.
- …
Advanced queries and use cases:
- Batch updates.
- Lightweight Paxos procedure.
- Materialized views.
Deployment, maintainance, and other operations around Cassandra.
CQL, language drivers, and actual appliaction implementation.

Paper Reading: Ray: A Distributed Framework for Emerging AI Applications

2021-08-02T05:21:00.000Z

Paper link: https://www.usenix.org/system/files/osdi18-moritz.pdf

Overview

Ray is a new and grossing distributed programming framework, with an ambitious plan to be the foundation of emerging AI/ML applications. In its own words, it aims to “provide a universal API for distributed computing”.Which means it needs to provide a programming interface that’s flexible enough for new applications, and a backend system designed to scale for elastic computing needs with some good performance.This paper (OSDI 18’) explains its API and architecture design to fulfill this goal. And I’ve found some very interesting points.

Programming Interface

Actor Pattern and Future Like Interface

At the programming interface level, Ray provides the “Actor Pattern”. A Python function invocation or a user-defined class object can work as an “Actor” in Ray. Simply annotate the function or class with @ray.remote, and call with f.remote(args) or initiate with Class.remote(args).

Ray’s computing results are always returned as Future for asynchronous computations. In this way, Ray’s Actors can spawn more Actors, and submit the workload in parallel.

These two tools combined can be real powerful in expressing complicated distributed computing computations and the dependencies between them, which often forms a DAG (Directed Acyclic Graph) dynamically. Like the paper described the dynamic task graph when training Reinforcement Learning models.

The programming API can also emulate co≤mputing patterns like MapReduce design pattern. In this example, map functions are defined as Ray actors and called to get results as Futures. Reduce function can also be called remotely and gather all their actual results from Futures.Granted this level of abstraction is not really equivalent to other common MapReduce frameworks. But still, it demos Ray’s flexibility.

Remote Object

Ray allows you to save large memory objects in the cluster for Actors to access (known as the Plasma store). The location is decided by scheduler based on task and data affinity. They can be used for saving intermediate results to speed up computation.

Architecture

Architecture Overview

Ray’s architecture follows a straightforward client-server model, where client is the Ray program and the client library, which communicates with servers that schedules the actual workloads and data to worker nodes.

Ray servers use a primary-follower pattern. Primary nodes is responsible for

GCS (Global Content Store) that saves the metadata of Actors and memory objects.
Global scheduler, for when the local scheduler cannot decide its task scheduling.
Web UI, debugger, profilers, etc.

Each worker node has its own object store and local scheduler.

Global Content Store

The GCS is a a sharded storage for metadata (default backed by Redis), keeping track of tasks and object details, including:

Specification of every task submitted to the cluster.
Serialized Python code for remote function.
Computation graph.
Current location of all objects.
Every scheduling event.

It provides a pub-sub infrastructure to enable efficient communications.

Bottom-Up Scheduler

Ray’s scheduler is very unique and interesting. It took a bottom-up approach, with local scheduler on each node as well as a global scheduler for scalability.

Each nodes runs a worker that periodically reports back node load for offloading or centralized global scheduling.Tasks are submitted bottom-up, to local schedulers, and only forward when the local scheduler is under heavy load.

Object Storage

A distributed memory storage of memory. It uses immutable data which simplifies the system design (e.g. avoid consistency issues). It keeps in memory and supports emitting to disk on a memory spill, with LRU policy.

It saves data lineage information (like in Spark) in GCS, so as to tolerate failures: once a result crashes, it’ll re-compute based on the parent data as well as function.

Object store uses Apache Arrow library for serialization.

Summary and Closing Thoughts

In summary, Ray provides a distributed programming framework for a diversity of tasks,with easy programming interface and good performance.It also has a strong backend: a job scheduler and a remote memory object cache. It’s everythinga distributed computing framework ever needs.

Also, It’s getting support for a variety of Machine Learning frameworks and integrations(e.g. scikit-learn, Spark, Tensorflow, PyTorch, Hyper-parameter tuning, and future distributed applications, etc).The Ray project itself has focused on RayTune and RayServe projects.With Ray’s flexibility, it’s totally possiblethat it could be a “glue” framework for all other frameworks.

There’s a case study from Burger King with a very interesting use caseof Ray and Spark, with Ray deeply integrated with Spark to access its memory. According to the article:

So they choose MXNet as their deep learning framework, and before cooperating with us,they would allocate a separate GPU cluster dedicated for distributed MXNet trainingbut they find that such a solution is not quite efficient, since in the entire pipeline,a large portion of the total time is spent on copying data from the big data clusters to the GPU cluster.

After deploying RayOnSpark, Ray can now access Spark’s memory. And with a wrapper around MXNet,Ray can combine these two procedures and run the applications in the same cluster. It has betterefficiency and is easier to maintain.

This is where I see it can shine, not just as the foundation for emerging frameworks,but as the missing link between ML procedures and applications,speeding up the ML pipeline with the glue layer. In this way, it truly haslots of potential.

References

Paper link: Ray: A Distributed Framework for Emerging AI Applications https://www.usenix.org/system/files/osdi18-moritz.pdf
Ray https://ray.io/
Ray: A Distributed Framework for Emerging AI Applications https://www.micahlerner.com/2021/06/27/ray-a-distributed-framework-for-emerging-ai-applications.html
Ray Architecture Documentation https://docs.google.com/document/d/1lAy0Owi-vPz2jEqBSaHNQcy2IBSDEHyXNOQZlGuj93c/preview#heading=h.ojukhb92k93n0
https://docs.ray.io/en/master/ray-design-patterns/map-reduce.html
Ray Plasma Store https://docs.ray.io/en/ray-0.4.0/plasma-object-store.html
RayOnSpark Example https://databricks.com/session_na20/running-emerging-ai-applications-on-big-data-platforms-with-ray-on-apache-spark

Golang Channel Idioms

2021-05-31T23:51:00.000Z

While learning Golang, I was fascinated with the power Golang’s goroutines and channels.Channel is a powerful tool to tackle synchronization problemsin asynchronous programs. It acts as a bridge between async goroutines and can describesome complicated logic expressively.Together they can be powerful weapons in building async applications. On the other hand, when misused, they can be a nightmare to debug.

Here I’ve summarized a few of the valuable idioms of using Golang routines from multiple references as well as my own experience. They can serve as a helpful toolbox that comes in handy for similar problems.So that you don’t have to design them from scratch, which might help you avoid synchronization errors.

For more information on Golang channels, see links in the references.They should give you a good introduction to channels and theirbasic behaviors.

If you found any problems or some idioms that you think this experience summary can provide.Please feel free to message me and let me know. Many thanks in advance!

I suggest you read the description and implement a toy version of each of these idioms first.Building your own implementation serves as a good exercise.

1. Use Channels To Get Results

1.1. Get Results From Async Functions

This is a very common usage of goroutines: to get results back from multiple async goroutines. Child goroutine can return a channel that finally sends back the result, which is akin to the Future/Promise idiom in some other languages.

Exercise: How to create an example, where you spawn multiple goroutines with work asynchronously (emulated with time.Sleep()), and then collect and print the results in the main thread?

See my implementation here.

1.2. Future-like implementation

You can also emulate Future/Promise. Functions can return channels wrapped in Futures, so as to process and return results asynchronously.

See my implementation here.

1.3. Iterate through all results from channel

A processor can be put in a goroutine and keep getting results until it’s closed.

1
2
3

for r := range mychannl {
  // process
}

The idiom is for the sender to close the channel. to avoid race conditions where sender sends on a closed channel. Sending on a closed channel will cause panic!

Exercise: how to create an example where you use the “for-range” idiom to get all results from the worker threads?

Hint: you’ll need proper way to make sure all worker threads are done with sending to the channel and close it properly. Consider sync.WorkGroup.

See my implementation here

2. Use Channels For Timeouts, Signals, and Cancellations

2.1. Handling Context

For network requests, exec commands, Golang actually has a great mechanism on setting timeouts and cancelling: context.

See more at: https://golang.org/pkg/context/

It is actually my recommended way of handling cancelling and timeouts in requests. It has a neat and clean interface, easier to use, and less error-prone to home-brew solutions with channels.

For example, golang’s cmd/exec.Command can be cancelled either by user or timeouts.The underlying implementation listens to context.Done()

See: https://golang.org/src/os/exec/exec.go?s=6768:6841#L394

What if you’ll need to implement your own requests that takes in a context and respect the cancelling signal? You can learn from the example above in Golang library.

My example of handling context.

2.2. Timeouts and Tickers.

You can use time package to handle timeouts, periodic events.Golang’s time package uses channels for callbacks, which makes logic easier to read and understand.

Exercise: how to create an example that uses timer/ticker to trigger/cancel an action?

My example here.

2.3. Processing OS Signals

You can use channels for notifications. One example is OS signals.Instead of installing signal handlers like some other languages, Golang’s signal package actually returns a channel. So users can handle channels like any other channels.

Notice:

From Go documentation (https://golang.org/pkg/os/signal/#Notify)

Package signal will not block sending to c: the caller must ensure that c has sufficient buffer space to keep up with the expected signal rate. For a channel used for notification of just one signal value, a buffer of size 1 is sufficient.

1
2
3

// Set up channel on which to send signal notifications.
// We must use a buffered channel or risk missing the signal
// if we're not ready to receive when the signal is sent.

Exercise: how to create an example of using OS signals to cancel current work?

My example here.

2.4. Use Channels As Message Broker

Channels can be used as a message broker when you need to send messages to multiple child goroutines.

It’s a kind of like the reverse of handling multiple message sources. The parent channel can send signal to each one of the channels that’s accepted by each child goroutine.

Exercise: how to create an example where child workers start after parent broadcasts a ready signal?

My implementation here.

3. Use Channels For Rate Limiting

3.1. Acts Like Semaphore For Limiting Worker Rate

// init semaphore
sem := make(chan struct{}, 4)
  for i := 0; i < 4; i++ {
  sem <- struct{}{}
}

// run the actual work, taking one from semaphore before
// starting, and give back once work is done
go func() {
  <- sem
  defer() {
    sem <- struct{}{}
  }()
 // some work here
}()

Exercise: how to use channel as a semaphore that limits parallel processing to a given rate? i.e. Limit to at most N workers processing at the same time?

My example here.

(How is this implementation better than evenly dividing the work to multiple goroutines before work starts? e.g. dividing 16 works to 4 goroutines, each processes 4 works?)

3.2. Competing consumers

Similarly, you can solve the above examples with competing consumers. Start N consumers from the start, and all of them take from the same channel.

My example here.

4. Other Interesting Use Cases:

Rate Limiting (Leaky bucket like implementation) with Go channels: https://github.com/golang/go/wiki/RateLimiting
Pubsub with Go channels: https://eli.thegreenplace.net/2020/pubsub-using-channels-in-go/

References

Golang Tour: https://tour.golang.org/concurrency/2
Golang context: https://golang.org/pkg/context/
Go101 on Channels: https://go101.org/article/channel.html
Go101 on Channel Use Cases: https://go101.org/article/channel-use-cases.html
Go Pubsub implementation in Channels: https://eli.thegreenplace.net/2020/pubsub-using-channels-in-go/
Leaky Bucket implementation with Go channels: https://github.com/golang/go/wiki/RateLimiting
The Behavior Of Channels: https://www.ardanlabs.com/blog/2017/10/the-behavior-of-channels.html

Paper Reading: 150 Successful Machine Learning Models Deployed: 6 Lessons Learned At Booking.com

2021-04-25T22:39:34.000Z

Paper link: https://www.kdd.org/kdd2019/accepted-papers/view/150-successful-machine-learning-models-6-lessons-learned-at-booking.com

Or download.

First published in KDD from booking.com, the paper described its lessons from deploying Machine Learning models in their production service. It provided some intriguing insights. I believe many are very valuable to understanding applying Machine Learning in real-world scenarios.

Here are some of my takeaways.

1. Model Families: Machine Learning is Applied Everywhere In The Service

Booking.com uses an abundance of Machine Learning models in its website service. Example includes:

Traveller Preference Models: predict the users’ flexibility in terms of traveling date, property location, property price.
Traveller Context Models: gives the context of the traveler. E.g., if one’s traveling for family, with friends, or for business. This data can be used to optimize recommendations and user experience at the website.
Item Space Navigation Model: gathers information about user interactions with the website to optimize the experience.
User Interface Optimization Models: optimizes font sizes, number of items in the list, background colors or images, with Machine Learning models based on user context.
Content Curation: curates content from different sources and formats of information to best display to human users.
Content Augmentation: based on user context, predicts the best value for the end-user and predicts the reservation price in the future.

From experience in the paper, all model families can provide value in the real world. Though some values are hard to quantify, the multiplying effort is clear.

What struck me most from this section is the number of models which is deployed for a service and how many aspects of user experience can be modeled, optimized, and fine-tuned with the power of Machine Learning.

Everything we see now on modern websites can be decided by a Machine Learning model behind it, and it learns our interactions with it to feedback our experience.

2. Modeling: Offline Model Performance Is Just A Health Check

The one-sentence takeaway: model performance offline can not be directly translated to business value, i.e., a model that excels in performance in the lab doesn’t necessarily do well in the real world!

There are some possible explanations for this effect. For example, the saturation of a model: a model cannot drive business value to infinite after you optimize it to a threshold. Or the uncanny valley effect (which I found to be interesting): when you optimize a model too well, it can scare the users and bring negative value.

As the paper summarizes, the offline model performance can be entirely uncorrelated to business outcomes!

3. Modeling: Before Solving A Problem, Design It

This section introduces the experience of designing Machine Learning models and the challenges.

For example, when modeling very subjective concepts, target variables are not given as ground truth. They are constructed. Therefore some setups are harder than others from a learning perspective. For some setups, data is closer to the concept we want to model.

Another common problem is the Selection Bias issue. For example, if you gather data by the user filling in the questionnaire, the data collected is strongly biased toward those who fill in.

The paper provided a mouthful to explain the detection of Selection Bias:

Diagnosing selection bias is straightforward: consider a sample of the natural observation space (users or sessions in the dates flexibility case), we can then construct a classification problem that classifies each observation into the class of the observations for which a target variable can be computed and the class of the observations for which a target variable cannot be computed. If this classification problem is easy (in the sense that a simple algorithm performs significantly better than random), then the bias is severe and must be addressed.

The way I understand it: construct a classification problem to classify if the target variable can be computed (e.g., those who fill in the questionnaire) and those who cannot. And if the classification is easy (better than random), it means the selection process (in this case, the questionnaire filter) is not random!

Coming up with better models to unlock business values can require many iterations.

4. Deployment: Time Is Money

This section introduces an important observation that website response speed correlates to user conversion.

Visual inspection shows a clear trend, in which an increase of about 30% in latency costs more than 0.5% in conversion rate (a relevant cost for our business).

The paper introduces some infrastructure optimization to amortize latency (redundancy, caching, etc.) and also emphasizes how some simple models (e.g., linear models) can actually outperform more accurate but slower models.

5. Monitoring: Unsupervised Red Flags

This section provides a great experience in monitoring model output based on output distribution and how it can be an indicator of model health in production.

For example, quoted from the original paper:

A smooth unimodal distribution with a central mode might indicate high bias in the model or high Bayes error in the data
An extreme, high frequency mode might indicate defects in the feature layer like wrong scaling or false outliers in the training data
Non-smooth, very noisy distributions point to too excessively sparse models
Difference in distributions between training and serving data may indicate concept drift, feature drift, bias in the training set, or other forms of training-serving skew.
Smooth bimodal distributions with one clear stable point are signs of a model that successfully distinguishes two classes

6. Evaluation: Experiment Design Sophistication Pays Off

This section introduces its experience through Experimentation through Randomized Controlled Trials (RCT). The experiment design and analysis gives a huge boost to understanding the data collected from the website and to the development of Machine Learning models.

The large majority of the successful use cases of machine learning studied in this work have been enabled by sophisticated experiment designs, either to guide the development process or to detect their impact.

The paper introduces a couple of methods for designing experiments for Machine Learning products.

Terminology for understanding trigger-based experiment process:

Treatment: whether the change is applied to the end-user. If a user is treated, then she sees the studied change applied to the website (as compared to the control group.)
Triggered: whether the user should be included in the final analysis of the data, based on the conditions of her data.

Experiment design:

Selective Triggering: Dividing the population into control and treatment groups. And trigger the users whose data is available for model features among the treated and non-treated.
Model-output Dependent Triggering: The treatment criteria might depend on one model output. Divide users into the control group where the model is not invoked and where the model is invoked. And where the model is invoked, further divide users into group 1: treatment, and group 2: non-treatment, where group 2 users are not treated even if the model output satisfies.Then the experiment is limited to users whose model satisfies.
Comparing Models: To evaluate a new model 2 against model 1, create a control group where only model 1 is invoked. Create experiment groups where both models are invoked, with group 1 uses model 2 output, and a compare group 2 which uses model 1 output. And only trigger the users where the two models disagree.

In this way, we’re limiting the experiment subjects that are most interesting to model performance and limiting noise in the dataset.

Summary

This paper shares great experiences in understanding and tackling Machine Learning challenges in real-world services. And it provides guidance in designing and analyzing Machine Learning models in production.

Book Review: The Death of Expertise

2020-11-15T00:23:00.000Z

The Death of Expertise: The Campaign Against Established Knowledge and Why It Matters

https://en.wikipedia.org/wiki/The_Death_of_Expertise

“The Death of Expertise” by Tom Nichols is a timely piece to theongoing information endemic, especially in America. Quoting Issac Asimov:

There is a cult of ignorance in the United States, and there always has been. The strain of anti-intellectualism has been a constant thread winding its way through our political and cultural life, nurtured by the false notion that democracy means that “my ignorance is just a good as your knowledge.”

The book describes the author’s view of why experts are so important in a democracyand the relationship between expertise the public. And it also goes on to decry theongoing decay in this relationship, where citizens are increasingly losing trustin experts, and experts are increasingly finding it difficult to communicate with their audience.

Our Own Brains

The author explains his own view of the many reasons behind this divide. Most significantly:Our innate incapability to think rationally. The world is complex and dramatic,yet our own brains tend to think in a more intuitive, direct, and emotional fashion.We love simple facts and jump to premature conclusions.We’re naturally not good at discerning our own ignorance and stupidity.It’s a phenomenon dubbed the “Dunning-Kruger” effect. We’re often over-confident inourselves and hold dear our world views, to the point of denying facts we find inconvenient.

It has undoubtedly always been the case. But our ways of thinking has been fundamental to why we may reject the other opinions, and it’s exacerbated by recent trends.

Higher Education

Higher education, for instance, has been one of the targets of the author.The author argues that colleges are slowly developing into a commercial productmore than a sanctuary of passing on knowledge and critical thinking.Colleges are more of an expensive “experience,” which caters to the “customers.”So much so that they avoid provoking students with uncomfortable ideas or evenunsatisfactory GPAs. This creates a generation of youngsters who cannot deal withreal-life situations, including accepting facts or opinions they find “offensive.”

The Internet

The widespread of the Internet has not brought information for the mass,but also overconfidence and arrogance.“Let Me Google That For You” has become a catchphrase for the Internet users whocompare the hours-long “research” online to years of expertise training and work experience.The fact is misinformation and deliberate fake information.Conspiracy theories are so prevalent on the Internet that you can almostalways find some rabbit hole for misleading or completely faketheories and keep reinforcing them.

Journalism

Modern-day journalism, to attract customers, is also increasingly becoming biasedand divisive, thanks to the audience. Journalists themselves always reported misleadinginformation due to the lack of expertise. But the growing trend of commercialjournalism has suffered the same fate as the Internet: it caters to the audience’sworst desires that create a feedback loop. The industry itself is undermining professionalismand turning journalism into entertainment.

Experts

Experts, more often than not, are also wrong. Experts are also human and are not immuneto making human mistakes. There’s negligence, prejudice, and conflict of interest in all industries.Experts should be responsible for their own words and actions, but the misconception that“experts should always be right, or else they can never be trusted” is detrimental to ourrelationship with experts as well.

The author argues that expertise is crucial in a democracy,but we’re now in an epistemological crisis. We need to urgentlyreconcile the relationships between citizens, experts, and decision-makers.

It’s been a good read with the book. It’s a wake-up call to the anti-intellectualism prevalent in American society.But the author did not give an (in my opinion) satisfactory answer to how to solve the problems of “The death of expertise.” He admits that experts make mistakes, and the public should put the experts in check, but also concludes that “experts are more often right than the public.” I’d not be satisfied with the notion that we should accept the facts passed on by experts.

Like some comments from GoodReads,the book (somewhat ironically) falls short on providing concrete,authoritative sources to confirm some of the trends are happening.It lacks the intellectual rigor to actually back the problems with researches,statistics, and convincing sources, making it more of a long rant than careful analysis of the problems.

Also, some reviewer point out the author may have missed the fact of some of the underlying structuralproblems in society, like “the corporatization of media and neoliberalismin general.” Instead, it focuses most fire on the public for beingincreasingly partisan and prejudiced.

In all, I believe this book serves as an interesting read and a great wake-upcall for the public about the problems. It has excellent anecdotal stories andcritiques of the problem. But it doesn’t serve as a rigorous analysis for ourissues at hand. Nor does it serve well in suggesting what can be done about them.

Reading Summary: 11/06/2020

2020-11-07T05:23:00.000Z

Technology

Edsger Dijkstra: The Man Who Carried Computer Science on His Shoulders

The often untold story behind a mastermind of Computer Science: Dijkstra, whose name has been an importantalgorithm widely used in GPS navigation.

The blog described a wise, hard-thinker, a great mind who made unparallel contributions to both ComputerScience as a mathematical and logical view, as well as Software Engineering which focuses on buildingsoftware and hardware components.

He’s most famous for his private reports, named “EWD”, and continued for more than forty years,describing his views on Computer Science and Software Engineering in general, and sometimes workedas reviews for others’ work. One of the most influencing “EWD” report was “Notes on Structured Programming,”which argued programming as a serious form of skill that demands intellectual rigor.

In 1972, Dijkstra received the ACM Turing Award, he was recognized for:

contributions to programming as a high, intellectual challenge; for eloquent insistence and practical demonstration that programs should be composed correctly, not just debugged into correctness; for illuminating perception of problems at the foundations of program design.

He has great passion for his art, and his strong personality sometimes sparked controversies.One of the most famous was the discussion on critiquing “GOTO” statements as harmful. Itbrought widespread, heated debate, yet Dijkstra’s view finally prevailed, and his insistencemade a monumental change to programming paradigm.

There are much more interesting details around his personal and academic life in the original post,too long to be summarized here. For example, his had a mini-van in Austin, which he often drove tonational parks with his wife, and it was named the “Touring Machine.” If you are passionate with computers and software,have a long weekend afternoon, it’s worth a good read.

Programming

WTF Python!

The quirks around the Python programming language. The use cases described in the document areusually not recommended way of using Python, as it might trigger unexpected behaviors.They expose some underlying implementation details, the majority of which for optimizations,and may have some counter-intuitive side-effects.

No programming language, or tool, frameworks is perfect. If you’d like to use somethingfluently, you’ll also need to understand its weird corner cases.

Studying

Introduction to the Zettelkasten Method

An interesting notebook method I’ve recently bumped into. And I’ve been using Notion notebookwith it.

The idea is that: you’ll read and learn many small facts and concepts, but would forget them quickly.Zettelkasten method will record, and connect them into much more powerful ideas and concepts, sinceinnovations often arrive when ideas clash. Used correctly, this personal notebook methodcould be a significant way of boosting your creativity.

Society

To Combat Conspiracy Theories, Teach Critical Thinking - And Community Values

An interesting idea from studying the spread of conspiracy theory: most followers and spreadersare drawn to conspiracy theories, as they find comfort and a sense of community in it.

Fighting disinformation and conspiracy theories can be hard, but this idea could provide a way ofpreventing the agents from spreading it.

When someone feels unfit or even abandoned by the mainstream society, they seek comfort inoutlandish ideas. The conclusion? One important aspect of fighting conspiracy theory is to buildcommunity value.

How Scientism Spawns Pseudoscience And Science Denialism

Scientism is the idea that we should believe in science and scientists no matter what.This idea is the wrong way of pursuing the true spirit of science,is actually considered harmful, and might backfire with denial.

Paper Reading: Julia: Dynamism and Performance Reconciled by Design

2020-11-05T23:54:00.000Z

Link: https://dl.acm.org/doi/pdf/10.1145/3276490

The paper outlines the Julia programming language’s some most important design choices, andexplains how they build a bridge between user-friendliness and performance.

The paper provided with a few benchmarks, to compare its performance with a C baseline,along with other dynamic languages like Python, MATLAB, JavaScript, and so on.While other dynamic programming languages suffer great performance loss, due to its dynamism,Julia can compete relatively close with the C/C++ baseline, with up to native performance in a fewcases, most of the benchmarks are within 2x of C or C++, while Python can suffer more than 70x slowerperformance than C++.

This is significant, as it may eliminate the “prototype in dynamic language, then reimplementin static language for faster performance” cycle, eliminating extra time on coding to achieveefficiency without sacrificing much performance.

Some key takeouts from this paper:

Type Annotations

Unlike Python, Julia incorporates option typing in its runtime, which helps to check itscorrectness at runtime, as well as enabling optimizations.

Type stability is an important concept in Julia code. It means that for a certain typecontext, an expression always return the value of the same type. And it’s the key toperformant Julia code, as the compiler can use the specialized low-level method for thattype.

In the example as following, the Julia compiler can spit x86 ASM almost the same as whata C/C++ compiler would:

function vsum(x)
    sum = zero(x)
    for i = 1:length(x)
        @inbounds v = x[i]
        if !is_na(v)
            sum += v
        end
    end
    sum
end

Julia compiler also infers type information based on user annotation as well asinput types at runtime, to better help JIT optimization.

Multiple Dispatch

Multiple Dispatch is similar to the concept of operator overloading in other programminglanguages. It means overloading function behavior based on the input types.

For example, the + function can consist of 180 underlying methods based on the input types.Each method declares what types it can handle, and Julia will “dispatch” it to the correctmethod when it’s called.

Method Specialization

At runtime, Julia can decide at function invocation time, what types are the inputs,and the method is “specialized” to these types, whichprovides JIT with the important information about the argument typefor “devirtualization.” So the method dispatch becomes a specialized compiled method,which enables more optimizations like inlining.

LLVM

Julia compiler parses program input to Julia AST, which then lowered to Julia IR. It enablesJulia language level optimizations, and then translates to LLVM IR. LLVM IR enables a largenumber of optimizations that are critical to Julia performance.

Conclusion

Based on skimming the paper, Julia is a very interesting language that’s inspired bymany former dynamic languages while providing more innovative solutions to issues thatused to trouble programmers. It’s definitely worth attention in areas like mathematical modeling,HPC, AI/ML, and etc.

Reading Summary: 07/20/2020

2020-07-20T06:30:00.000Z

A Sino-American bond, forged by Chinese students, is in peril $

How Chinese-American relationship is impacting the lives of many “stuck in between.”

The author had the foresight about the dangerous impact social media has on a society,and he was right.

He also proposes: the cure cannot be a pure technological one, it requires fixing thevulnerabilities inside economics, political, and social systems.

Technology

Testifying at the Senate about A.I.‑Selected Content on the Internet, from Stephen Wolfram

Stephen Wolfram’s testimony at the Senate, on A.I. selected content, his ideas onwhy algorithmic bias is dangerous, and how we can address it with proper regulations,transparency, and user choice.

He basically proposed that users should have an idea of what algorithm is feeding them data,and the capability to choose. This requires some open benchmarks on recommendation algorithms,and frameworks for users to choose.

Programming

The Rise of Embarrassingly Parallel Serverless Compute

What is serverless computing, why it is on the rise,and why is it useful for parallel data processing (data processing,CI/CD, compilation, ML, visualization, …, you name it).

NoSQL Data Modeling Techniques

A detailed guide for modeling your NoSQL data schemes.

Paper Reading: Aurora: Distributed Relational Database

2020-07-05T03:17:00.000Z

The following is my overly simplified summary of paper reading.

Aurora is a geo-distributed SQL database that supports replication, high-availability, and transactions,with its distributed design around replicating the database WAL log.

References

Course Syllabus: https://pdos.csail.mit.edu/6.824/schedule.html
Video Lectures: https://www.youtube.com/channel/UC_7WrbZTCODu1o_kfUMq88g/videos
Lecture: https://www.youtube.com/watch?v=jJSh54J1s5o

Design Choices

Saving data on EBS is too slow, unreliable, and generates too much traffic.
Building Storage as an independent Fault-Tolerant self healing service across data-centers.
Writing only redo log to disk, and across the network across AZs (Availability Zones).

The Log is the Database

Original design to mirror data on EBS is slow, unreliable, and incurs expensive network overheads.

The log is the Database: write only redo log to disk and across the network. Backup disk to s3 on the background.

Durability, Replication and Quorum Model

Traditional Quorum is inadequate: Doesn’t prevent total AZ (Availability Zone) failure.
2 * 3 architecture:3 different AZ, 2 nodes per AZ, 6 way copy.Vw=4, Vr=3. Need 4 nodes to write, 3 nodes to read.
So that Write is Available even when AZ goes down. Read even when AZ + 1.

Partitioning

Partition the database volumes to small fixed size segments called PGs (Protection Groups), each PG is replicated 6 ways across 3 AZs.

PGs are implemented as storage nodes with EC2 VMs, and attached SSDs.

Partitioning also helps reducing the MTTF (Mean Time to Failure) to reduce probability of losing quorum.

Storage Service Design Points

Move majority of the storage processing to the background.
Every task is asynchronous processing.
Foreground processing writes to the updated queue. Background handles compacting, GC, backup, etc, which doesn’t impact latency.

Log System Design Details

Problem: how to implement consistency on logs, without expensive 2PC, and how to handle recovery process.

Solution

Terminology:

LSN: Use monotonically increasing LSN (Log Sequence Number) for each Log entry.
VCL: Volume Complete LSN: the highest LSN for which it can guarantee availability of all prior log records.Since all log replication is async, not all log entries are replicated before later log entries are replicated.
MTR: Mini-transactions. Each database-level transaction is broken up into multiple mini-transactions that are ordered an must be performed atomically.
CPL: Consistency Point LSNs. The LSN points that allows truncations, as mini-transactions can span across multiple LSNs, and should be performed atomically. The final log record of a MTR (mini-transaction) is marked as CPL.
VDL: Volume Durable LSN. The highest CPL ≤ VCL, that is, it’s complete, and is the last log record that supports truncation to ensure the completeness of a MTR.During a recovery, logs after VDLs are truncated. It ensures all log records ≤ VDL are complete (all records all replicated), and doesn’t break MTRs (marked by CPL).

Writes:

Continuously logs to the storage service, establish write quorum for all writes, and register transactions as committed, therefore advancing VDL.
Limits concurrent LSNs that are higher than current VDL (concurrently writing but didn’t yet establish quorum).

Commits:

Aurora handles transaction commits asynchronously.
When a client commits a transaction, the thread handling the commit requests sets the transaction aside by recording its “commit LSN”.
When WAL’s VDL ≥ “commit LSN”, transaction is committed. Use dedicated thread to send ACK to client.

Reads:

Pages are served from the buffer cache.
Buffer caches only evict out-of-date pages. (where an evicted page from the cache only if it’s “page LSN” ≥ VDL.
Quorum reads during recovery only. Ordinary reads are from nodes with.
Read from the “up-to-date” read point with a low water mark.

Replicas:

Read replicas add additional costs.
Only log records that will be applied are ones whose LSN ≤ VDL.
The log records of a mini-transaction are applied atomically, so that replicas see a consistent view.

Recovery:

Aurora uses a Quorum read for recovery process.

Reading: Cassandra Data Modeling

2020-07-04T03:55:00.000Z

Reading from Cassandra official website: https://www.datastax.com/sites/default/files/content/whitepaper/files/2019-10/CM2019236 - Data Modeling in Apache Cassandra ™ White Paper-4.pdf

Cassandra is a exemplary implmentation of NoSQL database, and gained popularity in various web, big data, and ML applications.Recently I’ve stumbled upon a good summary of Cassandra handbook, which includes a decent introduction to its datamodeling techniques, which can in term be used in other NoSQL databases.

Here are my notes and summaries:

Data Modeling Concepts

There are great many ways Cassandra and traditional RDBMS are different: Cassandra is a wide-column database, with BASE eventual consistency guarantees, has looser relationships between tables. Therefore one needs to model their data very differently than traditional RDBMS for the application to run efficiently.

Namely NoSQL has following differences:

No Joins: tables have loose relationships with each other without database level joining.
No Referential Integrity: RDBMS requires foreign keys to refer to primary key in another table. NoSQL doesn’t enforce this.
Denormalization: contrary to what RDBMS normalization techniques, denormalization is first-class citizen in NoSQL. Many NoSQL databases supports aggregating fields in the same table to achieve row level atomicity.
Query First: SQL data modeling starts with entities and relations, while NoSQL data modeling starts with application queries.
Sorting: Sorting is an important design decision, for Cassandra and many NoSQL databases.

(?) What are the major differences between NoSQL and SQL data modeling?

Logical Data Modeling in Cassandra

A Cassandra table uses a composite key as primary key: with a partition key (K) and a cluster key ©.

Partition key (K) decides where the row stores in the cluster.
Clustering key © decides how the row is sorted under the same partition key.
Primary key can be composite of multiple Partition Keys (K) and Clustering Keys ©.e.g.: ((K1, K2), (C1, C2, C3)).

Primary key is crucially important in Cassandra data modeling, as:

Cassandra doesn’t support query filtering without partition key.
It impacts partitioning of data across databases, and therefore potentially impacts performance.
Sorting is a very important decision and it impacts query performance.

(?) What is Cassandra key and why is it important?

I. Build Application Workflow

Logical Data Modeling starts with overall Application workflow, known as the “query-first design.”

II. Build E-R Diagram

Entity-Relation Diagram often used for SQL data modeling, but helpful to think through the E-Rs involved in the NoSQL modeling.

Iterate between Application Query Workflow and E-R Diagram.

III. Use Chetboko Diagrams

Chetboko Diagram is a good tool to model the queries and tables required by the NoSQL application.

It captures the schema, highlights the partition key (K) and clustering key © for each table.
It shows the application query workflow, and how queries link the tables together.

(?) In NoSQL data modeling, what is a Chetboko Diagram and how does it help with modeling?

IV. Create Tables

Some items to consider when creating the tables:

Design keys well. Create unique keys.
Use data types effectively. Cassandra supports collection (set, list, map, tuple, …).
Use user-defined data-types by creating types.
Denormalization is normal.
Use secondary indexes and materialized views when necessary.

References

NoSQL Data Modeling Techniques: https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/
Data Modeling in Cassandra: https://www.datastax.com/sites/default/files/content/whitepaper/files/2019-10/CM2019236 - Data Modeling in Apache Cassandra ™ White Paper-4.pdf
Cassandra the Definitive Guide: https://www.datastax.com/resources/ebook/oreilly-cassandra-definitive-guide

Book Review: Black Swan - The Impact of the Highly Improbable

2020-04-05T21:40:00.000Z

I’ve just finished the major part (without the postscript essays) of the famous and oft-discussed book, once a best seller - the Black Swan. The author was knowledgable, and the book was insightful and well-crafted, with his unique style of discussing serious topics with occasional anecdotes and vivid storytelling. It was a fantastic ride.

Human Thinking Fallacy

Humans tend to think and live in Mediocristan, where probability tends to be in normal distribution - and that’s what most things are. Like human height, weight.

Black Swan incidents are ones that people can barely predict, sometimes grossly overlook. Examples include the 9.11 incident, 2008 stock market crash, etc.

But many other distributions are best described as power distribution, and that’s referred to as Extremistan, where cases tend to be extreme. Like human wealth.

It’s human nature to draw conclusions, find correlations, assume everything is close to what we observed, and extreme cases are extremely unlikely. And that’s the basic recipe for Black Swan incidents.

Think of a turkey well-fed by its owner. It quickly concludes that the owner is a friend, until the day before Thanksgiving. The author advises in the book: don’t be a turkey.

The author discussed a few cognitive biases we’re vulnerable to:

Confirmation Bias: People seek validation and reenforce their bias.
Narrative Fallacy: People tend to find causes, as stories are much easier to digest given causes and reasons. People love to conclude as part of our natural tendencies.
The Antechamber of Hope: The success of specific careers require an extraordinary amount of input and lonely hours waiting for hope. But many people don’t realize that, even the pursuers of these careers themselves.
Survival Bias: People tend to look at survivors or successful stories while overlook deaths, thus overlook the total probability.

How Are We Bad At Prediction

Human beings are particularly bad at making predictions. One phenomenon is the more information we have, the more confident we are, but not more accurate. It’s called “toxicity of information,” where noise is mistaken for signal.

The author argues that human technological advances are particularly unpredictable: “if you expect to expect something tomorrow, you should expect it today.” It’s especially true with new technologies. If we understand the details of new technology right now to predict it, we should already know how it’ll work and have it today.

In the book, the author slammed the so-called economists, social scientists, and the like, who build complicated mathematical models and beautiful charts to “forecast” the economic trends, stock market, etc., without taking into account chance plays in the outcomes. It makes them utterly vulnerable to Black Swans.

The author points out, however, that we should not try to predict Black Swans. Instead, build robustness against negative Black Swans, and shoot for positive Black Swans.

Gray Swans of Extremistan

The final part of the book author argues that the foundation of Black Swans is power distribution. It happens everywhere in the world: economy, company, nation powers, where winners take all. It has several implications:

Nobody is safe in Extremistan, but nobody is threatened with total extinction either.
More concentrated power means more devastating collapse, too.
There are always ways to soften Extremistan, e.g., tax to redistribute wealth, religion to bind people. But Extremistan is here to stay.
Black Swans are always going to happen. We can make it grayer by treating them with the correct attitude.

Many book reviews have already gone through what they dislike about the author’s arrogant tone in this book, dismissing all social science as pseudo-science. Also, the author loved to paint himself as the lone wise oracle shunned by ordinary people, but that’s not the truth: many people have similar or close ideas of impending dangers and what we should do about them.

Nevertheless, the ideas in the book are still worth a read and close attention, especially in a fast-changing world as it is today.

One of the best examples might be the coronavirus that’s sweeping across the world right now, as I’m sitting in my own house, not being able to visit the restaurants and coffee shops I love. In retrospect, when the news first broke out, I never expected it could have such a drastic impact. Many people, myself included, like most popular news anchors, technologists, president of the US, and so so many more on social media, regarded the virus as “something just like the flu,” and “it’s just going to go away when the season passes.” Media today love to bring out the old comments, (especially with different political agenda), and use them to mock how ignorant and short-sighted they are - even though they are not so innocent themselves. I see this more like a common flaw in human predictions, just as the book described: as humans, we’re particularly bad at predictions.

There are also voices pointing out that it didn’t need to be a Black Swan. Nassim Taleb, the author of this book, stated in the recent interview: coronavirus shouldn’t be a Black Swan, to governments, medical professionals, and epidemiologists who dealt with situations like this before. He was not alone. Bill Gates once warned us about the dangers of a pending pandemic. We didn’t take the advice seriously, and the pandemic still broke out as a Black Swan to all the rest of us.

Now instead of engaging in bitter political bickering, it’s wiser to learn from this lesson on all humanity and work together to make the next Black Swan grayer.

Reading Summary: Ultralearning

2020-01-25T22:55:34.000Z

Ultralearning is a quite interesting book from one of my favorite bloggers: Scott Young.Famous for his “MIT Challenge” – which he completed four years of MIT coursework in one single yearby completely self-studying – he now blogs regularly on studying methods, student cognitions,and everything related.

This book is his summary of his researches and experiences of studying.The book’s author argued that: there’s one possible way to learn and improve yourself,with intensive training and exercises. Like training muscles, you can adopt an extraordinary,unorthodox training plan for your brains, and pick up a new skill in a short amount of time,be it a foreign language, programming, sketch, or even public speaking. He called it “ultralearning.”In the book, he researched many references and interviewed like-minded friends,who had similar experiences of acquiring or improving a skill intensively.And he summarizes all the essential principles, as the guide to a successful “ultralearning” project.

Draw a Map: Research and layout a roadmap of what you try to study.Carve out enough time and make it a routine for you to follow.
Focus: Train your focus. You cannot learn with efficiency if you can’t focus. It’s one of the most critical capabilities, yet it’s the most difficult to obtain. Most people suffer from not being able to start focusing (procrastination) or not being able to sustain focus (fatigue). It requires a large amount of practice.
Directness: Contrary to some beliefs that learning can easily migrate from one skill to another, the author recommends directly target the very skill you’re trying to improve and be laser-focus on it.
Drill: This is an area where you need excellent will power: keep finding out your weakest point and attack it ferociously. Do not live in the illusion of improvement, but keep exposing your week point.
Retrieval: Test yourself to learn. The author notices though many students complain about missing lectures on online courses, few complain about missing tests. But learning isn’t about passively accepting knowledge. It’s really about acquiring and absorbing. Use the “Feynman techniques” to keep challenging your understanding.
Feedback: Get feedback from others, preferably professionals. Their opinions can help you realize your blind spots.
Retention: Memory is a huge aspect of learning. Use techniques to retain your knowledge of a subject, like spaced-repetition.
Intuition: Study your subject and practice so hard, so that you develop insights on it, and put together what you’ve learned like jigsaw pieces.
Experimentation: Finally, once you’ve gained enough skills, try to apply the skills and experiment on it to develop your original creations.

I’ve finished this book in less than a week, and it was a pretty fun read.It included many anecdotes from authors’ friends and intellectual celebrities with high achievements (like Feynman, Ramanujan, Van Gogh, etc.)Also, it comes as a practical guidebook to your own learning projects.Although the book is named “ultralearning,” it does provide principles and tricks on learning, in or outside of school. In many places, it resonates with me as a student.

If I have to pick bones, as a guidebook to learning, this book feels a little verbose on stories.And as research on learning psychology, many of the stories don’t feel formal and convincing enough.But in all, it was a fun read for all the guidelines the book provides. I’d recommend it to anyone who believes learning is an essential part of their life.

Paper Reading: Zookeeper

2020-01-20T20:28:00.000Z

Paper: https://www.usenix.org/legacy/events/atc10/tech/full_papers/Hunt.pdf

Presentation: https://www.usenix.org/conference/usenix-atc-10/zookeeper-wait-free-coordination-internet-scale-systems

Data Model

Zookeeper’s data model is very like that of Unix tree-like file system paths.Every node is called a znode, with a key name and value, and may have its own children (except for ephermeral nodes).

Each znodes contain metadata like timestamps and data version number.

Nodes may be regular nodes, or ephermeral nodes, where clients keep alive bysending heartbeats to the server, and are removed in server after timeout.Handy for keeping membership information.

Provides basic client API like create, get, set, delete, getChildren,and sync, for clients to read the most up to date information.

Guarantees

Keeps two consistency guarantees:

Linearizable writes: All writes are linearizable, and specificallyAsynchronous linearizabile, meaning client requests are non-blocking (or wait-free),but requests are processed in serialized fashion.
FIFO client order: All requests from clients are in processed in requestedorder. Meaning the client will see results in order when it’s issued. Thisprovides consistency guarantees in many applications.

Zookeeper writes are processed at leader level, while reads are processed at all nodes,for better scalability and performance, and therefore doesn’t provide strongconsistency. Zookeeper provides a sync() API for clients to read up to date data.

Primitives and Applications

With Zookeeper’s consistency model in mind, we can create powerful primitives based onZookeeper’s, for cluster key configuration management.

Configuration Management: Zookeeper’s A-Linearizibility consistency makes it idealfor managing consensus data across cluster. It also provides watch primitives for clientsto watch value changes.
Rendezvous: Saves information when cluster is bootstrapping, and master of anotherapplication is undecided. Clients can read from a designated znode for cluster configuration.It works as a service discovery mechanism.
Group Membership: Using ephemeral znodes, clients can save their liveness information,by keep sending heartbeats to the znodes. Once client disconnects/network partitions, znodegoes away. Can also use SEQUENTIAL flag to obtain unique name assignment.
Locks: Client can create a ephemeral znode as a lock, all future clients will read iflock is in place befor modifying data. Lock is release when holder unlocks or dies/network partitions. Fancier locks are possible with some modifications.

Implementation

Request Processor

Request processor in Zookeeper is idempotent.

All write requests are processed as transactions, it either generates a new version number for datawhen request version number matched, or generates an error if failed.

Atomic Broadcast

All servers process reads, and writes are forwarded to the leader. Zookeeper usesa protocol named Zab to keep consensus among the cluster. Like Paxos, it requiresa quorum to reach consensus.

Replicated Database

Zookeeper replicates data in all followers’ database. It takes snapshots to compact data.

Zookeeper snapshots are called fuzzy snapshot, as it’s not necessarily a valid state of theZookeeper data tree. But during recovery data can be recovered with fuzzy snapshot andoperation logs.

Client-Server Communication

Client reads from all servers for performance, and writes are transactions.To read the latest data, client syncs before read.

Followers process syncs by appending to previous write queues to leader.If there are no new writes before sync, it generates a dummy sync to guaranteethe leader is still leader.

Book Review: What the Dormouse Said

2019-11-04T02:39:00.000Z

When logic and proportion
Have fallen sloppy dead
And the White Knight is talking backwards
And the Red Queen’s off with her head
Remember what the dormouse said
Feed your head
Feed your head

I recently came across this book on how the 1960’s counter-culture and anti-war movemententangled with the personal computer movement in California. Much have we known abouthow the pirates of the Silicon Valley: Bill Gates and Steve Jobs shaped built personalcomputing enterprises, but this book recorded some very fascinating details of the storiesbefore their age, and how they inspired the generation of Bill Gates and Steve Jobsby first putting forward this very extraordinary idea of personal computing.

Story dates back to 1945 when Doug Engelbart started his musing on a device that can extend human mind,with inspirations from “Memex”, a device conceived by Vannevar Bush. It’s a machine that could track andretrieve vast volumes of information.

After school, a year of teaching and several failed attempts to find a job that can pursue his digitalcomputer dream, he landed in Stanford Research Institute, where he began his research in digital computer system.

At the same time in California, Myron Stolaroff first came in touch with the power of LSD, and laterdevoted his entire life to researching and promoting the power of it. The LSD was popular among engineersdescribed in the book, many, including Engelbart himself, used it as a mind-expanding tool.

In 1959, a young man named Fred Moore came into the campus of Berkeley. As a young man with some radicallyprogressive ideas in mind, he quickly rose to fame in the anti-ROTC student protests, and became oneof the leaders of the student movement in the 1960s and 1970s.

Three major threads led to the birth of personal computing. Engelbart had this vision of creatingan augmenting device with the power of machines. Stolaroff was experimenting with this substancethat can expand on human creativity as well as human spirituality.And Fred Moore set out on a crusade to spread freedom and peace. All three contributed to the creation of personal computing.

With funding from military, Engelbart continued his endeavor to Intelligence Augmentation.

On Dec 9, 1968, Engelbart introduced his system that works on a terminal with remote connectionsthrough ARPANET in the annual Fall Joint Computer Conference. Dubbed “the mother of all demos”,Engelbart and his team first demonstrated to the world the power of computers in empowering humans,and inspired a generation of young engineers to join his team, or pursue smiliar goals.

Book also introduced many interesting and important figures that influenced that age, e.g.

John McCarthy: who was a legendary figure who led the development of Stanford AI Lab (SAIL),not too far from SRI, but with different goal in mind: AI should totally be overpower human mind.
Alan Kay: later the Turing Award laureate, the father of SamllTalk and the concept of Dynabook.He pioneered the research in language design and human computer interactions in Stanford.
Steward Brand: one of many influenced by Staroloff’s experiments on LSD. He later influenced thewhole generation with the lengendary Whole Earth Catalog.
Jim Warren, a teacher in school at the time, he later was involved in the radical movementsof Midpeninsula Free U movement, and founded the most respected West Coast Computer Faire,an annual convention for minicomputers. He was also the founder of Dr. Dobb’s Journal.
JCR Licklider, the head of DARPA and the early funder of Engelbart’s research.
Bill English, one of the engineers on Engelbart’s team, who later worked at Xerox PARC.He and Engelbart both shared credit for creating the world’s first mouse.

With visionary and persistent figures like Engelbart, to genius engineers like Bill English,and student movement activists like Fred Moore, 1960s-1970s America, especially California,saw the shift of engineers sterotypes from uptight traditional stereotypes they used to be, to the LSD-sippinghippies who valued freedom and liberal ideas most, and pursued personal empowerment and individualism.The engineers in this story had influences from the radical Californian shifts in ideologies and activism,as well as the MIT hacker spirits. These people were not just geniuses, butthe ones who pursued individualism, and believed personal computers were the key to it.And maybe that, in turn, pushed forward the developmentof the most personal empowering device that we saw in the last century - personal computer.

Though their efforts and visions were not immediately celebrated in their time, theirinfluenced from SRI, to Xerox PARC was felt throughout the world, when young Steven Jobs and Steven Wozniakstarted from the Home-brew Computer Lab, and brought research ideas like GUI, mouse and personalcomputing to the whole world.

In all it was a very interesting book that’s worth a read if you’re interested in computer developmentat the age, and the tremendous stories behind how personal computing came into being.

Book Review: Data and Goliath

2019-09-16T05:29:00.000Z

https://play.google.com/store/books/details/Bruce_Schneier_Data_and_Goliath_The_Hidden_Battles?id=MwF-BAAAQBAJ https://www.amazon.com/dp/039335217X/

“Data and Goliath” is an excellent book a friend recommended.It’s a summary of all the dangerous and negative ways data, and the “Big Data” technology canshape our societies. The author Bruce Schneier isa prominent expert in cryptography who published impactful works oncryptography and issues on privacy. He’s also on the board of directors of Electronic Frontier Foundation.

The book provides abundant amount ofcases and examples related to big data misuse, as well as author’s carefuland in-depth analysis of different impacts data has on our societies,and pragmatic recommendations to different sides of the society on solvingthe “Big Data” problem.

The book mostly discusses how governments and corporates can abuseits use to profit, surveil or control citizens at the cost of our privacy, freedom, and even democracy. Without proper protection, regulationand activism, we are unknowingly giving up our rights to data.

Governments can abuse Big Data, and our political liberty and justice systemcan be corrupted, with mass surveillance on citizens, and surveillancedata can in turn be leveraged to accuse dissidents and silence politicalopponents. Government censorships can thwart free thinking and socialprogress, and make way for an oppressive regime.

The author provides an interesting thought experiment, originallyfrom English philosopher Jeremy Bentham: panopticon, meaning a prisonwhere all inmates can constantly be watched by the guard, even when guardis not actively watching them. In such a system, inmates are much moreconformant from the constant fear of criticism, judgements and punishments.A society becomes a panopticon with mass surveillance and censorship.

Some other examples include the political witch-hunting in 1950s led by senator Joseph McCarthy, and harassment Dr. Martin Luther King receivedfrom then then FBI directorJ. Edgar Hoover. The book described the chilling effect surveillance and abuseof power can have on political movements.

From a commercial perspective, misuse of “Big Data” can havedangerous effects on society as well. Surveillance-based discriminationbasically revive the “redlining” to the internet age, where discriminationcan be much more pervasive, intrusive and effective, and thus moredamaging. Large corporate collected data can be used for massive onlinemanipulation. A good example is how Facebook can nudge its users tovote with a rate of ~0.4%. Imagine if it discriminately displays thenudging information to vote.

(The book is finished around 2015, before the Cambridge Analytica incident,proving the author’s foresight.)

Finally the book stressed the importance of Software/Network security,privacy to our society, and analyzed why it doesn’t contradict thegovernments’ role of ensuring the security of the societies, corporates’role of leveraging data for profit. Finally it provides pragmatic recommendations on solving the “Big Data” mess,to governments, corporates, and the rest of us. In all it was a goodread.

Reading Summary 2019-08

2019-08-19T03:15:57.000Z

Cassandra Time Series Bucketing

How to model timeseries data with Cassandra.

Simple GoRPC

The best way to understand something, is to build one yourself. This tutorial covers basic network programming in Go, struct design and the usage of reflect package.

Optimizing M3: How Uber Halved Our Metrics Ingestion Latency by Forking the Go Compiler

A great experience sharing blog on how to debug a performance issue in their services. And with profiling and analysis tools, the Uber team was able to pinpoint this issue in worker pool and goroutine stack allocation, and then they forked the Go compiler to prove it’s a regression in the Go compiler. A very nice read and analysis process.

Book: Programming Models for Distributed Computation

A programming book on topics in distributed computation, from teaching experience in distributed system course, from Northeastern University.

Spotify Engineering Culture

A very nice engineering blog from 2014. A excellent overview of Spotify culture, and an introduction on how to build the “agile” team.

How We Helped Our Reporters Learn to Love Spreadsheets

NYTimes has released its in-house course to teach journalists data science. Journalism can also benefit from a little coding/data analytics skills.

Reading Summary 2019-04

2019-05-06T03:40:34.000Z

An Overview of Go’s Tooling

If go is one of your favorite languages as well, this is a must read:it introduces all the basic tooling that comes with Go’s ecosystem, whichmight greatly save your time.

HackerNews thread on TLA+:

A thread from HackerNews, discussing the importance of formal verificationfor distributed systems.

TLA+ and formal verification is notoriously known for its complexity and steeplearning curve. This might be one of my very future goals.

Are You a Software Architect?

What it takes to be a software architect, a great blog post from InfoQ.

InfluxData is Building a Fast Implementation of Apache Arrow in Go Using c2goasm and SIMD

TIL that it is possible to convert your C/C++ assembly into Go’s assembly, andcall from Go’s code. InfluxData leverages the tooling to embed AVX/SSE instructionsinto Golang’s assembly, thus boosts Go code’s performance, sometimes by ordersof magnitude.

More information on this tool, c2goasm, work from Minio.

Org-Mode Is One of the Most Reasonable Markup Languages to Use for Text

I think so, too. But it’ll require a community and proper tooling to see itreally prosper. Hope to see that some day.

Why and How Capitalism Needs to Be Reformed

A great piece from Ray Dalio, thefounder of investment firm Bridgewaters, a seasoned investor, discusses in hisrecent long post why American capitalism is sick in distributing resources,especially educational resources, and needs to be reformed to stay healthy.

Blog Reading: The log - What every software engineer should know about real-time data's unifying abstraction

2019-04-02T04:39:10.000Z

Link: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Kafka is a message queue, a pub-sub system, an event sourcing tool,and a stream processing infrastructure, is a key part of many streaming distributedsystems that requires streaming data. Its underlying idea, is to aggregate datafrom a distributed sources, to a unifying linear log structure.

The blog is from Kafka’s creator Jay Kreps when he was at LinkedIn,contemplating the log abstraction as a key part of any distributed systems. Thisis not Kafka’s design paper, implementation or a tutorial, but rather the process of brewingthe idea that led to its birth, and I found it equally interesting. The following aremy notes.

The link to Kafka paper: https://www.semanticscholar.org/paper/Kafka-%3A-a-Distributed-Messaging-System-for-Log-Kreps/9f948448e7a5f0cc94cd53656410face8b31b18a

What Is a Log?

Log is a simplest storage abstraction, similar to what we see in application logs,records are appended to the end of a log data structure, and reads proceeds left-to-right.This simple abstraction is powerful, in that:

It keeps the records, and the ordering of records, by when it’s appended to the data structure.
In a deterministic system, you can reconstruct the state of the systems at any time, byreplaying the input in order for every single step of the input.

The log centric approach arises from a simple observation that the author named“State Machine Replication Principle”:

If two identical, deterministic processes begin in the same state and getthe same inputs in the same order, they will produce the same output and endin the same state.

And there are two major different ways of leveraging logs in distributed processingand replication:

A “Primary Backup” Model, AKA “active-passive” model, where one node is elected asmaster, and writes its states to log. Upon master failure, a replica is elected andtake over.
The “State Machine Model”, AKA “active-active” model, where changes/operationsare written to the log, and each replica picks up the log.

What can log be used for

Data Integration

Make all of an organization’s data easily available in all its storage and processingsystems.

An organization may have multiple data inputs, that gathers events and data frommany places, and different consumers to digest that data. A log structure can serveas a buffer as well as a central pipeline for all the different producers and consumers.In this way, the log serves as an asynchronous messaging system.All producers and consumers can read buffered data from the log, with different pace.e.g. a real-time system may need to read instantly, while an analytic platform mayread it only hourly or even daily.

Also, in a system where there are M inputs and N output, you’ll need M * N pipelines tomake sure each consumer can read from all data producers. But with a single unified data pipeline,every producer and consumer can all write and read from one single log. And that’s theidea behind Kafka.

Also, Kafka’s log structure also enables high-performance optimizations, e.g.:

Enables partitioning.
Optimize high throughput by batching small reads and writes.
Avoids needless data copies, as it can keep same binary data structure in memory,on disk and in network transfers.

Real-time Data Processing

Computing derived data streams.

Log also makes real-time stream processing easier. Logs enables real-time data collectionfrom events or different data input, at different speed, that the consumers canread from at scale.

Log also enables more complicated data flow, e.g. when output of a log in the streamprocessing systems becomes the input of another. It can construct complicated data flowgraphs. And log has benefits:

It makes each dataset multi-subscriber and ordered, and the order is permanent.
The log provides buffering to the processes, so that the system can work inasynchronous fashion.

Distributed System Desgin

Practical systems can be simplified with a log-centric design.

Log enables high-performance and easy integration of data producers and consumers,distributed systems are more likely to move away from monolithic relational databases,and toward more diverse data sources and consumers. Building distributed systemswould more feel like lego games with open-source data components.

And a log system can work as the following role in system architecture.

Handle data consistency by sequencing concurrent updates to nodes.
Provide data replication between nodes.
Provide “commit” semantics to the writer (respond only when your write is guaranteednot to be lost).
Provide external data subscription feed from the system.
Provide the capability to restore failed replicas.
Handling data rebalancing of data between nodes.

The author built the powerful ideas of a log into Kafka, one of the most influentialdata streaming platform. This long blog might bring some insights to incorporateKafka into a distributed system, as well as provide inside in building new systeminfrastructures.

Reading-Summary 2019-03

2019-03-18T04:39:10.000Z

10 Breakthrough Technologies in 2019, by Bill Gates

Take a look at what Mr. Gates thinks are the greatest technologybreakthroughs right now. The list might surprise you.

What happens when you click Play button on Netflix

How Netflix leverages AWS technologies to build world-scale, highly-availbile,fault-tolerant distributed video streaming system.

Lyft Case Study - Amazon Web Services

Lyft architecture evolution on AWS.

Compounding Knowledge

From Farnam Street – an interesting blog site I found recently.

Also on Farnam Street and its “mental models”: The Mental Model Fallacy.TL;DR: The so-called “mental models” from Farnam Street is not of much valuewhen it’s from non-practitioners. And to learn businees, like basketball, swimming,etc., you’ll need to actually practice to learn the intricate knowledge that arenot easily translated into writings.

Parsing Gigabytes of JSON per Second

Unfortunately I didn’t have time to finish reading this paper. But it’s goodto learn the concept of branchless algorithms to fill the CPU pipeline andachieve amazing performance.

Paper Reading: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

2019-03-10T18:14:56.000Z

Link to paper: https://people.eecs.berkeley.edu/~alig/papers/mesos.pdf

Presentation: https://www.usenix.org/conference/nsdi11/mesos-platform-fine-grained-resource-sharing-data-center

Mesos is a cluster resource management software from UC Berkeley. Unlike many otherframeworks already existed, Mesos is designed to support heterogeneous frameworks (Hadoop,MPI, etc) in the same cluster and share resources between them, by providing a thin layerthat making resource offers to the framework schedulers, and delegate the schedulingdecision to the frameworks themselves.

With this design, Mesos can achieve pretty good elasticity between frameworks, andletting frameworks choose their own resources results in better data locality.

Design Choices

Mesos supports different frameworks, making them share cluster resources, toaccommodate differnt computation needs.
Find-grained sharing: Instead of allocating static amount of cluster to multipleframeworks, Mesos provides fine grained resources to all frameworks elastically.
Resource offers: Instead of making all decisions for everyone, supporting all frameworks outthere, Mesos implements “a scheduler of schedulers”, by providing resources offersto each framework and let it make scheduling and execution decisions, makingMesos itself thin, highly extensible, and scalable.
Frameworks are provided resource offers, and can make decisions on whether toaccept them based on their own requirements. For faster response time, each frameworkcan keep a white-list and/or black-list of all resource offers.

Architecture

Mesos consists of master and agent (used to be slave) nodes, and resource offersare resources on each agent node.
Each framework provides two parts: scheduler, and executor. Scheduler registerswith Mesos and makes scheduling decisions based on the resource offered, and executorruns on each client node, to run actual commands on each agent node.
Mesos uses pluggable resource allocation modules for resource allocation decisions.
Mesos uses pluggable resource isolation modules for resource isolation between differenttasks (e.g. containers).
Mesos provides APIs for resource offers and resource offer responses. Frameworks thatworks with Mesos need to implement the APIs.
Mesos work with Zookeeper to keep a quorum of masters to provide fail-recovery.Its master keeps a soft state, that can be reconstructed from agent information, for resilience.

A full cycle of resource offer works as follows:

Agent reports to Mesos master what resource it has.
Mesos master provides resource offers to a framework scheduler, based on organizationpolicies.
Scheduler decides on the offers, and returns decisions, and tasks to run on theoffers.
Master sends execution commands to corresponding agents to execute on the executors.
Repeat leftover offers to next framework.

Performance and Scalability

Mesos core is designed to be small, and from the paper it could scale to 50,000nodes with emulated load.

Delay scheduling: The team built a new MapReduce framework on Mesos named “Spark”, to handle iterativeMapReduce workloads. Since Mesos allows framework scheduler to choose resources, it can chooseresources so that data is retained on the executor. This avoids reloading data from diskeach iteration, and allows Spark to achieve ~10x performance over traditional MapReduceframework on iterative tasks.
Fine-grained sharing: different framework can expand and shrink based on its ownworkload, as the following chart shows. By pooling resources, Mesos lets each workloadscale up to fill gaps in the demand of others. In addition, fine-grained sharingallows resources to be reallocated in tens of seconds.

Paper Reading: Understanding Real-World Concurrency Bugs in Go

2019-03-04T06:46:03.000Z

Link: https://golangweekly.com/link/59972/b208593eda

A team from Penn State University and Purdue published their latest study on concurrency bugs found in Golang projects, namely large projects from Github: Docker and Kubernetes, two datacenter container systems, etcd, a distributedkey-value store system, gRPC, an RPC library, and CockroachDB and BoltDB. The authors searched commit histories of each repository to understand concurrency bug fixes for categorization and study.

TL;DR:

Go’s message-passing concurrency mechanism, something Go is proud of, isn’t as easy to use as it’s generally perceived. It creates just as many bugs, if not more, than shared-memory concurrency model.
Shared memory synchronization is still used more in Go projects.
Go’s built-in race and deadlock bug detection library still cannot catch all the bugs. There’s room for more improvements.

Abstract: The author of this paper analyzed 171 bugs in 6 aforementioned open-source Go projects for a systematic study of Go concurrency bugs, providing better understanding for go bugs and concurrency bug detection tools.

Type of Bugs

The author categorized the bugs into blocking and non-blocking bugs. Blocking bugs are misuse of synchronization primitives that causes the program, or a subset of goroutines to hang. Non-Blocking bugs happen when shared memory is unprotected, causing data races, or errorneous message passing, e.g.: when goroutines don’t quit properly, causing resource leaks.

Blocking bugs

The paper further divided blocking bugs into traditional shared memory bugs, and bugs caused by misuse of message passing, or libraries related to messaging.

This led to an interesting observation from this paper: contrary to common belief, message passing are potentially more likely to cause blocking bugs than shared memory.

An example of blocking bugs related to message passing, with its fix. similar to the one I had before:

  // goroutine 1

  func goroutine1() {
      m.Lock()
-     ch <- request // blocks
+     select {
+         case ch <- request
+         default:
+     }
      m.Unlock()
  }

// goroutine 2
func goroutine2() {
    for {
        m.Lock()   // blocks
        m.Unlock()
        request <- ch
    }
}

An example for blocking bug related to messaging library from the paper is Pipe library.

The paper also noticed that for blocking bugs, there’s a high correlation between blocking bugs (shared memory as well as message passing) to their fixes, indicating there’s high potential in developing automated tools to help fix such bugs.

Non-Blocking bugs

For non-blocking bugs, the paper also divided them into traditional bugs, (e.g. unprotected shared memory causing data races), misuse of channels, or shared data in special libraries.

An interesting example related to non-blocking bug caused by message passing, mentioned in the paper:

// when multiple goroutines execute the following code, default
// can execute multiple times, closing the channel more than once,
// which leads to panic in Go runtime

- select {
-     case <- c.closed:
          // do something
-     default:
+         Once.Do(func() {
              close(c.closed)
+         })
- }

Example regarding non-blocking bug related to library, the paper mentioned the context library, where context object type is designed to be accessed by mulitple goroutines. And accessing string type in the context library could potentially lead to data races.

The paper observes some traditional data race detector cannot detect all types, calling for future researches on this topic.

More Discussions

More discussions from HackerNews: https://news.ycombinator.com/item?id=19280927

Paper Reading: Large-scale cluster management at Google with Borg

2019-02-27T17:05:34.000Z

Link: https://ai.google/research/pubs/pub43438

About: Borg is Google’s large cluster workload scheduling and management system, which handles Google’s most service and batch job workloads on a cluster on scale of thousands of machines. It hides users from burdens of management of cluster, and provides high-availability features that handles failures.

The now very famous and popular open-source docker orchestration tool Kubernetes, is an open source successor to Borg, and keeps borrowing ideas from Borg (see kubernetes).

Concepts

Workloads

There are heterogeneous workloads on the cluster, that could mainly be categorized as

long-running services: that responds to user requests.
batch jobs: computation work that might take long time to finish.

Cluster and cells

A cell is a collection of machines in a datacenter. A cluster hosts one large cell or several smaller cells for testing.

Jobs and tasks

A job is made of one of multiple tasks. Tasks can:

have constraints on what OS, what IP, processor it requires,
run inside a container with resources (CPU, memory, disk) limits with command-line flags.

Users can operate by jobs with RPCs to Borg.

Allocs

Alloc: is a reserved set of resources on a machine for one or more tasks to be run.
Alloc Set: a set of Allocs on multiple machines. Once an Alloc Set is created, a job can be scheduled to run on it.

Priority, Quota and Admission Control

Every job has a priority, and the scheduler schedule them ranking by the priority.

Quota is assigned to/purchased by the user. It’s defined by resources at a certain priority. Quota is managed by admission control, and a job/user is over quota, the job is immediately rejected.

Naming and monitoring

Borg names and monitors tasks with:

“Borg name service”, that assigns each task a name and a DNS name, so that a task can be reachable at a certain DNS address.
Chubby consistency service: a task writes its info to Chubby upon creation, and updates when there’s a change in health.
Almost every task has an HTTP endpoint that exposes health metrics that can be queried by Borg health monitoring service.
Records all job submission and task events, resource usage metrics in a database for future query.

Architecture

Borgmaster and Borglet

Borg master records all the job status and manages state machines to all the objects in the system (machines, tasks, allocs, etc). And the data is saved in a Paxos-enabled Chubby store.

Borglet is a local Borg agent that resides on every machine in a cell, which manages tasks on a single machine, and sends heartbeats to the master.

Scheduler

Borgmaster records jobs to Paxos store and pending queue, which is picked up by the scheduler, and gets scheduled. The scheduler uses an algorithm “E-PVM” for scoring, (sometimes called “worst fit”), or an algorithm that packs the tasks to minimal number of machines (sometimes called “best fit”).

Scalability

Borg uses the following techniques for scalability:

Scheduler uses a separate process, to operate in parallel with the other Borgmaster.
A scheduler operates on a cached copy of the cell state.
Uses separate threads to talk to Borglets and respond to read-only RPCs.
Shards (partitioned) functions across five Borgmaster replicas.

Availability

Failures are normal and applications run on Borg on expected to handle failures, and automatically rescheduled when evicted due to failure, eviction, preemption, and etc.

Conclusion

Borg serves as an important example for the design of all other large-scale distributed scheduling systems, which performs in the challenges of functionality, scalability and availability, and high utilization of the cluster resources.

Debugging An Interesting Deadlock in Golang

2019-02-09T20:05:32.000Z

This week I’ve been chasing a deadlock issue in a Golang server application, which will essentially render the server unresponsive to client requests indefinitely and cannot recover in anyway without restarting. I’ve trying all ways days and nights, even ended up re-writing a small portion of the application to clean up all the locks - no luck.

Root Cause

The root cause of this vexing issue is the combination use of mutex locks and blocking channels. In Golang, channels are also used often as a powerful way for sychronization. They’re often used to protect inner states of a structure, or to distribute workloads, to make sure different actions are not taken at the same time.

See here: https://medium.com/stupid-gopher-tricks/more-powerful-synchronization-in-go-using-channels-f4a1c3242ed0

go func() {
  for {
    select {
    case value = <-h.setValCh: // set the current value.
    case h.getValCh <- value: // send the current value.
  }
}()

By using a big select statement as a mux for all coming read and write requests, channels protect shared states, just like mutexes, and sometimes with more flexibility (e.g. when you include timer or ticker in the code). However it could be dangerous when people don’t realize, as a way of synchronization, channels are as well as prone to misuse, especially when mixed with mutexes.

Here’s an example of misusing channels to cause an deadlock. See if you can spot it:

func foo() {
a := make(chan bool)
b := make(chan bool)
done := make(chan bool)
go func() {
for {
select {
case <-a:
fmt.Println("case A")
<-b
case <-b:
fmt.Println("case B")
case <-done:
fmt.Println("case done")
break
}
}
}()
}

It could be easy to reason about deadlocks when you’re using mutexes only, or when you’re using channels only, but perhaps not so easy when you’re mixing both.

Below is the simplified version of the deadlock bug, demonstrating how mutexes and channels used together can cause interesting issues. Without reading further can you spot the issue?

type A struct {
    mtx *sync.Mutex
    // other data structures
}

type B struct {
    action chan bool
    clear  chan bool
    // other channels and data structures
}

a := NewA()
b := NewB()

func NewB() *B {
    go func() {
        for {
            select {
            case <- clear:
                // clear records
            case <- action:
                a.Action()
                // ... other cases
            }
        }
    }()
    // other initializations
}

func (a *A) Action() {
    a.Mtx.Lock()
    defer a.Mtx.Unlock()

    // do action
}

func (a *A) Foo() {
    a.Mtx.Lock()
    defer a.Mtx.Unlock()

    // do some other actions
    b.clear <- true
}

The issue lies where Action(), and Foo() can be called simultaenously or in very close time, and they both enter the critical section of A's mutex locks. And B's mux uses blocking channels to coordinate different actions, the b.clear <- true statement will block if code in previous case has not been completed.

Therefore, a.Action() and a.Foo() can both be locked, and b.clear is blocked as it’s waiting for a.Action() to finish, which is not going to happen when a.Action() is waiting for a.Foo() to unlock!

Useful Debugging Tools

In debugging experience I haven’t run into a very good tool that’ll analyze this type of deadlock. There are several tools that deals with mutex locks only. There’s one even built inside Golang’s runtime, but that’s not enough, as it only detects if all the goroutine are locked.

I’ve used gdb and Golang’s pporf library. The convenience of pprof library is that, if you’re writing a server application, you can directly register an HTTP endpoint with all useful debug output on /debug/pprof. The one I used dumped all the running goroutines in the application:

1	curl localhost:10000/debug/pprof/goroutines?debug=1

And when examining all the outputs when a deadlock happens, you need to pay attention to the following details:

What mutex lock are still pending. As they’re competing with the locks, and can potentially be the culprit that contributed to the deadlock.
What channels are pending. This could be hard and easy to omit, as there can be a lot of channels that are pending by design: they are waiting for signals for certain actions, not necessarily out of a deadlock. So, it might be faster to start examining channels used in the sychronizing channels pattern mentioned above.

On a side note, the pprof can be really useful if you’re trying to understand how the program is behaving. I even identified a resource leak in the code using pprof (maybe I’ll write another blog to discuss it). See more at:

Example goroutine output from pprof, from blog mentioned above:

goroutine 149 [chan send]:
main.sum(0xc420122e58, 0x3, 0x3, 0xc420112240)
        /home/karthic/gophercon/count-instrument.go:39 +0x6c
created by main.sumConcurrent
        /home/karthic/gophercon/count-instrument.go:51 +0x12b

goroutine 243 [chan send]:
main.sum(0xc42021a0d8, 0x3, 0x3, 0xc4202760c0)
        /home/karthic/gophercon/count-instrument.go:39 +0x6c
created by main.sumConcurrent
        /home/karthic/gophercon/count-instrument.go:51 +0x12b

goroutine 259 [chan send]:
main.sum(0xc4202700d8, 0x3, 0x3, 0xc42029c0c0)
        /home/karthic/gophercon/count-instrument.go:39 +0x6c
created by main.sumConcurrent
        /home/karthic/gophercon/count-instrument.go:51 +0x12b

Lesson Learned

It’s easy to overlook channels as a powerful synchronization tool in Golang, and bad consequences may happen. Instead of expecting deadlock tools to come and save the day, it might be more efficient to reason about the code more prudently, with the following lessons in mind:

Channels can be used for synchronizations as well.
Beware when you’re using channels and mutexes at the same time. Reason it well! The key is not to put the blocking channel send/receive inside a critical section.
Keep mutex protected sections as small as possible, right around the values you’re trying to protect if possible. You can even consider using getter/setter for structs with protected fields, and not expose mutexes as public. This will give you much better time when you’re reasoning with the code.

Reading-Summary 2019-01

2019-01-23T04:21:34.000Z

Becoming a magician

If you want to become a ‘magician’, the ones that with intricate moves and skills to amaze the audience, you’ll need to adopt a growing mindset:

you cannot become a ‘magician’ with the same progress rate, or by simply imagining a better self: sometimes the way to changes involves a fundamental shift in how you see the world. And to achieve that you’ll need to observe fellow ‘magicians’, learn the difference, and make non-linear progresses.

How to Get Rich (Without getting lucky)

Some interesting takeaways:

You will get rich by giving society what it wants but does not yet know how to get. At scale.
Arm yourself with specific knowledge, accountability, and leverage.
Specific knowledge is found by pursuing your genuine curiosity and passion rather than whatever is hot right now.
Study microeconomics, game theory, psychology, persuasion, ethics, mathematics, and computers.

I don’t usually like the “success stories” or “how to become rich” genre of books/blogs/articles, and I keep my suspicions with this one, too. Nevertheless I find most of the principles described in this blog reasonable, and the author sounded sincere: build skills, build trust, build networks, build leverages, and finally, build your own brand.

There are quite a few books out there how to teach you to be “successful”, and some time I’d like to do some research on those, with more caution than I approach other books.

Are We Living in the Gilded Age 2.0 ?

Extraordinary similarities observed between right now, and the late 19 century to early 21 century, where technology brings human society unrivaled fortune and wealth - unevenly. The society underwent serious transformation, and paved way to modern liberalism. The same might be expected, or not. History never follows scripts.

Using spaced repetition systems to see through a piece of mathematics

Another great piece from Michael Neilsen, on how Anki systems help improve not just memory, but the whole process of understanding itself.

The Writer Who Destroyed an Empire

Aleksandr Solzhenitsyn - the man who told the truth. He spread the knowledge of the gulag system and how it’s used to suppress and mistreat people, and undermined the credibility of the Soviet Union Iron Curtain empire, one of the many factors that brought it to its knees.

It’s time for a Bill of Data Rights

The new digital age problems require new solutions. In the article the author proposed the following ‘Bill of Rights’ for the new digital age:

The right of the people to be secure against unreasonable surveillance shall not be violated.
No person shall have his or her behavior surreptitiously manipulated.
No person shall be unfairly discriminated against on the basis of data.

Book Review: Weaving The Web

2019-01-13T21:27:58.000Z

The book “Weaving The Web”, from the creator of the World Wide Web himself, Sir Tim Berners-Lee, was first published in around 1999. But it was quite pleasant to read, and I think was surprisingly relevant to what’s going on with the Internet and the web now, in 2019, 20 years later.

Sir TBL introduced his experience of coming up with the idea of a universal “information aggregator” that unifies access to all world’s knowledge and information online while working at CERN, how he cooperated with similar brilliant minds to come up with first tools for the web, how he pushed the web into momentum, and finally, his own reflections onthe impact of the web on society, both positive and negative.

The “Internet” was already a widespread concept before Sir TBL started working on the web. And Sir TBL brought up with this simple yet powerful concept: all the world’s document on the Internet addressed by a “Universal Resource Locator”, and linked together via “hyperlinks”. In this way, you can start your research from any documents, and find all relevant resources by simply clicking on these “links” from any document. And in this way, all world’s online knowledge is weaved together and accessible to you. This abstraction helped made the Internet much more accessible to the public, and opened doors to waves of innovations and business opportunities. I think this is one of the reasons why TBL and his invention was great: he pondered on one complicated problem of organizing the Internet’s information long and hard and came up with the most essential but powerful abstraction, which benefited the whole world.

Thanks to CERN, Sir TBL was able to work on this side-project, and finally made it completely free and open to the world. Also, thanks to Sir TBL, when he left CERN to cofound WWW Consortium (W3C) in MIT, he wanted to make sure the Internet is kept running free and open to all. Without his spirit of openness and efforts to keep the web on this track, the web would be a much more dismal place. For this, he should be truly respected.

In the book, he also discussed his philosophy of keeping the open web: including topics on privacy, net neutrality, censorship, etc. It’s striking to see some of these ideas are still so relevant, if not more important today. In 2019 we are experiencing woes from abuses of web’s power, from the very Internet conglomerates the web helped to nurture, and governments who use it to rip off the freedom it’s designed to give people. That’s why I find this book still relevant and interesting today: the founder had expressed his concerns on the web long before. Had we listened to his ideas more carefully, we would be more aware and prepared to save it.

There are more interesting nuggets in the book: the whole thought process when he designed the web, the anecdotes when he first demoed the web, the stories of the first browsers of the web, and his musings on semantic web and his ultimate goal to “link the world’s information”. In all, it’s a recommend to read.

Paper Reading 10-22: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

2018-10-22T06:24:08.000Z

Link: https://ai.google/research/pubs/pub36356

This is a 2010 paper that presents Dapper, a tracing infrastructure from Google,to solve problems at Google scale, in its massive scale distributed systems,where a service could invoke very deep RPC calls across different nodes in thecluster, which makes tracing quite challenging.

Highlights and takeaways:

Design

The paper introduces the following concepts to describe the system:tree, span, and annotation.

`tree`

A simple service call could span a few different nodes in the system,forming a calling tree between different services, as shown abovein figure1.

`span`

In Dapper trace tree, the tree nodes are basic units of work which isreferred to as spans. The edge indicates a casual relationship a spanand a parent. See figure2.

Each trace has a single trace id across all its children spans.Each span has one id, and records the relationships between parentand child. See figure2.Parent spans always starts before child and ends after children finish.

Dapper is designed to follow distributed control paths with near-zerointervention from the application developers, by instrumenting thefollowing libraries:

thread library: Dapper attaches a trace context to thread-localstorage.
asynchronous control flow library: Dapper instruments the control flow library to ensureall async callbacks store the context of their creator.
IPC library: All Google’s inter-process communication is built arounda single RPC framework, for all communications on same machine,and across network.

`annotation`

The instrumentation above is sufficient to derive traces of complexdistributed systems and made transparent to users, but Dapper alsoprovides capabilities for users to annotate important sections totheir applications.

Sampling

To improve performance, one of Dapper’s design decision issampling. Dapper team noticed that Samplingat a relative small rate can get pretty good results with insightsto critical performance issues.

Trace collection

Trace collection is divided to the following steps:

Dapper span data is written to local log files.
Local logs are collected by Dapper daemons.
And are finally written to Bigtable.
User can query and analyze different traces with a Dapper webinterface, which aggregates all logs and sort by tracing ids.

Paper Reading 10-14: A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

2018-10-15T02:47:00.000Z

https://www.microsoft.com/en-us/research/publication/a-reconfigurable-fabric-for-accelerating-large-scale-datacenter-services/

This is one of the series of papers from Microsoft’s Project Catapult,which studies leveraging reconfigurable devices (FPGA, etc.) to accelerate data center, from very specificaccelerating algorithms like page ranking for Bing search engine, to more sophisticated machinelearning frameworks like DNN.

This is one of their early publications, which introduces the basic design and implementationof the FPGA accelerated datacenter. It covers the very fundamental details of all aspects ofserver design, from hardware, network topology, FPGA core design, fault-tolerant cluster managementsoftware design, workload scheduling algorithm, and etc…

Some highlights and takeaways:

Hardware

Catapult hardware is integrated with existing server-grade blades, which takes the space on thePCIe of the motherboard, through a daughter board with one single high-end FPGA card.

Network Design

The daughter-boardcards connects each other with a fast secondary network, independent of the CPU network. The secondarynetworks form a 6x8 torus topology network (see more details in paper), which gives fast inter-FPGA communications,good routability but not too much cabling complexity. The CPU network connects to a 48-port switch foreach pod.

Software Interface

On FPGA space has been divided into Shell and Role. Shell manages the common libraries or functionalitieslike memory management, serial link, or PCIe, reconfiguration logics, etc… The Role space is responsible foractual acceleration algorithm, which will be reloaded for each reconfiguration when the FPGA functionality needsto be updated.

FPGAs will be reconfigured from time to time and certain software must be designed to ensure to take the FPGAcompletely offline and ignored by neighbors, to ensure correct operations.

Debugging might be hard to achieve through typical JTAG hardware debugging facilities, considering the scaleof the datacenter. The paper presents an ‘always-on’ data collector that captures the key components andsaves them to a circular-buffer log.

The software interfaces divides the algorithm into 7-stages and distributed them to a 8-node FPGA group, withone for redundancy. The paper describes how the network accelerates the ‘Feature Extraction’, which producesa single score at the last stage, indicating how close the document is to the search key word.

All the queries are queued in memory in DRAM. The Queue Manager takes documents from each queue, then sends themdown the pipeline. It also manages the model reloads in the pipeline, which calculates different feature ‘scores’ forqueries.

Evaluation

The Catapult project, according to the paper, ‘reduces the worst-case latency by 29% in the 95 percentile distribution’in their evaluation environment, and provides 95% gain in throughput relative to software.

Reading-Summary 2018-10-14

2018-10-14T20:52:00.000Z

Posts I find interesting around the web:

Miscalleneous Posts

Augumenting Long-term Memory

A very interesting posts on augumenting long-term memory, based on Ebbinghaus’ forgetting curve theory: use flashcards to memorize everything you’ve learned, and even trivias like your friends’ birthday, etc… It uses Anki flashcard software to go through the list of stuff.

Author also reasoned about the benefits of memorizing all the details, concepts, and “everything”: the details are the building blocks of a field of knowledge, and memorizing them dramatically helps the understanding this field.

It’s a long read but a deep discussion, and I find it a joyful read.

How To Get Rich

An interesting talk from Jared Diamond, the author of Guns, Germs, and Steel. Despite the kind of misleading title, it’s an interesting take on history and the progress of human civilizations, and how competitions between civilizations influence their prosperity.

Systems Design and Distributed Systems

SoftwareArch: You are going to need it — Using Interfaces and Dependency Injection to future proof your designs

An introduction to interfaces in Golang, and how dependency injection can help you design large projects.

System Design Primer

The basic concepts of system design, web design, basic principals and distributed systems design. A collaborated effort on Github.

Distributed Periodic Scheduling with Cron

A chapter from Google’s new Site Reliability Engineering book, on how to design a distributed cron job daemon, and handle problems including fault-tolerance, repeatedly scheduled jobs, overloading the cluster, etc… The whole book is a very valuable summary of experience of automation and distributed systems design at Google, and at Google scale. Definitely will read through other chapters.

Go hits the concurrency nail right on the head

Eli Bendersky’s blog post on why Golang gracefully handles the problems of concurrency at language level, that other major languages handles rather awkwardly.

Use goroutine to unify the interface to coroutines and thread.
Use channels to enforce the ‘share memory by communicating’ pattern.

Which greatly reduces the programmer’s mental burden of design highly concurrent systems.

Getting started with Python in HPC

An introduction to learning Python in HPC, from introduction to Python language, to distributed HPC frameworks for Python.

A Whirlwind Tour of Distributed Systems

A list of concepts, papers, and interesting blog posts on distributed systems design.

Paper Reading 09-09: C++ and the Perils of Double-Checked Locking

2018-09-09T18:03:20.000Z

An interesting paper on the perils of C++, design pattern and multi-threading when they’re mixed together:

C++ and the Perils of Double-Checked Locking

The DCLP(Double-Checked Locking Pattern) is often-used in singleton design pattern: you’d like to initialize a shared object for singleton pattern, you follow the steps:

check lock if the resource is already initialized
if no, lock the mutex
check again if the resource is locked inside the mutex-protected area.
and again if no, initialize the object

See C++ example:

Singleton* Singleton::instance() {
  if (pInstance == 0) {              // 1st check, to avoid locking every time
    Lock lock;

    if (pInstance == 0) {            // 2nd check, a safe check to guarantee correctness
      pInstance = new Singleton;
    }
  }

  return pInstance
}

This pattern however, introduces subtle bugs when described in C++ with multi-threading.

The issue is with this statement:

1	pInstance = new Singleton;

The following steps happen:

Allocate memory for the object
Construct an object in the allocated memory.
Assign pInstance to the allocated memory.

But C++ specification don’t enforce the steps happen in order, and compilers are therefore not constrained to reorder them for sake of optimization. As long as the observable outcome of the instructions are correct, compilers are free to place instructions in an order so that CPUs are most utilized. Consider the following case with DCLP:

Thread A execute the DCLP piece of code for the first time, performs the 1st check, lock the mutex, performs the 2nd check, allocates memory for Singleton object, points the pInstance to the allocated memory. But before the Singleton object is constructed, thread A is suspended or another thread is scheduled at the same time.
Thread B enters DCLP area, determines that pInstance is non-null, and start using the object even before it’s fully constructed, and start accessing the Singleton object.

Oops. This is a very subtle bug, and hard to detect issue when we’re trying to initialize a shared resource once.

The paper digs into details on how compiler can leverage all sorts of different optimizations to spoil you effort to correct the DCLP code, and how to actually implement it correctly with volatile keyword.

It’s a very interesting paper on algorithm, C++, and programming, It makes you stand in awe of the difficulty and intricacies of C++ and multi-threaded programming.

Reading-Summary 2018-06

2018-07-09T05:04:23.000Z

Posts I found interesting around the web:

man7 Linux cgroups

Linux manual page to cgroups feature in the kernel, which restricts Linux processes CPU, max process numbers, memory usage, network setup and etc…

man7 Linux namespaces

Linux manual page to namespaces feature in the kernel. Namespaces can be specified by the clone syscall, and isolates the child process’ cgroup, IPC, network, mount, domain names, and etc…

GOTO 2018 Containers From Scratch

When all the ingredients come together, it’s the foundation where Docker is built upon. This very interesting talk from GOTO2018 demonstrates how you can use the following technologies already built-in the Linux kernel to create your own very small proof-of-concept docker:

chroot
namespace
cgroups

It also includes very interesting details including (but not limited to):

You’ll need to mount the /proc virtual file systems for your ‘containerized’ child process.
You’ll need to provide ‘UnshareFlag’ CLONE_NEWNS to the clone system call, to ‘unshare’ the mount point from the child process from the parent process, so that parent doesn’t see child’s mount points (which could be many and messy).

A Classical Math Problem Gets Pulled Into the Modern World

An optimization problem is being used in AI, and therefore all AI applications, including self-driving, etc. Math is magical.

Wikipedia is fixing one of Internet’s biggest flaws

As it actually encourages collaborations, discussions, and exposure to opposing views.

Golang Patterns - Part 2

Technical Writing: Learning from Kernighan

Learning technical writing from the author of your favorite C programming book, ‘The C Programming Language’.

A Note on Linux Hugepages

2018-07-01T23:30:20.000Z

Page table is where Linux stores virtual to physical page address translation, and its size can get huge when memory usage is high. One way to reduce the size of page tables, and reduce the number of page faults, is to use huge pages. I’ve been digging some information on hugepages for my own curiosity, and it looks like Linux has pretty good support for huge pages. And this blog serves as a quick note on my readings.

Hugepages

The sysctl directory contains /sys/kernel/mm/hugepages/hugepages-{pagesize}kB/ control files and information on hugepages, where pagesize could be 1048576 or 2048, corresponding to 1GB or 2MB of hugepage size.

To get information on hugepages on your Linux systems, the hugepages directory contains the controlling files:

nr_hugepages
nr_hugepages_mempolicy
nr_overcommit_hugepages
free_hugepages
resv_hugepages
surplus_hugepages.

You can also get hugepage-related information from /proc/meminfo:

HugePages_Total:    2048
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

The Kernel Documentation of Hugetlbpage contains the detailed information and explanation of the purpose and usage of hugepage files, as well as meminfo fields.

Allocating Hugepages

The most convenient way to reserve hugepages on x86_64 Linux is to echo into the sysctl file /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages, e.g.:

1	# echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

And for user to access and use the huge pages, Linux actually provides a quite convenient interface: the hugetlbfs file system, e.g.:

1	mount -t hugetlbfs nodev /mnt/huge

Which mounts a pseudo filesystem of type hugetlbfs on /mnt/huge, and uses the default huge pagesize specified by the system, and all files created inside the directory uses huge pages.

And after that, you can use hugepage-backed memory by creating files inside /mnt/huge directory. See example in Linux source tree: hugepage-mmap.c. The author takes the following steps:

Open a file inside /mnt with read-write permission.
Map memory to the file using mmap, with proper protection and flags set (PROT_READ | PROR_WRITE and MAP_SHARED in this case).
Use the hugepage-backed memory as usual.
Clean up memory and file.

Enable Hugepage On Start

According to Linux Kernel Documentation:

System administrators may want to put this command in one of the local rc
init files.  This will enable the kernel to allocate huge pages early in
the boot process when the possibility of getting physical contiguous pages
is still very high.

And to quote the LWN article:

If the huge page pool is statically allocated at boot-time, then this section
will not be relevant as the huge pages are guaranteed to exist. In the event
the system needs to dynamically allocate huge pages throughout its lifetime,
then external fragmentation may be a problem.

So to avoid external fragmentation and make sure that the hugepage allocation is always successful, we may want to reserve hugepages memory regions on boot time. This Debian Wiki Page provides a way to do that. To reserve number of hugepages, add the following line in /etc/sysctl.conf:

1	vm.nr_hugepages = 1024

And to mount it automatically on system start, simply add to /etc/fstab:

1	hugetlbfs /hugepages hugetlbfs mode=1770,gid=2021 0 0

Advanced Topics

There are other advanced topics, which probably will not be covered in the scope of this quick note:

Transparent Huge Pages: system automatically decides if memory should be backed by hugepages, which make usage of hugepage memory much easier.
libhugetlbfs APIs: libhugetlbfs provides programmer APIs to manage and access hugepage memory as well. Hugepage library utilities hugectl can overload Linux standard shmget() functions to allow huge pages to be used by allocating shared memory.
Text And Data: hugectl has options to run an application, with its text and data section mapped by hugepages, which gives potential performance benefits.

These are ideas worth digging into in the future, for applications where hugepages can potential give a good performance boost.

References

Reading-Summary 2018-05

2018-05-21T05:48:06.000Z

Posts I found interesting during my reading:

Writing a Time Series Database from Scratch

The author’s experience in writing a time-series database from groundup, for Prometheus.

Introducing Thanos: Prometheus at Scale

The effort to scale Prometheus with a new project Thanos, with Kubernetes sidecar pattern, to read data from individual nodes, pre-process (e.g. sampling), and submit to acentralized data storage and display.

A Beginner’s Guide To Scaling To 11 Million+ Users On Amazon’s AWS

What kind of machine/cluster you’ll need for different size of user base (from 1 to billions).

Nexflix FlameScope

A display of CPU trace as a Github-style texture tiles.

A Usable C++ Dialect that is Safe Against Memory Corruption

IT-‘No Bug’-Hare is an interesting blog I found recently, focused on system, C++ language and game design. A good read for C++ fanatics and system designers.

I’m feeling guilty for not updating for so long. But on the bright side: I’m back.

As a part of work requirements I’m taking on Golang and some small distributed system design jobs. It’s an interesting language for this task: network, systems, infrastructures, etc. I’m having mixed but mostly positive feelings about this language, and maybe will share my experience when I got a chance.

Reading Summary 2017-06

2017-06-12T00:54:11.000Z

It’s been a while since I ever post a reading summary never mention a new blog post. Writing is a time demanding job.

Society and Technology

Why do we manage academia so badly?

"Managers want metrics that are easy to calculate, easy to understand, and quick to yield a value …metrics with these desirable properties are almost always worse than useless."

Easy metrics are also easily “hacked” - people “hack” the metrics to make statistics look good, while deviate from the original purpose of academia: to achieve good quality research.

Did Reddit’s April Fool’s gag solve the issue of online hate speech?

An interesting, anarchic style experiment on Reddit: let thousands of Redditers draw a picture all at the same time, what would possibly happen? It turned out to be surprisingly good.

Tim Berners-Lee: I invented the web. Here are three things we need to change to save it

Tim Berners-Lee: The Father of the World Wide Web and Turing Award winner believes the web nowadays has serious flaws, namely the loss of control of personal privacy, rampant spreading of misinformation on the web, and manipulations from the political campaigns online. It took everyone to build the web we have today, and it takes everyone to fix it now.

More reports and readings on Tim Berners-Lee:

Alan Kay’s Reading List

Posted here again. He did so many tons of readings to get his insights on science, Computer Science, technology and society. It’s gonna be a long but joyful road.

You’re Not Going To Believe What I’m About To Tell You

Lesson 101 for a netizen: handling viewpoints that contradict your own. A good read.

Crypto Tokens: A Breakthrough in Open Network Design

How Ethereum and BlockChain technology may bring us a truly open, distributed Internet. Maybe.

I got tired of commoditized content for entrepreneurs. Here’s what I started watching instead.

A curated list of video courses/podcasts for entrepreneurs.

Reading Summary 2017-03

2017-03-27T02:14:00.000Z

Programming

Episode #51 - Mechanics of Building a Carpooling Service - Introduction

Sysadmin Casts is back again, and this time with more stuff: how to turn an idea into a MVP.

Technology

Technology Review: 10 Breakthroughs Technologies 2016

Artificial Intelligence for better conversational interfaces, Elon Musk’s companies, and biological technologies are back to people’s attention again.

12½ Great YouTube Channels For Entrepreneurs

YouTube channels for inspirations.

Can FPGA Beat GPUs In Accelerating Next-Generation Deep Learning?

Why FPGAs still have a shot.

What if Sociologists Had as Much Influence as Economists?

Write Good

An interesting tool for better writing, including Flesch–Kincaid readability tests and pointing out Weasel Words.

Reading Summary 2016-12

2017-01-09T05:15:00.000Z

C/C++

How to find size of an array in C without sizeof

The difference between arr and &arr - basically, arr is of type int *, and &arr is of type (int *)[size].

Very excellent article on the fundamentals of C/C++!

What Every C Programmer Should Know About Undefined Behavior

Some “gotchas” and pitfalls in the C programming language and how sometimes compiler optimizations can make it worse. Long story short is, steer away from undefined behaviors.

This post is from Chris Lattner himeself. Really nice article.

Python

Python Has Big Impact At Red Hat

Why Python is such a cool language and how Python is used in Redhat. Most of redhat’s important infrastructure is written in Python, including but not limited to firewalld, yum, and its successor dnf, and many cloud PaaS tools for OpenShift.

Statistics For Hackers - PyCon2016 (Video)

How to write a few lines of Python to simulate for a statistic problem which otherwise be onerous with all the math theorems and formulas.

Refactoring Python: Why and how to restructure your code - PyCon2016 (Video)

How to write clean, well-structured and “Pythonic” code.

Python Language - PyCon2016 (Video)

By no one but Guido van Rossum himself, on the status of Python 3, and how he created Python.

(And Death to Python 2!!)

Stop Writing Classes - PyCon2012 (Video)

Stop writing classes - when, and when not to use classes. Stop thinking in Java (no offense), and learn to be “Pythonic”, for a smaller, cleaner, and more well-structured project code base.

Misc.

If You’re So Smart, Why Aren’t You Rich?

Conscientiousness - “A Personality trait marked by diligence, perseverance and self-discipline”.

Thinking about thinking

Daniel Kahneman - I’ve been recently reading his book “Thinking: Fast and Slow”. It’s often listed as work in economics, but from what I’ve read it’s also an amazing book on psychology and human cognitives.

Reading more books is definitely one of my New Year resolutions. Just started this book, will finish.

Reading Summary 2016-11

2016-11-28T05:56:03.000Z

C/C++

“Effective C++” and “C++ In A Nutshell”

Finished most part of “C++ In A Nutshell”, and Scott Meyer’s “Effective C++”, and started to learn the basics of C++ language. Really great books to start to learn the basics of C++, and some of the fundamental problems in the language.

System Design

What Are System Design Questions?

A very interesting guide to scalable system design and how you should deal with them in an interview. It’s very interesting to learn the basics, while to do them properly, it might require years of experience.

Scalability For Dummies

A guide to scalability, a series of interesting and concise introduction to the same problem.

Unix

Unix As IDE: Introduction

A very interesting guide on how to use Unix’s core utilities (grep, find, bash, awk, sort, gcc, gdb, git, vim/emacs, …) to arm yourself for code editing/maintenance tasks.

Programming Languages

How It Feels To Learn JavaScript In 2016

JavaScript…

The Definitive Guide to Python Exceptions

A Python hacker’s guide to Python, from the author of “The Hacker’s Guide To Python”.

Miscellaneous

Why Democracy Rewards Bad People

In light of the recent election…

Reading Summary 2016-09

2016-09-28T05:11:03.000Z

Reading

A Bit of Python

Some security pitfalls in Python language. Very interesting read, from RedHat.

Improving Workflow By Using Clang-based Tools

Tips For Productive Debugging With GDB

A very beautifully crafted GDB init file. Worth taking a look.

The Definitive Guide to Python Exceptions

From the author of ‘The Hacker’s Guide to Python’.

Reading Summary 2016-08

2016-08-16T04:51:14.000Z

Reading

Alan Kay’s Reading List

If this site is reliable, this is Alan Kay’s reading list for all his students. He’s a great thinker, not just in Computer Science, but human intelligence in general. His list is a constant reminder how much I’m trailing the great minds of this generation, and how much I should pick up the pace in reading.

How to use your full brain when writing code

Tips on being an efficient programmer.

Digital Rights

How Technology Hijacks People’s Minds -from a Magician and Google’s Design Ethicist

Interestingly how big companies like Facebook and Google use techniques to enchant you to stay on their page for more time, or click on more of their links. I think it’s an interesting read that raises our awareness against cases such tricks, and help us defend ourselves from such exploitation.

Reweaving the web

How a slew of new startup decide to use the latest technology such as “Blockchain” and “Ethereum” to decentralize the key web infrastructures and the World Wide Web they support, to compete against giant cooperations like Google and Facebook. It’s an interesting to trend to keep an eye on, but so far I don’t know if I have the optimism that they’ll succeed.

Reading Summary 2016-04

2016-05-06T06:34:31.000Z

Programming

Eli Bendersky’s Website

Eli Bendersky’s blog has always been a must-read to me. He never fails to regularly come up with posts of interesting and insightful ideas, or detailed tutorials.

He also actively participates in LLVM-dev mailing list and based on his blogs, has board interests in programming language, computer systems and etc…

Computer Science

What is HCI research? And what is its relationship to computer science?

Phillip Guo is another one of my favorite bloggers. This time he wrote an intro to HCI research.

Avoid Nasal Demons

2016-02-21T07:31:27.000Z

Recently my colleague and I were working to port V8 JS engine as one of our benchmarks. We used it as it’s a widely-used library on devices we cared about, and we believed it’s a well-maintained, high code quality project. Or at least we thought.

The very recent GCC 6.0 version in trunk, however, will produce bad binary for a relatively stable version of V8 with -O3 flag enabled. The output binary will segfault on some of the very basic tests. At first we immediately assumed it was a bug from the bleeding-edge GCC, and submitted the bug report to the community, which responded promptly (within half an hour, that’s incredible speed. Kudos for GCC), that the problem resulted from an undefined behavior in V8. The problem roots in the fact that some V8 code is dereference null object pointers to access member functions. You can even see in their C++ code comparing this to NULL in class member functions.

if (this == NULL) {   // some logic}

And new GCC decided to optimize it away. Cause in well-defined C++ programs, this will never be NULL.

Undefined behavior are also referred to as Nasal Demons. The “dereferencing NULL pointer” code has also been discussed in this well-written post: Still Comparing “this” Pointer to Null?, about the hazards of using it. Somehow, from M$ MFC library, to widely used V8 JS engine, they are all using this for a happy hacking experience. This tech debt is a time bomb they plant in their code, and no one knows when it will go off. For V8 it was around Oct. 2015 when mainline trunk GCC guys decided to use this undefined behavior for optimization, which causes crashes in produced V8 binary.

Theoretically it could be worse: this can cause a security vulnerability. And the problematic code will work just fine with the last revision of GCC compiler, but not with the very next commit. It’s a nightmare for anyone to debug.

Guys in chromium project seem to be aware of this problem for some time. I quote: “Fundamentally this is fixable by making the functions static and explicitly passing the entity as parameter, but that’s a tremendous amount of work.” See this bug:

https://bugs.chromium.org/p/v8/issues/detail?id=3782

All coders who touched V8 code should be much smarter than I am. But somehow they just let this code slip in, and right now the bad code piles up and it’s too hard to fix. The moral of this story is: C/C++ is a very hard language to use right, and it should take much patience to learn, understand, and write correct, clean code. Without patience to learn correct code, fall to the dark side of the source one easily will.

Looks like this code has bitten other people as well. And they are from quite a while ago:

https://jira.mongodb.org/browse/SERVER-15182

https://jira.mongodb.org/browse/SERVER-15306

Attached is a pretty good presentation on undefined C/C++ code:

http://www.slideshare.net/linaroorg/bkk16503-undefined-behavior-and-compiler-optimizations-why-your-program-stopped-working-with-a-newer-compiler

Reading Summary 2016-02

2016-02-21T02:08:07.000Z

Programming

A critique of “How to C in 2016” by Matt

A good review as well as critique to the original “How to C in 2016”, debunking some myths, and making suggestions on how to really code in C.

Miscellany

List of Common Misconceptions from Wikipedia.

Reading Summary 2015-12

2015-12-31T07:30:43.000Z

Frontend

How to set up a Web development environment with React, Babel, Webpack, and JavaScript ES6 - Philip Guo

Minimal fuss setup for getting started with React and JavaScript ES6

The minimal fuss setup for frontend development, from Philip, one of my favorite professor, programmer and bloggers.

Compiler

Reverse Engineering for Beginners

Or rather, an intro to assembly. I’ve just took a quick glimpse on the lite version, which is x86/x86_64 MSVC assembly only. A quick review to polish the memories on x86 assembly.

The full version also contains ARM version of assembly, which is my next target.

Python

PEP8

The PEP8 Style Guide for Python Code. A good guide to writing consistently readable and beautiful Python code.

Miscellaneous

OpenPGP for Complete Beginners

A good intro to openpgp if you’re a beginner or haven’t heard of it before.

12 resolutions for programmers

An idea list of new year resolutions for programmers. I really like the ‘Embrace the uncomfortable’ part. Comfort is what kills you - it makes you lazy and dull, and makes your brains decay. It’s a good idea to stimulate it once in a while.

I do want to learn at least one more new programming language (or maybe pickup Haskell or/and Scheme again?), learn more about security, learn how to use vim, and learn more about non-programming (economics, philosophy, sociology and etc.?).

The Bicycle Problem: How the Illusion of Explanatory Depth Tricks Your Brain

Feynman Technique (Youtube)

Scott Young explains why we acutally do not understand what we think we understand. And how to really understand by using the ‘Feynman Technique’.

Start-up Nation: The Story of Israel’s Economic Miracle

I’ve read the Chinese version of this book. Very interesting insight on Israel and Jewish culture. It basically explains how Israel manage to build such a powerful nation and exert influence on global economics, politics, and technology, with limited resources and hostile environment.

Here Comes Everybody - Book Review

2015-11-30T06:52:28.000Z

Just finished Clay Shirky’s Here Comes Everybody, which I think it’s a very interesting book. The author shared his insights on how the Internet effectively gathers the power of people, and how it is rapidly reshaping the society today. Book starts with a story on how Internet helps a lady to regain her lost cellphone with the assistance and pressure from people online, and expands discussion to what why it could happen, and what we should do about it. The world is smaller and people are closer than ever before, for better or worse, because of technologies. In this book, the author carefully analyzed ways Internet could affect our lives, what it means to the world.

I listed several observations the authors provided in this book, which I find very interesting.

Mass Amateurization

The Internet provides most people the ability to access information from everybody else, which makes everyone a media outlet. It has always been a trend that new technologies lower the barriers of professions, and causes mass amateurization. Just like ancient scribes has been replaced by Gutenberg printing technology, the technological barriers of printing, editing, distributing news and etc. has been lowered by the invention of Internet, and made accessible to the public instead of the elite few, blurring the lines between amateurs and professionals.

Publish, Then Filter

One outcome of mass amateurization is that the contents provided by the general public is often not of good quality as professionals. However, the accessibility of the Internet has extremely lowered the costs of publishing, and the new form of media has adapted to the ‘publish, then filter’ pattern.

Power Law Distribution

– “Fewer than two percent of the Wikipedia users ever contribute, yet that is enough to create profound value for millions of users.”

The distribution of participation in large projects always follow power law: the most active contributor contributes ten, to hundreds of times more than average contributors. And the larger the project. This is true for almost all online participants. Most Wikipedia’s pages are contributed by a handful few, but maintained by many users who contribute a few lines, or fix some typos. Most large open source projects are maintained by a few core developers, yet receive small contributions from everywhere. Interestingly, I quote the book: “most large social experiments are engines for harnessing inequality rather than limiting it.”

Motivation To Participate

Before Wikipedia the founders started off their ideas of an open online encyclopedia by creating a site called Nupedia, with contents contributed from experts only. Apparently this experiment failed, but the succeeding non-profit, volunteer-only Wikipedia soon gained popularity. One of the many interesting questions about Wikipedia is: what gave people the motivation to contribute?

The author’s answer is: the love to Wikipedia. 'When people care enough, they can come together and accomplish things of a scope and longevity that were previously impossible; they can do big things for love."

Wikipedia provides a power engine (the wiki engine) to protect the love from contributors. Wiki allows revisions and histories, thus made iterative improvements possible, and at the same maintains history versions to keep wiki pages from catastrophic damages from evil-minded people. Together they are indispensable ingredients to Wikipedia’s success.

Promise, Tool, Bargain

“The order of promise, tool, and bargain is also the order in which they matter most to the success of any given group.”

The promise of a group provides the ideology for one group and is the main reason why people are willing to participate. It sets the tone for this group activity. “Let’s try to see if we can come up with something together”, is actually the very first promise Torvalds put in the mail introducing his toy OS Linux. It was not as sweeping as a promise like “Let’s make a world-changing Operating System together” (although it did at last), but it provides just enough interest to people for this small infant project.

Tools define how interactions happen among the groups, setting tones for interactions. A wiki is good for shared knowledge and judgment, while a mailing list is more convenient for open discussions.

The bargain is more like the adjustment to the culture inside one group. “We expect politeness of one another, and we rebuke the impolite” is a bargain’s most likely creating a culture which is friendly and respecting.

Summary

This is an interesting book on how large groups, especially groups on Internet works, and how the “wisdom of the crowd” is collected, and should be collected. As my energy is so limited, I can’t even list out all the important ideas in it. This post is my best effort. Anyone who’s interested in building a society online might benefit from this book. In all it might be interesting to take a look at.

Kevin Hu's Blog

Making a Toy Gradient Descent Implementation

1 Introduction

2 Key Concepts

2.1 Chain Rule

2.2 Backpropagation

2.3 AutoGrad

3 ToyGrad Architecture

3.1 Value Engine Design

3.2 Neural Network Design

3.3 Things to Notice

4 References

Paper Readings on LLM Task Performing - II

1 Overview

2 Papers and Ideas

2.1 Rethinking with Retrieval Faithful Large Language Model Inference

2.2 Self-Consistency Improves Chain of Thought Reasoning in Language Models

2.3 Tree of Thoughts: Deliberate Problem Solving with Large Language Models

2.4 Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

2.5 Decomposed Prompting A Modular Approach for Solving Complex Tasks

2.6 Automatic Chain of Thought Prompting in Large Language Models

2.7 Making Large Language Models Better Reasoners with Step-Aware Verifier

Hacking LangChain For Fun and Profit - I

1 Overview

2 Concepts

2.1 Chains

2.1.1 LLMChain

2.1.2 Extending and Joining Chains

2.2 Prompt Template

2.3 Agent

2.3.1 Agent Interface

2.3.2 Tools

2.3.3 AgentExecutor

3 Putting It All Together: ZeroshotAgent

4 References

Paper Readings on LLM Task Performing

1 Overview

2 Papers and Ideas

2.1 InstructGPT: Aligning language models to follow instructions

2.2 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

2.3 Toolformer: Language Models Can Teach Themselves to Use Tools

2.4 MRKL: Modular Reasoning, Knowledge and Language

2.5 Show Your Work: Scratchpads for Intermediate Computation with Language Models

2.6 ReAct: Synergizing Reasoning and Acting in Language Models

2.7 Measuring and Narrowing the Compositionality Gap in Language Models

2.8 Langchain: One Chain to Bring Them All

3 More to Follow

OpenAPI Generator For Go Web Development

OpenAPI Definition

Server Code Generation

How to create a project bootstrap with openapi generator

Implementation

Documentation

Gotchas

Related work

References:

Reading BBoltDB

Overview

User Interface

Opening The Database

Managing Memory

B+Tree

Transaction

Conclusion

Building Applications With Cassandra: Experience And Gotchas

Election and Paxos

Optimizing Time-series Data Retention

Membership Change When One Node Is Down

Dynamically Manipulating Tables Is Bad

Deletion In Cassandra Is Hard

References

Building Applications With Cassandra: A Very Quick Guide

Cassandra Overview

References

Cassandra Architecture For the Impatient

References

Data Modeling

Beyond RDBMS

Data Modeling: Time-Series Example

Data Modeling: Hotel Service Example