I’ve recently came across a few of Andrej Karpathy’svideo tutorial serieson Machine Learning and I found them immensely fun and educational.I highly appreciate his hands-on approach to teaching basic concepts of Machine Learning.
Richard Feynman once famously stated that “What I cannot create, I do not understand.”So here’s my attempt to create a toy implementation of gradient descent,to better understand the core algorithm that powers Deep Learningafter learning from by Karpathy’s video tutorial of micrograd:
https://github.com/hxy9243/toygrad
Even though there’s a plethora of books, blogs, and references that explains the gradient descent algorithm,it’s a totally different experience when you get to build it yourself from the scratch.During this course I found there are quite a few knowledge gaps for myself, things that I’ve taken for grantedand didn’t really fully understand.
And this blog post is my notes during this experience. Even writing this post helped my understanding in many ways.
In Calculus, the chain rule of derivative shows the basic rules for finding the derivativesof the composite functions.
As we learned from Calculus class, the chain rule states:
For example, the derivative of the function is:
With chain-rule, we could derive the auto-grad algorithm to implement backpropagationon complex compute graphs.
Backpropagation, or backward propagation of errors, is the algorithm to find the derivativeof the loss function (which is a function that computes the difference between prediction and the actual output data).
It calcuates the gradient backwards through the feed-foward network from the last layer to the first.
There are 4 steps to the Backpropagation algorithm:
And we repeat this 4-step process until the loss is close to the minimal value.
Conceptually, the gradients represent the slope rate at the level of the current parameters. So we could substract the parameters atthe direction of the gradient by a small amount (decided by the learning rate). It’s a process of moving the loss function closerto the minimal value, at a speed defined by learning rate.
There are a lot of tricks and optimizations to adjusting the learning rate, but that’s outside the scope of this discussion.
AutoGrad, or Automatic Differentiation, is the core of the backpropagation algorithm in the backward step.It computes the gradients of all parameters in a reverse manner, calculating the derivative of allparameters of the function by applying the gradient backwards in the compute graph.
And here’s the explanation why the AutoGrad algorithm makes sense.
https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#differentiation-in-autograd
The AutoGrad algorithm could be derived by the chain rule. The chain rule in Calculus states that:
Assuming in this simple case, w3 is the result of a computation from w2, i.e. the successor of w2 in the compute graph,we can derive that:
When we need to find the gradients of all the parameters, we just need to apply this chain rule backward the computational graph. The gradient of each intermediate variables, denoted as:
would be the sum of all gradients of its successors in the operation.
With this chain rule in mind, we could start with the final result of the equation and work upward the compute graph.It has a seed gradient value of 1 (as ), and back-propagate the gradients back to all intermediate parameters.
See more at: https://en.wikipedia.org/wiki/Automatic_differentiation#Reverse_accumulation
The core of this autograd engine design is the Value
class that could both express the feed-forward computationas well as the backward propagation of gradients. These are the steps when considering its design:
Value
class.For example, in the case of multiplication:
1 | class Value: |
By calling backward()
on the final result of the compute graph, we first assign a gradient of 1 to the final value,and compute the gradients of all operands of the final value. We can do it in two steps:
backward()
function on each one of the values in the compute order, thus updating the gradients of all parameters.With the Value engine designed, we could now chain these value computation to form a feed-forward neural networkwith dense layers, where all Values are connected to the next layer’s values.
We’ll also need code for:
forward()
and backward()
step on the neural network after feeding it the training input and output data.Here’s an outline of the code that describes the steps to a model definition and training process:
1 | class Model(Module): |
If you reached this far, I highly recommend the video tutorial of microgradfrom Andrej Karpathy himself, along with his implementation.He explains it way better than I could.
Also let me know if you found any problems in this blog post or my version of implementation:
https://github.com/hxy9243/toygrad
Previously: Paper Readings on LLM Task Performing
I’m still combing through the papers on LLM and I’ll summarize my readings here on this blog.There’s an increasing amount of interest and attention in this field and hence there will bemany more papers in the forseeable future.Hopefully I can squeeze more time to read and share my experience with all of you.
https://arxiv.org/abs/2301.00303
This early paper (early as in the LLM universe, Jan 2023) laid the foundation of the RAG (Retrieval-Augmented-Generation) design for LLM architecture.The idea is that: since LLM is weak in factual reasoning, we can insert related information as context in the prompt input to the LLM, and the generation would infer from the given context instead of its own generation, which is prone to errors.
The paper uses CoT (Chain-of-Thought) to break down the input question, and retrieve supporting informationfrom the data source (Wikipedia nad Wikidata in this case) for each step of the reasoning.
The paper did not mention chunking and embedding-based document preprocessing and retrieval, but amore traditional and well-tested information retrieval algorithm BM25.
https://arxiv.org/abs/2203.11171
Self-Consistency is like the ensemble method: it uses CoT (Chain-of-Thougt) reasoning togenerate a diverse set of reasoning paths, and votes to find out the most consistent answer.
This requires the problem to have a clear to define and clear to compare answer,as with most mathematic and arithmetic questions.
https://arxiv.org/abs/2305.10601
Tree-of-Thought is an extension to the idea of Chain-of-Thought and Self-Consistency.Instead of a single, linear chain of reasoning,each step of the ToT will branch out to multiple probable answers.
This is helpful for the problems that requires some exploration, e.g. problems like 24 sum, creative writing,mini-crosswords.
This approach requires the problem is more explorative, and has clear definitions of success to filter outbad output results.
It reminds me of dynamic programming in traditional programming, and the problem sets are even similiar.
https://arxiv.org/abs/2205.10625
As I understand, this is a prompt engineering way of saying:
Least to most decomposing can be described as guiding LLM to think: “to solve this problem, we’ll need to know XXX first.”
https://arxiv.org/abs/2210.02406
Similarly, Decomposed Prompting is the idea to devide a problem into morestructured subproblems,and may even use external tools (like retrieval) for completing the answer.
What’s interesting about this paper is how it divides into sub-problem more methodically(e.g. with iterative, recursive, splitting, or external information retrieval).Sub-problems can be more clear-defined, and it’s clearer to merge them.
The paper uses few-shot example prompting to guide the LLM to decompose theproblem.
https://arxiv.org/abs/2210.03493
The observation this paper makes is that: with more diverse examples for few-shot learning,the LLM is more likely to be correct. So this paper optimizes the example creation period toprovide more diverse examples.
The Auto-CoT creates examples from dataset with existing questions, but no existing correct QA pairs. It’s designed to reduce impact from wrong answers from examples or Zeroshot results.
Auto CoT creates examples for in-context learning. It selects from a pool of diverse questions to create examples to maximize correctness. It achieves it by:
https://arxiv.org/abs/2206.02336
This paper has very similiar idea from the “Self-Consistency” paper: it breaks the problem intosteps, each step has diverse outputs, and it performs voting to decide the most correct outputfor the next step.
This paper uses a voting verifier to vote the most plausible next step.The voting verifier is trained with a dataset of multiple reasoning paths.
]]>Recently I’ve looked into the LangChain project and I was surprised by how it could be such a powerful and mature a project built in such short span of time. It covers many essential tools for creating your own LLM-driven projects, abstracting cumbersome steps with only a few lines of code.
I like where the project direction is going, and the development team has been proactively including and introducing new ideas of the latest LLM features in the project.
The path to understanding this new project weren’t really smooth. It has its own opinions for code organization and it could be unintuitive to guess how to hack your own projects for more than the tutorials. Many of the tutorials out there explains how to create a small application with LangChain but doesn’t cover how to intuitively comprehend the abstraction and design choices.
Hence I have taken the initiative to document my personal cognitive process throughout this journey. By doing so, I aim to clarify my own understanding while also providing assistance to y’all who are interested in hacking LangChain for fun and profit.
This blog post will dedicate to the overall understanding of all the concepts. I found it really helpful to start by understanding the concepts that directly interacts with the LLM, especially the core API interfaces. Once you have the mindmap of all the LangChain abstractions, it’s much more intuitive to hack and extend your own implementation.
I’ll be covering the very basic concepts around Chain and Agents:
What I’m not writing in this blog, and they’ll be for another blog/discussion:
Pick the right ones, and programming will flow naturally from design;
Pick the wrong ones, and programming will be a series of nasty surprises.
– MIT Professor Daniel Jackson on Abstraction in Software, in his book “Software Abstractions”
Chains are the basic way of organizing actions, extending LLM capabilities, and integrate different Chain actions together. You can think of it as a “chain” of actions grouped together.
The interface of the base Chain class:
1 | class Chain: |
Once you understand this it’s pretty clear how to extend the Chain. You’ll need to define:
_call()
method (or _acall()
for asynchronous calling, but I’d like to skip those for now).The __call__()
, and run()
methods are really just wrappers around this core method that processes input parameters.
Sometimes it could be confusing that there are so many different ways of calling the same Chain. But think of:
_call()
as the basic functionality you’ll need to define as the developer. It has nice, preprocessed input parameters.__call__()
or run()
as the interface for users of your project that takes in more flexible input.With this interface, you can extend the functionality by “chaining” them together in a link. The output from the previous chain will be the input keys to the next.
See examples in: https://colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/02-langchain-chains.ipynb
LLMChain is a special type of chain that wraps around the underlying LLM generative engine. And it’s the most commonly used Chains for direct use and extensibility. You can extend it to any special features you want, or even “chain” them up to perform a pipeline of actions with LLM.
The interface for LLMChain is simple. See more at source code.
1 | class LLMChain: |
LLMChain extends the original Chain by defining:
You can either directly call it, or use it to build more specialized Chains.
1 | chain = LLMChain(llm=llm, template=tempate) |
See more at the “Template” section to understand prompt Template.
You can extend the Chain to accomplish anything that requires inputs and produces an output. Think of it as a task and you can use it for: e.g. text preprocessing, or even parsing.
A Chain doesn’t necessarily have to involve interacting with LLMs. It can be any task you find useful when implementing the whole task pipeline.
See examples in section “Generic Chains”:
In the example, the TransformChain that does just regex transformations to remove white spaces. You can use it together with other Chains to create a pipeline of transformation -> rewrite
using SequentialChain
to link them together.
1 | sequential_chain = SequentialChain( |
Once you understand Chains, you can build powerful pipeline of chains in LangChain (hence the name). There are chains that:
See more Chain examples on Github: https://github.com/hwchase17/langchain/tree/master/langchain/chains
I was quite baffled with the idea of prompt and templates when I’m first exposed to LangChain. But the idea is actually quite simple. It’s the same idea with any templates: you define a template text, and interpolate it with text variables.
The most common use case of prompt template is that it creates the outline of the input to the LLM, and you can customize the input by variables. That’s it. That’s how simple it is.
One common use case for Template is, as mentioned above, to format the final LLM prompt. It could be very useful in Agents, where you have multiple queries to the LLM, and you want to define the prompt with different intermediate steps at each iteration.
And let’s take a look at the concept of Agents.
One of the most powerful applications for LLM is tool-use. Agent provides an abstraction to choose from a toolbox to solve more open and complicated questions.
According to LangChain’s official documentation:
Some applications require a flexible chain of calls to LLMs and other tools based on user input. The Agent interface provides the flexibility for such applications. An agent has access to a suite of tools, and determines which ones to use depending on the user input. Agents can use multiple tools, and use the output of one tool as the input to the next.
See more at:
1 | class Agent: |
That’s it. That’s the interface for Agent.
First step of understanding the Agent is to strip away the complicated tool-use features etc and look at the interface.
Agent is an automatic actor that can make “plans” based on each step of the LLM output. You can add more features to create a full-featured, complete Agent that can run actions for you, e.g. Tools to use tools, PromptTemplates to build prompts, Parsers to parse output.
To create your own LangChain Agents, you’ll just need to worry about making the plans (e.g. handling input, creating prompts, parsing outputs, and returning outputs).
To illustrate the interface for Agents, I’ve created a very simple implementation of a dummy agent that executes whatever tool you define for exactly 3 times.
In this example, the plan is: return the AgentAction
for 3 times with whatever tool that’s given, and then return the AgentFinish
.
1 | class DummyAgent(BaseSingleActionAgent): |
(See the snippet on Gist.Also, I’ve just started a small side project that hacks Agents. See more on Github.)
Tool is an interface that interacts with other environments. The interface is real simple too, with run
or asynchronous arun
.
Tools can be any external actions to the LLM, e.g. calculators, search engines, SQL execution, document or data loaders, or anything with an API. It can also be any other Chains!
Its interface is also simple. Similarly, you’ll just need to define the inputs, outputs, and what to run.
1 | from langchain.tools.base import BaseTool |
Sometimes it could be confusing as it could be used with different ways of initializing:
1 | # initializing by setting the name, description, and a callable function |
But the idea is the same. Remember, it’s but syntactic sugar to create a Tool
with name, description, and the _run
step is to call the func.
Tool’s function could be an API call (e.g. calculator, search, load text, …), or it could invoke other Chains. It’s flexible like that, and you can reuse Chains or even Agents as Tool function. So in this way, one Agent
can invoke other Agent
s.
To get an idea of how Agents come about and some of the fundamental ideas on LLM task performing, see my other blog on a list of papers I found useful in understanding LLM reasoning.
AgentExecutor is also also a Chain: it has the exact simple interface of a Chain: input, output, and the action - which is to wrap everything about Agents together.
There are quite a few syntactic sugars provided by the LangChain library to “initialize_agent”. But remember, it’s not returning an Agent, but an AgentExecutor, which has the interface of a “Chain”.
1 | from langchain.agents import initialize_agent |
(Example from: https://archive.pinecone.io/learn/langchain-agents/)
The Agent
class abstract the most essential part of the agent behavior: how it “plans” each step based on input and intermediate results, and how it decides what actions to take, or whether to finalize the Agent execution.
The _call()
implementation of an AgentExecutor Chain wraps it all up:
These grunt works all implemented in AgentExecutor so that we can focus on the interesting part, which is the actual planning.
And typically we ignore these grunt work and only focuses on the interesting part, like creating a ReAct Agent that performs tasks based on the tools given.
And yes, AgentExecutor
is a Chain
and so it can be used with other Chains or as Tools to other agents.
See another example in the same Pinecone tutorial mentioned above:
1 | # initializing by setting the name, description, and a callable function |
llm_math is an AgentExecutor class that wraps the “llm_math” Agent, and it’s a Chain whose run()
interface is a function to invoke the Agent.
Clear enough?
LangChain already provides a rich library of Agents that can perform interesting work, like reading CSV data, managing files, calling APIs, etc.
See: https://github.com/hwchase17/langchain/tree/master/langchain/agents/agent_toolkits
Once you understand all these pieces, you can assemble everything together to make your own Agent.
There are two papers behind the implementations. I’ve also mentioned them in my previous blog:
LangChain had its implementation of ZeroShotAgent
If you think this is helpful, I’ll keep exploring and write what I found about LangChain and NLP + LLM in general. Hope this helps your understanding!
I’ve spent the last couple of months reading about the development of AI and NLP development in general ever since the release of ChatGPT. And here’s some of my personal findings specifically on task performing capabilities.
The field of AI has been advancing rapidly, and the results have exceeded expectations for many users and researchers.One particularly impressive development is the Language Model (LLM), which has demonstrated a remarkable ability to generate natural, human-like text.Another exciting example is ChatGPT from OpenAI, which has shown impressive task performing and logical reasoning capabilities.
Looking ahead, I am optimistic that LLM will continue to be incredibly effective at performing more complex tasks with the help of plugins, prompt engineering, and some human input/interactions. The potential applications for LLM are vast and promising.
I’ve compiled a list of papers of extending the task performing capabilities in this field. I’m quite enthusiastic and excited about the potential of longer term of this capability that brings LLMs like ChatGPT to more powerful applications.
Here’s my first list of paper, also what I consider to be more fundamental papers, along with my very quick summaries.
From OpenAI: one of the papers that gave ChatGPT its capabilities to be super useful by following human orders. So instead of simply generating copycat-like text, LLMs now have the capability to be more use.
GPT-3 generation:
1 | Explain the theory of gravity to a 6 year old. |
InstructGPT generation:
1 | People went to the moon, and they took pictures of what they saw, and sent them back to the earth so we could all see them. |
Later OpenAI leveraged RLHF to teach LLM to perform various tasks (summary, QA, rewrite, generation, ideation, brainstorm, etc.) and opened a door for opportunities.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
The paper introduced the power of the phrase “Let’s think about it step by step” to LLMs. As LLMs are generally good at solving small, straight-forward math problems, it still struggles to reason with longer or more convoluted logic. By breaking the original problem into small pieces, LLMs can achieve much better performance at each step and the final answer.
This creates the opportunities of breaking user problems into smaller steps and solve more complicated problems.
https://arxiv.org/abs/2302.04761
A ground-breaking paper from Meta. Instead of follow limited text-based tasks, Toolformer paper created a dataset of using external tool and fine-tuned LLMs on it.
Now LLM has APIs and input parameters, and can invoke APIs, perform more complicated tasks, and interact with the rest of the world, like search, calculation, posting twitter, or potentially anything with an API.
https://arxiv.org/abs/2205.00445
MRKL: pronounced ‘Miracle’, is an interesting paper on leveraging external tools by generating tool name, and tool inputs as operands.
MRKL has similar ideas with Toolformer and focuses more on generating the correct operands for the LLMs.
Users can automate a plethora of tasks by combining LLM and calling external tools.
https://arxiv.org/abs/2112.00114
This paper extends the idea of Chain-of-Thought, by writing down each step of reasoning in a more clear, verbose fashion, as if providing the LLM a piece of paper to scratch intermediate results on. Unsurprisingly, this creates math solution performance extensively.
Combined with ReAct paper: create iterative QA pairs to solve complex problems with LLM by breaking down with small problems.
https://arxiv.org/abs/2210.03629
The ReAct paper combines the Chain-of-Thought reasoning and task-performing or text generation action into a single loop (REasoning + ACTions). It guides the LLM to iteratively solve a given problem by breaking them into a loop of:
1 | Thought: the thought process with information from previous loops. |
In this case, the paper synthesizes much higher accuracy than previous simple Chain-of-Thought efforts.
https://arxiv.org/abs/2210.03350
The paper introduces what can be best summarized as “Self-Ask” method. At each loop of problem solving, prompt the LLM to ask itself: “any follow-up questions?”The LLM examines its own logic and decides if the solution is satisfactory. If not, it’ll keep the iteration going until an good answer is reached.
The paper also examines how best to compose all the sub-problems into a single problem and what’s the gap in achieving that.
And you can put them all together with the magical Langchain project. Langchain integrated all the above-mentioned tool-performing ideas into a single project, as can be seen from its source code and documentation:
Langchain provides abstractions within many essential LLM concepts like Embedding
, Vector Search
, Chain
, Agents
. Once combined together, these individual tools can make powerful applications that performs actions.
There are a number of good resources on Langchin out there:
As we look to the future, the potential of LLMs is limitless. I’ll keep the watch on the most interesting and exciting research advances that brings more intelligence into AI systems. Stay tuned.
]]>Using a generator might save much time to kick-start your web app project. And most importantly, I found that a good, well-defined, consistentAPI definition is so crucial to your development, testing, and most importantly, communication among teams and customers. I highly recommendthat for any sizable project, you spend some quality time on writing a good API spec. It’ll become essential to your development workflow.I used to highly doubt this, and now I don’t think I can live without it.
And if you manually keep documentation, or API specifictions in sync with your Go code, you’ll have a hard time reviewing, checking,and testing between code and specs. The best way IMHO is to automate the process, by either generating the API code from spec, or the other around.Many toolings support either one of these, and openapi-generator
is one of the really nice tools that I’m going to introduce in this blog post.
Openapi Generator supports many languages on the server as well as on the client side.And it has generator for different frameworks of Go. Right here I’m going to use go-server
generatoras an example.It uses the Gorilla framework for the server-side code.
For this blog post I’ve also made an example of code generation in my Github repo. I’ve generated the code,and implemented only one endpoint /books
with example data:
https://github.com/hxy9243/go-examples/tree/main/src/openapi
OpenAPI format (previously named Swagger) is a way of documenting your REST API endpoints, yet it goes way before simply documentation. You can use it to generate pretty HTML documentation, use for interactive debugging or automatic testing, and in this case, code generation.
See openAPI official documentations and tutorials: https://swagger.io/specification/.
To create a new openAPI definition, you can start with a YAML or JSON file for the data, starting with metadata information for your APIs, including title, version, etc.
An example openAPI 3.0 API definition may look like this:
1 | openapi: 3.0.0 |
The most important field is paths
, where you define the endpoints of different paths for your API, along with request parameters, and response data models.
For example, the following path defines the /users
endpoint, with “status 200 response” format being an array of “User” data model, which is defined in a separate file user.yaml
under components/schemas/users
. OpenAPI allows referencing other files, which makes dividing and organizing specifications a lot easier.
1 | /users/: |
And the example user.yaml
file contains the specification about the format of the user:
1 | components: |
And you can fill in all the other endpoints, methods, and data models to define the complete API for your application.
Visual Studio Code has extensions to help you better write, lint, and format your openAPI spec, e.g.: https://marketplace.visualstudio.com/items?itemName=42Crunch.vscode-openapi.
Also see official openAPI webpage https://spec.openapis.org/oas/latest.html# for the full specification to learn the very details, including but not limited data types, security, model formats, etc.
With a correctly defined openAPI specification, you can now generate server code with openapi-generator
.
Example command to generate Go code:
1 | openapi-generator-cli generate \ |
The code will be in output directory server
, and the source code folder would be server/openapi
.
Generated Go code will contain the following files:
api_default_service.go
: the default service implementation, where all endpoints will return an Unimplemented
error. The service defined here might be the most important data struct. It is the one you’ll need to extend to fill in your own implementation.api_default.go
: the default API routing and controller that wraps around the service implementation mentioned above.api.go
: defines the router and servicer interface.error.go
, helpers.go
, impl.go
, logger.go
: the helper files for errors definition, response, logging, etc.model_*.go
: data model struct definition, from your model definitions. e.g. the model_user.go
file generated from the above example, with User
Go struct.Voila! Now that you’ve generated the service code as a library, all you need to do is to start filling in your own implementation, and then start your own HTTP server in Go.
After the code is generated, you can create your own library to implement the service by extending the DefaultApiService
.
Your endpoints will turn into function endpoints like this, with parameters (if exists) in function parameter list, and returns response and error and return.
1 | // GETUsers - |
Although you can start editing away in the generated file (like suggested in the generated code), my personal favorite way to do inherit and override the functions with your own.
You can import the generated openapi
library to your own implementation, and inherit it with your own service with the exact function signature:
1 | import "github.com/xxx/xxx/server/gen/openapi" |
The data model defined will be generated as structs in Go. For example, the user example we had will be generated as following, with correct data type mapping:
1 | type User struct { |
The type “string” will be formatted to Go string, and Birthdate with type “string” and format “date-time” will be turned into Go’s time.Time
.
You can see more details at: https://openapi-generator.tech/docs/generators/go-server/ about supported types for this generator.
After that, you can start running your own HTTP server by passing the Router as generated HTTP handler, with the default controller and router from openapi library generated code:
1 | service := NewMyWebAppService() |
Overriding the default controller can also gives you more flexibility in your code if necessary, e.g. adding your own logging, tracing, or other middlewares.
The generator tool can also generate HTML pages, for your users, customers, or other teammates. e.g.:
1 | doc: api/*.yaml |
It’ll give you a well-rendered HTML project that documents all your API endpoints:
One other documentation tool I found really handy for openAPIs is the redoc tool. They can generate a very pretty HTML page for your API specifications.
Redoc-cli: https://redocly.com/docs/redoc/deployment/cli/
There are also some gotchas I bumped into while researching and using the openapi-generator:
You can reuse definition with openAPI reference symbol $ref
and divide your definitions into different files, which saves much repetitive work, and makes it easy to organize.
Customize the variable name of the output with title
: you can define your own Go struct name if you add the title
method in the data model definition.
The generated code is not immediately usable, you have to call goimports -w
to automatically fix the formatting of the code. Fortunately, it’s not hard to do by adding a few lines in Makefile:
1 | for f in gen/openapi/*.go; do goimports -w $f; done |
You can define different types of integers in go by specifying the format of the field of an integer, e.g. a field can have type of “integer”, and format of “uint64”, and will be generated as uint64
in Go.
The following work also supports openAPI in web API development in other languages, with some generating code just like this one.
Or if you want to keep your current development workflow, some projects allow generating openAPI specs from your own code, e.g.: FastAPI, encore project,etc.
Below is my notes for the very quick reading from its source code, hopefully providing anyoneinterested in data systems a quick glimpse of what’s under the hood. It’s not meant as an exhaustiveand deep analysis.
BBoltDB is a local, embedded database backed by a single file mmap
ped into the memory,meaning it’s backed by only one file, and it’s meant to be built into other data systemsas the storage engine.The data file is locked by the application, with only one process access at a time.
There’s no write-ahead-log. Storage engine uses B+ tree for managing storage.And all operations are written to the database file.
Reading source code from v1.3.5 release:
https://github.com/etcd-io/bbolt/tree/v1.3.5
BBoltDB is a key-value store, so data format is as simple as key-value pairs, organized into “Buckets”.Buckets provides namespace for a range of key-value pairs, and they can be nested.
Using BBoltDB in your application is as simple as importing its library in yoru Golang project,and opening a database: See more at BBoltDB.
1 | import ( |
DB operations are carried out in transactions. For example, a Read-write transaction:
1 | err := db.Update(func(tx *bolt.Tx) error { |
Read-only transactions:
1 | err := db.View(func(tx *bolt.Tx) error { |
See more at:
https://github.com/etcd-io/bbolt#read-write-transactions
And for more features like range query, auto-incrementing:
https://github.com/etcd-io/bbolt#autoincrementing-integer-for-the-bucket
BBoltDB organize mmap
ped memory as page
s. The first two pages are saving metadata for the database,like version info, configurations, etc. The third page saves the freelist
of pages, which isthe addresses of memory that’s not allocated yet. The rest of the pages are
See source code from https://github.com/etcd-io/bbolt/blob/v1.3.5/db.go#L178.
When opening the database process, BBoltDB carries out the following major steps:
flock
) access to single process.mmap
the file into memory.BBoltDB manages the mapped in memory by pages of 4KB size. And it keeps all the pages besides the metapages in a freelist for allocation. Freelist can be of default array type, or a map type by option.
It initializes meta pages in total at the time of the first start. It contains metadata information about the database, including the freelist info.
See at https://github.com/etcd-io/bbolt/blob/v1.3.5/db.go#L426.
The freelist
is initialized either by loading the metadata from the page,or constructed by scanning through all the free pages in the DB and get their ids.
Allocation: uses sync.Pool
for one page, or allocates memory for longer contiguous pages.All allocation is backed by a page (could be a multiple of page size) with a page id.
By default, freelist
of pages uses a list implementation, and multiple pages of allocationis guaranteed to be contiguous.
Implementation is in arrayAllocate()
: https://github.com/etcd-io/bbolt/blob/v1.3.5/freelist.go#L107
Once a page is allocated, it’s serialized into the memory as a node
, which represents the node in a B+tree.
B+Tree is the higher level abstraction from the memory pages in BBoltDB. The database manageskey spaces by B+Tree and its nodes. The node
s are initialized from memory pages from the freelist
,and organized as a B+Tree. Each node
has its own index key, with a listof inode
s (for internal nodes) that saves the actual keys and values as bytes.
See https://github.com/etcd-io/bbolt/blob/v1.3.5/node.go
1 | // node represents an in-memory, deserialized page. |
The node.go
file contains the whole B+Tree implementation, from indexing, reading/writing,spilling, splitting node, rebalancing, to deletion.
Each transaction will write the node data to the actual memory that’s backed by the page. And ifthere are more data than the configured size of the node, the node will spill data to a newnode, and recursively split the parents if necessary.
All R/W operations in BBoltDB are managed by transactions:
tx = db.Begin()
.tx.write()
. See at https://github.com/etcd-io/bbolt/blob/v1.3.5/tx.go#L514.tx.Commit()
.Example from user’s perspective, from the README for BBoltDB:
1 | // Start a writable transaction. |
BBoltDB has a simple (compared to other database systems) yet powerful capabilitysupporting the operations ofhuge volume of data in many well-tested online systems in many companies. Reading the sourcecode provides a hint of insights on how a B+tree based simple Key-Value store works,yet I’ve only scratched the surface. If you’re willing to maintain/extendBBoltDB, this could maybe be a first step to understanding its detailed design.
]]>Cassandra is always considered to be favoring the “AP” in “CAP” theorem, where it guarantees eventual consistency for availability and performance. But when really necessary, you can still leverage Cassandra’s built-in “Light-weight Transaction” for elections to determine a leader node in the cluster.
Basically, it works by writing to a table with your own lease:
1 | INSERT INTO leases (name, owner) VALUES ('lease_master', 'server_1') IF NOT EXISTS; |
The IF NOT EXISTS
triggers the Cassandra built-in Light-weight Transaction and can be used to declare a consensus among a cluster. With a default TTL in the table, this can be used for leases control, or master election. For example:
1 | CREATE table leases ( |
So that the lease owner needs to keep writing to the lease row for heartbeats.
I’m not sure about the performance characteristics of Cassandra’s election behavior with other applications (etcd, Zookeeper, …) and it’ll be interesting to see a study. But since those are already more full-featured and well-understood in keeping consensus, I’d recommend delegating this behavior to them unless you’re stuck with Cassandra for your application.
One great use case of Cassandra is logs and timeseries data saving. But what if you’d want to automatically drop stale data and don’t want to populate the tombstones in Cassandra? Removing and updating data frequently may actually cause problems in Cassandra.
Cassandra team developed a very useful strategy to just handle this situation. It’s called TWCS (Time Window Compaction Strategy). And it works by grouping your timeseries data into chunks (in the same SSTable) and directly dropping them when their TTL is reached, instead of generating new tombstones. Check out this blog for use cases and details.
So that you can create a table with these flags enabled:
1 | -- creating table compacting data every day, with 7 days TTL and TWCS |
It’s some neat optimizations you can do while saving time-series data with a deadline in mind.
Interestingly enough, Cassandra can get grumpy when you try to man-handle its membership. For example, during our development and testing, we encountered this issue where the cassandra cluster is just reluctant to accept a new node when there’s already a node down. The logs from the node shows:
1 | CassandraDaemon.java:465 - Exception encountered during startup |
It turns out that Cassandra needs to move the data consistently to the new node. And when one node is down and Cassandra cannot form a quorum for the data with one node missing, it’ll be reluctant to hand the potentially broken data to the newcomer.
Here’s also an interesting blog about replacing Cassandra dead node and all the surprises along the way. The lesson is: managing Cassandra membership could be harder than you actually thought. So it might be a good idea to read the manual.
In short, if you don’t understand Cassandra, it’ll give you surprises.
When we started building our application, we used a way to automatically create new tables. It worked well for a while, and then we kept hitting this weird error:
1 | Caused by: org.apache.cassandra.exceptions.ConfigurationException: |
It turns out Cassandra long had this problem with running into race conditions with creating Column Families (a.k.a Cassandra’s tables).
After searching through the Internet, our conclusion is simply: do not attempt to dynamically create tables in a distributed system in the first place. We redesigned our application and schema and this problem went away since.
It’s not from our own experience, but I still feel like it’s worth sharing. When not careful, Cassandra’s Quorum read/write can still result in dirty data in very special cases. Due to its design, Cassandra can have some pretty complex steps to delete data!
https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Rule of thumb from this experience: repair time <= gc_grace_seconds. So that repair would propagate tombstones before GC cleans it up. Still, it’s recommended that Cassandra cluster be constantly repaired.
https://docs.datastax.com/en/archived/cassandra/2.1/cassandra/operations/opsRepairNodesWhen.html
Here’s another interesting case for deletion in Cassandra causing headaches and surprises, due to tombstone hurting performance. It’s from Discord’s Experience:
We noticed Cassandra was running 10 second “stop-the-world” GC constantly but we had no idea why. We started digging and found a Discord channel that was taking 20 seconds to load.…To our surprise, the channel had only 1 message in it. It was at that moment that it became obvious they deleted millions of messages using our API, leaving only 1 message in the channel.If you have been paying attention you might remember how Cassandra handles deletes using tombstones (mentioned in Eventual Consistency). When a user loaded this channel, even though > there was only 1 message, Cassandra had to effectively scan millions of message tombstones (generating garbage faster than the JVM could collect it).
Basically if you are not careful, deletion in Cassandra could actually be a part of the query burden due to the tombstones. Understanding Cassandra’s behavior is essential to operation at its best performance.
Cassandra as an open-source NoSQL database has gained popularity in cloud and big data applications. Inspired by DynamoDB, it also has good latency, tunable consistency, easy to achieve scalability, and high-availability with cluster setup.
Our team’s been using Cassandra as the backend for an application we’ve been shipping to customers. We chose it for its high-availability setup, and good performance. We used to store time-series data and some simple configuration data as Key-value pairs. So it felt like a natural choice. And in our experience over time, it has proven to be highly capabable at serving our purposes.
With impressive availability, scalability, and read/write performance, Casandra also comes with its limitations. We cannot design data models the same way we did with traditional relational databases with SQL interface. And it doesn’t come with many of the guarantees from traditional databases, like consistency level, transactions, cascading deletion, etc. Like other NoSQL databases, Cassandra was designed to optimize batch write operations with good read and write latency. It fits applications without too much update/delete operations, especially ones with no high amounts of transactions.
So the best use cases for Cassandra can be:
I hope this blog could be useful if you’re starting off a new application or a module, and evaluting databases of choice. It starts with an overview of Cassandra, its architecture, then how to evaluate Cassandra for your project, and how to design your data models with examples. It will give you a better picture of whether and how Cassandra can fit in your project. So that you can start thinking about your application and data modeling from a high-level. And then, you can go on to learn more about this database’s details from other references provided in this blog or other resources.
With the prevalence of Machine Learning and Big Data applications, I strongly believe Cassandra can play an important role and it’s definitely worth learning about its ideas.
As an alternative, ScyllaDB could be a very neat open-source replacement for Cassandra, with compatible CQL and driver interface. See more at: https://www.scylladb.com/ Its blog also provides with some use case studies.
Cassandra has some interesting architectural design ideas to achieve its availability as well as performance. The tradeoff is its own limitations.
A Cassandra cluster consists of one or many decentralized nodes with that shares the client query load for scalability as well as replication for high-availability. Cassandra cluster has no master node, it maintains its membership information with the Gossip Protocol.
Cassandra cluster partitions its data among nodes as a token ring. All data in Cassandra is partitioned to its nodes in the ring based on the key hash and replication configurations.
Node membership and sharing is decided by the Consistent Hashing algorithm for load-balancing and minimal data movement during membership changes. For each table, the partition key decides which replicas it writes to. Therefore it’s important to keep in mind to include the partition key in the design of your schema (as discussed below).
Client queries can go to multiple nodes in the cluster based on your replication and query configuration. This is Cassandra’s “Tunable Consistency”: higher number of nodes for each query would sacrifice response time and availability, but maintains higher consistency. And vice versa. (See more at CAP theorem.) Tunable Consistency allows users to decide what consistent level is for data read/write. For Eventual Consistency, Cassandra responds with confirmation after writing to any one of the replicated nodes for low latency and high availability. While for Quorum Consistency, Cassandra reads/writes from a quorum of the replicated nodes, and takes the latest write as the final result (LWW Last Write Wins strategy). You can choose the consistency level based on your application needs.
Same as BigTable and DynamoDB design, Cassandra uses MemTable as in memory storage, and SSTable (Sorted Strings Table) as storage backend. MemTables are periodically flushed to disk as SSTables, which are immutable, sorted by key, and gives impressive batch read/write performance. But updating/deletion in SSTable uses new records and tombstones. It appends the new records to new SSTables instead of overwriting the existing ones. So huge amounts of update/delete operations will be inefficient in Cassandra.
So Cassandra’s architecture decides that:
For more information, I found the book “Cassandra: The Definitive Guide” a very helpful reference.
Update: though it’s a blog about ScyllaDB, it’s closely modeled after Cassandra and DynamoDB: https://docs.scylladb.com/architecture/. And I’ve found it a very helpful source as well.
To start off building applications with Cassandra, first thing to come to mind is your data schema, and whether they can fit nicely into the Cassandra paradigm.
In the book “Cassandra: The Definitive Guide”, I found a paragraph that best summarizes schema design with NoSQL databases:
By contrast, in Cassandra you don’t start with the data model; you start with the query model. Instead of modeling the data first and then writing queries, with Cassandra you model the queries and let the data be organized around them. Think of the most common query paths your application will use, and then create the tables that you need to support them.
Users need to keep in mind not to design its application schema as Entity-Relation model with traditional SQL databases, but start by thinking of all the queries you’ll need from the database.
For example, from ScyllaDB blog, here’s an example of a typical Cassandra time-series table schema design.
Cassandra requires a composite primary key for each table: one as row key (or partition key) used to locate the row data for writing as well as reading; another as sorting key (or clustering key).
1 | -- Example from https://www.scylladb.com/2020/02/20/nauto-achieving-consistency-in-an-eventually-consistent-environment/ |
The Primary Key is used to identify and locate the rows in the Cassandra cluster, and it’s of two parts: the partition key and the clustering key.
The first part (version
, id
, bucket
) as partition key, is used to locate the right partition in the cluster and the nodes where the data resides. It is required for writing, and efficient reading as well.
The second part end_ms
, start_ms
as clustering key is for sorting rows. Choosing the right clustering key gives you better performance for batch reading.
The Cassandra book includes an example of designing data schema with Cassandra.
https://www.oreilly.com/library/view/cassandra-the-definitive/9781449399764/ch04.html
Think of a hotel booking application, where you’ll need to look up what hotels are in the city, what hotels are close to site-seeing locations, book reservations, look up detailed amenities for hotel rooms, and so on.
So A query-first approach to schema design is, start by thinking about all the queries you’ll make to the backend service:
And for reservations, you’ll probably need to find the right reservation by different types of query info, for example:
If you link all your queries based on the keys and results of each query, you’ll end up with a diagram of all the supported queries from the client workflow. And based on this diagram, we can come up with the tables necessary for our application, along with the best keys for partition and sorting.
All the queries will form a workflow diagram linking all the tables together. This could be represented as “Chebotko Diagram” for schema design. A Chebotko Diagram is a way to visualize the relationships between table and queries, including the table key and field design in the schema. With this schema, it’s much clearer how to finally implement your tables and the application workflow.
A part of the final diagram would look like this, an example from the book.
For the reservation procedure. The denormalized schema design will be based on queries, and might be replicated across different tables, contrary to the E-R model for SQL schemas:
Here’s also a fun exercise you can do with the example in Scylla’s User Stories section. Imagine a Pet Care service where each user tags their pets with remote monitor sensors (e.g. for heart rate, Geolocation, and other metrics). And you’ll need to design an application to log and query the metrics for each pet.
https://www.scylladb.com/2020/09/09/carepet-an-example-iot-use-case-for-hands-on-app-developers/
So you’re working for a service for pet health monitoring. You’ll need to design the backend service that saves and queries pet’s health metrics over time for all customers and their pets. Each pet monitor gathers data and saves in the backend database, which is Cassandra in this case. And you’ll need to design the schema to fit in the Cassandra model.
From the database, you’ll need provide following queries:
Try design the tables and see if your solution is close to the example’s.
https://github.com/scylladb/care-pet/blob/master/docs/design_and_data_model.md
To summarize, Cassandra’s architectural design decides that:
It’s a powerful tool with its limitations. When used in the right scenarios, it can be a formidable weapon in your arsenal.
If you feel like Cassandra can be a good fit for your application, you can then go on to learn more about:
Ray is a new and grossing distributed programming framework, with an ambitious plan to be the foundation of emerging AI/ML applications. In its own words, it aims to “provide a universal API for distributed computing”.Which means it needs to provide a programming interface that’s flexible enough for new applications, and a backend system designed to scale for elastic computing needs with some good performance.This paper (OSDI 18’) explains its API and architecture design to fulfill this goal. And I’ve found some very interesting points.
At the programming interface level, Ray provides the “Actor Pattern”. A Python function invocation or a user-defined class object can work as an “Actor” in Ray. Simply annotate the function or class with @ray.remote
, and call with f.remote(args)
or initiate with Class.remote(args)
.
Ray’s computing results are always returned as Future for asynchronous computations. In this way, Ray’s Actors can spawn more Actors, and submit the workload in parallel.
These two tools combined can be real powerful in expressing complicated distributed computing computations and the dependencies between them, which often forms a DAG (Directed Acyclic Graph) dynamically. Like the paper described the dynamic task graph when training Reinforcement Learning models.
The programming API can also emulate co≤mputing patterns like MapReduce design pattern. In this example, map functions are defined as Ray actors and called to get results as Futures. Reduce function can also be called remotely and gather all their actual results from Futures.Granted this level of abstraction is not really equivalent to other common MapReduce frameworks. But still, it demos Ray’s flexibility.
Ray allows you to save large memory objects in the cluster for Actors to access (known as the Plasma store). The location is decided by scheduler based on task and data affinity. They can be used for saving intermediate results to speed up computation.
Ray’s architecture follows a straightforward client-server model, where client is the Ray program and the client library, which communicates with servers that schedules the actual workloads and data to worker nodes.
Ray servers use a primary-follower pattern. Primary nodes is responsible for
Each worker node has its own object store and local scheduler.
The GCS is a a sharded storage for metadata (default backed by Redis), keeping track of tasks and object details, including:
It provides a pub-sub infrastructure to enable efficient communications.
Ray’s scheduler is very unique and interesting. It took a bottom-up approach, with local scheduler on each node as well as a global scheduler for scalability.
Each nodes runs a worker that periodically reports back node load for offloading or centralized global scheduling.Tasks are submitted bottom-up, to local schedulers, and only forward when the local scheduler is under heavy load.
A distributed memory storage of memory. It uses immutable data which simplifies the system design (e.g. avoid consistency issues). It keeps in memory and supports emitting to disk on a memory spill, with LRU policy.
It saves data lineage information (like in Spark) in GCS, so as to tolerate failures: once a result crashes, it’ll re-compute based on the parent data as well as function.
Object store uses Apache Arrow library for serialization.
In summary, Ray provides a distributed programming framework for a diversity of tasks,with easy programming interface and good performance.It also has a strong backend: a job scheduler and a remote memory object cache. It’s everythinga distributed computing framework ever needs.
Also, It’s getting support for a variety of Machine Learning frameworks and integrations(e.g. scikit-learn, Spark, Tensorflow, PyTorch, Hyper-parameter tuning, and future distributed applications, etc).The Ray project itself has focused on RayTune and RayServe projects.With Ray’s flexibility, it’s totally possiblethat it could be a “glue” framework for all other frameworks.
There’s a case study from Burger King with a very interesting use caseof Ray and Spark, with Ray deeply integrated with Spark to access its memory. According to the article:
So they choose MXNet as their deep learning framework, and before cooperating with us,they would allocate a separate GPU cluster dedicated for distributed MXNet trainingbut they find that such a solution is not quite efficient, since in the entire pipeline,a large portion of the total time is spent on copying data from the big data clusters to the GPU cluster.
After deploying RayOnSpark, Ray can now access Spark’s memory. And with a wrapper around MXNet,Ray can combine these two procedures and run the applications in the same cluster. It has betterefficiency and is easier to maintain.
This is where I see it can shine, not just as the foundation for emerging frameworks,but as the missing link between ML procedures and applications,speeding up the ML pipeline with the glue layer. In this way, it truly haslots of potential.
Here I’ve summarized a few of the valuable idioms of using Golang routines from multiple references as well as my own experience. They can serve as a helpful toolbox that comes in handy for similar problems.So that you don’t have to design them from scratch, which might help you avoid synchronization errors.
For more information on Golang channels, see links in the references.They should give you a good introduction to channels and theirbasic behaviors.
If you found any problems or some idioms that you think this experience summary can provide.Please feel free to message me and let me know. Many thanks in advance!
I suggest you read the description and implement a toy version of each of these idioms first.Building your own implementation serves as a good exercise.
This is a very common usage of goroutines: to get results back from multiple async goroutines. Child goroutine can return a channel that finally sends back the result, which is akin to the Future/Promise idiom in some other languages.
Exercise: How to create an example, where you spawn multiple goroutines with work asynchronously (emulated with time.Sleep()
), and then collect and print the results in the main thread?
See my implementation here.
You can also emulate Future/Promise. Functions can return channels wrapped in Futures, so as to process and return results asynchronously.
See my implementation here.
A processor can be put in a goroutine and keep getting results until it’s closed.
1 | for r := range mychannl { |
The idiom is for the sender to close the channel. to avoid race conditions where sender sends on a closed channel. Sending on a closed channel will cause panic!
Exercise: how to create an example where you use the “for-range” idiom to get all results from the worker threads?
Hint: you’ll need proper way to make sure all worker threads are done with sending to the channel and close it properly. Consider sync.WorkGroup
.
See my implementation here
For network requests, exec commands, Golang actually has a great mechanism on setting timeouts and cancelling: context
.
See more at: https://golang.org/pkg/context/
It is actually my recommended way of handling cancelling and timeouts in requests. It has a neat and clean interface, easier to use, and less error-prone to home-brew solutions with channels.
For example, golang’s cmd/exec.Command
can be cancelled either by user or timeouts.The underlying implementation listens to context.Done()
See: https://golang.org/src/os/exec/exec.go?s=6768:6841#L394
What if you’ll need to implement your own requests that takes in a context and respect the cancelling signal? You can learn from the example above in Golang library.
My example of handling context.
You can use time
package to handle timeouts, periodic events.Golang’s time
package uses channels for callbacks, which makes logic easier to read and understand.
Exercise: how to create an example that uses timer/ticker to trigger/cancel an action?
My example here.
You can use channels for notifications. One example is OS signals.Instead of installing signal handlers like some other languages, Golang’s signal package actually returns a channel. So users can handle channels like any other channels.
Notice:
From Go documentation (https://golang.org/pkg/os/signal/#Notify)
Package signal will not block sending to c: the caller must ensure that c has sufficient buffer space to keep up with the expected signal rate. For a channel used for notification of just one signal value, a buffer of size 1 is sufficient.
1 | // Set up channel on which to send signal notifications. |
Exercise: how to create an example of using OS signals to cancel current work?
My example here.
Channels can be used as a message broker when you need to send messages to multiple child goroutines.
It’s a kind of like the reverse of handling multiple message sources. The parent channel can send signal to each one of the channels that’s accepted by each child goroutine.
Exercise: how to create an example where child workers start after parent broadcasts a ready signal?
My implementation here.
1 | // init semaphore |
Exercise: how to use channel as a semaphore that limits parallel processing to a given rate? i.e. Limit to at most N workers processing at the same time?
My example here.
(How is this implementation better than evenly dividing the work to multiple goroutines before work starts? e.g. dividing 16 works to 4 goroutines, each processes 4 works?)
Similarly, you can solve the above examples with competing consumers. Start N consumers from the start, and all of them take from the same channel.
My example here.
Or download.
First published in KDD from booking.com, the paper described its lessons from deploying Machine Learning models in their production service. It provided some intriguing insights. I believe many are very valuable to understanding applying Machine Learning in real-world scenarios.
Here are some of my takeaways.
Booking.com uses an abundance of Machine Learning models in its website service. Example includes:
From experience in the paper, all model families can provide value in the real world. Though some values are hard to quantify, the multiplying effort is clear.
What struck me most from this section is the number of models which is deployed for a service and how many aspects of user experience can be modeled, optimized, and fine-tuned with the power of Machine Learning.
Everything we see now on modern websites can be decided by a Machine Learning model behind it, and it learns our interactions with it to feedback our experience.
The one-sentence takeaway: model performance offline can not be directly translated to business value, i.e., a model that excels in performance in the lab doesn’t necessarily do well in the real world!
There are some possible explanations for this effect. For example, the saturation of a model: a model cannot drive business value to infinite after you optimize it to a threshold. Or the uncanny valley effect (which I found to be interesting): when you optimize a model too well, it can scare the users and bring negative value.
As the paper summarizes, the offline model performance can be entirely uncorrelated to business outcomes!
This section introduces the experience of designing Machine Learning models and the challenges.
For example, when modeling very subjective concepts, target variables are not given as ground truth. They are constructed. Therefore some setups are harder than others from a learning perspective. For some setups, data is closer to the concept we want to model.
Another common problem is the Selection Bias issue. For example, if you gather data by the user filling in the questionnaire, the data collected is strongly biased toward those who fill in.
The paper provided a mouthful to explain the detection of Selection Bias:
Diagnosing selection bias is straightforward: consider a sample of the natural observation space (users or sessions in the dates flexibility case), we can then construct a classification problem that classifies each observation into the class of the observations for which a target variable can be computed and the class of the observations for which a target variable cannot be computed. If this classification problem is easy (in the sense that a simple algorithm performs significantly better than random), then the bias is severe and must be addressed.
The way I understand it: construct a classification problem to classify if the target variable can be computed (e.g., those who fill in the questionnaire) and those who cannot. And if the classification is easy (better than random), it means the selection process (in this case, the questionnaire filter) is not random!
Coming up with better models to unlock business values can require many iterations.
This section introduces an important observation that website response speed correlates to user conversion.
Visual inspection shows a clear trend, in which an increase of about 30% in latency costs more than 0.5% in conversion rate (a relevant cost for our business).
The paper introduces some infrastructure optimization to amortize latency (redundancy, caching, etc.) and also emphasizes how some simple models (e.g., linear models) can actually outperform more accurate but slower models.
This section provides a great experience in monitoring model output based on output distribution and how it can be an indicator of model health in production.
For example, quoted from the original paper:
This section introduces its experience through Experimentation through Randomized Controlled Trials (RCT). The experiment design and analysis gives a huge boost to understanding the data collected from the website and to the development of Machine Learning models.
The large majority of the successful use cases of machine learning studied in this work have been enabled by sophisticated experiment designs, either to guide the development process or to detect their impact.
The paper introduces a couple of methods for designing experiments for Machine Learning products.
Terminology for understanding trigger-based experiment process:
Experiment design:
In this way, we’re limiting the experiment subjects that are most interesting to model performance and limiting noise in the dataset.
This paper shares great experiences in understanding and tackling Machine Learning challenges in real-world services. And it provides guidance in designing and analyzing Machine Learning models in production.
]]>https://en.wikipedia.org/wiki/The_Death_of_Expertise
“The Death of Expertise” by Tom Nichols is a timely piece to theongoing information endemic, especially in America. Quoting Issac Asimov:
There is a cult of ignorance in the United States, and there always has been. The strain of anti-intellectualism has been a constant thread winding its way through our political and cultural life, nurtured by the false notion that democracy means that “my ignorance is just a good as your knowledge.”
The book describes the author’s view of why experts are so important in a democracyand the relationship between expertise the public. And it also goes on to decry theongoing decay in this relationship, where citizens are increasingly losing trustin experts, and experts are increasingly finding it difficult to communicate with their audience.
The author explains his own view of the many reasons behind this divide. Most significantly:Our innate incapability to think rationally. The world is complex and dramatic,yet our own brains tend to think in a more intuitive, direct, and emotional fashion.We love simple facts and jump to premature conclusions.We’re naturally not good at discerning our own ignorance and stupidity.It’s a phenomenon dubbed the “Dunning-Kruger” effect. We’re often over-confident inourselves and hold dear our world views, to the point of denying facts we find inconvenient.
It has undoubtedly always been the case. But our ways of thinking has been fundamental to why we may reject the other opinions, and it’s exacerbated by recent trends.
Higher education, for instance, has been one of the targets of the author.The author argues that colleges are slowly developing into a commercial productmore than a sanctuary of passing on knowledge and critical thinking.Colleges are more of an expensive “experience,” which caters to the “customers.”So much so that they avoid provoking students with uncomfortable ideas or evenunsatisfactory GPAs. This creates a generation of youngsters who cannot deal withreal-life situations, including accepting facts or opinions they find “offensive.”
The widespread of the Internet has not brought information for the mass,but also overconfidence and arrogance.“Let Me Google That For You” has become a catchphrase for the Internet users whocompare the hours-long “research” online to years of expertise training and work experience.The fact is misinformation and deliberate fake information.Conspiracy theories are so prevalent on the Internet that you can almostalways find some rabbit hole for misleading or completely faketheories and keep reinforcing them.
Modern-day journalism, to attract customers, is also increasingly becoming biasedand divisive, thanks to the audience. Journalists themselves always reported misleadinginformation due to the lack of expertise. But the growing trend of commercialjournalism has suffered the same fate as the Internet: it caters to the audience’sworst desires that create a feedback loop. The industry itself is undermining professionalismand turning journalism into entertainment.
Experts, more often than not, are also wrong. Experts are also human and are not immuneto making human mistakes. There’s negligence, prejudice, and conflict of interest in all industries.Experts should be responsible for their own words and actions, but the misconception that“experts should always be right, or else they can never be trusted” is detrimental to ourrelationship with experts as well.
The author argues that expertise is crucial in a democracy,but we’re now in an epistemological crisis. We need to urgentlyreconcile the relationships between citizens, experts, and decision-makers.
It’s been a good read with the book. It’s a wake-up call to the anti-intellectualism prevalent in American society.But the author did not give an (in my opinion) satisfactory answer to how to solve the problems of “The death of expertise.” He admits that experts make mistakes, and the public should put the experts in check, but also concludes that “experts are more often right than the public.” I’d not be satisfied with the notion that we should accept the facts passed on by experts.
Like some comments from GoodReads,the book (somewhat ironically) falls short on providing concrete,authoritative sources to confirm some of the trends are happening.It lacks the intellectual rigor to actually back the problems with researches,statistics, and convincing sources, making it more of a long rant than careful analysis of the problems.
Also, some reviewer point out the author may have missed the fact of some of the underlying structuralproblems in society, like “the corporatization of media and neoliberalismin general.” Instead, it focuses most fire on the public for beingincreasingly partisan and prejudiced.
In all, I believe this book serves as an interesting read and a great wake-upcall for the public about the problems. It has excellent anecdotal stories andcritiques of the problem. But it doesn’t serve as a rigorous analysis for ourissues at hand. Nor does it serve well in suggesting what can be done about them.
]]>The often untold story behind a mastermind of Computer Science: Dijkstra, whose name has been an importantalgorithm widely used in GPS navigation.
The blog described a wise, hard-thinker, a great mind who made unparallel contributions to both ComputerScience as a mathematical and logical view, as well as Software Engineering which focuses on buildingsoftware and hardware components.
He’s most famous for his private reports, named “EWD”, and continued for more than forty years,describing his views on Computer Science and Software Engineering in general, and sometimes workedas reviews for others’ work. One of the most influencing “EWD” report was “Notes on Structured Programming,”which argued programming as a serious form of skill that demands intellectual rigor.
In 1972, Dijkstra received the ACM Turing Award, he was recognized for:
contributions to programming as a high, intellectual challenge; for eloquent insistence and practical demonstration that programs should be composed correctly, not just debugged into correctness; for illuminating perception of problems at the foundations of program design.
He has great passion for his art, and his strong personality sometimes sparked controversies.One of the most famous was the discussion on critiquing “GOTO” statements as harmful. Itbrought widespread, heated debate, yet Dijkstra’s view finally prevailed, and his insistencemade a monumental change to programming paradigm.
There are much more interesting details around his personal and academic life in the original post,too long to be summarized here. For example, his had a mini-van in Austin, which he often drove tonational parks with his wife, and it was named the “Touring Machine.” If you are passionate with computers and software,have a long weekend afternoon, it’s worth a good read.
The quirks around the Python programming language. The use cases described in the document areusually not recommended way of using Python, as it might trigger unexpected behaviors.They expose some underlying implementation details, the majority of which for optimizations,and may have some counter-intuitive side-effects.
No programming language, or tool, frameworks is perfect. If you’d like to use somethingfluently, you’ll also need to understand its weird corner cases.
An interesting notebook method I’ve recently bumped into. And I’ve been using Notion notebookwith it.
The idea is that: you’ll read and learn many small facts and concepts, but would forget them quickly.Zettelkasten method will record, and connect them into much more powerful ideas and concepts, sinceinnovations often arrive when ideas clash. Used correctly, this personal notebook methodcould be a significant way of boosting your creativity.
An interesting idea from studying the spread of conspiracy theory: most followers and spreadersare drawn to conspiracy theories, as they find comfort and a sense of community in it.
Fighting disinformation and conspiracy theories can be hard, but this idea could provide a way ofpreventing the agents from spreading it.
When someone feels unfit or even abandoned by the mainstream society, they seek comfort inoutlandish ideas. The conclusion? One important aspect of fighting conspiracy theory is to buildcommunity value.
Scientism is the idea that we should believe in science and scientists no matter what.This idea is the wrong way of pursuing the true spirit of science,is actually considered harmful, and might backfire with denial.
]]>The paper outlines the Julia programming language’s some most important design choices, andexplains how they build a bridge between user-friendliness and performance.
The paper provided with a few benchmarks, to compare its performance with a C baseline,along with other dynamic languages like Python, MATLAB, JavaScript, and so on.While other dynamic programming languages suffer great performance loss, due to its dynamism,Julia can compete relatively close with the C/C++ baseline, with up to native performance in a fewcases, most of the benchmarks are within 2x of C or C++, while Python can suffer more than 70x slowerperformance than C++.
This is significant, as it may eliminate the “prototype in dynamic language, then reimplementin static language for faster performance” cycle, eliminating extra time on coding to achieveefficiency without sacrificing much performance.
Some key takeouts from this paper:
Unlike Python, Julia incorporates option typing in its runtime, which helps to check itscorrectness at runtime, as well as enabling optimizations.
Type stability is an important concept in Julia code. It means that for a certain typecontext, an expression always return the value of the same type. And it’s the key toperformant Julia code, as the compiler can use the specialized low-level method for thattype.
In the example as following, the Julia compiler can spit x86 ASM almost the same as whata C/C++ compiler would:
1 | function vsum(x) |
Julia compiler also infers type information based on user annotation as well asinput types at runtime, to better help JIT optimization.
Multiple Dispatch is similar to the concept of operator overloading in other programminglanguages. It means overloading function behavior based on the input types.
For example, the + function can consist of 180 underlying methods based on the input types.Each method declares what types it can handle, and Julia will “dispatch” it to the correctmethod when it’s called.
At runtime, Julia can decide at function invocation time, what types are the inputs,and the method is “specialized” to these types, whichprovides JIT with the important information about the argument typefor “devirtualization.” So the method dispatch becomes a specialized compiled method,which enables more optimizations like inlining.
Julia compiler parses program input to Julia AST, which then lowered to Julia IR. It enablesJulia language level optimizations, and then translates to LLVM IR. LLVM IR enables a largenumber of optimizations that are critical to Julia performance.
Based on skimming the paper, Julia is a very interesting language that’s inspired bymany former dynamic languages while providing more innovative solutions to issues thatused to trouble programmers. It’s definitely worth attention in areas like mathematical modeling,HPC, AI/ML, and etc.
]]>How Chinese-American relationship is impacting the lives of many “stuck in between.”
The author had the foresight about the dangerous impact social media has on a society,and he was right.
He also proposes: the cure cannot be a pure technological one, it requires fixing thevulnerabilities inside economics, political, and social systems.
Stephen Wolfram’s testimony at the Senate, on A.I. selected content, his ideas onwhy algorithmic bias is dangerous, and how we can address it with proper regulations,transparency, and user choice.
He basically proposed that users should have an idea of what algorithm is feeding them data,and the capability to choose. This requires some open benchmarks on recommendation algorithms,and frameworks for users to choose.
What is serverless computing, why it is on the rise,and why is it useful for parallel data processing (data processing,CI/CD, compilation, ML, visualization, …, you name it).
A detailed guide for modeling your NoSQL data schemes.
]]>Aurora is a geo-distributed SQL database that supports replication, high-availability, and transactions,with its distributed design around replicating the database WAL log.
Original design to mirror data on EBS is slow, unreliable, and incurs expensive network overheads.
The log is the Database: write only redo log to disk and across the network. Backup disk to s3 on the background.
Partition the database volumes to small fixed size segments called PGs (Protection Groups), each PG is replicated 6 ways across 3 AZs.
PGs are implemented as storage nodes with EC2 VMs, and attached SSDs.
Partitioning also helps reducing the MTTF (Mean Time to Failure) to reduce probability of losing quorum.
Problem: how to implement consistency on logs, without expensive 2PC, and how to handle recovery process.
Terminology:
Writes:
Commits:
Reads:
Replicas:
Recovery:
Cassandra is a exemplary implmentation of NoSQL database, and gained popularity in various web, big data, and ML applications.Recently I’ve stumbled upon a good summary of Cassandra handbook, which includes a decent introduction to its datamodeling techniques, which can in term be used in other NoSQL databases.
Here are my notes and summaries:
There are great many ways Cassandra and traditional RDBMS are different: Cassandra is a wide-column database, with BASE eventual consistency guarantees, has looser relationships between tables. Therefore one needs to model their data very differently than traditional RDBMS for the application to run efficiently.
Namely NoSQL has following differences:
(?) What are the major differences between NoSQL and SQL data modeling?
A Cassandra table uses a composite key as primary key: with a partition key (K) and a cluster key ©.
Primary key is crucially important in Cassandra data modeling, as:
(?) What is Cassandra key and why is it important?
Logical Data Modeling starts with overall Application workflow, known as the “query-first design.”
Entity-Relation Diagram often used for SQL data modeling, but helpful to think through the E-Rs involved in the NoSQL modeling.
Iterate between Application Query Workflow and E-R Diagram.
Chetboko Diagram is a good tool to model the queries and tables required by the NoSQL application.
(?) In NoSQL data modeling, what is a Chetboko Diagram and how does it help with modeling?
Some items to consider when creating the tables:
Humans tend to think and live in Mediocristan, where probability tends to be in normal distribution - and that’s what most things are. Like human height, weight.
Black Swan incidents are ones that people can barely predict, sometimes grossly overlook. Examples include the 9.11 incident, 2008 stock market crash, etc.
But many other distributions are best described as power distribution, and that’s referred to as Extremistan, where cases tend to be extreme. Like human wealth.
It’s human nature to draw conclusions, find correlations, assume everything is close to what we observed, and extreme cases are extremely unlikely. And that’s the basic recipe for Black Swan incidents.
Think of a turkey well-fed by its owner. It quickly concludes that the owner is a friend, until the day before Thanksgiving. The author advises in the book: don’t be a turkey.
The author discussed a few cognitive biases we’re vulnerable to:
Human beings are particularly bad at making predictions. One phenomenon is the more information we have, the more confident we are, but not more accurate. It’s called “toxicity of information,” where noise is mistaken for signal.
The author argues that human technological advances are particularly unpredictable: “if you expect to expect something tomorrow, you should expect it today.” It’s especially true with new technologies. If we understand the details of new technology right now to predict it, we should already know how it’ll work and have it today.
In the book, the author slammed the so-called economists, social scientists, and the like, who build complicated mathematical models and beautiful charts to “forecast” the economic trends, stock market, etc., without taking into account chance plays in the outcomes. It makes them utterly vulnerable to Black Swans.
The author points out, however, that we should not try to predict Black Swans. Instead, build robustness against negative Black Swans, and shoot for positive Black Swans.
The final part of the book author argues that the foundation of Black Swans is power distribution. It happens everywhere in the world: economy, company, nation powers, where winners take all. It has several implications:
Many book reviews have already gone through what they dislike about the author’s arrogant tone in this book, dismissing all social science as pseudo-science. Also, the author loved to paint himself as the lone wise oracle shunned by ordinary people, but that’s not the truth: many people have similar or close ideas of impending dangers and what we should do about them.
Nevertheless, the ideas in the book are still worth a read and close attention, especially in a fast-changing world as it is today.
One of the best examples might be the coronavirus that’s sweeping across the world right now, as I’m sitting in my own house, not being able to visit the restaurants and coffee shops I love. In retrospect, when the news first broke out, I never expected it could have such a drastic impact. Many people, myself included, like most popular news anchors, technologists, president of the US, and so so many more on social media, regarded the virus as “something just like the flu,” and “it’s just going to go away when the season passes.” Media today love to bring out the old comments, (especially with different political agenda), and use them to mock how ignorant and short-sighted they are - even though they are not so innocent themselves. I see this more like a common flaw in human predictions, just as the book described: as humans, we’re particularly bad at predictions.
There are also voices pointing out that it didn’t need to be a Black Swan. Nassim Taleb, the author of this book, stated in the recent interview: coronavirus shouldn’t be a Black Swan, to governments, medical professionals, and epidemiologists who dealt with situations like this before. He was not alone. Bill Gates once warned us about the dangers of a pending pandemic. We didn’t take the advice seriously, and the pandemic still broke out as a Black Swan to all the rest of us.
Now instead of engaging in bitter political bickering, it’s wiser to learn from this lesson on all humanity and work together to make the next Black Swan grayer.
]]>This book is his summary of his researches and experiences of studying.The book’s author argued that: there’s one possible way to learn and improve yourself,with intensive training and exercises. Like training muscles, you can adopt an extraordinary,unorthodox training plan for your brains, and pick up a new skill in a short amount of time,be it a foreign language, programming, sketch, or even public speaking. He called it “ultralearning.”In the book, he researched many references and interviewed like-minded friends,who had similar experiences of acquiring or improving a skill intensively.And he summarizes all the essential principles, as the guide to a successful “ultralearning” project.
I’ve finished this book in less than a week, and it was a pretty fun read.It included many anecdotes from authors’ friends and intellectual celebrities with high achievements (like Feynman, Ramanujan, Van Gogh, etc.)Also, it comes as a practical guidebook to your own learning projects.Although the book is named “ultralearning,” it does provide principles and tricks on learning, in or outside of school. In many places, it resonates with me as a student.
If I have to pick bones, as a guidebook to learning, this book feels a little verbose on stories.And as research on learning psychology, many of the stories don’t feel formal and convincing enough.But in all, it was a fun read for all the guidelines the book provides. I’d recommend it to anyone who believes learning is an essential part of their life.
]]>Presentation: https://www.usenix.org/conference/usenix-atc-10/zookeeper-wait-free-coordination-internet-scale-systems
Zookeeper’s data model is very like that of Unix tree-like file system paths.Every node is called a znode, with a key name and value, and may have its own children (except for ephermeral nodes).
Each znodes contain metadata like timestamps and data version number.
Nodes may be regular nodes, or ephermeral nodes, where clients keep alive bysending heartbeats to the server, and are removed in server after timeout.Handy for keeping membership information.
Provides basic client API like create
, get
, set
, delete
, getChildren
,and sync
, for clients to read the most up to date information.
Keeps two consistency guarantees:
Asynchronous linearizabile
, meaning client requests are non-blocking (or wait-free),but requests are processed in serialized fashion.Zookeeper writes are processed at leader level, while reads are processed at all nodes,for better scalability and performance, and therefore doesn’t provide strongconsistency. Zookeeper provides a sync()
API for clients to read up to date data.
With Zookeeper’s consistency model in mind, we can create powerful primitives based onZookeeper’s, for cluster key configuration management.
watch
primitives for clientsto watch value changes.SEQUENTIAL
flag to obtain unique name assignment.Request processor in Zookeeper is idempotent.
All write requests are processed as transactions, it either generates a new version number for datawhen request version number matched, or generates an error if failed.
All servers process reads, and writes are forwarded to the leader. Zookeeper usesa protocol named Zab
to keep consensus among the cluster. Like Paxos, it requiresa quorum to reach consensus.
Zookeeper replicates data in all followers’ database. It takes snapshots to compact data.
Zookeeper snapshots are called fuzzy snapshot
, as it’s not necessarily a valid state of theZookeeper data tree. But during recovery data can be recovered with fuzzy snapshot andoperation logs.
Client reads from all servers for performance, and writes are transactions.To read the latest data, client sync
s before read.
Followers process syncs by appending to previous write queues to leader.If there are no new writes before sync
, it generates a dummy sync
to guaranteethe leader is still leader.
When logic and proportion
Have fallen sloppy dead
And the White Knight is talking backwards
And the Red Queen’s off with her head
Remember what the dormouse said
Feed your head
Feed your head
I recently came across this book on how the 1960’s counter-culture and anti-war movemententangled with the personal computer movement in California. Much have we known abouthow the pirates of the Silicon Valley: Bill Gates and Steve Jobs shaped built personalcomputing enterprises, but this book recorded some very fascinating details of the storiesbefore their age, and how they inspired the generation of Bill Gates and Steve Jobsby first putting forward this very extraordinary idea of personal computing.
Story dates back to 1945 when Doug Engelbart started his musing on a device that can extend human mind,with inspirations from “Memex”, a device conceived by Vannevar Bush. It’s a machine that could track andretrieve vast volumes of information.
After school, a year of teaching and several failed attempts to find a job that can pursue his digitalcomputer dream, he landed in Stanford Research Institute, where he began his research in digital computer system.
At the same time in California, Myron Stolaroff first came in touch with the power of LSD, and laterdevoted his entire life to researching and promoting the power of it. The LSD was popular among engineersdescribed in the book, many, including Engelbart himself, used it as a mind-expanding tool.
In 1959, a young man named Fred Moore came into the campus of Berkeley. As a young man with some radicallyprogressive ideas in mind, he quickly rose to fame in the anti-ROTC student protests, and became oneof the leaders of the student movement in the 1960s and 1970s.
Three major threads led to the birth of personal computing. Engelbart had this vision of creatingan augmenting device with the power of machines. Stolaroff was experimenting with this substancethat can expand on human creativity as well as human spirituality.And Fred Moore set out on a crusade to spread freedom and peace. All three contributed to the creation of personal computing.
With funding from military, Engelbart continued his endeavor to Intelligence Augmentation.
On Dec 9, 1968, Engelbart introduced his system that works on a terminal with remote connectionsthrough ARPANET in the annual Fall Joint Computer Conference. Dubbed “the mother of all demos”,Engelbart and his team first demonstrated to the world the power of computers in empowering humans,and inspired a generation of young engineers to join his team, or pursue smiliar goals.
Book also introduced many interesting and important figures that influenced that age, e.g.
With visionary and persistent figures like Engelbart, to genius engineers like Bill English,and student movement activists like Fred Moore, 1960s-1970s America, especially California,saw the shift of engineers sterotypes from uptight traditional stereotypes they used to be, to the LSD-sippinghippies who valued freedom and liberal ideas most, and pursued personal empowerment and individualism.The engineers in this story had influences from the radical Californian shifts in ideologies and activism,as well as the MIT hacker spirits. These people were not just geniuses, butthe ones who pursued individualism, and believed personal computers were the key to it.And maybe that, in turn, pushed forward the developmentof the most personal empowering device that we saw in the last century - personal computer.
Though their efforts and visions were not immediately celebrated in their time, theirinfluenced from SRI, to Xerox PARC was felt throughout the world, when young Steven Jobs and Steven Wozniakstarted from the Home-brew Computer Lab, and brought research ideas like GUI, mouse and personalcomputing to the whole world.
In all it was a very interesting book that’s worth a read if you’re interested in computer developmentat the age, and the tremendous stories behind how personal computing came into being.
]]>“Data and Goliath” is an excellent book a friend recommended.It’s a summary of all the dangerous and negative ways data, and the “Big Data” technology canshape our societies. The author Bruce Schneier isa prominent expert in cryptography who published impactful works oncryptography and issues on privacy. He’s also on the board of directors of Electronic Frontier Foundation.
The book provides abundant amount ofcases and examples related to big data misuse, as well as author’s carefuland in-depth analysis of different impacts data has on our societies,and pragmatic recommendations to different sides of the society on solvingthe “Big Data” problem.
The book mostly discusses how governments and corporates can abuseits use to profit, surveil or control citizens at the cost of our privacy, freedom, and even democracy. Without proper protection, regulationand activism, we are unknowingly giving up our rights to data.
Governments can abuse Big Data, and our political liberty and justice systemcan be corrupted, with mass surveillance on citizens, and surveillancedata can in turn be leveraged to accuse dissidents and silence politicalopponents. Government censorships can thwart free thinking and socialprogress, and make way for an oppressive regime.
The author provides an interesting thought experiment, originallyfrom English philosopher Jeremy Bentham: panopticon, meaning a prisonwhere all inmates can constantly be watched by the guard, even when guardis not actively watching them. In such a system, inmates are much moreconformant from the constant fear of criticism, judgements and punishments.A society becomes a panopticon with mass surveillance and censorship.
Some other examples include the political witch-hunting in 1950s led by senator Joseph McCarthy, and harassment Dr. Martin Luther King receivedfrom then then FBI directorJ. Edgar Hoover. The book described the chilling effect surveillance and abuseof power can have on political movements.
From a commercial perspective, misuse of “Big Data” can havedangerous effects on society as well. Surveillance-based discriminationbasically revive the “redlining” to the internet age, where discriminationcan be much more pervasive, intrusive and effective, and thus moredamaging. Large corporate collected data can be used for massive onlinemanipulation. A good example is how Facebook can nudge its users tovote with a rate of ~0.4%. Imagine if it discriminately displays thenudging information to vote.
(The book is finished around 2015, before the Cambridge Analytica incident,proving the author’s foresight.)
Finally the book stressed the importance of Software/Network security,privacy to our society, and analyzed why it doesn’t contradict thegovernments’ role of ensuring the security of the societies, corporates’role of leveraging data for profit. Finally it provides pragmatic recommendations on solving the “Big Data” mess,to governments, corporates, and the rest of us. In all it was a goodread.
]]>How to model timeseries data with Cassandra.
The best way to understand something, is to build one yourself. This tutorial covers basic network programming in Go, struct design and the usage of reflect
package.
A great experience sharing blog on how to debug a performance issue in their services. And with profiling and analysis tools, the Uber team was able to pinpoint this issue in worker pool and goroutine stack allocation, and then they forked the Go compiler to prove it’s a regression in the Go compiler. A very nice read and analysis process.
A programming book on topics in distributed computation, from teaching experience in distributed system course, from Northeastern University.
A very nice engineering blog from 2014. A excellent overview of Spotify culture, and an introduction on how to build the “agile” team.
NYTimes has released its in-house course to teach journalists data science. Journalism can also benefit from a little coding/data analytics skills.
]]>If go is one of your favorite languages as well, this is a must read:it introduces all the basic tooling that comes with Go’s ecosystem, whichmight greatly save your time.
A thread from HackerNews, discussing the importance of formal verificationfor distributed systems.
TLA+ and formal verification is notoriously known for its complexity and steeplearning curve. This might be one of my very future goals.
What it takes to be a software architect, a great blog post from InfoQ.
TIL that it is possible to convert your C/C++ assembly into Go’s assembly, andcall from Go’s code. InfluxData leverages the tooling to embed AVX/SSE instructionsinto Golang’s assembly, thus boosts Go code’s performance, sometimes by ordersof magnitude.
More information on this tool, c2goasm, work from Minio.
I think so, too. But it’ll require a community and proper tooling to see itreally prosper. Hope to see that some day.
A great piece from Ray Dalio, thefounder of investment firm Bridgewaters, a seasoned investor, discusses in hisrecent long post why American capitalism is sick in distributing resources,especially educational resources, and needs to be reformed to stay healthy.
]]>Kafka is a message queue, a pub-sub system, an event sourcing tool,and a stream processing infrastructure, is a key part of many streaming distributedsystems that requires streaming data. Its underlying idea, is to aggregate datafrom a distributed sources, to a unifying linear log structure.
The blog is from Kafka’s creator Jay Kreps when he was at LinkedIn,contemplating the log abstraction as a key part of any distributed systems. Thisis not Kafka’s design paper, implementation or a tutorial, but rather the process of brewingthe idea that led to its birth, and I found it equally interesting. The following aremy notes.
The link to Kafka paper: https://www.semanticscholar.org/paper/Kafka-%3A-a-Distributed-Messaging-System-for-Log-Kreps/9f948448e7a5f0cc94cd53656410face8b31b18a
Log is a simplest storage abstraction, similar to what we see in application logs,records are appended to the end of a log data structure, and reads proceeds left-to-right.This simple abstraction is powerful, in that:
The log centric approach arises from a simple observation that the author named“State Machine Replication Principle”:
If two identical, deterministic processes begin in the same state and getthe same inputs in the same order, they will produce the same output and endin the same state.
And there are two major different ways of leveraging logs in distributed processingand replication:
Make all of an organization’s data easily available in all its storage and processingsystems.
An organization may have multiple data inputs, that gathers events and data frommany places, and different consumers to digest that data. A log structure can serveas a buffer as well as a central pipeline for all the different producers and consumers.In this way, the log serves as an asynchronous messaging system.All producers and consumers can read buffered data from the log, with different pace.e.g. a real-time system may need to read instantly, while an analytic platform mayread it only hourly or even daily.
Also, in a system where there are M inputs and N output, you’ll need M * N pipelines tomake sure each consumer can read from all data producers. But with a single unified data pipeline,every producer and consumer can all write and read from one single log. And that’s theidea behind Kafka.
Also, Kafka’s log structure also enables high-performance optimizations, e.g.:
Computing derived data streams.
Log also makes real-time stream processing easier. Logs enables real-time data collectionfrom events or different data input, at different speed, that the consumers canread from at scale.
Log also enables more complicated data flow, e.g. when output of a log in the streamprocessing systems becomes the input of another. It can construct complicated data flowgraphs. And log has benefits:
Practical systems can be simplified with a log-centric design.
Log enables high-performance and easy integration of data producers and consumers,distributed systems are more likely to move away from monolithic relational databases,and toward more diverse data sources and consumers. Building distributed systemswould more feel like lego games with open-source data components.
And a log system can work as the following role in system architecture.
The author built the powerful ideas of a log into Kafka, one of the most influentialdata streaming platform. This long blog might bring some insights to incorporateKafka into a distributed system, as well as provide inside in building new systeminfrastructures.
]]>Take a look at what Mr. Gates thinks are the greatest technologybreakthroughs right now. The list might surprise you.
How Netflix leverages AWS technologies to build world-scale, highly-availbile,fault-tolerant distributed video streaming system.
Lyft architecture evolution on AWS.
From Farnam Street – an interesting blog site I found recently.
Also on Farnam Street and its “mental models”: The Mental Model Fallacy.TL;DR: The so-called “mental models” from Farnam Street is not of much valuewhen it’s from non-practitioners. And to learn businees, like basketball, swimming,etc., you’ll need to actually practice to learn the intricate knowledge that arenot easily translated into writings.
Unfortunately I didn’t have time to finish reading this paper. But it’s goodto learn the concept of branchless algorithms to fill the CPU pipeline andachieve amazing performance.
]]>Presentation: https://www.usenix.org/conference/nsdi11/mesos-platform-fine-grained-resource-sharing-data-center
Mesos is a cluster resource management software from UC Berkeley. Unlike many otherframeworks already existed, Mesos is designed to support heterogeneous frameworks (Hadoop,MPI, etc) in the same cluster and share resources between them, by providing a thin layerthat making resource offers to the framework schedulers, and delegate the schedulingdecision to the frameworks themselves.
With this design, Mesos can achieve pretty good elasticity between frameworks, andletting frameworks choose their own resources results in better data locality.
A full cycle of resource offer works as follows:
Mesos core is designed to be small, and from the paper it could scale to 50,000nodes with emulated load.
A team from Penn State University and Purdue published their latest study on concurrency bugs found in Golang projects, namely large projects from Github: Docker and Kubernetes, two datacenter container systems, etcd, a distributedkey-value store system, gRPC, an RPC library, and CockroachDB and BoltDB. The authors searched commit histories of each repository to understand concurrency bug fixes for categorization and study.
TL;DR:
Abstract: The author of this paper analyzed 171 bugs in 6 aforementioned open-source Go projects for a systematic study of Go concurrency bugs, providing better understanding for go bugs and concurrency bug detection tools.
The author categorized the bugs into blocking and non-blocking bugs. Blocking bugs are misuse of synchronization primitives that causes the program, or a subset of goroutines to hang. Non-Blocking bugs happen when shared memory is unprotected, causing data races, or errorneous message passing, e.g.: when goroutines don’t quit properly, causing resource leaks.
The paper further divided blocking bugs into traditional shared memory bugs, and bugs caused by misuse of message passing, or libraries related to messaging.
This led to an interesting observation from this paper: contrary to common belief, message passing are potentially more likely to cause blocking bugs than shared memory.
An example of blocking bugs related to message passing, with its fix. similar to the one I had before:
1 | // goroutine 1 |
1 | // goroutine 2 |
An example for blocking bug related to messaging library from the paper is Pipe
library.
The paper also noticed that for blocking bugs, there’s a high correlation between blocking bugs (shared memory as well as message passing) to their fixes, indicating there’s high potential in developing automated tools to help fix such bugs.
For non-blocking bugs, the paper also divided them into traditional bugs, (e.g. unprotected shared memory causing data races), misuse of channels, or shared data in special libraries.
An interesting example related to non-blocking bug caused by message passing, mentioned in the paper:
1 | // when multiple goroutines execute the following code, default |
Example regarding non-blocking bug related to library, the paper mentioned the context
library, where context
object type is designed to be accessed by mulitple goroutines. And accessing string type in the context
library could potentially lead to data races.
The paper observes some traditional data race detector cannot detect all types, calling for future researches on this topic.
More discussions from HackerNews: https://news.ycombinator.com/item?id=19280927
]]>About: Borg is Google’s large cluster workload scheduling and management system, which handles Google’s most service and batch job workloads on a cluster on scale of thousands of machines. It hides users from burdens of management of cluster, and provides high-availability features that handles failures.
The now very famous and popular open-source docker orchestration tool Kubernetes, is an open source successor to Borg, and keeps borrowing ideas from Borg (see kubernetes).
There are heterogeneous workloads on the cluster, that could mainly be categorized as
A cell is a collection of machines in a datacenter. A cluster hosts one large cell or several smaller cells for testing.
A job is made of one of multiple tasks. Tasks can:
Users can operate by jobs with RPCs to Borg.
Every job has a priority, and the scheduler schedule them ranking by the priority.
Quota is assigned to/purchased by the user. It’s defined by resources at a certain priority. Quota is managed by admission control, and a job/user is over quota, the job is immediately rejected.
Borg names and monitors tasks with:
Borg master records all the job status and manages state machines to all the objects in the system (machines, tasks, allocs, etc). And the data is saved in a Paxos-enabled Chubby store.
Borglet is a local Borg agent that resides on every machine in a cell, which manages tasks on a single machine, and sends heartbeats to the master.
Borgmaster records jobs to Paxos store and pending queue, which is picked up by the scheduler, and gets scheduled. The scheduler uses an algorithm “E-PVM” for scoring, (sometimes called “worst fit”), or an algorithm that packs the tasks to minimal number of machines (sometimes called “best fit”).
Borg uses the following techniques for scalability:
Failures are normal and applications run on Borg on expected to handle failures, and automatically rescheduled when evicted due to failure, eviction, preemption, and etc.
Borg serves as an important example for the design of all other large-scale distributed scheduling systems, which performs in the challenges of functionality, scalability and availability, and high utilization of the cluster resources.
]]>The root cause of this vexing issue is the combination use of mutex locks and blocking channels. In Golang, channels are also used often as a powerful way for sychronization. They’re often used to protect inner states of a structure, or to distribute workloads, to make sure different actions are not taken at the same time.
1 | go func() { |
By using a big select statement as a mux for all coming read and write requests, channels protect shared states, just like mutexes, and sometimes with more flexibility (e.g. when you include timer or ticker in the code). However it could be dangerous when people don’t realize, as a way of synchronization, channels are as well as prone to misuse, especially when mixed with mutexes.
Here’s an example of misusing channels to cause an deadlock. See if you can spot it:
1 | func foo() { |
It could be easy to reason about deadlocks when you’re using mutexes only, or when you’re using channels only, but perhaps not so easy when you’re mixing both.
Below is the simplified version of the deadlock bug, demonstrating how mutexes and channels used together can cause interesting issues. Without reading further can you spot the issue?
1 | type A struct { |
The issue lies where Action()
, and Foo()
can be called simultaenously or in very close time, and they both enter the critical section of A
's mutex locks. And B
's mux uses blocking channels to coordinate different actions, the b.clear <- true
statement will block if code in previous case has not been completed.
Therefore, a.Action()
and a.Foo()
can both be locked, and b.clear
is blocked as it’s waiting for a.Action()
to finish, which is not going to happen when a.Action()
is waiting for a.Foo()
to unlock!
In debugging experience I haven’t run into a very good tool that’ll analyze this type of deadlock. There are several tools that deals with mutex locks only. There’s one even built inside Golang’s runtime, but that’s not enough, as it only detects if all the goroutine are locked.
I’ve used gdb
and Golang’s pporf
library. The convenience of pprof
library is that, if you’re writing a server application, you can directly register an HTTP endpoint with all useful debug output on /debug/pprof
. The one I used dumped all the running goroutines in the application:
1 | curl localhost:10000/debug/pprof/goroutines?debug=1 |
And when examining all the outputs when a deadlock happens, you need to pay attention to the following details:
On a side note, the pprof
can be really useful if you’re trying to understand how the program is behaving. I even identified a resource leak in the code using pprof
(maybe I’ll write another blog to discuss it). See more at:
Example goroutine output from pprof
, from blog mentioned above:
1 | goroutine 149 [chan send]: |
It’s easy to overlook channels as a powerful synchronization tool in Golang, and bad consequences may happen. Instead of expecting deadlock tools to come and save the day, it might be more efficient to reason about the code more prudently, with the following lessons in mind:
If you want to become a ‘magician’, the ones that with intricate moves and skills to amaze the audience, you’ll need to adopt a growing mindset:
you cannot become a ‘magician’ with the same progress rate, or by simply imagining a better self: sometimes the way to changes involves a fundamental shift in how you see the world. And to achieve that you’ll need to observe fellow ‘magicians’, learn the difference, and make non-linear progresses.
Some interesting takeaways:
I don’t usually like the “success stories” or “how to become rich” genre of books/blogs/articles, and I keep my suspicions with this one, too. Nevertheless I find most of the principles described in this blog reasonable, and the author sounded sincere: build skills, build trust, build networks, build leverages, and finally, build your own brand.
There are quite a few books out there how to teach you to be “successful”, and some time I’d like to do some research on those, with more caution than I approach other books.
Extraordinary similarities observed between right now, and the late 19 century to early 21 century, where technology brings human society unrivaled fortune and wealth - unevenly. The society underwent serious transformation, and paved way to modern liberalism. The same might be expected, or not. History never follows scripts.
Another great piece from Michael Neilsen, on how Anki systems help improve not just memory, but the whole process of understanding itself.
Aleksandr Solzhenitsyn - the man who told the truth. He spread the knowledge of the gulag system and how it’s used to suppress and mistreat people, and undermined the credibility of the Soviet Union Iron Curtain empire, one of the many factors that brought it to its knees.
The new digital age problems require new solutions. In the article the author proposed the following ‘Bill of Rights’ for the new digital age:
Sir TBL introduced his experience of coming up with the idea of a universal “information aggregator” that unifies access to all world’s knowledge and information online while working at CERN, how he cooperated with similar brilliant minds to come up with first tools for the web, how he pushed the web into momentum, and finally, his own reflections onthe impact of the web on society, both positive and negative.
The “Internet” was already a widespread concept before Sir TBL started working on the web. And Sir TBL brought up with this simple yet powerful concept: all the world’s document on the Internet addressed by a “Universal Resource Locator”, and linked together via “hyperlinks”. In this way, you can start your research from any documents, and find all relevant resources by simply clicking on these “links” from any document. And in this way, all world’s online knowledge is weaved together and accessible to you. This abstraction helped made the Internet much more accessible to the public, and opened doors to waves of innovations and business opportunities. I think this is one of the reasons why TBL and his invention was great: he pondered on one complicated problem of organizing the Internet’s information long and hard and came up with the most essential but powerful abstraction, which benefited the whole world.
Thanks to CERN, Sir TBL was able to work on this side-project, and finally made it completely free and open to the world. Also, thanks to Sir TBL, when he left CERN to cofound WWW Consortium (W3C) in MIT, he wanted to make sure the Internet is kept running free and open to all. Without his spirit of openness and efforts to keep the web on this track, the web would be a much more dismal place. For this, he should be truly respected.
In the book, he also discussed his philosophy of keeping the open web: including topics on privacy, net neutrality, censorship, etc. It’s striking to see some of these ideas are still so relevant, if not more important today. In 2019 we are experiencing woes from abuses of web’s power, from the very Internet conglomerates the web helped to nurture, and governments who use it to rip off the freedom it’s designed to give people. That’s why I find this book still relevant and interesting today: the founder had expressed his concerns on the web long before. Had we listened to his ideas more carefully, we would be more aware and prepared to save it.
There are more interesting nuggets in the book: the whole thought process when he designed the web, the anecdotes when he first demoed the web, the stories of the first browsers of the web, and his musings on semantic web and his ultimate goal to “link the world’s information”. In all, it’s a recommend to read.
]]>This is a 2010 paper that presents Dapper, a tracing infrastructure from Google,to solve problems at Google scale, in its massive scale distributed systems,where a service could invoke very deep RPC calls across different nodes in thecluster, which makes tracing quite challenging.
Highlights and takeaways:
The paper introduces the following concepts to describe the system:tree
, span
, and annotation
.
tree
A simple service call could span a few different nodes in the system,forming a calling tree between different services, as shown abovein figure1.
span
In Dapper trace tree, the tree nodes are basic units of work which isreferred to as spans
. The edge indicates a casual relationship a spanand a parent. See figure2.
Each trace has a single trace id across all its children spans.Each span has one id, and records the relationships between parentand child. See figure2.Parent spans always starts before child and ends after children finish.
Dapper is designed to follow distributed control paths with near-zerointervention from the application developers, by instrumenting thefollowing libraries:
annotation
The instrumentation above is sufficient to derive traces of complexdistributed systems and made transparent to users, but Dapper alsoprovides capabilities for users to annotate important sections totheir applications.
To improve performance, one of Dapper’s design decision issampling. Dapper team noticed that Samplingat a relative small rate can get pretty good results with insightsto critical performance issues.
Trace collection is divided to the following steps:
This is one of the series of papers from Microsoft’s Project Catapult,which studies leveraging reconfigurable devices (FPGA, etc.) to accelerate data center, from very specificaccelerating algorithms like page ranking for Bing search engine, to more sophisticated machinelearning frameworks like DNN.
This is one of their early publications, which introduces the basic design and implementationof the FPGA accelerated datacenter. It covers the very fundamental details of all aspects ofserver design, from hardware, network topology, FPGA core design, fault-tolerant cluster managementsoftware design, workload scheduling algorithm, and etc…
Some highlights and takeaways:
Catapult hardware is integrated with existing server-grade blades, which takes the space on thePCIe of the motherboard, through a daughter board with one single high-end FPGA card.
The daughter-boardcards connects each other with a fast secondary network, independent of the CPU network. The secondarynetworks form a 6x8 torus topology network (see more details in paper), which gives fast inter-FPGA communications,good routability but not too much cabling complexity. The CPU network connects to a 48-port switch foreach pod.
On FPGA space has been divided into Shell and Role. Shell manages the common libraries or functionalitieslike memory management, serial link, or PCIe, reconfiguration logics, etc… The Role space is responsible foractual acceleration algorithm, which will be reloaded for each reconfiguration when the FPGA functionality needsto be updated.
FPGAs will be reconfigured from time to time and certain software must be designed to ensure to take the FPGAcompletely offline and ignored by neighbors, to ensure correct operations.
Debugging might be hard to achieve through typical JTAG hardware debugging facilities, considering the scaleof the datacenter. The paper presents an ‘always-on’ data collector that captures the key components andsaves them to a circular-buffer log.
The software interfaces divides the algorithm into 7-stages and distributed them to a 8-node FPGA group, withone for redundancy. The paper describes how the network accelerates the ‘Feature Extraction’, which producesa single score at the last stage, indicating how close the document is to the search key word.
All the queries are queued in memory in DRAM. The Queue Manager takes documents from each queue, then sends themdown the pipeline. It also manages the model reloads in the pipeline, which calculates different feature ‘scores’ forqueries.
The Catapult project, according to the paper, ‘reduces the worst-case latency by 29% in the 95 percentile distribution’in their evaluation environment, and provides 95% gain in throughput relative to software.
]]>A very interesting posts on augumenting long-term memory, based on Ebbinghaus’ forgetting curve theory: use flashcards to memorize everything you’ve learned, and even trivias like your friends’ birthday, etc… It uses Anki flashcard software to go through the list of stuff.
Author also reasoned about the benefits of memorizing all the details, concepts, and “everything”: the details are the building blocks of a field of knowledge, and memorizing them dramatically helps the understanding this field.
It’s a long read but a deep discussion, and I find it a joyful read.
An interesting talk from Jared Diamond, the author of Guns, Germs, and Steel. Despite the kind of misleading title, it’s an interesting take on history and the progress of human civilizations, and how competitions between civilizations influence their prosperity.
An introduction to interfaces in Golang, and how dependency injection can help you design large projects.
The basic concepts of system design, web design, basic principals and distributed systems design. A collaborated effort on Github.
A chapter from Google’s new Site Reliability Engineering book, on how to design a distributed cron job daemon, and handle problems including fault-tolerance, repeatedly scheduled jobs, overloading the cluster, etc… The whole book is a very valuable summary of experience of automation and distributed systems design at Google, and at Google scale. Definitely will read through other chapters.
Eli Bendersky’s blog post on why Golang gracefully handles the problems of concurrency at language level, that other major languages handles rather awkwardly.
Which greatly reduces the programmer’s mental burden of design highly concurrent systems.
An introduction to learning Python in HPC, from introduction to Python language, to distributed HPC frameworks for Python.
A list of concepts, papers, and interesting blog posts on distributed systems design.
]]>C++ and the Perils of Double-Checked Locking
The DCLP(Double-Checked Locking Pattern) is often-used in singleton design pattern: you’d like to initialize a shared object for singleton pattern, you follow the steps:
See C++ example:
1 | Singleton* Singleton::instance() { |
This pattern however, introduces subtle bugs when described in C++ with multi-threading.
The issue is with this statement:
1 | pInstance = new Singleton; |
The following steps happen:
pInstance
to the allocated memory.But C++ specification don’t enforce the steps happen in order, and compilers are therefore not constrained to reorder them for sake of optimization. As long as the observable outcome of the instructions are correct, compilers are free to place instructions in an order so that CPUs are most utilized. Consider the following case with DCLP:
pInstance
is non-null, and start using the object even before it’s fully constructed, and start accessing the Singleton
object.Oops. This is a very subtle bug, and hard to detect issue when we’re trying to initialize a shared resource once.
The paper digs into details on how compiler can leverage all sorts of different optimizations to spoil you effort to correct the DCLP code, and how to actually implement it correctly with volatile
keyword.
It’s a very interesting paper on algorithm, C++, and programming, It makes you stand in awe of the difficulty and intricacies of C++ and multi-threaded programming.
]]>Linux manual page to cgroups feature in the kernel, which restricts Linux processes CPU, max process numbers, memory usage, network setup and etc…
Linux manual page to namespaces feature in the kernel. Namespaces can be specified by the clone
syscall, and isolates the child process’ cgroup, IPC, network, mount, domain names, and etc…
When all the ingredients come together, it’s the foundation where Docker is built upon. This very interesting talk from GOTO2018 demonstrates how you can use the following technologies already built-in the Linux kernel to create your own very small proof-of-concept docker:
chroot
namespace
cgroups
It also includes very interesting details including (but not limited to):
/proc
virtual file systems for your ‘containerized’ child process.CLONE_NEWNS
to the clone
system call, to ‘unshare’ the mount point from the child process from the parent process, so that parent doesn’t see child’s mount points (which could be many and messy).An optimization problem is being used in AI, and therefore all AI applications, including self-driving, etc. Math is magical.
As it actually encourages collaborations, discussions, and exposure to opposing views.
Learning technical writing from the author of your favorite C programming book, ‘The C Programming Language’.
]]>The sysctl directory contains /sys/kernel/mm/hugepages/hugepages-{pagesize}kB/
control files and information on hugepages, where pagesize could be 1048576 or 2048, corresponding to 1GB or 2MB of hugepage size.
To get information on hugepages on your Linux systems, the hugepages
directory contains the controlling files:
nr_hugepages
nr_hugepages_mempolicy
nr_overcommit_hugepages
free_hugepages
resv_hugepages
surplus_hugepages
.You can also get hugepage-related information from /proc/meminfo
:
1 | HugePages_Total: 2048 |
The Kernel Documentation of Hugetlbpage contains the detailed information and explanation of the purpose and usage of hugepage files, as well as meminfo
fields.
The most convenient way to reserve hugepages on x86_64 Linux is to echo into the sysctl file /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
, e.g.:
1 | # echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages |
And for user to access and use the huge pages, Linux actually provides a quite convenient interface: the hugetlbfs
file system, e.g.:
1 | mount -t hugetlbfs nodev /mnt/huge |
Which mounts a pseudo filesystem of type hugetlbfs
on /mnt/huge
, and uses the default huge pagesize specified by the system, and all files created inside the directory uses huge pages.
And after that, you can use hugepage-backed memory by creating files inside /mnt/huge
directory. See example in Linux source tree: hugepage-mmap.c. The author takes the following steps:
/mnt
with read-write permission.mmap
, with proper protection and flags set (PROT_READ | PROR_WRITE
and MAP_SHARED
in this case).According to Linux Kernel Documentation:
1 | System administrators may want to put this command in one of the local rc |
And to quote the LWN article:
1 | If the huge page pool is statically allocated at boot-time, then this section |
So to avoid external fragmentation and make sure that the hugepage allocation is always successful, we may want to reserve hugepages memory regions on boot time. This Debian Wiki Page provides a way to do that. To reserve number of hugepages, add the following line in /etc/sysctl.conf
:
1 | vm.nr_hugepages = 1024 |
And to mount it automatically on system start, simply add to /etc/fstab
:
1 | hugetlbfs /hugepages hugetlbfs mode=1770,gid=2021 0 0 |
There are other advanced topics, which probably will not be covered in the scope of this quick note:
libhugetlbfs
provides programmer APIs to manage and access hugepage memory as well. Hugepage library utilities hugectl
can overload Linux standard shmget()
functions to allow huge pages to be used by allocating shared memory.hugectl
has options to run an application, with its text
and data
section mapped by hugepages, which gives potential performance benefits.These are ideas worth digging into in the future, for applications where hugepages can potential give a good performance boost.
The author’s experience in writing a time-series database from groundup, for Prometheus.
The effort to scale Prometheus with a new project Thanos, with Kubernetes sidecar pattern, to read data from individual nodes, pre-process (e.g. sampling), and submit to acentralized data storage and display.
What kind of machine/cluster you’ll need for different size of user base (from 1 to billions).
A display of CPU trace as a Github-style texture tiles.
IT-‘No Bug’-Hare is an interesting blog I found recently, focused on system, C++ language and game design. A good read for C++ fanatics and system designers.
I’m feeling guilty for not updating for so long. But on the bright side: I’m back.
As a part of work requirements I’m taking on Golang and some small distributed system design jobs. It’s an interesting language for this task: network, systems, infrastructures, etc. I’m having mixed but mostly positive feelings about this language, and maybe will share my experience when I got a chance.
]]>"Managers want metrics that are easy to calculate, easy to understand, and quick to yield a value …metrics with these desirable properties are almost always worse than useless."
Easy metrics are also easily “hacked” - people “hack” the metrics to make statistics look good, while deviate from the original purpose of academia: to achieve good quality research.
See also:
Every attempt to manage academia makes it worse
An interesting, anarchic style experiment on Reddit: let thousands of Redditers draw a picture all at the same time, what would possibly happen? It turned out to be surprisingly good.
Tim Berners-Lee: The Father of the World Wide Web and Turing Award winner believes the web nowadays has serious flaws, namely the loss of control of personal privacy, rampant spreading of misinformation on the web, and manipulations from the political campaigns online. It took everyone to build the web we have today, and it takes everyone to fix it now.
More reports and readings on Tim Berners-Lee:
Posted here again. He did so many tons of readings to get his insights on science, Computer Science, technology and society. It’s gonna be a long but joyful road.
Lesson 101 for a netizen: handling viewpoints that contradict your own. A good read.
How Ethereum and BlockChain technology may bring us a truly open, distributed Internet. Maybe.
A curated list of video courses/podcasts for entrepreneurs.
]]>Sysadmin Casts is back again, and this time with more stuff: how to turn an idea into a MVP.
Artificial Intelligence for better conversational interfaces, Elon Musk’s companies, and biological technologies are back to people’s attention again.
YouTube channels for inspirations.
Why FPGAs still have a shot.
An interesting tool for better writing, including Flesch–Kincaid readability tests and pointing out Weasel Words.
]]>The difference between arr and &arr - basically, arr is of type int *, and &arr is of type (int *)[size].
Very excellent article on the fundamentals of C/C++!
Some “gotchas” and pitfalls in the C programming language and how sometimes compiler optimizations can make it worse. Long story short is, steer away from undefined behaviors.
This post is from Chris Lattner himeself. Really nice article.
Why Python is such a cool language and how Python is used in Redhat. Most of redhat’s important infrastructure is written in Python, including but not limited to firewalld
, yum
, and its successor dnf
, and many cloud PaaS tools for OpenShift.
How to write a few lines of Python to simulate for a statistic problem which otherwise be onerous with all the math theorems and formulas.
How to write clean, well-structured and “Pythonic” code.
By no one but Guido van Rossum himself, on the status of Python 3, and how he created Python.
(And Death to Python 2!!)
Stop writing classes - when, and when not to use classes. Stop thinking in Java (no offense), and learn to be “Pythonic”, for a smaller, cleaner, and more well-structured project code base.
Conscientiousness - “A Personality trait marked by diligence, perseverance and self-discipline”.
Daniel Kahneman - I’ve been recently reading his book “Thinking: Fast and Slow”. It’s often listed as work in economics, but from what I’ve read it’s also an amazing book on psychology and human cognitives.
Reading more books is definitely one of my New Year resolutions. Just started this book, will finish.
]]>Finished most part of “C++ In A Nutshell”, and Scott Meyer’s “Effective C++”, and started to learn the basics of C++ language. Really great books to start to learn the basics of C++, and some of the fundamental problems in the language.
A very interesting guide to scalable system design and how you should deal with them in an interview. It’s very interesting to learn the basics, while to do them properly, it might require years of experience.
A guide to scalability, a series of interesting and concise introduction to the same problem.
A very interesting guide on how to use Unix’s core utilities (grep, find, bash, awk, sort, gcc, gdb, git, vim/emacs, …) to arm yourself for code editing/maintenance tasks.
JavaScript…
A Python hacker’s guide to Python, from the author of “The Hacker’s Guide To Python”.
In light of the recent election…
]]>Some security pitfalls in Python language. Very interesting read, from RedHat.
A very beautifully crafted GDB init file. Worth taking a look.
From the author of ‘The Hacker’s Guide to Python’.
]]>If this site is reliable, this is Alan Kay’s reading list for all his students. He’s a great thinker, not just in Computer Science, but human intelligence in general. His list is a constant reminder how much I’m trailing the great minds of this generation, and how much I should pick up the pace in reading.
Tips on being an efficient programmer.
Interestingly how big companies like Facebook and Google use techniques to enchant you to stay on their page for more time, or click on more of their links. I think it’s an interesting read that raises our awareness against cases such tricks, and help us defend ourselves from such exploitation.
How a slew of new startup decide to use the latest technology such as “Blockchain” and “Ethereum” to decentralize the key web infrastructures and the World Wide Web they support, to compete against giant cooperations like Google and Facebook. It’s an interesting to trend to keep an eye on, but so far I don’t know if I have the optimism that they’ll succeed.
]]>Eli Bendersky’s blog has always been a must-read to me. He never fails to regularly come up with posts of interesting and insightful ideas, or detailed tutorials.
He also actively participates in LLVM-dev mailing list and based on his blogs, has board interests in programming language, computer systems and etc…
Phillip Guo is another one of my favorite bloggers. This time he wrote an intro to HCI research.
I haven’t read extensively from books or blogs recently, which is a shame. I shall definitely invest more time in reading and expanding knowledge.
]]>The very recent GCC 6.0 version in trunk, however, will produce bad binary for a relatively stable version of V8 with -O3
flag enabled. The output binary will segfault on some of the very basic tests. At first we immediately assumed it was a bug from the bleeding-edge GCC, and submitted the bug report to the community, which responded promptly (within half an hour, that’s incredible speed. Kudos for GCC), that the problem resulted from an undefined behavior in V8. The problem roots in the fact that some V8 code is dereference null object pointers to access member functions. You can even see in their C++ code comparing this
to NULL
in class member functions.
if (this == NULL) { // some logic}
And new GCC decided to optimize it away. Cause in well-defined C++ programs, this
will never be NULL
.
Undefined behavior are also referred to as Nasal Demons. The “dereferencing NULL pointer” code has also been discussed in this well-written post: Still Comparing “this” Pointer to Null?, about the hazards of using it. Somehow, from M$ MFC library, to widely used V8 JS engine, they are all using this for a happy hacking experience. This tech debt is a time bomb they plant in their code, and no one knows when it will go off. For V8 it was around Oct. 2015 when mainline trunk GCC guys decided to use this undefined behavior for optimization, which causes crashes in produced V8 binary.
Theoretically it could be worse: this can cause a security vulnerability. And the problematic code will work just fine with the last revision of GCC compiler, but not with the very next commit. It’s a nightmare for anyone to debug.
Guys in chromium project seem to be aware of this problem for some time. I quote: “Fundamentally this is fixable by making the functions static and explicitly passing the entity as parameter, but that’s a tremendous amount of work.” See this bug:
https://bugs.chromium.org/p/v8/issues/detail?id=3782
All coders who touched V8 code should be much smarter than I am. But somehow they just let this code slip in, and right now the bad code piles up and it’s too hard to fix. The moral of this story is: C/C++ is a very hard language to use right, and it should take much patience to learn, understand, and write correct, clean code. Without patience to learn correct code, fall to the dark side of the source one easily will.
Looks like this code has bitten other people as well. And they are from quite a while ago:
https://jira.mongodb.org/browse/SERVER-15182
https://jira.mongodb.org/browse/SERVER-15306
Attached is a pretty good presentation on undefined C/C++ code:
]]>A good review as well as critique to the original “How to C in 2016”, debunking some myths, and making suggestions on how to really code in C.
The minimal fuss setup for frontend development, from Philip, one of my favorite professor, programmer and bloggers.
Or rather, an intro to assembly. I’ve just took a quick glimpse on the lite version, which is x86/x86_64 MSVC assembly only. A quick review to polish the memories on x86 assembly.
The full version also contains ARM version of assembly, which is my next target.
The PEP8 Style Guide for Python Code. A good guide to writing consistently readable and beautiful Python code.
A good intro to openpgp if you’re a beginner or haven’t heard of it before.
An idea list of new year resolutions for programmers. I really like the ‘Embrace the uncomfortable’ part. Comfort is what kills you - it makes you lazy and dull, and makes your brains decay. It’s a good idea to stimulate it once in a while.
I do want to learn at least one more new programming language (or maybe pickup Haskell or/and Scheme again?), learn more about security, learn how to use vim, and learn more about non-programming (economics, philosophy, sociology and etc.?).
Scott Young explains why we acutally do not understand what we think we understand. And how to really understand by using the ‘Feynman Technique’.
I’ve read the Chinese version of this book. Very interesting insight on Israel and Jewish culture. It basically explains how Israel manage to build such a powerful nation and exert influence on global economics, politics, and technology, with limited resources and hostile environment.
]]>I listed several observations the authors provided in this book, which I find very interesting.
The Internet provides most people the ability to access information from everybody else, which makes everyone a media outlet. It has always been a trend that new technologies lower the barriers of professions, and causes mass amateurization. Just like ancient scribes has been replaced by Gutenberg printing technology, the technological barriers of printing, editing, distributing news and etc. has been lowered by the invention of Internet, and made accessible to the public instead of the elite few, blurring the lines between amateurs and professionals.
One outcome of mass amateurization is that the contents provided by the general public is often not of good quality as professionals. However, the accessibility of the Internet has extremely lowered the costs of publishing, and the new form of media has adapted to the ‘publish, then filter’ pattern.
– “Fewer than two percent of the Wikipedia users ever contribute, yet that is enough to create profound value for millions of users.”
The distribution of participation in large projects always follow power law: the most active contributor contributes ten, to hundreds of times more than average contributors. And the larger the project. This is true for almost all online participants. Most Wikipedia’s pages are contributed by a handful few, but maintained by many users who contribute a few lines, or fix some typos. Most large open source projects are maintained by a few core developers, yet receive small contributions from everywhere. Interestingly, I quote the book: “most large social experiments are engines for harnessing inequality rather than limiting it.”
Before Wikipedia the founders started off their ideas of an open online encyclopedia by creating a site called Nupedia, with contents contributed from experts only. Apparently this experiment failed, but the succeeding non-profit, volunteer-only Wikipedia soon gained popularity. One of the many interesting questions about Wikipedia is: what gave people the motivation to contribute?
The author’s answer is: the love to Wikipedia. 'When people care enough, they can come together and accomplish things of a scope and longevity that were previously impossible; they can do big things for love."
Wikipedia provides a power engine (the wiki engine) to protect the love from contributors. Wiki allows revisions and histories, thus made iterative improvements possible, and at the same maintains history versions to keep wiki pages from catastrophic damages from evil-minded people. Together they are indispensable ingredients to Wikipedia’s success.
“The order of promise, tool, and bargain is also the order in which they matter most to the success of any given group.”
The promise of a group provides the ideology for one group and is the main reason why people are willing to participate. It sets the tone for this group activity. “Let’s try to see if we can come up with something together”, is actually the very first promise Torvalds put in the mail introducing his toy OS Linux. It was not as sweeping as a promise like “Let’s make a world-changing Operating System together” (although it did at last), but it provides just enough interest to people for this small infant project.
Tools define how interactions happen among the groups, setting tones for interactions. A wiki is good for shared knowledge and judgment, while a mailing list is more convenient for open discussions.
The bargain is more like the adjustment to the culture inside one group. “We expect politeness of one another, and we rebuke the impolite” is a bargain’s most likely creating a culture which is friendly and respecting.
This is an interesting book on how large groups, especially groups on Internet works, and how the “wisdom of the crowd” is collected, and should be collected. As my energy is so limited, I can’t even list out all the important ideas in it. This post is my best effort. Anyone who’s interested in building a society online might benefit from this book. In all it might be interesting to take a look at.
]]>