04 25 2021

Paper Reading: 150 Successful Machine Learning Models Deployed: 6 Lessons Learned At Booking.com

Paper link: https://www.kdd.org/kdd2019/accepted-papers/view/150-successful-machine-learning-models-6-lessons-learned-at-booking.com

Or download.

First published in KDD from booking.com, the paper described its lessons from deploying Machine Learning models in their production service. It provided some intriguing insights. I believe many are very valuable to understanding applying Machine Learning in real-world scenarios.

Here are some of my takeaways.

11 14 2020

BookReview

Book Review: The Death of Expertise

The Death of Expertise: The Campaign Against Established Knowledge and Why It Matters

https://en.wikipedia.org/wiki/The_Death_of_Expertise

“The Death of Expertise” by Tom Nichols is a timely piece to the ongoing information endemic, especially in America. Quoting Issac Asimov:

There is a cult of ignorance in the United States, and there always has been. The strain of anti-intellectualism has been a constant thread winding its way through our political and cultural life, nurtured by the false notion that democracy means that “my ignorance is just a good as your knowledge.”

The book describes the author’s view of why experts are so important in a democracy and the relationship between expertise the public. And it also goes on to decry the ongoing decay in this relationship, where citizens are increasingly losing trust in experts, and experts are increasingly finding it difficult to communicate with their audience.

11 06 2020

Reading

Reading Summary: 11/06/2020

Technology

Edsger Dijkstra: The Man Who Carried Computer Science on His Shoulders

The often untold story behind a mastermind of Computer Science: Dijkstra, whose name has been an important algorithm widely used in GPS navigation.

The blog described a wise, hard-thinker, a great mind who made unparallel contributions to both Computer Science as a mathematical and logical view, as well as Software Engineering which focuses on building software and hardware components.

He’s most famous for his private reports, named “EWD”, and continued for more than forty years, describing his views on Computer Science and Software Engineering in general, and sometimes worked as reviews for others’ work. One of the most influencing “EWD” report was “Notes on Structured Programming,” which argued programming as a serious form of skill that demands intellectual rigor.

In 1972, Dijkstra received the ACM Turing Award, he was recognized for:

contributions to programming as a high, intellectual challenge; for eloquent insistence and practical demonstration that programs should be composed correctly, not just debugged into correctness; for illuminating perception of problems at the foundations of program design.

He has great passion for his art, and his strong personality sometimes sparked controversies. One of the most famous was the discussion on critiquing “GOTO” statements as harmful. It brought widespread, heated debate, yet Dijkstra’s view finally prevailed, and his insistence made a monumental change to programming paradigm.

There are much more interesting details around his personal and academic life in the original post, too long to be summarized here. For example, his had a mini-van in Austin, which he often drove to national parks with his wife, and it was named the “Touring Machine.” If you are passionate with computers and software, have a long weekend afternoon, it’s worth a good read.

11 05 2020

PaperReading

Paper Reading: Julia: Dynamism and Performance Reconciled by Design

Link: https://dl.acm.org/doi/pdf/10.1145/3276490

The paper outlines the Julia programming language’s some most important design choices, and explains how they build a bridge between user-friendliness and performance.

The paper provided with a few benchmarks, to compare its performance with a C baseline, along with other dynamic languages like Python, MATLAB, JavaScript, and so on. While other dynamic programming languages suffer great performance loss, due to its dynamism, Julia can compete relatively close with the C/C++ baseline, with up to native performance in a few cases, most of the benchmarks are within 2x of C or C++, while Python can suffer more than 70x slower performance than C++.

This is significant, as it may eliminate the “prototype in dynamic language, then reimplement in static language for faster performance” cycle, eliminating extra time on coding to achieve efficiency without sacrificing much performance.

Some key takeouts from this paper:

07 20 2020

Reading

Reading Summary: 07/20/2020

A Sino-American bond, forged by Chinese students, is in peril $

How Chinese-American relationship is impacting the lives of many “stuck in between.”

The author had the foresight about the dangerous impact social media has on a society, and he was right.

He also proposes: the cure cannot be a pure technological one, it requires fixing the vulnerabilities inside economics, political, and social systems.

Technology

Testifying at the Senate about A.I.‑Selected Content on the Internet, from Stephen Wolfram

Stephen Wolfram’s testimony at the Senate, on A.I. selected content, his ideas on why algorithmic bias is dangerous, and how we can address it with proper regulations, transparency, and user choice.

He basically proposed that users should have an idea of what algorithm is feeding them data, and the capability to choose. This requires some open benchmarks on recommendation algorithms, and frameworks for users to choose.

Programming

The Rise of Embarrassingly Parallel Serverless Compute

What is serverless computing, why it is on the rise, and why is it useful for parallel data processing (data processing, CI/CD, compilation, ML, visualization, …, you name it).

NoSQL Data Modeling Techniques

A detailed guide for modeling your NoSQL data schemes.

07 04 2020

PaperReading

Paper Reading: Aurora: Distributed Relational Database

The following is my overly simplified summary of paper reading.

Aurora is a geo-distributed SQL database that supports replication, high-availability, and transactions, with its distributed design around replicating the database WAL log.

References

Course Syllabus: https://pdos.csail.mit.edu/6.824/schedule.html
Video Lectures: https://www.youtube.com/channel/UC_7WrbZTCODu1o_kfUMq88g/videos
Lecture: https://www.youtube.com/watch?v=jJSh54J1s5o

07 03 2020

PaperReading

Reading: Cassandra Data Modeling

Reading from Cassandra official website: https://www.datastax.com/sites/default/files/content/whitepaper/files/2019-10/CM2019236%20-%20Data%20Modeling%20in%20Apache%20Cassandra%20%E2%84%A2%20White%20Paper-4.pdf

Cassandra is a exemplary implmentation of NoSQL database, and gained popularity in various web, big data, and ML applications. Recently I’ve stumbled upon a good summary of Cassandra handbook, which includes a decent introduction to its data modeling techniques, which can in term be used in other NoSQL databases.

Here are my notes and summaries:

Data Modeling Concepts

There are great many ways Cassandra and traditional RDBMS are different: Cassandra is a wide-column database, with BASE eventual consistency guarantees, has looser relationships between tables. Therefore one needs to model their data very differently than traditional RDBMS for the application to run efficiently.

Namely NoSQL has following differences:

No Joins: tables have loose relationships with each other without database level joining.
No Referential Integrity: RDBMS requires foreign keys to refer to primary key in another table. NoSQL doesn’t enforce this.
Denormalization: contrary to what RDBMS normalization techniques, denormalization is first-class citizen in NoSQL. Many NoSQL databases supports aggregating fields in the same table to achieve row level atomicity.
Query First: SQL data modeling starts with entities and relations, while NoSQL data modeling starts with application queries.
Sorting: Sorting is an important design decision, for Cassandra and many NoSQL databases.

04 05 2020

Reading

Book Review: Black Swan - The Impact of the Highly Improbable

I’ve just finished the major part (without the postscript essays) of the famous and oft-discussed book, once a best seller - the Black Swan. The author was knowledgable, and the book was insightful and well-crafted, with his unique style of discussing serious topics with occasional anecdotes and vivid storytelling. It was a fantastic ride.

01 25 2020

Reading

Reading Summary: Ultralearning

Ultralearning is a quite interesting book from one of my favorite bloggers: Scott Young. Famous for his “MIT Challenge” – which he completed four years of MIT coursework in one single year by completely self-studying – he now blogs regularly on studying methods, student cognitions, and everything related.

This book is his summary of his researches and experiences of studying. The book’s author argued that: there’s one possible way to learn and improve yourself, with intensive training and exercises. Like training muscles, you can adopt an extraordinary, unorthodox training plan for your brains, and pick up a new skill in a short amount of time, be it a foreign language, programming, sketch, or even public speaking. He called it “ultralearning.” In the book, he researched many references and interviewed like-minded friends, who had similar experiences of acquiring or improving a skill intensively. And he summarizes all the essential principles, as the guide to a successful “ultralearning” project.

01 20 2020

PaperReading

Paper Reading: Zookeeper

Paper: https://www.usenix.org/legacy/events/atc10/tech/full_papers/Hunt.pdf

Presentation: https://www.usenix.org/conference/usenix-atc-10/zookeeper-wait-free-coordination-internet-scale-systems

Kevin Hu's Blog

A Hungry Fool

Paper Reading: 150 Successful Machine Learning Models Deployed: 6 Lessons Learned At Booking.com

Book Review: The Death of Expertise

Reading Summary: 11/06/2020

Technology

Edsger Dijkstra: The Man Who Carried Computer Science on His Shoulders

Paper Reading: Julia: Dynamism and Performance Reconciled by Design

Reading Summary: 07/20/2020

A Sino-American bond, forged by Chinese students, is in peril $

Technology

Testifying at the Senate about A.I.‑Selected Content on the Internet, from Stephen Wolfram

Programming

The Rise of Embarrassingly Parallel Serverless Compute

NoSQL Data Modeling Techniques

Paper Reading: Aurora: Distributed Relational Database

References

Reading: Cassandra Data Modeling

Data Modeling Concepts

Book Review: Black Swan - The Impact of the Highly Improbable

Reading Summary: Ultralearning

Paper Reading: Zookeeper

Technology

Social

Technology

Programming

References

Data Modeling Concepts