From Borg to Kubernetes

Overview

Teaching: 0 min
Exercises: 0 min

Questions

What does it mean to orchestrate?

What is the difference between traditional job management system and container management system?

Objectives

Understand the rise in abstraction levels as computing task moves from executing a job to running a container

Understand the relationship between container engine and container orchestration system

Be familiar with common open-source orchestration systems

1. What does orchestrate mean

Dictionary definition: to arrange or combine so as to achieve a desired or maximum effect

Recall Kubernetes documentation: We tell Kubernetes what the desired state of our system is like, and Kubernetes will work to maintain that

Before containerization/virtualization, we have cluster of computers running jobs.

Jobs = applications running on single or multiple computing nodes

Applications’ dependencies are tied in to the supporting operating system on these nodes.

Cluster management system only need to manage applications.

Container is more than an application.

A lightweight virtualization of an operating system and its components that help an application to run, including external libraries.

A running container does not depending on a host computer’s libraries.

Is the management process the same as a cluster management system?

2. Borg, a cluster management system

Google’s Cluster Management System

First developed in 2003.

Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. “Large-scale cluster management at Google with Borg.” In Proceedings of the Tenth European Conference on Computer Systems, p. 18. ACM, 2015.

Manages hundreds of thousands of jobs, from many thousands of different applications, across clusters up to tens of thousands machines.

3. Why Borg and Kubernetes

Borg is the predecessor of Kubernetes. Understand Borg helps understand the design decision in creating Kubernetes.

Kubernetes is perhaps the most popular open-source container orchestration system today, for both academic and industry.

Other container orchestration systems are either

Deprecating (Docker Swarm)

Integrates container management as part of the existing framework rather than developing a new management system (UC Berkeley’s Mesos and Twitter’s Aurora)

We will briefly discuss them at the end of this episode.

4. Benefits of Borg

Hides the details of resource management and failure handling so its users can focus on application development.

Operates with very high reliability and availability, and supports applications that have similar requirements.

Runs workloads across tens of thousands of machines efficiently.

Is not the first system that can do these, but is one of the very few that can do it at such scale.

5. User’s perspective

Work is submitted to Borg as jobs, which can have one or more tasks (binary).

Each job runs in one Borg cell, consisting of multiple machines that are managed as a single unit.

Job types:

Long running services that should never goes down and have short-lived latency-sensitive requests: Gmail, Google Docs, Web Search …

Batch jobs that take a few seconds to a few days to complete.

Borg cells allow for not just applications, but applications frameworks

One master job and one or more worker jobs.

The framework can execute parallel applications itself.

Examples of frameworks running on top of Borg:

MapReduce

FlumeJava: Data-Parallel Pipelines

Millwheel: Fault-tolerant Stream Processing at Internet Scale

Pregel: Large-scale graph processing

6. Clusters and cells in Borg

Machines in cells belong to a single cluster, defined by the high-performance datacenter-scale network fabric connecting them.

How is this different that the traditional cluster model?

A Borg’s alloc defines a reserved set of resources on a machine in which one or more tasks can be run.

7. Jobs and tasks

A job consists of multiple tasks

Jobs have constraints that allow them to map to machines with satisfactory attributes

Tasks:

Each task maps to a set of Linux processes.

Authors’ notes: Borg was not designed for virtualization (2003).

Also has resource requirements (CPU cores, RAM, disk space, port available …)

All Borgs’ programs are statically linked.

What does this mean?

Why?

8. Borg’s architecture

Borg Master

Borglet

Sound familiar? (Kuber Master and Kubelet)

9. Borg Master

Consists of two process:

The main Borgmaster process

The scheduler

Borgmaster:

Replicated five times

Contains in-memory copy of most of the state of the cell

Handles client RPCs that either mutate state (create jobs) or provide read-only access to data.

Manages state machines for all the objects in the system (machines, tasks, allocs …)

Scheduler:

Perform feasibility check to map tasks’ constraints to available resources.

Picks one of the feasible machines to run the tasks.

10. Borglet

Local Borg agent that is present on every machine in a cell.

Starts and stops tasks, restarts if failed.

Manages local resources through OS kernel manipulations

Reports state of the machine to the Borgmaster.

11. Scalability of Borg Master

Reported in the 2015 paper:

Unsure of the ultimate scalability limit (flex anyone?)

A single master can

manage many thousands machines in a cell

several cells have arrival rates of more than 10,000 tasks per minute.

2020 Borg analysis report:

(Muhamad Tirmazi, Adam Barker, Nan Deng, Md E. Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. “Borg: the next generation.” In Proceedings of the fifteenth European conference on computer systems)[https://dl.acm.org/doi/pdf/10.1145/3342195.3387517]

2011 log data: 1 cell, 12000 machines (40 GB compressed)

2020 log data: 8 cells, 96000 machines (350 GB compressed)

The below graph show fraction of CPU and memory allocation of each category of priority queue **relative to cell’s capacity”.

What is special about this?

Keyword: overcommit since 2011.

12. Isolation

Sharing machines between tasks help improving utilization

Security Isolation:

Need good security isolation mechanism among multiple tasks on the same machine.

chroot to jail processes. SSH-connection is used for communication.

VMs are utilized to sandbox external software (Google App Engine and Google Compute Engine). A VM is run as a single task.

Performance isolation

Application’s class: latency-sensitive and batch (batch can be allowed to starved)

Resources:

Compressible: rate-based and can be reclaimed without killing the tasks (CPU cycles, I/O bandwidth)

Incompressible: cannot be reclaimed (memory, disk space)

13. Kubernetes: where does it come from

Developed from lessons learned via Borg

Become available with the initial release of Docker in March 2013

14. Kubernetes: applications versus services

A service is a process that:

is designed to do a small number of things (often just one).

has no user interface and is invoked solely via some kind of API.

An application is a process that:

has a user interface (even if it’s just a command line) and

often performs lots of different tasks. It can also expose an API,

It is common for applications to call several service behind the scenes

15. Kubernetes: what does it have?

Kubelet: a special background process responsible for create, destroy, and monitor containers on a host.

Proxy: a simple network proxy used to separate IP address of the container from the service it provides.

cAdvisor: collects, aggregates, processes, and exports information about running containers.

Pods

A collection of containers and volumes that are bundled and scheduled together because they share a common resource (same file system or IP address).

Docker: Each container gets its own IP

Kubernetes: Containers of a pod share the same address.

A pod emulates a logical host (like a VM) to the containers.

Important:

Kubernetes schedules and orchestrates things at the pod level, not at the container level.

Containers running in the same pod have to be managed together (shared fate).

Management transparency: You don’t have to micromanage processes within a pod.

16. What Kubernetes learned from Borg

Rejection of the job concept and organize around the concept of pods.

labels are used to described the objects (jobs, services, …) and their desired states.

IP addresses are mapped to pods and services and not physical computers.

Optimizations for high-demand jobs.

The perception of Kubernetes’ kernel as an operation system kernel for a distributed system.

17. Borg, Oemga, and Kubernetes

Burns, Brendan, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes. “Borg, Omega, and Kubernetes.” Queue 14, no. 1 (2016): 10.

Borg:

Isolation through the root file system (chroot, cgroups).

A modern container is more than just an isolation mechanism: It is also an image, files that make up the applications that runs inside the container.

Application-oriented infrastructure

Containerization transform the data center from being machine-oriented to being application-oriented.

Containers encapsulate the application environment, abstracting away many details of machines and OS from the application developer and the deployment infrastructure.

Managing containers means managing applications rather than machines.

Application environment

Decoupling of image and OS.

Hermetic image:

What is hermetic?

Encapsulation almost all dependencies except Linux kernel system-call interface.

Containers as the unit of management

Relieves application developers and operations teams from worrying about specific details of machines and OS.

Provides the infrastructure team flexibility to roll out new hardware and upgrade the OS with minimal impact on running applications and their developers.

Ties telemetry collected by the management system to applications rather than to machines.

18. Orchestration is only the beginning …

Many new systems have been built around Borg to improve its container-management services

Naming and service discovery

Master election

Application-aware load balancing

Horizontal and vertical scaling

…

Kubernetes attempts to avoid escalating complexity through a consistent approach in its API.

19. Other container management system

Recalling Hadoop YARN (Yet Another Resource Negotiator)

Second generation scheduler for Hadoop (Open-source implementation of Google File System)

Deployment of software frameworks as jobs

Apache Mesos is more similar to YARN and Borg than Kubernetes

Cluster management system

Containers are executed as jobs.

Twitter’s Aurora is a scheduler running on top of Mesos.

Configurations are more complex, but is still a cluster management system.

Key Points

Container orchestration systems grow from traditional

previous episode

Introduction to Cloud Computing

next episode

From Borg to Kubernetes

Overview

1. What does orchestrate mean

2. Borg, a cluster management system

3. Why Borg and Kubernetes

4. Benefits of Borg

5. User’s perspective

6. Clusters and cells in Borg

7. Jobs and tasks

8. Borg’s architecture

9. Borg Master

10. Borglet

11. Scalability of Borg Master

12. Isolation

13. Kubernetes: where does it come from

14. Kubernetes: applications versus services

15. Kubernetes: what does it have?

16. What Kubernetes learned from Borg

17. Borg, Oemga, and Kubernetes

18. Orchestration is only the beginning …

19. Other container management system

Key Points

previous episode

next episode