class: title Deep Dive into
Kubernetes Internals
for Builders and Operators
(LISA2019 talk)
.footnote[![QR Code to the slides](images/qrcode-lisa.png)โ๐ป Slides!] .debug[ ``` ``` These slides have been built from commit: ae9780f [lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- ## Outline - Introductions - Kubernetes anatomy - Building a 1-node cluster - Connecting to services - Adding more nodes - What's missing .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- class: title Introductions .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- class: tutorial-only ## Viewer advisory - Have you attended my talk on Monday? -- - Then you may experience *dรฉjร -vu* during the next few minutes (Sorry!) -- - But I promise we'll soon build (and break) some clusters! .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- ## Hi! - Jรฉrรดme Petazzoni ([@jpetazzo](https://twitter.com/jpetazzo)) - ๐ซ๐ท๐บ๐ธ๐ฉ๐ช - ๐ฆ๐ง๐ป - ๐(๐ ๐ ๐ ๐ ๐ ๐ ๐ ) - ๐ฅ๐ง ๐ข๐ ([1], [2], [3]) - ๐จ๐ปโ๐ซโจโธ๏ธ๐ฐ - ๐๐๐ป [1]: http://jpetazzo.github.io/2018/09/06/the-depression-gnomes/ [2]: http://jpetazzo.github.io/2018/02/17/seven-years-at-docker/ [3]: http://jpetazzo.github.io/2017/12/24/productivity-depression-kanban-emoji/ ??? I'm French, living in the US, with also a foot in Berlin (Germany). I'm a container hipster: I was running containers in production, before it was cool. I worked 7 years at Docker, which according to Corey Quinn, is "long enough to be legally declared dead". I also struggled a few years with depressed and burn-out. It's not what I'll discuss today, but it's a topic that matters a lot to me, and I wrote a bit about it, check my blog if you'd like. After a break, I decided to do something I love: teaching witchcraft. I deliver Kubernetes training. As you can see, I love emojis, but if you don't, it's OK. (There will be much less emojis on the following slides.) .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- ## Why this talk? - One of my goals in 2018: pass the CKA exam -- - Things I knew: - kubeadm - kubectl run, expose, YAML, Helm - ancient container lore -- - Things I didn't: - how Kubernetes *really* works - deploy Kubernetes The Hard Way .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- ## Scope - Goals: - learn enough about Kubernetes to ace that exam - learn enough to teach that stuff - Non-goals: - set up a *production* cluster from scratch - build everything from source .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- ## Why are *you* here? -- - Need/want/must build Kubernetes clusters -- - Just curious about Kubernetes internals -- - The Zelda theme -- - (Other, please specify) -- class: tutorial-only .footnote[*Damn. Jรฉrรดme is even using the same jokes for his talk and his tutorial!
This guy really has no shame. Tsk.*] .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- class: title TL,DR .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- class: title *The easiest way to install Kubernetes is to get someone else to do it for you.* (Me, after extensive research.) ??? Which means that if any point, you decide to leave, I will not take it personally, but assume that you eventually saw the light, and that you would like to hire me or some of my colleagues to build your Kubernetes clusters. It's all good. .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- class: talk-only ## This talk is also available as a tutorial - Wednesday, October 30, 2019 - 11:00 amโ12:30 pm - Salon ABCD - Same content - Everyone will get a cluster of VMs - Everyone will be able to do the stuff that I'll demo today! .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- class: title The Truthยน About Kubernetes .footnote[ยนSome of it] .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- ## What we want to do ```bash kubectl run web --image=nginx --replicas=3 ``` *or* ```bash kubectl create deployment web --image=nginx kubectl scale deployment web --replicas=3 ``` *then* ```bash kubectl expose deployment web --port=80 curl http://... ``` ??? Kubernetes might feel like an imperative system, because we can say "run this; do that." .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- ## What really happens - `kubectl` generates a manifest describing a Deployment - That manifest is sent to the Kubernetes API server - The Kubernetes API server validates the manifest - ... then persists it to etcd - Some *controllers* wake up and do a bunch of stuff .footnote[*The amazing diagram on the next slide is courtesy of [Lucas Kรคldstrรถm](https://twitter.com/kubernetesonarm).*] ??? In reality, it is a declarative system. We write manifests, descriptions of what we want, and Kubernetes tries to make it happen. .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- class: pic ![Diagram showing Kubernetes architecture](images/k8s-arch4-thanks-luxas.png) ??? What we're really doing, is storing a bunch of objects in etcd. But etcd, unlike a SQL database, doesn't have schemas or types. So to prevent us from dumping any kind of trash data in etcd, We have to read/write to it through the API server. The API server will enforce typing and consistency. Etcd doesn't have schemas or types, but it has the ability to watch a key or set of keys, meaning that it's possible to subscribe to updates of objects. The controller manager is a process that has a bunch of loops, each one responsible for a specific type of object. So there is one that will watch the deployments, and as soon as we create, updated, delete a deployment, it will wake up and do something about it. .debug[[lisa/begin.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/begin.md)] --- ## 19,000 words They say, "a picture is worth one thousand words." The following 19 slides show what really happens when we run: ```bash kubectl run web --image=nginx --replicas=3 ``` .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/01.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/02.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/03.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/04.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/05.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/06.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/07.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/08.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/09.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/10.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/11.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/12.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/13.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/14.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/15.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/16.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/17.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/18.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: pic ![](images/kubectl-run-slideshow/19.svg) .debug[[k8s/deploymentslideshow.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/k8s/deploymentslideshow.md)] --- class: title Building a 1-node cluster .debug[[lisa/dmuc.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/dmuc.md)] --- ## Requirements - Linux machine (x86_64) 2 GB RAM, 1 CPU is OK - Root (for Docker and Kubelet) - Binaries: - etcd - Kubernetes - Docker .debug[[lisa/dmuc.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/dmuc.md)] --- ## What we will do - Create a deployment (with `kubectl create deployment`) - Look for our pods - If pods are created: victory - Else: troubleshoot, try again .footnote[*Note: the exact commands that I run will be available in the slides of the tutorial.*] .debug[[lisa/dmuc.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/dmuc.md)] --- class: pic ![Demo time!](images/demo-with-kht.png) .debug[[lisa/dmuc.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/dmuc.md)] --- ## What have we done? - Started a basic Kubernetes control plane (no authentication; many features are missing) - Deployed a few pods .debug[[lisa/dmuc.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/dmuc.md)] --- class: title Pod-to-service networking .debug[[lisa/kubeproxy.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/kubeproxy.md)] --- ## What we will do - Create a service to connect to our pods (with `kubectl expose deployment`) - Try to connect to the service's ClusterIP - If it works: victory - Else: troubleshoot, try again .footnote[*Note: the exact commands that I run will be available in the slides of the tutorial.*] .debug[[lisa/kubeproxy.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/kubeproxy.md)] --- class: pic ![Demo time!](images/demo-with-kht.png) .debug[[lisa/kubeproxy.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/kubeproxy.md)] --- ## What have we done? - Started kube-proxy - ... which created a bunch of iptables rules .debug[[lisa/kubeproxy.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/kubeproxy.md)] --- class: title Adding more nodes .debug[[lisa/kubenet.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/kubenet.md)] --- ## What do we need to do? - More machines! - Can we "just" start kubelet on these machines? -- - We need to update the kubeconfig file used by kubelet - It currently uses `localhost:8080` for the API server - We need to change that! .debug[[lisa/kubenet.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/kubenet.md)] --- ## What we will do - Get more nodes - Generate a new kubeconfig file (pointing to the node running the API server) - Start more kubelets - Scale up our deployment .debug[[lisa/kubenet.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/kubenet.md)] --- class: pic ![Demo time!](images/demo-with-kht.png) .debug[[lisa/kubenet.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/kubenet.md)] --- class: title Beyond kubenet .debug[[lisa/cni.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/cni.md)] --- ## When kubenet is not enough (1/2) - IP address allocation is rigid (one subnet per node) - What about DHCP? - What about e.g. ENI on AWS? (allocating Elastic Network Interfaces to containers) .debug[[lisa/cni.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/cni.md)] --- ## When kubenet is not enough (1/2) - Containers are connected to a Linux bridge - What about: - Open vSwitch - VXLAN - skipping layer 2 - using directly a network interface (macvlan, SR-IOV...) .debug[[lisa/cni.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/cni.md)] --- ## The Container Network Interface - Allows us to decouple network configuration from Kubernetes - Implemented by plugins - Plugins are executables that will be invoked by kubelet - Plugins are responsible for: - allocating IP addresses for containers - configuring the network for containers - Plugins can be combined and chained when it makes sense .debug[[lisa/cni.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/cni.md)] --- ## Combining plugins - Interface could be created by e.g. `vlan` or `bridge` plugin - IP address could be allocated by e.g. `dhcp` or `host-local` plugin - Interface parameters (MTU, sysctls) could be tweaked by the `tuning` plugin The reference plugins are available [here]. Look in each plugin's directory for its documentation. [here]: https://github.com/containernetworking/plugins/tree/master/plugins .debug[[lisa/cni.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/cni.md)] --- ## How plugins are invoked - Parameters are given through environment variables, including: - CNI_COMMAND: desired operation (ADD, DEL, CHECK, or VERSION) - CNI_CONTAINERID: container ID - CNI_NETNS: path to network namespace file - CNI_IFNAME: what the network interface should be named - The network configuration must be provided to the plugin on stdin (this avoids race conditions that could happen by passing a file path) .debug[[lisa/cni.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/cni.md)] --- ## Setting up CNI - We are going to use kube-router - kube-router will provide the "pod network" (connectivity with pods) - kube-router will also provide internal service connectivity (replacing kube-proxy) - kube-router can also function as a Network Policy Controller (implementing firewalling between pods) .debug[[lisa/cni.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/cni.md)] --- ## How kube-router works - Very simple architecture - Does not introduce new CNI plugins (uses the `bridge` plugin, with `host-local` for IPAM) - Pod traffic is routed between nodes (no tunnel, no new protocol) - Internal service connectivity is implemented with IPVS - kube-router daemon runs on every node .debug[[lisa/cni.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/cni.md)] --- ## What kube-router does - Connect to the API server - Obtain the local node's `podCIDR` - Inject it into the CNI configuration file (we'll use `/etc/cni/net.d/10-kuberouter.conflist`) - Obtain the addresses of all nodes - Establish a *full mesh* BGP peering with the other nodes - Exchange routes over BGP - Add routes to the Linux kernel .debug[[lisa/cni.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/cni.md)] --- ## What's BGP? - BGP (Border Gateway Protocol) is the protocol used between internet routers - It [scales](https://www.cidr-report.org/as2.0/) pretty [well](https://www.cidr-report.org/cgi-bin/plota?file=%2fvar%2fdata%2fbgp%2fas2.0%2fbgp-active%2etxt&descr=Active%20BGP%20entries%20%28FIB%29&ylabel=Active%20BGP%20entries%20%28FIB%29&with=step) (it is used to announce the 700k CIDR prefixes of the internet) - It is spoken by many hardware routers from many vendors - It also has many software implementations (Quagga, Bird, FRR...) - Experienced network folks generally know it (and appreciate it) - It also used by Calico (another popular network system for Kubernetes) - Using BGP allows us to interconnect our "pod network" with other systems .debug[[lisa/cni.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/cni.md)] --- class: pic ![Demo time!](images/demo-with-kht.png) .debug[[lisa/cni.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/cni.md)] --- class: title, talk-only What's missing? .debug[[lisa/end.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/end.md)] --- ## What's missing? - Mostly: security - Notably: RBAC - Also: availabilty .debug[[lisa/end.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/end.md)] --- ## TLS! TLS everywhere! - Create certs for the control plane: - etcd - API server - controller manager - scheduler - Create individual certs for nodes - Create the service account key pair .debug[[lisa/end.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/end.md)] --- ## Service accounts - The controller manager will generate tokens for service accounts (these tokens are JWT, JSON Web Tokens, signed with a specific key) - The API server will validate these tokens (with the matching key) .debug[[lisa/end.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/end.md)] --- ## Nodes - Enable NodeRestriction admission controller - authorizes kubelet to update their own node and pods data - Enable Node Authorizer - prevents kubelets from accessing data that they shouldn't - only authorize access to e.g. a configmap if a pod is using it - Bootstrap tokens - add nodes to the cluster safely+dynamically .debug[[lisa/end.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/end.md)] --- ## Consequences of API server outage - What happens if the API server goes down? - kubelet will try to reconnect (as long as necessary) - our apps will be just fine (but autoscaling will be broken) - How can we improve the API server availability? - redundancy (the API server is stateless) - achieve a low MTTR .debug[[lisa/end.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/end.md)] --- ## Improving API server availability - Redundancy implies to add one layer (between API clients and servers) - Multiple options available: - external load balancer - local load balancer (NGINX, HAProxy... on each node) - DNS Round-Robin .debug[[lisa/end.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/end.md)] --- ## Achieving a low MTTR - Run the control plane in highly available VMs (e.g. many hypervisors can do that, with shared or mirrored storage) - Run the control plane in highly available containers (e.g. on another Kubernetes cluster) .debug[[lisa/end.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/end.md)] --- class: title Thank you! .debug[[lisa/end.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/end.md)] --- ## A word from my sponsor - If you liked this presentation and would like me to train your team ... Contact me: jerome.petazzoni@gmail.com - Thank you! โฅ๏ธ - Also, slides๐๐ป ![QR code to the slides](images/qrcode-lisa.png) .debug[[lisa/end.md](https://github.com/jpetazzo/container.training/tree/lisa-2019-10/slides/lisa/end.md)]