Egress
This is the third time Iβve tried to write this, a random panic on my Hackintosh wiped out half a blog post and the other attempt read like garbage π« . So here we go! I recently added Egress IPv6 to kube-vip, but it isnβt something that I really advertised as functionality in kube-vip, so I though it best to actually write about it. The concept of Egress within Kubernetes is a bit of an interesting one because it also exposes a bit of confusion when it comes to Kubernetes networking. Namely what actually comes out of the box with a Kubernetes cluster.
Kubernetes Networking
When we think about a simple Kubernetes cluster (and deploying it), then we need to consider all of the additional components that are actually required in order for the cluster to actually work. These are commonly thought of as the CxI
plugins, where the x
is usually R
for the runtime, N
for the networking plugin and finally S
for storage. Without a CRI
, there is simply no functionality to stop and start containers effectively making your cluster pointless and additionally without a CNI
your containers wont have any networking capabilities! (well it turns out thatβs not strictly true).
The CNI in itβs most basic form is largely doing:
- Pod networking with IPAM
- Multi node networking, allowing pods to speak to one another as if they were on a big flat network (often an overlay)
- Network Policy
- Ingress typically though
hostPort
Above is a gross over simplification of what CNIs accomplish and most now do far more than just basic networking, however there is a huge amount of networking functionality that isnβt explained. Who or what creates a Kubernetes service
and manages the forwarding to a pod that is selected, and additionally when a pod wishes to send traffic externally to the cluster who or what is happening.
Kube-Proxy
So if you do a kubectl get pods -A
on a cluster you may encounter kube-proxy
pods, which typically run on all nodes as a daemonset
and are largely there to manipulate the masses of iptables rules required in order to make the internals of a Kubernetes cluster work. Whenever a service is created kube-proxy
will create rules that will watch for ingress traffic to that service address, additionally it also keeps track of the endpoints
that match that particular service. These rules, regardless of the technology (currently iptables but soon nftables in upstream) will modify traffic that is destined for a service IP and change that destination to one of the endpoint IP addresses. There is a great overview (in epic detail) available at learnk8s.
Where things get interesting is egress traffic, if you had a pod that had busybox+curl and you were using that the curl https://google.com
it will typically just work, but the thing about the pod network is that it isnβt meant to be directly accessible! So how does return traffic get to the pod?
So the answer is pretty simple, and it is pretty much the same logic that applies to most household internet connections. This logic is the idea of NAT (Network Address Translation), but simply put like the pod in a Kubernetes cluster your home laptop isnβt directly accessible on the internet however the ADSL/Fibre router is. When an internal device (or pod) wants to connect to the outside world the traffic will traverse the either a physical router or routing logic in the kernel and the destination address is changed to the address of the router. So in the diagram below google.com
will receive a connection from the external address which it can communicate with, when it receives traffic from this connection it will then change the address back to the internal address and traffic is then passed back internally.
1 | ββββββββββββββββββββββββββββββββββββββ |
So this is fine, effectively each node in the cluster will take care of handling the traffic originating from pods by making it look like itβs originating from the node where the pod resides and in most cases this works. However if your cluster has 1000 nodes and you have a pod that wants to access something external protected by a firewall, well that firewall will now need 1000 rules allowing access because who knows where that pod will be scheduled to run in the future. Or perhaps one of those modern app transformation programs actually works (Iβve been part of a myriad of failed attempts to create these programs) and youβve finally migrated an application that has some static IP requirements, well again there is no way to have anything static in this scenario.
Egress Magic πͺ
The most common solution to implement βcontrolledβ Egress within a Kubernetes cluster is the concept of an Egress gateway and in most cases this is another small pod (or pods) that sit within the cluster that traffic is forwarded to. This gateway will then do the same manipulation however it will typically have a series of IP addresses that it can use dependant on who the originating traffic is coming from. So in theory we could tie our legacy application to its original IP address and the gateway will then take care of manipulating the traffic to make it appear as though this original IP address is still in use.
1 | Node-01 192.168.0.22 |
Egress with kube-vip π
A few years ago I popped into the #sig-networking channel (excellent place to discuss ideas btw) to discuss egress within Kubernetes as I wanted to implement it within kube-vip. There are a few projects that have implemented through different mechanisms in the past, but largely if you wanted to implement it with a stable architecture the obvious solution was using a gateway (as discussed above). After fiddling around with the rules that kube-proxy implements I decided that perhaps there was an alternative solution!
The kube-vip project is everyones (some peoples) goto project when wanting to deploy a load balancer solution within a Kubernetes cluster, and it is pretty devastatingly simple. Apply a load balancer address to a node, and then when traffic is sent to it the traffic then makes its way to a pod via kube-proxy (easy)!
1 | βββββββββββ |
The same, but backwards
For Egress could we potentially combine some of the behaviour of both kube-proxy
and kube-vip to produce a stable Egress solution? It turns out yes, we can and again it is devastatingly simple. To accomplish this we effectively overload the behaviour of a Kubernetes service by using it to define not only the ingress (as a loadbalancer) but also the egress with additional iptables rules. This creates a 1:1 relationship between a service and a pod and uses the service IP as the pods ingress and egress address.
1 | ββββββββββββββ |
If we look the above diagram weβve created a service that selects Pod-01
and the following will happen:
- The service load balancer is attached to the node where
Pod-01
currently resides - The Kernel is updated so that traffic leaving from
Pod-01
should be updated so that it looks like it is coming from the load balancer, so effectively10.0.0.3
is re-written to become192.168.0.100
. - As the process sends traffic or initiates connections to the
Server
they are now all coming from the externally facing load balancer address.
In the event Pod-01
is killed or crashes then kube-vip is notified through a Kubernetes watcher, where the event of a pod delete will result in the egress rules being cleaned up. When the Pod reappears on another node, then the egress rules will be re-applied and the traffic will continue to be appearing to come from the same IP address, so we get a level of High Availability for free.
Where is the eBPF π
Iβve wanted to re-implement this functionality in eBPF for quite a long time, however it simply isnβt as easy as I hoped.
βBut isnβt there an egress hook for TC?β
Indeed there is, however this hook is the last step in the chain before the packet goes back onto the nic and leaves the machine. What this effectively means is that all of the netfilter (iptables & kube-proxy
) magic will have already modified the packet before our eBPF program can see it.
So for now the plan is to migrate kube-vip to use googles nftables GO library and weβll see what happens next.
Thanks
- Lars Ekman
- Antonio Ojea
- The other folks that listened to my daft ideas in #SIG-NETWORK