Egress

Posted on 2024-11-03 Edited on 2024-11-07 Disqus:

This is the third time I’ve tried to write this, a random panic on my Hackintosh wiped out half a blog post and the other attempt read like garbage 🫠. So here we go! I recently added Egress IPv6 to kube-vip, but it isn’t something that I really advertised as functionality in kube-vip, so I though it best to actually write about it. The concept of Egress within Kubernetes is a bit of an interesting one because it also exposes a bit of confusion when it comes to Kubernetes networking. Namely what actually comes out of the box with a Kubernetes cluster.

Kubernetes Networking

When we think about a simple Kubernetes cluster (and deploying it), then we need to consider all of the additional components that are actually required in order for the cluster to actually work. These are commonly thought of as the CxI plugins, where the x is usually R for the runtime, N for the networking plugin and finally S for storage. Without a CRI, there is simply no functionality to stop and start containers effectively making your cluster pointless and additionally without a CNI your containers wont have any networking capabilities! (well it turns out that’s not strictly true).

The CNI in it’s most basic form is largely doing:

Pod networking with IPAM
Multi node networking, allowing pods to speak to one another as if they were on a big flat network (often an overlay)
Network Policy
Ingress typically though hostPort

Above is a gross over simplification of what CNIs accomplish and most now do far more than just basic networking, however there is a huge amount of networking functionality that isn’t explained. Who or what creates a Kubernetes service and manages the forwarding to a pod that is selected, and additionally when a pod wishes to send traffic externally to the cluster who or what is happening.

Kube-Proxy

So if you do a kubectl get pods -A on a cluster you may encounter kube-proxy pods, which typically run on all nodes as a daemonset and are largely there to manipulate the masses of iptables rules required in order to make the internals of a Kubernetes cluster work. Whenever a service is created kube-proxy will create rules that will watch for ingress traffic to that service address, additionally it also keeps track of the endpoints that match that particular service. These rules, regardless of the technology (currently iptables but soon nftables in upstream) will modify traffic that is destined for a service IP and change that destination to one of the endpoint IP addresses. There is a great overview (in epic detail) available at learnk8s.

Where things get interesting is egress traffic, if you had a pod that had busybox+curl and you were using that the curl https://google.com it will typically just work, but the thing about the pod network is that it isn’t meant to be directly accessible! So how does return traffic get to the pod?

So the answer is pretty simple, and it is pretty much the same logic that applies to most household internet connections. This logic is the idea of NAT (Network Address Translation), but simply put like the pod in a Kubernetes cluster your home laptop isn’t directly accessible on the internet however the ADSL/Fibre router is. When an internal device (or pod) wants to connect to the outside world the traffic will traverse the either a physical router or routing logic in the kernel and the destination address is changed to the address of the router. So in the diagram below google.com will receive a connection from the external address which it can communicate with, when it receives traffic from this connection it will then change the address back to the internal address and traffic is then passed back internally.

┌────────────────────────────────────┐                                                 
│┌────────────────┐                  │                                                 
││Pod-01 10.0.0.3 ┼─────────┐        │            ┌───────────────────────────┐
│└────────────────┘         │        │            │ google.com                │
│                           │        │            └───────────────────────────┘
└───────────────────────────▼────────┘                          ▲              
                 Node-01 192.168.0.22───────────────────────────┘

So this is fine, effectively each node in the cluster will take care of handling the traffic originating from pods by making it look like it’s originating from the node where the pod resides and in most cases this works. However if your cluster has 1000 nodes and you have a pod that wants to access something external protected by a firewall, well that firewall will now need 1000 rules allowing access because who knows where that pod will be scheduled to run in the future. Or perhaps one of those modern app transformation programs actually works (I’ve been part of a myriad of failed attempts to create these programs) and you’ve finally migrated an application that has some static IP requirements, well again there is no way to have anything static in this scenario.

Egress Magic 🪄

The most common solution to implement “controlled” Egress within a Kubernetes cluster is the concept of an Egress gateway and in most cases this is another small pod (or pods) that sit within the cluster that traffic is forwarded to. This gateway will then do the same manipulation however it will typically have a series of IP addresses that it can use dependant on who the originating traffic is coming from. So in theory we could tie our legacy application to its original IP address and the gateway will then take care of manipulating the traffic to make it appear as though this original IP address is still in use.

Node-01 192.168.0.22                                      
┌──────────────────────────────────────────┐                                    
│┌────────────────┐                        │                                    
││Pod-01 10.0.0.3 │                        │                                    
│└───────┬────────┘                        │                                    
│        │                                 │                                    
└────────┼─────────────────────────────────┘                                    
         │                                                                      
         │                                         ┌───────────────────────────┐
         │                      ┌─────────────────►│ Google.com                │
┌────────┼──────────────────────┼──────────┐       └───────────────────────────┘
│        │              ┌───────┼────────┐ │                                    
│        └─────────────►│Egress Gateway  │ │                                    
│                       └────────────────┘ │                                    
│                                          │                                    
└──────────────────────────────────────────┘                                    
                      Node-02 192.168.0.23

Egress with kube-vip 🐙

A few years ago I popped into the #sig-networking channel (excellent place to discuss ideas btw) to discuss egress within Kubernetes as I wanted to implement it within kube-vip. There are a few projects that have implemented through different mechanisms in the past, but largely if you wanted to implement it with a stable architecture the obvious solution was using a gateway (as discussed above). After fiddling around with the rules that kube-proxy implements I decided that perhaps there was an alternative solution!

The kube-vip project is everyones (some peoples) goto project when wanting to deploy a load balancer solution within a Kubernetes cluster, and it is pretty devastatingly simple. Apply a load balancer address to a node, and then when traffic is sent to it the traffic then makes its way to a pod via kube-proxy (easy)!

┌─────────┐                                                                 
│Client🧑‍💻 ┼────────────────────► 192.168.0.100───────►Node-01 192.168.0.22  
└─────────┘                     ┌──────▲──────────────────┬────────────────┐
                                │      │                  │                │
                                │┌─────┼──────┐         ┌─▼──────────────┐ │
                                ││kube-vip🐙  │         │Pod-01 10.0.0.3 │ │
                                │└────────────┘         └────────────────┘ │
                                └──────────────────────────────────────────┘

The same, but backwards

For Egress could we potentially combine some of the behaviour of both kube-proxy and kube-vip to produce a stable Egress solution? It turns out yes, we can and again it is devastatingly simple. To accomplish this we effectively overload the behaviour of a Kubernetes service by using it to define not only the ingress (as a loadbalancer) but also the egress with additional iptables rules. This creates a 1:1 relationship between a service and a pod and uses the service IP as the pods ingress and egress address.

┌────────────┐                                                                    
│Server      │◄──────────────────────┐                                            
└────────────┘                       │192.168.0.100           Node-01 192.168.0.22
                                      ┌──▲────▲──────────────────────────────────┐
                                      │  │    │               ┌──────────┐       │
                                      │  │    └───────────────┤Kernel 🧠 │◄─┐    │
                                      │  │                    │          │  │    │
                                      │  │        ┌──────────►│(iptables)│  │    │
                                      │  │        │           └──────────┘  │    │
                                      │┌─┼────────┼─┐         ┌─────────────┼──┐ │
                                      ││kube-vip 🐙 │         │Pod-01 10.0.0.3 │ │
                                      │└────────────┘         └────────────────┘ │
                                      └──────────────────────────────────────────┘

If we look the above diagram we’ve created a service that selects Pod-01 and the following will happen:

The service load balancer is attached to the node where Pod-01 currently resides
The Kernel is updated so that traffic leaving from Pod-01 should be updated so that it looks like it is coming from the load balancer, so effectively 10.0.0.3 is re-written to become 192.168.0.100.
As the process sends traffic or initiates connections to the Server they are now all coming from the externally facing load balancer address.

In the event Pod-01 is killed or crashes then kube-vip is notified through a Kubernetes watcher, where the event of a pod delete will result in the egress rules being cleaned up. When the Pod reappears on another node, then the egress rules will be re-applied and the traffic will continue to be appearing to come from the same IP address, so we get a level of High Availability for free.

Where is the eBPF 🐝

I’ve wanted to re-implement this functionality in eBPF for quite a long time, however it simply isn’t as easy as I hoped.

“But isn’t there an egress hook for TC?”

Indeed there is, however this hook is the last step in the chain before the packet goes back onto the nic and leaves the machine. What this effectively means is that all of the netfilter (iptables & kube-proxy) magic will have already modified the packet before our eBPF program can see it.

So for now the plan is to migrate kube-vip to use googles nftables GO library and we’ll see what happens next.

Thanks

Lars Ekman
Antonio Ojea
The other folks that listened to my daft ideas in #SIG-NETWORK