Egress v2 in kube-vip

I’m not sure if it was my time working for (Docker)[docker.io] or just the general confusion around what various companies and startups are doing around container images, but I’ve always opted to go FROM scratch where possible. Several years ago we (me) wanted to implement egress control within Kube-vip, although up to that point I didn’t really want to be involved in the datapath I decided to go for it. My main concern about the datapath is that people get very upset when their packets arrive backwards or get lost 😱. However after a bit of research and seeing what the rest of the world was doing, I decided to opt for a slightly different (if not weird approach) which was to tie the service to a pod and map the external address 1:1 with a pod. There is a ton of detail about that in this blog post 👉 https://thebsdbox.co.uk/2024/11/03/Egress/.

iptables & nftables

In order to implement the egress source NAT I decided to largely duplicate the behaviour of kube-proxy, which is to SNAT the address of the pod to the address of the load balancer service and job done 🎉. However in order to do that we needed to start adding iptables binaries into the kube-vip images. This now suddenly means adding in other dependancies, and as we all know more dependancies means a larger vector for attacking. Under most circumstances a lot of people wouldn’t mind and/or care but unfortunately kube-vip serves as the control plane HA VIP and is in a critical position, so exploiting it would be a big issue. So now we introduce two kube-vip images, one from scratch and another with all the iptables and dependancies.

tech-debt = tech-debt + 1;

The state of iptables & nftables

What, you want some details.. fine

So we’ve been using iptables for years and it largely works, it has its issues with scale but so does everything at one point or another. One of the main issues has been programming rules or lower level abstractions, given we’re ultimately programming the kernel can we get rid of all the binaries. Well the answer to that is:

The answer unfortunately is: No.

Additionally:

We don’t guarantee a stable interface

… awesome, and it has been this way for years and years.

When is iptables not iptables

When it is nftables, of course 🤷🏼‍♂️. It turns out that for the last few years all of our legacy iptables code hasn’t been churning out iptables rules, you’ve been living a lie! Feel free to read this madness at your earliest convenience. But the tl;dr is that in most places the iptables binary is actually a new program that will take your iptables syntax and pump out nftables rules.

So why is no-one writing nftables rules?

Well because you didn’t have to, if the iptables rules are being on the-fly being written as nftables rules then just leave it to that. But additionally the syntax is dramatically different to iptables rules, which means that it comes with it’s own learning curve. This and a few buggy releases of the userland tooling have unfortunately slowed it’s adoption. However, recently there has been quite a few things happening in the nftables space! Namely Kubernetes 1.31 introduced the capability to use kube-proxy in an nftables mode, which is detailed here.

The (finally) growing ecosystem for nftables

The Kubernetes project has finally started to embrace nftables and has opted to write a wrapper around nftables, which is called knftables. This is a wrapper around the nftables binaries, which unfortunately still requires bringing along both the nftables binaires, the libraries and any other dependancies. But what is there if you’re a slightly mad purist who demands scratch.

A pure Go package for manipulating nftables 🎉

I’ve been watching https://github.com/google/nftables for quite some time, actually since November 2023. The thing that was missing from this was the capability to implement SNAT rules in this pure Go implementation. Luckily this feature was implemented not too long after, but then life gets in the way and you’re too busy climbing a mountain or doing something else.