Egress v2 in kube-vip

Posted on 2025-07-17 Edited on 2025-07-29 Disqus:

I’m not sure if it was my time working for Docker or just the general confusion around what various companies and startups are doing around container images, but I’ve always opted to go FROM scratch where possible. Several years ago we (me) wanted to implement egress control within Kube-vip, although up to that point I didn’t really want to be involved in the datapath I decided to go for it. My main concern about the datapath is that people get very upset when their packets arrive backwards or get lost 😱. However after a bit of research and seeing what the rest of the world was doing, I decided to opt for a slightly different (if not weird approach) which was to tie the service to a pod and map the external address 1:1 with a pod. There is a ton of detail about that in this blog post 👉 https://thebsdbox.co.uk/2024/11/03/Egress/.

iptables & nftables

In order to implement the egress source NAT I decided to largely duplicate the behaviour of kube-proxy, which is to SNAT the address of the pod to the address of the load balancer service and job done 🎉. However in order to do that we needed to start adding iptables binaries into the kube-vip images. This now suddenly means adding in other dependancies, and as we all know more dependancies means a larger vector for attacking. Under most circumstances a lot of people wouldn’t mind and/or care but unfortunately kube-vip serves as the control plane HA VIP and is in a critical position, so exploiting it would be a big issue. So now we introduce two kube-vip images, one from scratch and another with all the iptables and dependancies.

tech-debt = tech-debt + 1;

The state of iptables & nftables

What, you want some details.. fine

So we’ve been using iptables for years and it largely works, it has its issues with scale but so does everything at one point or another. One of the main issues has been programming rules or lower level abstractions, given we’re ultimately programming the kernel can we get rid of all the binaries. Well the answer to that is:

The answer unfortunately is: No.

Additionally:

We don’t guarantee a stable interface

… awesome, and it has been this way for years and years.

When is iptables not iptables

When it is nftables, of course 🤷🏼‍♂️. It turns out that for the last few years all of our legacy iptables code hasn’t been churning out iptables rules, you’ve been living a lie! Feel free to read this madness at your earliest convenience. But the tl;dr is that in most places the iptables binary is actually a new program that will take your iptables syntax and pump out nftables rules.

So why is no-one writing nftables rules?

Well because you didn’t have to, if the iptables rules are being on the-fly being written as nftables rules then just leave it to that. But additionally the syntax is dramatically different to iptables rules, which means that it comes with it’s own learning curve. This and a few buggy releases of the userland tooling have unfortunately slowed it’s adoption. However, recently there has been quite a few things happening in the nftables space! Namely Kubernetes 1.31 introduced the capability to use kube-proxy in an nftables mode, which is detailed here.

The (finally) growing ecosystem for nftables

The Kubernetes project has finally started to embrace nftables and has opted to write a wrapper around nftables, which is called knftables. This is a wrapper around the nftables binaries, which unfortunately still requires bringing along both the nftables binaires, the libraries and any other dependancies. But what is there if you’re a slightly mad purist who demands scratch.

A pure Go package for manipulating nftables 🎉

I’ve been watching https://github.com/google/nftables for quite some time, actually since November 2023. The thing that was missing from this was the capability to implement SNAT rules in this pure Go implementation. Luckily this feature was implemented not too long after, but then life gets in the way and you’re too busy climbing a mountain 🧗🏻 or doing something else 🚵🏻.

However I finally managed to find some time several weeks ago to finally investigate what the code would look like in order to finally rip out all the external dependancies and program the kernel directly!

Egress v2 finally 🥹!

We’re calling this Egress v2 as it’s a complete re-write of how egress is handled within kube-vip, although to make things as simple as possible we’re just adding an additional annotation to the service that will be used to control egress :-)

kube-vip.io/egress: true - will enable the service to be used as the egress for the pod it is selecting
kube-vip.io/egress-internal: true - will use the internal/Egress v2 mechanism for configuring the source NATing

This move also simplifies the rules that need writing within the kernel, before we would have to write multiple rules to ensure that we didn’t egress when a pod would try and talk to another pod or a service (this would break everything). Additionally added the capability add additional subnets that shouldn’t have the egress address changed or to only egress when sending traffic to a specific destination port. All of this functionality created additional rules that where hard to order or could be accidentally left behind when cleaning rules (on service deletion).

Egress v2 tables and chains

To make it super easy to ensure that kube-vip orders and cleans its egress configuration correctly it will do the following:

Create a table for IPv4 table ip kube_vip_v4 and IPv6 table ip6 kube_vip_v6
Each service creates it’s own chain chain kube_vip_snat_v4_abc the chain contains the IP version (v4/v6) and more importantly the UUID of the service abc in reality would be much longer
In each chain is the rule for a particular service

Example rule: ip saddr { 10.0.0.1 } ip daddr != 10.0.0.0-10.0.255.255 ip daddr != 10.96.0.0-10.96.255.255 snat to 11.0.0.2
The above rule is typical of the rules created for egress, where we ensure that we don’t egress for the podCIDR or the serviceCIDR. We can see that the POD IP 10.0.0.1 is source NAT’d to 11.0.0.2 when not going to those networks.

When a service is deleted we retrieve it’s UUID along with the IP address version and simply delete the chain, and everything is completely tidy.

Additional details If kube-vip is forcefully restarted then it won’t garbage collect old rules. However we can ensure kube-vip is restarted with the environment variable egress_clean which will clean all rules from the kube-vip tables ensuring a clean slate as kube-vip re-participates in being part of hosting load balancing services.

The TL;DR

We’re going to be removing all external dependancies from kube-vip, so no more programs or other tooling to manage the networking (Great for image size and security)! Kube-vip can now speak directly to the kernel and configure your egress rules, in a much better way than before! 🤩

Coming soon!

All of the code has been written and is ready to merge at this time, and we’re hoping to get kube-vip v1.0 in to your hands soon! 🎉