thebsbdox

Adding a gateway to kube-vip

Posted on 2026-01-14 Edited on 2026-01-15 In blog Disqus:

So …

That was over a year ago, and well I didn’t write a service mesh (in the traditional sense) although I did write some interesting bits and finally did some MutatingWebHook stuff. With my initial idea being to use sidecars to implement all of the magic proxy stuff. Although I largely ended up getting stuck with some issues around Kubernetes v1.33 and EphemeralContainers which has largely been fixed at this point. So given the Christmas break I decided to have another crack at it!

tl;dr where’s the code -> https://github.com/kube-vip/kube-gateway

kube-gateway (v0.0001)

The architecture for the kube-gateway (today) consists of:

The gateway watcher
The proxy

The gateway watcher

This currently is just a pod that has the permissions to watch for pod updates for things such as an IP address has been assigned or that an annotation has been applied. When the correct annotations are applied to a pod then the watcher will modify the pod.Spec, which I hear you cry 😿 is READ-ONLY! This is true (partially), but the read-only sections typicaly mean I can’t add a volume/configmap/secret and that can render it impossible to add things like secrets or sidecars to a pod (at run-time) impossible. However, it turns out that the pod.Spec.EphemeralContainer is read-write so this does provide us with an opportunity to add an additional container although we still can’t add any volumes as they are immutable.

So to get around this the watcher has this snippet of code:

// Create certificates and then a Kubernetes secret
i.c.createCertificate(pod.Name, pod.Status.PodIP)

// create a secret in kubernetes
err := i.c.loadSecret(pod.Name, i.clientset)
if err != nil {
	slog.Error(err)
}
ec.EnvFrom = append(ec.EnvFrom, v1.EnvFromSource{
	SecretRef: &v1.SecretEnvSource{
		LocalObjectReference: v1.LocalObjectReference{
			Name: secret, // map to the secret we created earlier
		},
		Optional: nil,
	},
})

The watcher itself has it’s own Certificate Authority certs and will use them to create new certificates for the pod (based upon it’s IP address), which are then loaded into a Kubernetes secret. Finally we can reference the secrets inside the EphermeralContainer as environment variables as this is allowed in the Kubernetes spec!

Then the annotations we apply are typically passed to the EphermeralContainer as additional environment variables that the kube-gateway uses!

Annotating your pod for mTLS 🔐

To enable mTLS between two pods we will need to apply a gateway to each pod (to handle the encryption and decryption on either side).
BONUS: If you want to offload the TLS encryption the kube-gateway supports kTLS (in-kernel TLS)!

To enable this then each pod will need the following annotation before enabling encryption:

kubectl annotate pod <pod name> kube-gateway.io/ktls="true"

This is done with the following:

This will apply the gateway to pod-01:

kubectl annotate pod pod-01 kube-gateway.io/encrypt="true"

This will then apply the other gateway to pod-02:

kubectl annotate pod pod-02 kube-gateway.io/encrypt="true"

Annotating your pod for AI 🤖

Annotate your LLM

This is highly experimental at the moment, but the feature set will expand as it is worked on 😀

As of today (15th January 2026) the implementation is expecting your LLM to be running within the cluster somewhere, this is a self imposed limitation of the kube-gateway (and will be changed soon).

A kube-gateway endpoint is required (today) in order to facilitate end-to-end connectivity, we will need to annotate our LLM with a gateway endpoint annotation:

kubectl annotate pod -n ollama ollama-bbbc54cbc-8mqt5 kube-gateway.io/endpoint="true"

Now anything that is interacting with it can be modified with our gateway annotations!

Annotate your AI workloads

Imagine we have some workloads created a long time ago (3 weeks in the world of AI 🙄), and these workloads are using an old/outdated/expensive model and although they workloads perform OK we want to improve them. With the kube-gateway we can modify this on the fly.

Depending on the workload type or language/code or SDK used the connection to the LLM, then a HTTP keep-alive may be used and if this is the case then the existing TCP session will be used for all API calls (and we wont be able to pass it to the gateway). So the netflush annotion will instruct the kube-gateway pod on startup to use some eBPF 🐝 magic to knock the TCP session to be recreated without disturbing the connection. (this annotation is optional and will need investigating, the python ollama SDK doesn’t require this for example).

kubectl annotate pod aiworkload kube-gateway.io/netflush="true"

The next annation for our aiworkload is to change the model:

kubectl annotate pod aiworkload kube-gateway.io/ai-model="gemma2:2b"

Now whenever there is a request our kube-gateway will modify the traffic transparently to use the model specified!

Finally enable the kube-gateway on the workload:

kubectl annotate pod aiworkload kube-gateway.io/ai="true"

eBPF, mTLS and AI oh my..

So under the covers what is actually occuring in order to facilitate this?

Well simply put there are three parts to it:

The watcher takes input and adds an ephemeral container to a specified workload
Some eBPF code is injected to ensure that traffic from the correct process has it’s original destination captured and is now sent to a new destination, which is our gateway!
Our gateway in the ephermal container recieves the connection, and will lookup the original destination. We can make changes to the TCP stream/data (such as in the AI use-case) or we can encrypt it and send it to the pod that now also has a gateway as a ephemeral container attached to decrypt the traffic and send it to the original application.

What’s next?

At this point we can easily sit in the middle of any traffic and transparently mutate it, which allows us a huge opportunity to add in all sorts of guardrails and monitoring opportunities.

mTLS

The main goal in the future would be to redirect to a per-node gateway, which should be doable and drastically reduce the need for a gateway per pod architecture.

AI workloads

Remove the need for a gateway endpoint, which means that only the “client” side would need to have a gateway attached and would mean the “client” can connect to anything transparently. Additionally extending the functionality of what we may want to mutate in the AI payload, such as limitating tokens or having rules about prompts etc.. I guess we shall see!

Either head to kube-gateway.io or the Github repository https://github.com/kube-vip/kube-gateway

Egress v2 in kube-vip

Posted on 2025-07-17 Edited on 2025-07-29 Disqus:

I’m not sure if it was my time working for Docker or just the general confusion around what various companies and startups are doing around container images, but I’ve always opted to go FROM scratch where possible. Several years ago we (me) wanted to implement egress control within Kube-vip, although up to that point I didn’t really want to be involved in the datapath I decided to go for it. My main concern about the datapath is that people get very upset when their packets arrive backwards or get lost 😱. However after a bit of research and seeing what the rest of the world was doing, I decided to opt for a slightly different (if not weird approach) which was to tie the service to a pod and map the external address 1:1 with a pod. There is a ton of detail about that in this blog post 👉 https://thebsdbox.co.uk/2024/11/03/Egress/.

iptables & nftables

In order to implement the egress source NAT I decided to largely duplicate the behaviour of kube-proxy, which is to SNAT the address of the pod to the address of the load balancer service and job done 🎉. However in order to do that we needed to start adding iptables binaries into the kube-vip images. This now suddenly means adding in other dependancies, and as we all know more dependancies means a larger vector for attacking. Under most circumstances a lot of people wouldn’t mind and/or care but unfortunately kube-vip serves as the control plane HA VIP and is in a critical position, so exploiting it would be a big issue. So now we introduce two kube-vip images, one from scratch and another with all the iptables and dependancies.

tech-debt = tech-debt + 1;

The state of iptables & nftables

What, you want some details.. fine

So we’ve been using iptables for years and it largely works, it has its issues with scale but so does everything at one point or another. One of the main issues has been programming rules or lower level abstractions, given we’re ultimately programming the kernel can we get rid of all the binaries. Well the answer to that is:

The answer unfortunately is: No.

Additionally:

We don’t guarantee a stable interface

… awesome, and it has been this way for years and years.

When is iptables not iptables

When it is nftables, of course 🤷🏼‍♂️. It turns out that for the last few years all of our legacy iptables code hasn’t been churning out iptables rules, you’ve been living a lie! Feel free to read this madness at your earliest convenience. But the tl;dr is that in most places the iptables binary is actually a new program that will take your iptables syntax and pump out nftables rules.

So why is no-one writing nftables rules?

Well because you didn’t have to, if the iptables rules are being on the-fly being written as nftables rules then just leave it to that. But additionally the syntax is dramatically different to iptables rules, which means that it comes with it’s own learning curve. This and a few buggy releases of the userland tooling have unfortunately slowed it’s adoption. However, recently there has been quite a few things happening in the nftables space! Namely Kubernetes 1.31 introduced the capability to use kube-proxy in an nftables mode, which is detailed here.

The (finally) growing ecosystem for nftables

The Kubernetes project has finally started to embrace nftables and has opted to write a wrapper around nftables, which is called knftables. This is a wrapper around the nftables binaries, which unfortunately still requires bringing along both the nftables binaires, the libraries and any other dependancies. But what is there if you’re a slightly mad purist who demands scratch.

A pure Go package for manipulating nftables 🎉

I’ve been watching https://github.com/google/nftables for quite some time, actually since November 2023. The thing that was missing from this was the capability to implement SNAT rules in this pure Go implementation. Luckily this feature was implemented not too long after, but then life gets in the way and you’re too busy climbing a mountain 🧗🏻 or doing something else 🚵🏻.

However I finally managed to find some time several weeks ago to finally investigate what the code would look like in order to finally rip out all the external dependancies and program the kernel directly!

Egress v2 finally 🥹!

We’re calling this Egress v2 as it’s a complete re-write of how egress is handled within kube-vip, although to make things as simple as possible we’re just adding an additional annotation to the service that will be used to control egress :-)

kube-vip.io/egress: true - will enable the service to be used as the egress for the pod it is selecting
kube-vip.io/egress-internal: true - will use the internal/Egress v2 mechanism for configuring the source NATing

This move also simplifies the rules that need writing within the kernel, before we would have to write multiple rules to ensure that we didn’t egress when a pod would try and talk to another pod or a service (this would break everything). Additionally added the capability add additional subnets that shouldn’t have the egress address changed or to only egress when sending traffic to a specific destination port. All of this functionality created additional rules that where hard to order or could be accidentally left behind when cleaning rules (on service deletion).

Egress v2 tables and chains

To make it super easy to ensure that kube-vip orders and cleans its egress configuration correctly it will do the following:

Create a table for IPv4 table ip kube_vip_v4 and IPv6 table ip6 kube_vip_v6
Each service creates it’s own chain chain kube_vip_snat_v4_abc the chain contains the IP version (v4/v6) and more importantly the UUID of the service abc in reality would be much longer
In each chain is the rule for a particular service

Example rule: ip saddr { 10.0.0.1 } ip daddr != 10.0.0.0-10.0.255.255 ip daddr != 10.96.0.0-10.96.255.255 snat to 11.0.0.2
The above rule is typical of the rules created for egress, where we ensure that we don’t egress for the podCIDR or the serviceCIDR. We can see that the POD IP 10.0.0.1 is source NAT’d to 11.0.0.2 when not going to those networks.

When a service is deleted we retrieve it’s UUID along with the IP address version and simply delete the chain, and everything is completely tidy.

Additional details If kube-vip is forcefully restarted then it won’t garbage collect old rules. However we can ensure kube-vip is restarted with the environment variable egress_clean which will clean all rules from the kube-vip tables ensuring a clean slate as kube-vip re-participates in being part of hosting load balancing services.

The TL;DR

We’re going to be removing all external dependancies from kube-vip, so no more programs or other tooling to manage the networking (Great for image size and security)! Kube-vip can now speak directly to the kernel and configure your egress rules, in a much better way than before! 🤩

Coming soon!

All of the code has been written and is ready to merge at this time, and we’re hoping to get kube-vip v1.0 in to your hands soon! 🎉

Continuing building your service mesh

Posted on 2024-12-02 Disqus:

In a previous post I detailed (in my head at least) the “shopping list” of bits needed to implement a simple service mesh, you can read that here https://thebsdbox.co.uk/2024/11/30/Building-your-own-service-mesh/. Whilst that post covered some of the theoretical bits and the eBPF 🐝 magic, my aim is to wrap up all of the other pieces needed here.

The proxy

In this “build your own” service mesh, the proxy will live next to the application that we care about.

OMG a sidecar 😱

How that sidecar gets there can be an interesting discussion, so lets look at the choices.

You maintain some yaml, basically your deployments will need to have the sidecar added to to them. This has to be before you deploy as you can’t add a sidecar to an existing pod.
But what about ephemeral containers! (I hear you ask), well they’re pretty good and yes you can add them to an existing pod. BUT, if you need to mount files (like 🔐 certificates), then Volumes need adding to the pod.spec and you can’t do that. AH HA! I hear you think, use secrets and environment variables! that way it doesn’t modify the main body of the pod.spec, just the pod.spec.ephemeralcontainer[x]. Great idea, but it doesn’t work and there has been an issue open about it for nearly (checks notes) 2 years.
Ye olde sidecar injector! The defacto method of modifying a pod.spec before it’s actually committed to the Kubernetes API. (I won’t go into detail as people have written about this for some time)

So regardless of how the proxy gets there, it needs to be there. There will be a proxy in every application that we care about. When one pod wants to talk to another pod it will actually be the proxies that are doing the talking!

What is in a proxy

Our eBPF code
Code that will create connections and TCP listeners for sending and receiving traffic
The required certificates in order for traffic to be encrypted

Proxy startup

On startup out proxy will determine it’s own pid and along with the pod CIDR range add that to an eBPF map, it will then attach our eBPF programs to the kernel. Once that has occurred it will start the proxy listener, this is where our proxy will receive all traffic forwarded by our eBPF program. It will then read the certificates (from the filesystem or environment variables) and once they’re loaded it will start another listener for incoming TLS connections! That’s it.

Proxy running lifecycle

The proxy is listening on it’s internal proxy port
A new connection is received on this port hijacked by eBPF 🐝, where we do a getsockopt syscall with the option SO_ORIGINAL_DST. This returns to us the original destination address and port that this connection was heading to before we hijacked it with eBPF.
We create a new outbound connection to the original destination address, however we substitute the port with that of the proxies TLS listening port. This initiates a new TLS connection between both proxies!
The source proxy will send to the destination proxy the port that we were originally accessing.
The destination proxy will connect to that port and begin receiving traffic from the source.
At this point we have full end-to-end connectivity from one application to another, without the applications realising that we’re in the middle!

Creating Certificates 📝

In order for the TLS to work the certificates will need to be created with the correct details, namely the IP addresses of the pods to ensure that the identification works correctly. This raising a chicken and egg scenario as ideally we require these details asap, however we only can be allocated this IP address once the pod has been created by the Kubernetes API. As we can’t modify the Volume section of a pod once it has been created we can refer to secrets as environment variables before they have been created.

We then write some code to using Kubernetes informers, there is excellent detail here. These informers will “inform” us when a pod has been both created and updated , we care more about the update as this is the operation where the pod.status.podIP will be populated with the address we care about. Once we have this we can create the required certificates and upload them as a secret to be used by the proxy container.

The final piece is the injector 💉

This is relatively straight forward, this piece of code will on startup register through the AdmissionController that certain resources (pods in our case) when created be sent to our code. Which will patch the pod.spec to now include our container as an initContainer, and be sent back to the Kubernetes API server to be scheduled.

In Summary

We “kind of” have the makings of a service mesh at this point (in my mind at least), we transparently move traffic from the application through our proxy, where we can apply what we wish. In this Proof of concept we newly mint certificates and then establish end to end mTLS, where traffic is encrypted between source and destination. Although that doesn’t mean we have to end there 😀

What next..

All the source code for this experiment is available at https://github.com/thebsdbox/smesh so feel free to go and have a look around. It isn’t the tidiest, but it does work 😂

Building your own service mesh

Posted on 2024-11-30 Edited on 2024-12-05 Disqus:

I saw a few mentions about “service mesh” and mTLS amongst other things during the KubeCon US week and given some of the messing around i’d been doing with eBPF recently I asked myself “how hard could it be to write one from scratch”?

The service mesh shopping list

There are a bunch of components that we will need to implement in order for us to implement the “service mesh” type behaviour. Most service meshes implement a heck of a lot more, we’re exploring the basics needed to implement it.

Traffic redirector 🚦

We need a way of taking traffic from an application and sending it elsewhere, typically to our proxy where we will potentially modify the traffic. The traffic needs to be redirected in a way where the application does’t need to know about it occurring, however we need to ensure that the traffic will reach its destination and traffic is returned in a way that makes sense to the application. In most circumstances this is handled by iptables rules that will change the source and destination of the packets as they navigate the kernel. As a pod initiates a connection to another pod within the cluster we will need to redirect it to our program, which we will call the proxy.

The Proxy

Our proxy will need to be listening somewhere that is accessible on the network and as outbound connections are created their destination will be modified to that of the proxy (we also need to keep a copy of that destination somewhere). At this point we will start receiving data from the source and it is here were we opportunity to potentially change the original traffic or parse the traffic and then make decisions based upon what we learn.

The Injector 💉

The injector is code that will modify the behaviour of Kubernetes so that when new workloads are scheduled an additional container could be added, or something could run before the workload starts that will write iptables/nftables rules into the kernel.

Certificates 📝

If we are wanting to use mTLS between pods then we will need to create certificates, these certs will need things like the pod IPs or pod hostnames etc. in order for the certificates to work. Given that we wont know these details until the pod starts we will need to capture this information by watching Kubernetes and creating the certificates when we see a pod being created.

Lets get started 🐝

If I can’t control the traffic then I can’t do anything, so first things first, I’m going to use eBPF in order to manipulate the traffic and make sure that it is sent to where I need it to go. Why eBPF? well because!

So lets walk this through…

There are a bunch of methods for manipulating traffic XDP, TC, sockets etc.. so what’s the choice?

XDP? Nope, no egress and if we’re wanting to capture traffic being initiated out to somewhere else, then that’s egress.
TC? It has egress, BUT it’s already gone through the kernel, iptables, sockets etc.. changing the traffic to send back into the kernel is a bit of a pain.
Sockets, seems like the best option for what we’re aiming for.

The eBPF 🐝 magic 🪄

Our eBPF code is going to manipulate the L3 & L4 behaviour of packets as they traverse the kernel and in some-cases user-land (i.e. the proxy).

The life of our packet is the following!

For this walkthrough:

pod-01 is 10.0.0.10
pod-02 is 10.10.0.20

Our eBPF program is started and is passed the CIDR range of pods in our Kubernetes cluster and the pid of the proxy, this is done through an eBPF map.
The application within the pod (pod-01) is wanting to create an outbound connection connect(), in this case to pod-02. This would typically be a high internal port 32305 (for example) attempting to connect outbound.
The eBPF program will change the destination from 10.10.0.20 to the proxy that is listening on localhost, so 10.10.0.20:<port> would become 127.0.0.1:18000.
We also stuff the original destination address and port into a map, which uses the socket “cookie” as it’s key.
The proxy on 127.0.0.1:18000 will receive all the TCP magic from the application that started the connection and once the socket has been established we hook in with eBPF.
Here we will add to another map the source port 32305 and the unique socket “cookie”.
The proxy has an established connection from the application, however it needs to know the original destination, we do this through calling a syscall getsockopt with a specific option SO_ORIGINAL_DST. This is captured by eBPF, which it does a look up on the src port 32305 to find the cookie, it then uses the cookie to look up in another map to return the original destination 10.10.0.20:<port>.
The proxy can now establish a connection outbound to the destination pod or another proxy (this will be covered later).
As traffic is read() from the proxy it is then forwarded to the internal connection and the application in pod-01 processes it as if there was no proxy in the middle.

Why do we pass the `pid` of the proxy into the eBPF program? (I hear you ask)

Well, we would end up in a loop if the proxy has it’s out bound connections looped back to itself. So if we see a connection from the proxy then we don’t redirect it.

Abridged logs

$ kubectl logs  pod-01 -c smesh-proxy
[2024/12/02T10:17:58.618] [application] [INFO] [main.go:66,main] Starting the SMESH 🐝
[2024/12/02T10:17:58.618] [application] [INFO] [main.go:94,main] detected Kernel 6.8.x
[2024/12/02T10:17:58.682] [application] [INFO] [connection.go:23,startInternalListener] internal proxy [pid: 7] 127.0.0.1:18000
[2024/12/02T10:17:58.682] [application] [INFO] [connection.go:33,startExternalListener] external proxy [pid: 7] 0.0.0.0:18001
[2024/12/02T10:17:58.682] [application] [INFO] [connection.go:62,startExternalTLSListener] external TLS proxy [pid: 7] 0.0.0.0:18443

< Proxy is up and running>
< We receieve a forwarded connection from eBPF 🐝 >

[2024/12/02T10:18:14.080] [application] [INFO] [connection.go:75,start] internal proxy connection from 127.0.0.1:33804 -> 127.0.0.1:18000

< We've looked up the connection through eBPF to find the original destination >
< The proxy connects to pod-02 (on it's local proxy port, where it takes care of forwarding to the application in the same pod) and we can now start sending traffic from pod-01 through the proxy >

[2024/12/02T10:18:14.087] [application] [INFO] [connection.go:156,internalProxy] Connected to remote endpoint 10.10.0.20:18443, original dest 10.10.0.20:9000

< The application in pod-02 has established a new connection in the opposite direction >

[2024/12/02T10:18:16.081] [application] [INFO] [connection.go:95,startTLS] external TLS proxy connection from 10.10.0.20:47292 -> 10.0.0.10:18443

Summary

This post steps through the bits needed in order to form a service mesh and how we use eBPF in order to redirect traffic to another process listening within the same pod. We know that this is achievable, but we now need to understand how to architect these pieces and get traffic across to the other pod! (which i’ll cover in the next post)

UPDATE: That post is now available here

Egress

Posted on 2024-11-03 Edited on 2024-11-07 Disqus:

This is the third time I’ve tried to write this, a random panic on my Hackintosh wiped out half a blog post and the other attempt read like garbage 🫠. So here we go! I recently added Egress IPv6 to kube-vip, but it isn’t something that I really advertised as functionality in kube-vip, so I though it best to actually write about it. The concept of Egress within Kubernetes is a bit of an interesting one because it also exposes a bit of confusion when it comes to Kubernetes networking. Namely what actually comes out of the box with a Kubernetes cluster.

Kubernetes Networking

When we think about a simple Kubernetes cluster (and deploying it), then we need to consider all of the additional components that are actually required in order for the cluster to actually work. These are commonly thought of as the CxI plugins, where the x is usually R for the runtime, N for the networking plugin and finally S for storage. Without a CRI, there is simply no functionality to stop and start containers effectively making your cluster pointless and additionally without a CNI your containers wont have any networking capabilities! (well it turns out that’s not strictly true).

The CNI in it’s most basic form is largely doing:

Pod networking with IPAM
Multi node networking, allowing pods to speak to one another as if they were on a big flat network (often an overlay)
Network Policy
Ingress typically though hostPort

Above is a gross over simplification of what CNIs accomplish and most now do far more than just basic networking, however there is a huge amount of networking functionality that isn’t explained. Who or what creates a Kubernetes service and manages the forwarding to a pod that is selected, and additionally when a pod wishes to send traffic externally to the cluster who or what is happening.

Kube-Proxy

So if you do a kubectl get pods -A on a cluster you may encounter kube-proxy pods, which typically run on all nodes as a daemonset and are largely there to manipulate the masses of iptables rules required in order to make the internals of a Kubernetes cluster work. Whenever a service is created kube-proxy will create rules that will watch for ingress traffic to that service address, additionally it also keeps track of the endpoints that match that particular service. These rules, regardless of the technology (currently iptables but soon nftables in upstream) will modify traffic that is destined for a service IP and change that destination to one of the endpoint IP addresses. There is a great overview (in epic detail) available at learnk8s.

Where things get interesting is egress traffic, if you had a pod that had busybox+curl and you were using that the curl https://google.com it will typically just work, but the thing about the pod network is that it isn’t meant to be directly accessible! So how does return traffic get to the pod?

So the answer is pretty simple, and it is pretty much the same logic that applies to most household internet connections. This logic is the idea of NAT (Network Address Translation), but simply put like the pod in a Kubernetes cluster your home laptop isn’t directly accessible on the internet however the ADSL/Fibre router is. When an internal device (or pod) wants to connect to the outside world the traffic will traverse the either a physical router or routing logic in the kernel and the destination address is changed to the address of the router. So in the diagram below google.com will receive a connection from the external address which it can communicate with, when it receives traffic from this connection it will then change the address back to the internal address and traffic is then passed back internally.

┌────────────────────────────────────┐                                                 
│┌────────────────┐                  │                                                 
││Pod-01 10.0.0.3 ┼─────────┐        │            ┌───────────────────────────┐
│└────────────────┘         │        │            │ google.com                │
│                           │        │            └───────────────────────────┘
└───────────────────────────▼────────┘                          ▲              
                 Node-01 192.168.0.22───────────────────────────┘

So this is fine, effectively each node in the cluster will take care of handling the traffic originating from pods by making it look like it’s originating from the node where the pod resides and in most cases this works. However if your cluster has 1000 nodes and you have a pod that wants to access something external protected by a firewall, well that firewall will now need 1000 rules allowing access because who knows where that pod will be scheduled to run in the future. Or perhaps one of those modern app transformation programs actually works (I’ve been part of a myriad of failed attempts to create these programs) and you’ve finally migrated an application that has some static IP requirements, well again there is no way to have anything static in this scenario.

Egress Magic 🪄

The most common solution to implement “controlled” Egress within a Kubernetes cluster is the concept of an Egress gateway and in most cases this is another small pod (or pods) that sit within the cluster that traffic is forwarded to. This gateway will then do the same manipulation however it will typically have a series of IP addresses that it can use dependant on who the originating traffic is coming from. So in theory we could tie our legacy application to its original IP address and the gateway will then take care of manipulating the traffic to make it appear as though this original IP address is still in use.

Node-01 192.168.0.22                                      
┌──────────────────────────────────────────┐                                    
│┌────────────────┐                        │                                    
││Pod-01 10.0.0.3 │                        │                                    
│└───────┬────────┘                        │                                    
│        │                                 │                                    
└────────┼─────────────────────────────────┘                                    
         │                                                                      
         │                                         ┌───────────────────────────┐
         │                      ┌─────────────────►│ Google.com                │
┌────────┼──────────────────────┼──────────┐       └───────────────────────────┘
│        │              ┌───────┼────────┐ │                                    
│        └─────────────►│Egress Gateway  │ │                                    
│                       └────────────────┘ │                                    
│                                          │                                    
└──────────────────────────────────────────┘                                    
                      Node-02 192.168.0.23

Egress with kube-vip 🐙

A few years ago I popped into the #sig-networking channel (excellent place to discuss ideas btw) to discuss egress within Kubernetes as I wanted to implement it within kube-vip. There are a few projects that have implemented through different mechanisms in the past, but largely if you wanted to implement it with a stable architecture the obvious solution was using a gateway (as discussed above). After fiddling around with the rules that kube-proxy implements I decided that perhaps there was an alternative solution!

The kube-vip project is everyones (some peoples) goto project when wanting to deploy a load balancer solution within a Kubernetes cluster, and it is pretty devastatingly simple. Apply a load balancer address to a node, and then when traffic is sent to it the traffic then makes its way to a pod via kube-proxy (easy)!

┌─────────┐                                                                 
│Client🧑‍💻 ┼────────────────────► 192.168.0.100───────►Node-01 192.168.0.22  
└─────────┘                     ┌──────▲──────────────────┬────────────────┐
                                │      │                  │                │
                                │┌─────┼──────┐         ┌─▼──────────────┐ │
                                ││kube-vip🐙  │         │Pod-01 10.0.0.3 │ │
                                │└────────────┘         └────────────────┘ │
                                └──────────────────────────────────────────┘

The same, but backwards

For Egress could we potentially combine some of the behaviour of both kube-proxy and kube-vip to produce a stable Egress solution? It turns out yes, we can and again it is devastatingly simple. To accomplish this we effectively overload the behaviour of a Kubernetes service by using it to define not only the ingress (as a loadbalancer) but also the egress with additional iptables rules. This creates a 1:1 relationship between a service and a pod and uses the service IP as the pods ingress and egress address.

┌────────────┐                                                                    
│Server      │◄──────────────────────┐                                            
└────────────┘                       │192.168.0.100           Node-01 192.168.0.22
                                      ┌──▲────▲──────────────────────────────────┐
                                      │  │    │               ┌──────────┐       │
                                      │  │    └───────────────┤Kernel 🧠 │◄─┐    │
                                      │  │                    │          │  │    │
                                      │  │        ┌──────────►│(iptables)│  │    │
                                      │  │        │           └──────────┘  │    │
                                      │┌─┼────────┼─┐         ┌─────────────┼──┐ │
                                      ││kube-vip 🐙 │         │Pod-01 10.0.0.3 │ │
                                      │└────────────┘         └────────────────┘ │
                                      └──────────────────────────────────────────┘

If we look the above diagram we’ve created a service that selects Pod-01 and the following will happen:

The service load balancer is attached to the node where Pod-01 currently resides
The Kernel is updated so that traffic leaving from Pod-01 should be updated so that it looks like it is coming from the load balancer, so effectively 10.0.0.3 is re-written to become 192.168.0.100.
As the process sends traffic or initiates connections to the Server they are now all coming from the externally facing load balancer address.

In the event Pod-01 is killed or crashes then kube-vip is notified through a Kubernetes watcher, where the event of a pod delete will result in the egress rules being cleaned up. When the Pod reappears on another node, then the egress rules will be re-applied and the traffic will continue to be appearing to come from the same IP address, so we get a level of High Availability for free.

Where is the eBPF 🐝

I’ve wanted to re-implement this functionality in eBPF for quite a long time, however it simply isn’t as easy as I hoped.

“But isn’t there an egress hook for TC?”

Indeed there is, however this hook is the last step in the chain before the packet goes back onto the nic and leaves the machine. What this effectively means is that all of the netfilter (iptables & kube-proxy) magic will have already modified the packet before our eBPF program can see it.

So for now the plan is to migrate kube-vip to use googles nftables GO library and we’ll see what happens next.

Thanks

Lars Ekman
Antonio Ojea
The other folks that listened to my daft ideas in #SIG-NETWORK

Highly Available workloads across Mulitple Kubernetes Clusters

Posted on 2024-05-11 Edited on 2024-05-14 Disqus:

Our Journey begins in 2003 were I somehow blagged a Unix role, largely due to experience with FreeBSD and sparse access to a few Unix systems over the years. This first role consisted of four days and twelve hours per shift, where we were expected to watch for alerts and fix a myriad of different systems ranging from:

Every alert in HP Openview (which you can see in the screenshot below) would lead down a rabbit hole of navigating various multi-homed jump boxes in order to track down the host that had generated the alert.

OpenView

Once we finally logged onto the machine we were often presented with a obscure version of Unix, a completely random shell or a user-land set of tooling that was incomprehensible.. it was ace.

Typically in this environment the architecture was 1:1, as in one server would host one big application. Although there were some applications that had strict demands and in some cases penalties in the event of downtime, it was these applications that would often make use of various clustering technologies of the time in order to provide high availability. Given the infrastructure was a complete hodgepodge of varying systems it would stand to reason that the clustering software would follow suit, this meant that we were presented with systems such as:

High Availability

As mentioned above in this place of work some applications simply weren’t allowed to fail as there would be a penalty (typically per minute of downtime charges) so for these applications a highly available solution is required. These solutions are to keep an application as available to end users as possible, so in the event an application crashes then it’s the clustering softwares job to restart it. Although you could create your own with a bash script:

#!/bin/bash
while true
do
  echo "Starting program"
  /usr/bin/exciting_program
  sleep 1
done

So restarting the program to ensure availability is one thing, but what about things such as OS upgrades or hardware failures? In those use-cases then the application will cease to run as the system itself will be unavailable. This is where multiple systems are clustered together in order to provide a highly available solution to any downtime, when a system becomes unavailable then it’s the clustering softwares role to make a decision about how and where to restart the application. A lot of my time was fighting with Sun Cluster and how it would implement high availability 🫠

Implementing High Availability (back in the day)

In these systems there were a number of pre-requisites in order for high availability to work

Shared storage

If the application had persistent data, and they pretty much all were based upon Oracle databases back then then this underlying storage needed to be shared. This is so that in the event of a failure the storage can be mounted on the node that is selected to take over the workload.

Re-architecting the application

This doesn’t technically mean re-writing the application, it means writing various startup and shutdown scripts with logic in them in order to ensure that the clustering software can successfully complete them without them ending in a failed state.

Networking

If the application was moving around during a failover then external programs or end users still needed to access it with the same address/hostname etc.. so in the event of a failover a virtual IP (VIP) and hostname will typically be added to the node where the application is failing over to.

Quorum

In order for a node to become the chosen one in the cluster a quorum device should be present in order to ensure that a decision can be made about who will takeover the running of applications, and in the case of a network connection failure between nodes that a “split brain” scenario can’t occur.

Split Brain will occur when cluster nodes can no longer communicate, leading them to believe that they’re the only nodes in the cluster and thus voting for themselves in order to become the leader. This would lead to multiple nodes all accessing the same data or advertising conflicting IP addresses, and generally causing chaos.

Did HA work?

Sometimes

One evening I was asked to lead a large planned change to upgrade the hardware on a mighty Sun e10K that was being used to host arguably one of the most important applications within the entire company. The plan was relatively straight forward:

Log into node1, check it was primary and running the application.
Log into node2, ensure its health (hardware/disk space etc..)
Initiate a failover on node1
Watch and ensure services came up on node2 and validate the application was healthy and externally accessible
Upgrade hardware on node1
Check node1 is healthy again (after upgrade)
Initiate a failover on node2, back to node1
Realise something is wrong
Panic
Really panic
Consider fleeing to a far flung country

So what went wrong?

The application failover from node1->node2 went fine, we saw the application shutdown followed by the shared storage being unmounted and finally the network information removed from node1. Then on node2 we witnessed the storage being mounted, the networking details being applied followed by the application being restarted. We even had the applications/database people log in and watch all the logs to ensure that everything came up correctly.

When we failed back things went very wrong, the first application/storage/network all moved back however the second application stopped and everything just hung. Eventually the process excited with an error about the storage being half remounted. The app/database people jumped onto the second node to see what was happening with the first application whilst we tried to work out what was happening. Eventually we tried to bring everything back to node2 where everything was last running successfully and again the application stopped and the process timed out about the storage 🤯

At this point we had a broken application spread across two machines trying to head in opposite directions but stuck in a failed state, at this point various incident teams were being woken up and various people started prodding and poking things to fix it.. this went on for a few hours before we worked out what was going on. (Spoiler it was step 4)

So this change was happening in the middle of the night, meaning that ideally no-one should really be using it or noticing it not working for the “momentary” downtime. One of the applications team was had opened a terminal and had changed directory to where the application logs where (on the shared storage) in order to watch and make sure the application came up correctly. However, this person then went to watch TV or get a nap (who knows) leaving their session logged on and living within the directory on the shared storage. When it came to failing the application back the system refused to unmount the shared storage as something was still accessing the filesystem 🤬 .. even better when we tried to bring the other half of the application back it failed because someone was looking at the database logs when it attempted to unmount the shared storage for that 🫠

I think it was this somewhat stressful experience that weirdly made me want to learn more about Sun cluster and other HA solutions, and here we are today :-D

High Availability in Kubernetes

Much like that script I embedded above, HA is endlessly performed by Kubernetes typically referred to as the “reconciliation loop”. The reconciliation loops role is largely to compare expected state and actual state and reconcile the difference, so expected state if 3 pods and there is only 1 then schedule 2 more etc. Additionally within the Kubernetes cluster (actually it comes from etcd but 🤷🏼‍♂️) is the concept of leader election, which allows things running within the cluster to use this mechanism to elect a leader amongst all participants. This mechanism allows you to have multiple copies of an application running and with a bit of simple logic ensure that only the active/leader instance is the one that is actually doing the processing or accepting connections etc.

High Availability across Kubernetes

Perfecting Protocol Parsing (Probably) with eBPF

Posted on 2024-01-15 Edited on 2024-01-16 Disqus:

I recently had a little bit of time to kill and decided to see if I could actually do some parsing of other protocols with eBPF. The previous post that I created was about http and whilst it’s an important application protocol to be able to read and potentially manipulate, it feels like there was only so much that could be done. Webpages are highly dynamic and can contain large amounts of data, which are qualities that aren’t always the best to try and parse with eBPF.

So my next attempt was to see how difficult it would be in order to parse something a bit spicier 🌶️! So I recently wrote a basic parsed for BGP messages, which originally was designed to just parse the first bit of data to understand the different message types and give some insight into what was occurring when BGP peers are sending info back and forth, it evolved over the weekend a little bit and now understands peering information and before I decided to write this can now manipulate the data between peers (without the BGP software being aware).

Code is available here

So to begin we will need to do what we always do when we have some network data (the socket buffer skb) in eBPF, which is to check it’s HTTP->TCP/UDP and strip off the headers once we are looking at the correct traffic. This is covered in the previous two eBPF posts, and is in all of the example code so I wont ~~duplicate~~ triplicate the code here. With all of these headers removed (I say removed, we just move the pointer (bit like the needle on a record player) past them so we’re now left with the data portion remaining). With our raw data remaining we now need to convert this into a format that matches the protocol itself, so lets start there!

Protocols

A lot of these protocols are pretty old, and are detailed in documents called Request for Comments or an rfc. These documents put together by experts in the field largely define the architecture of a protocol and a good example, which I used in order to parse HTTP is this one and you can see that this was originally authored in 1999.

So lets get to the crux of it, if you’ve been working with JSON/YAML/XML etc. or anything else that is obviously structured then abandon hope all yee who enter 😂 Almost every protocol has it’s own unique way of how it structures data, some are cleaner than others. To begin with BGP seemed pretty straight forward…

To begin with we’ll need to use the rfc document for the BGP standards, quickly reading through this we can understand that every BGP message starts with the same “fixed size” header:

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           Marker                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          Length               |      Type     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

(the diagrams in rfcs are a tad confusing, however the descriptions are a bit clearer)

Simply put the marker should be 16 octets! (aka 16 bytes), the length should be 2 octets (2 bytes or 16 bits) and the type is 1 octet (1 byte or 8 bits) and with this information we can create a structure to put the raw data in that will allow us to shape it into the BGP Message header.

struct bgp_message {
	__u8 marker[16];
	__u16 length;
	__u8 type;
};

The joys of padding

If we look at our struct above we can see marker is 16 bytes, length is 2 and type is 1 giving us a grand total of (drum roll 🥁) … 19 bytes. So why oh why, when we do a sizeof(bgp_message) do we end up with 20 bytes 🤯 This was specifically an issue with the BGP keep alive messages that consist of just a BGP message (Type set to 4), where I would attempt to read the BGP message header (expecting it to be 19 bytes) and the compiler was trying to read 20, which was obviously 1 byte too many causing the load_bytes function to fail.

So after some annoying failed attempts to copy 20 bytes our of 19, I realised that my bpg_message struct is probably being padded, this is process of adding some additional data making it more efficient for the CPU to load and store the data. More detail is available here, in most cases it’s not a problem, however we need everything to align perfectly so set packing per byte we can add the following #pragma pack(1) (which effectively disables packing). Now our struct is the correct size and we have will be able to retrieve data from the skb without causing any errors.

Getting data from the `skb`

So we should have a variable that points to the location in the skb where the data lives, this after the frame/IP/TCP headers have ended in my code it’s usually poffset. We will create a variable called bgpm that will now populate with the bytes from the skb using the bpf_skb_load_bytes function.

struct bgp_message bgpm;
int ret = bpf_skb_load_bytes(skb, poffset, &bgpm, sizeof(bgpm));
if (ret != 0) {
	bpf_printk("error %d",ret); // if we can't load the data print the error message
	return 0;
}

(We can see that the sizeof(bgpm), with the padding enabled was causing this to fail as there were only 19 bytes left in the skb and we were trying to load 20 🙄)

Once we have the header, we need to move our poffset so that we point to whatever exists after the header

1	poffset += sizeof(bgpm); // remove header

Understanding application data

We have successfully parsed the header, so we can now use this information to start to understand what the additional data remaining is and with BGP the message type and the length of the remaining data are key. The bgp.type will be one of the following values:

1 - OPEN
2 - UPDATE
3 - NOTIFICATION
4 - KEEPALIVE

Where the bgp.length will represent how much data exists (including the header), so to determine how much “remaining” data is left we would remaining = bgpm.len - sizeof(bgpm) given a KEEPALIVE message is just the header, this should return 0. However other message types often come with additional data!

So lets parse the header, and we’ll look at the UPDATE message in further detail!

#define BGP_OPEN		    1
#define BGP_UPDATE		    2
#define BGP_NOTIFICATION	3
#define BGP_KEEPALIVE		4

...

switch (bgpm.type)
{
	case BGP_OPEN:
	case BGP_UPDATE:
	// Parse the UPDATE data :-)
	case BGP_NOTIFICATION:
	case BGP_KEEPALIVE:
	default:
}

(As every message comes through the kernel we parse the header and then process the remainder of the data)

The UPDATE message is (personally) pretty bonkers:

+-----------------------------------------------------+
|   Withdrawn Routes Length (2 octets)                |
+-----------------------------------------------------+
|   Withdrawn Routes (variable)                       |
+-----------------------------------------------------+
|   Total Path Attribute Length (2 octets)            |
+-----------------------------------------------------+
|   Path Attributes (variable)                        |
+-----------------------------------------------------+
|   Network Layer Reachability Information (variable) |
+-----------------------------------------------------+

As mentioned, if you’ve been writing/parsing JSON or higher level data structures then arrays etc. are pretty simplistic. With these older structures we will need to do various bits of logic to determine how many pieces of information are marked as variable.

The Withdrawn Routes is straight forward enough, the Total Path Attributes is mind boggling…

Without screaming into the void too much, we’re given {x} amount of bytes as the Path Attributes and we would need to do the following:

Read the first 3 bytes to get the flags/type/len.
Then dependant on the type read another random sized number of bytes, as each path attribute contains a different amount of data
We can load that len into the another specific path attribute struct and read that particular data
Move the data pointer forward the size of the Path Attribute “header” and the length of the remaining data len.
Once we’ve done all that move the poffset the size of the Total Path Attributes Length so we can read the NLRI data
Sip a large glass of whisky

Modify the BGP data

So whilst everything detailed above is great for gaining insight into what is happening from a BGP perspective, perhaps we may want to impose some changes to the BGP data as it’s flowing through! For this example we will change the AS number of a new route as it’s being pushed out to a ToR switch. In order to do this we will need to look for the Path Attribute with the type of 2 known as the AS_PATH detailed here.

struct bgp_path_as {
	__u8 type;
	__u8 lenth;
	__u32 as; 
};

(Here is the format defined as a C struct)

At this point we’ve gone through through each of the Path Attributes found type 2/AS_PATH and pulled it from the skb, and we want to change it to a different AS number.

bgp_as.as = bpf_htonl(65002);
ret = bpf_skb_store_bytes(skb, pathOffset, &bgp_as, sizeof(bgp_as), BPF_F_RECOMPUTE_CSUM);
if (ret != 0) {
    bpf_printk("error %d",ret);
    return 0;
}

*(NOTE: pathOffset points to after the path header of the AS_PATH entry exists)

Here we can use the bpf_skb_store_bytes to write an updated bpg_as that has our changed AS number, this helper also has the flag BPF_F_RECOMPUTE_CSUM that takes care of fixing any checksum changes due to the changed underlying data.

NOTE: You should notice that where we’re assigning the new AS 65002 we’re wrapping it with the function bpf_htonl, which is effectively changing a host to network long. Simply put numbers that are used for networking use a different “endian” (the number are stored backwards), you can read more about that here.

The user land BGP program that is peering to the ToR is blissfully unaware that the route it is advertising is using a different AS number 😂 at this point.

Outro

The RFC docs are a great way to begin to understand what this seemingly opaque block of data that proceeds the various headers when processing network data with eBPF. The lack of unbounded loops and some other freely expressible way of manipulating data mean that extra thought has to be given when looking and parsing application data. But with a thoughtful approach I don’t see why most protocols can’t be processed by eBPF, today we need to bind programs to TC (Traffic Control) but once XDP has egress support we can offload so much application processing that the network layer will become incredibly powerful. I’m exciting to parse more protocols :-) (DNS next).

Application traffic with eBPF

Posted on 2023-12-08 Edited on 2023-12-11 Disqus:

In a previous post I talked a little bit about building up the knowledge with eBPF to start to understand a little bit more about what is going in and out of a network adapter. Basically taking your ethernet frame and stripping off the headers (Ethernet + IP Header + TCP/UDP Header) you are finally left with what remains within the packet from an application or data sense.

All of the code lives within the “learning eBPF” repository, specifically the eBPF code is here. The plan for this post is to step through the bits that I think are useful or could be important…

Note This code did do some Ingress/Egress packet modification so uses some eBPF helpers that requires 6.1+ of the Linux Kernel to work.

The maps!

Presumably you’ve come across these before? If not never fear!! Simply put an eBPF map is the mechanism for communicating between user-land and the in-kernel eBPF program. What is exceptionally cool (in my mind at least) is that these maps use keys and values.. so I don’t have to loop around data comparing and looking for what matches whatver it is i’m looking for, I pass a key and if something matches I get the corresponding data :D

Below is the map that I will use, which is called url_map the key is 20 characters long (a bounded “string” some might say), and the value that is assigned to that key is a struct that i’ve defined above.

// Defines a different URL associated with a key
struct url_path {
  __u8 path_len;
  __u8 path[max_path_len]; // This should be a char but code generation between here and Go..
};

// Defines my URL map
struct {
  __uint(type, BPF_MAP_TYPE_HASH);
  __uint(max_entries, 1024);
  __type(key, char[max_path_len]);
  __type(value, struct url_path);
}
url_map SEC(".maps");

The eBPF programs!

There are two eBPF programs defined in the code tc_egress and tc_ingress, bonus points if you can guess how they are attached! For this post, we will only concern ourselves with the tc_ingress program.

So as we would see in the myriad of examples that already exist we need to go through the header identification dance.

Do the sanity checks, and cast the data to the type of ethhdr (Ethernet header)
Find the protocol within the ethernet frame by reading the h_proto within the ethernet header (also called Ethertype).
Cast the data after the ethernet header as a iphdr (IP header)
Find the protocol within the IP Header, we also will need to determine the size of the IP header (turns out they can be different sizes! ¯\_(ツ)_/¯)
To determine the size of the header we times it’s value by four, why I hear you ask! Well this value is multiplied by 32bits do determine the size of the header, so if the value was 6 then the header would be 192 bits (or 24 bytes). So to simply determine the IP header in bytes we can multiply this value by 4!
Cast the data *after the IP Header as a tcphdr (TCP Header)
Like step (5) we will need to determine the size of the TCP Header (it again can be dynamic) and it’s the same step here, we simply need to multiply the value doff by four to determine the header size in bytes.
With all of this calculated we can now infer that the data lives at the end of the Ethernet Header size, the IP Header size and the TCP Header size.
Finally we can determine how big the application data is by taking the tot_len (total length) from the IP Header and subtracting the IP and TCP Header sizes.

Application Data !!

In order to read this data we will need a few things that were mentioned above!

First, we will need the data offset (where the data starts) and that is found after the Ethernet header + the IP Header size (once calculated) and the TCP Header (again, once calculated). We will also need a buffer in order to store the data we will be reading from the socket buffer.

// A data buffer to store our application data
char pdata[60];

// Calculate the offset to where our data actually lives
poffset = ETH_HLEN + ip_hlen + tcp_hlen;


// Load data from the socket buffer, poffset starts at the end of the TCP Header
int ret = bpf_skb_load_bytes(skb, poffset, pdata, 60);
if (ret != 0) {
   return 0;
}

We use the bpf_skb_load_bytes to read the a set amount of data (60 bytes) into our buffer (pdata) from the socket buffer (skb) starting from the offset where we know the data is (poffset)!

At this point we have 60 bytes of data, should be enough for us to write some code to understand it.

HTTP Data :-)

Lets look at what happens when we try a HTTP request!

 ~ curl code/test -vvv
*   Trying 192.168.0.22:80...
* Connected to code (192.168.0.22) port 80 (#0)
> GET /test HTTP/1.1
> Host: code
> User-Agent: curl/7.87.0
> Accept: */*

...

I’m using curl to request the URL /test from the host code (code is my development VM, that runs code-server). We can see the data that is sent to the server (each line begins with > to determine the direction of communication). The first line of data in a HTTP request is typically a verb followed by the resource we would like to interact with and this request ends with the HTTP specification and a carriage return as defined in the HTTP standards. So we can see the line that we care about is GET /test (we/I don’t really care about the HTTP specification at this point :D).

Find the HTTP method

The first step is to read the first three characters of pdata and make find if pdata[0] == G, pdata[1] == E and pdata[2] == T this will effectively allow us to find if this is both a HTTP request in the first place and specifically if it is a HTTP request!

Once we’ve validated those first 3 bytes we will want to read the more data starting from the 4 byte (three bytes for the request and one for the space between)!

char path[max_path_len];
memset(&path, 0, sizeof(path));

int path_len = 0;

// Find the request URI (starts at offset 4), ends with a space
for (int i = 4; i < sizeof(pdata) ; i++)
{
    if (pdata[i] != ' ') {
        path[i-4] = pdata[i];
    } else {
        path[i-4] = '\0';
        path_len = i-4;
        break;
    }
}

The above function will read through the rest of the HTTP data (from the 4th byte) until it encounters a space, leaving us with the URL we are trying to GET! We can validate this with a debug print statement:

1	bpf_printk("<- incoming path [%s], length [%d]", path, path_len);

Which will look like the following in your logs:

1	<idle>-0 [001] dNs3. 2252901.017812: bpf_trace_printk: <- incoming path [/test], length [5]

Acting on the HTTP application request

The above explanations detail what and how we’re reading the data, but if we want to “dynamically” look up the HTTP requests we will need to make use of eBPF maps.

In our GO userland code we do the following:

path := flag.String("path", "", "The URL Path to watch for")
flag.Parse()

// ... 

// Create a uint8 array
var urlPath [20]uint8
// copy our bytes into the uint8 array (we can cast)
copy(urlPath[:], *path)

// place our urlPath as the key
err = objs.UrlMap.Put(urlPath,
  bpfUrlPath{
    Path:    urlPath,
    PathLen: uint8(len(urlPath)),
  })
if err != nil {
  panic(err)
}

As we can see in the code above our Go program when started will read from the flag -path and that will be used as a key in our eBPF map, the value can be ignored for now.

struct url_path *found_path = bpf_map_lookup_elem(&url_map, path);
if (found_path > 0) {
    bpf_printk("Looks like we've found your path [%s]", path);
    // perhaps do more, block traffic or redirect?
}

In our eBPF program we will do a map lookup on the HTTP request, if that request as a char array exists as a key then we can operate on it!

Starting our Go program now sudo ./http -interface ens160 -path /test will yield the following:

INFO[0000] Starting 🐝 the eBPF HTTP watcher, on interface [ens160] for path [/test] 
INFO[0000] Loaded TC QDisc                              
INFO[0000] Press Ctrl-C to exit and remove the program  
          <idle>-0       [001] d.s3. 2252901.015575: bpf_trace_printk: <- 0.0.0.0:56345 -> 0.0.0.0:80
          <idle>-0       [001] D.s3. 2252901.015642: bpf_trace_printk: -> 192.168.0.22:80 -> 192.168.0.180:56345
          <idle>-0       [001] d.s3. 2252901.017552: bpf_trace_printk: <- 0.0.0.0:56345 -> 0.0.0.0:80
          <idle>-0       [001] d.s3. 2252901.017793: bpf_trace_printk: <- 0.0.0.0:56345 -> 0.0.0.0:80
          <idle>-0       [001] dNs3. 2252901.017812: bpf_trace_printk: <- incoming path [/test], length [5]
          <idle>-0       [001] dNs3. 2252901.017814: bpf_trace_printk: Looks like we've found your path [/test]

Conclusion

Parsing HTTP isn’t too bad as it is a relatively simple protocol, it uses easy verbs and simple methods for structure with spaces and carriage returns to differentiate. This methodology would potentially work OK with other protocols like DNS, POP3 or SMTP. When things are encrypted we would need someway of decrypting before we can parse the data (that’s beyond me…). However, I hope that this sparks some ideas into playing more with eBPF and attempting to parse and operate on applications with eBPF!

eBPF adventures in networking

Posted on 2023-11-18 Edited on 2023-12-12 Disqus:

I’ve been wanting to write some hopefully useful posts around eBPF for sometime, although usually by the time I’ve come up with something I though may be useful someone has already beaten me to the punch. Given that I’ve been focussing in networking one way or another for a while, this has largely been the area that I’ve focussed on, although I did manage to put something together for the recent eBPF summit 2023 that I thought was quite fun. As mentioned there are a lot of people that are starting to write eBPF content, so I’ll potentially refer to their posts instead of duplicating content.

`XDP` vs `TC`, or even `sysprobes`

I’ll start with a few acronyms or even technologies in the Linux Kernel that you may or may not have come across. But basically from my perspective at least these are your main options for modifying a running system to interact with networking data.

XDP

There already exists a lot of information about the eXpress Data Plane, so I’ll not delve into too much detail. The tl;dr is that an XDP eBPF program that hooks into XDP will have access to the an incoming network frame before it is processed by the kernel itself. In some cases the eBPF program will be loaded into the NIC driver itself, which will effectively offload the program to the NIC itself.

PROs

The best performance
Excellent for use cases such as firewalls, DDos protection or load balancing
Sees incoming traffic before anything else can make any modifications

CONs

Ingress only, any traffic that you see with an XDP program is only incoming and there is currently no way of seeing traffic that is outbound
Uses the XDP data structure, which is a little different the SKB that is the default for most socket programming.

TC (or Traffic Control)

The Traffic Control is an integral part of the kernel networking structure, largely comprising of the capability of adding things such as qdiscs and filters to an interface. The qdisc largely focuses on providing a TBD and a filter can then be attached to this qdisc, often a filter will actually be an eBPF program under the covers.

A common workflow is:

Create a qdisc or replace an existing one that concerns itself with either ingress or egress. The qdisc is attached to an interface.
Load your eBPF program
Create a filter, that attaches itself to either ingress or egress now exposed through the qdisc on an interface. That filters has the eBPF program attached too it, meaning all traffic either incoming or outgoing will now run through a program (if connected)
Profit 💰

PROs

Provides hooks for ingress and egress
Uses the traditional SKB data structure

CONs

It’s slightly more complicated to attach a TC program to either their ingress or egress queues. The user will need to make use of qdiscs in order to do this, some eBPF SDKs don’t support TC program usage natively.
The traffic a TC eBPF program sees may have already been modified by an earlier XDP program or even the kernel itself.

Syscalls

This might seem a little weird compared to the other two, which are specifically designed in order to handle networking. Whereas an alternative is to attach some eBPF code to a syscall within the kernel, specifically calls such as tcp4_connect() / tcp6_connect(). This is a little bit further down the stack as at this point an incoming packet has already been through a lot of the kernel logic and the eBPF introspection point is as the traffic is about to interact with an application itself.

Programming a network!

So at this point we (hopefully) realise that we’ve a number of different entry points that will allow us to inject our code on the “conveyor belt” that a packet will traverse starting from the NIC all the way to the application (and back, in the case of egress).

Recap

At the beginning of our so called “conveyor belt” we can attach our XDP program and get the raw untouched network data. In the middle of the “conveyor belt” our TC program will become part of the path through the kernel and receive potentially modified network data. At the end of the conveyor belt we can attach code to functions that the application will call in order to get the network data just before it is ingested by the running application.

Data representation

Depending where you attach your program determines two main things, one the relative level of potential modification of traffic and how the traffic is represented.

The XDP struct

I’d write about it but DataDog already have done, you can read that here.

The SKB (Socket buffer)

The SKB is a data type that has existed within the kernel long before eBPF was added to the kernel, and it already comes with a number of helper functions that make interacting with an SKB object a little easier. For more deep dive into SKB you can read this -> http://vger.kernel.org/~davem/skb_data.html

Parsing the data

Regardless of which struct you interact with, they share some commonality and that is largely that there are two variables that are identical across both data types.

These are:

*data, which is a pointer to the data received by the eBPF program
data_len, which is an integer that specified how much data there is (to help make sure you never access *data more than data_len (obvious really 🤓))

So that all seems simple enough, but wait… what is actually in *data?? (Well that is for you to discover)

Well we do that through continually “casting” the *data and moving along it to strip off the various headers in order to understand and find the underlying data!

casting?

You can skip this if you like, but this is a quick (and terrible) example of how we typically take some raw data and turn it into something that makes sense. At the moment *data will just be a stream of random data that won’t make any sense and we will need to effectively add “formatting” too it so that we can understand what it looks like.

Consider the following random line of data Bobby0004500100.503 Harvard Drive90210 some of it makes sense to the raw eye but some of it is unclear.

Imagine the data structure called “person”:

Name: string
Age: number
Balance: float
Street: string
ZipCode: number

If we were to “cast” our random data to the “person” structure above it would suddenly become:

Name: Bobby
Age: 45
Balance: 100.50
Street: 3 Harvard Drive
ZipCode: 90210

Now all of a sudden I’m able to both understand and access the underlying variables in the structure as they now make sense, I.e. person->Name and find out that this particular object of type person has the name variable “Bobby”!

This is exactly what we will do to our *data !

What’s in the data?

So the first step is to determine if the data starts with an Ethernet frame! Pretty much all of the data that travels around starts with an Ethernet frame, which is pretty simplistic but it’s role is to have a source and destination hardware address (regardless of virtualisation/containerisation/cabled network or WiFi). So our first step is to cast our *data to the type ETHHDR, if this is successful we will now be able to understand the variables that make up the Ethernet header data type. These would include the source and destination MAC addresses, but also more importantly the what the contents of the remaining data is. Again, in most circumstances the contents of the *data after the Ethernet header is typically an IP header, but we will validate be checking the Ethernet frames TBD variable.

Once we validate that the next set of data is the IP Header we will need to cast the data after the Ethernet Header to the type IPHDR. Once we do this we will have access to the IP specific data such as source IP (saddr) or destination address (daddr), again importantly the IP header contains a variable that details what the data is after the end of the IP Header. This is usually a TCP header or UDP header, but there are other alternatives such as sctp etc..

Once we’ve looked inside the IPHeader and determined that the data type is TCP (could be UDP or something else), we will cast the data after both the Ethernet header and the IP header to the type TCP header! (Almost there). With access to the contends to the TCP header we have the TCP specific data, such as source port or destination port, the checksum to ensure validity of the data amongst other useful variables.

We now have almost everything, however the TCP header can be variable length so we will need to determine this by looking at the tcl_len variable, which we need to times by 4. We now have everything we need to get to the final data!

So, the *data points to the beginning of the data! We have determined that there is an Ethernet header followed by a IP header and finally a TCP header, which means *data + Ethernet header + IP header + TCP header = Actual application data !

What can we do with this information ?

As we parse through the various headers, we effectively unlock more and more information at different layers of the OSI model!

[layer 2] The Ethernet Header provides us with the source and destination hardware addresses, we could use this information to potentially stop frames being processed from source MAC addresses that we know to be dangerous.

[layer 3] The IP Header contains the source and destination IP addresses, again we can act like a firewall by having an eBPF program drop all traffic for a specific IP address. Alternatively we could have logic that will potentially redirect traffic based upon the IP addresses of we could even implement load balancing logic at this layer that will redirect to an underlying set of other IP addresses

[layer 4] The TCP or UDP Headers define the destination port numbers, which we can use to determine what the application protocol is (I.e. port 80 typically means that the remaining *data is likely to be HTTP data). More often than not we would perform actions such as load balancing at this layer, based upon the destination (I.e. balance across multiple other load balancer addresses)

[layer 7] As mentioned the data at the end of the collection of various headers is the actual application data, which we can also parse (as long as we know the format). So for instance if an external web browser were to try and access /index.html on my machine with an eBPF program attached, I’d parse all the way to TCP to determine that it was port 80 and then the application data should be in the HTTP format. I could validate this by looking at the first three characters of application data (after all the headers), with some pseudo code like below:

ApplicationData = EthernetHDR + IPHDR + TCPHDR // Add all headers lengths together to find the data
If ( data[ApplicationData] = "G" && data[ApplicationData+1] = "E" && data[ApplicationData+2] = "T" ) {
	// It's a HTTP GET request 
	// do something exciting
}

Wrap up

Now we “kind of” understand the logic we should probably look at implementing some code to do all this .. that’s for another day though.

Finding and fiddling with Slacks APIs

Posted on 2023-10-18 Edited on 2023-10-19 Disqus:

It’s starting to feel a little bit as though the noose is starting to tighten a lot in the IT industry at the moment, whether it be Open Source projects spuriously changing their licenses and pulling the rug from their users to companies that expanded too far and too quickly during the pandemic suddenly contracting. This tightening of the belt has also impacted a bunch of tooling that people have come to depend on as part of their day to day life, or their workflow.

Slack isn’t slack when it comes to the community usage

So! Whats this rambling collection of words about?

Like all of the wonderful platforms as a service out there ~~IRC for money~~ Slack has the various tiers, each unlocking more and more functionality whilst ramping up the various costs associated. A lot of communities (especially in the Open Source world) have been built on the free and open communication that takes place on a Slack instance/workspace devoted to a project or community. But the problem i’m trying to address in this post is the hoops people need to jump though in order to join these communities hosted on Slack.

When the community first starts out, it makes perfect sense that could be a need for simple restraints and/or approval process for joining a community. This is usually handled through an invitation process, if you want to join the community then you need to somehow signal that intent allowing the owner(s) of the community to then invite you to join. However once you start to hit any sort of scale then this simply becomes an unmanageable task, ultimately impacting the growth and health of a hopefully growing community and it is at this point where you would potentially want to open the floodgates as the community (hopefully) exponentially grows.

Note On the side of security, we’re moving to a reactive instead of proactive approach for maintaining community membership (again coming with it’s own challenges).

So what are your options?

The legitimate option

You’re main options without trying to automate anything are to simply select the + Add coworkers button and create an invite link, this can be set to never expire and is meant to work for 400 people (YMMV). It doesn’t really look like there are many options available to automate this procedure, you can open Slack in a web browser with “developer mode” enabled to capture the API endpoint but i’ve been unable to determine what token can be used.

The endpoint for this API call is https://<workspace>.slack.com/api/users.admin.createSharedInvite, which should be posted with a FORM with fields such as token, set_active however no token I could produce would seem to get this to work (even taken the token used in the browser).

Example code:

token := os.Getenv("SLACK_TOKEN")
instance := os.Getenv("SLACK_INSTANCE")

slack_url := fmt.Sprintf("https://%s/api/users.admin.createSharedInvite", instance)

values := url.Values{
	"expiration":  {"99999"},
	"token":       {token},
	"set_active":  {"true"},
	"max_signups": {"200"},
}
client := &http.Client{}

req, err := http.NewRequest("POST", slack_url, strings.NewReader(values.Encode()))
if err != nil {
	fmt.Println(err)
	return
}
req.Header.Add("Content-Type", "application/x-www-form-urlencoded")

resp, err := client.Do(req)
if err != nil {
	fmt.Println(err)
	return
}
if err != nil {
	log.Fatal(err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
	log.Fatal(err)
}
fmt.Println(string(body[:]))

The “other” option

With trying to replicate the Slack client behaviour not working, we’re left to trying less legitimate routes to make it easier for people to join the Slack workspace. So there are a number of Open Source projects that are quite old and unmaintained that make use of an old API (undocumented) endpoint /api/users.admin.invite (not to be confused with /api/admin.users.invite, api docs). This endpoint works with a legacy token that you’re no longer able to create (as of 2020).. it turns out invoking the correct sequence will allow you to create a token that can still use this endpoint however!

Create an App https://api.slack.com/apps
Create the App from scratch, give it a name (any will do)
Scroll down and make a copy the Client ID and the Client Secret
Scroll further up and find the Permissions box
In the Redirect URLs box we will need to enter a bogus URL (make sure it doesn’t actually resolve) such as https://0.0.0.0/, ensure you click Save URLs
Under the User Token Scopes, select the Add an OAuth Scope and add admin
Finally scroll back up to OAuth Tokens for Your Workspace and select Install to Workspace.
Select the Allow button to create our first token!

At this point we will have a User OAuth Token, however this token even with it’s permissions wont work with the users.admin.invite endpoint 🙄

To generate our actual token we need to do some additional oauth steps!

Build our specific url! https://<workspace>.slack.com/oauth?client_id=<Client ID>&scope=admin%2Cclient
Replace the <workspace> with the name of your particular workspace, and change <Client ID> to the ID from your application.
Paste this URL into a new tab on the logged in browser, you’ll be redirected to a webpage that won’t work (remember the 0.0.0.0). However you should see the new URL in the address bar! (Including a code= part of the URL). Grab this code for the next part!
Build our new URL! https://<workspace>.slack.com/api/oauth.access?client_id=<Client ID>&client_secret=<Client Secret>&code=<code>
Either open this in a web browser again, or in curl (remember to put the URL in quotes ")
The output from this will be some json {"ok":true,"access_token":"xoxp-..., the access_token is what we’re after !!
Verify that you can invite people with your new token!

1
2
3

curl -X POST 'https://<workspace>.slack.com/api/users.admin.invite' \
--data 'email=dan@thebsdbox.co.uk&token=<token>&set_active=true' \
--compressed

A few seconds later…

1	{"ok":true}

Just to test, you can run the invite command again and if it’s all working as expected you should receive an error!

1	{"ok":false,"error":"already_in_team_invited_user"}

In conclusion

That seems like a lot of effort, but in the end it’s now possible again to build a simple web form allowing people to register themselves to join your Slack workspace!

kube-gateway (v0.0001)

The gateway watcher

Annotating your pod for mTLS 🔐

Annotating your pod for AI 🤖

Annotate your LLM

Annotate your AI workloads

eBPF, mTLS and AI oh my..

What’s next?

mTLS

AI workloads

iptables & nftables

The state of iptables & nftables

What, you want some details.. fine

When is iptables not iptables

So why is no-one writing nftables rules?

The (finally) growing ecosystem for nftables

A pure Go package for manipulating nftables 🎉

Egress v2 finally 🥹!

Egress v2 tables and chains

The TL;DR

Coming soon!

The proxy

What is in a proxy

Proxy startup

Proxy running lifecycle

Creating Certificates 📝

The final piece is the injector 💉

In Summary

What next..

The service mesh shopping list

Traffic redirector 🚦

The Proxy

The Injector 💉

Certificates 📝

Lets get started 🐝

The eBPF 🐝 magic 🪄

Why do we pass the pid of the proxy into the eBPF program? (I hear you ask)

Abridged logs

Summary

Kubernetes Networking

Kube-Proxy

Egress Magic 🪄

Egress with kube-vip 🐙

The same, but backwards

Where is the eBPF 🐝

“But isn’t there an egress hook for TC?”

Thanks

High Availability

Implementing High Availability (back in the day)

Shared storage

Re-architecting the application

Networking

Quorum

Did HA work?

So what went wrong?

High Availability in Kubernetes

High Availability across Kubernetes

Protocols

The joys of padding

Getting data from the skb

Understanding application data

Modify the BGP data

Outro

The maps!

The eBPF programs!

Application Data !!

HTTP Data :-)

Find the HTTP method

Acting on the HTTP application request

Conclusion

XDP vs TC, or even sysprobes

XDP

PROs

CONs

TC (or Traffic Control)

PROs

CONs

Syscalls

Programming a network!

Recap

Why do we pass the `pid` of the proxy into the eBPF program? (I hear you ask)

Getting data from the `skb`

`XDP` vs `TC`, or even `sysprobes`