Building your own service mesh

I saw a few mentions about “service mesh” and mTLS amongst other things during the KubeCon US week and given some of the messing around i’d been doing with eBPF recently I asked myself “how hard could it be to write one from scratch”?

The service mesh shopping list

There are a bunch of components that we will need to implement in order for us to implement the “service mesh” type behaviour. Most service meshes implement a heck of a lot more, we’re exploring the basics needed to implement it.

Traffic redirector 🚦

We need a way of taking traffic from an application and sending it elsewhere, typically to our proxy where we will potentially modify the traffic. The traffic needs to be redirected in a way where the application does’t need to know about it occurring, however we need to ensure that the traffic will reach its destination and traffic is returned in a way that makes sense to the application. In most circumstances this is handled by iptables rules that will change the source and destination of the packets as they navigate the kernel. As a pod initiates a connection to another pod within the cluster we will need to redirect it to our program, which we will call the proxy.

The Proxy

Our proxy will need to be listening somewhere that is accessible on the network and as outbound connections are created their destination will be modified to that of the proxy (we also need to keep a copy of that destination somewhere). At this point we will start receiving data from the source and it is here were we opportunity to potentially change the original traffic or parse the traffic and then make decisions based upon what we learn.

The Injector 💉

The injector is code that will modify the behaviour of Kubernetes so that when new workloads are scheduled an additional container could be added, or something could run before the workload starts that will write iptables/nftables rules into the kernel.

Certificates 📝

If we are wanting to use mTLS between pods then we will need to create certificates, these certs will need things like the pod IPs or pod hostnames etc. in order for the certificates to work. Given that we wont know these details until the pod starts we will need to capture this information by watching Kubernetes and creating the certificates when we see a pod being created.

Lets get started 🐝

If I can’t control the traffic then I can’t do anything, so first things first, I’m going to use eBPF in order to manipulate the traffic and make sure that it is sent to where I need it to go. Why eBPF? well because!

So lets walk this through…

There are a bunch of methods for manipulating traffic XDP, TC, sockets etc.. so what’s the choice?

  • XDP? Nope, no egress and if we’re wanting to capture traffic being initiated out to somewhere else, then that’s egress.
  • TC? It has egress, BUT it’s already gone through the kernel, iptables, sockets etc.. changing the traffic to send back into the kernel is a bit of a pain.
  • Sockets, seems like the best option for what we’re aiming for.

The eBPF 🐝 magic 🪄

Our eBPF code is going to manipulate the L3 & L4 behaviour of packets as they traverse the kernel and in some-cases user-land (i.e. the proxy).

The life of our packet is the following!

For this walkthrough:

  • pod-01 is 10.0.0.10
  • pod-02 is 10.10.0.20
  1. Our eBPF program is started and is passed the CIDR range of pods in our Kubernetes cluster and the pid of the proxy, this is done through an eBPF map.
  2. The application within the pod (pod-01) is wanting to create an outbound connection connect(), in this case to pod-02. This would typically be a high internal port 32305 (for example) attempting to connect outbound.
  3. The eBPF program will change the destination from 10.10.0.20 to the proxy that is listening on localhost, so 10.10.0.20:<port> would become 127.0.0.1:18000.
  4. We also stuff the original destination address and port into a map, which uses the socket “cookie” as it’s key.
  5. The proxy on 127.0.0.1:18000 will receive all the TCP magic from the application that started the connection and once the socket has been established we hook in with eBPF.
  6. Here we will add to another map the source port 32305 and the unique socket “cookie”.
  7. The proxy has an established connection from the application, however it needs to know the original destination, we do this through calling a syscall getsockopt with a specific option SO_ORIGINAL_DST. This is captured by eBPF, which it does a look up on the src port 32305 to find the cookie, it then uses the cookie to look up in another map to return the original destination 10.10.0.20:<port>.
  8. The proxy can now establish a connection outbound to the destination pod or another proxy (this will be covered later).
  9. As traffic is read() from the proxy it is then forwarded to the internal connection and the application in pod-01 processes it as if there was no proxy in the middle.

Why do we pass the pid of the proxy into the eBPF program? (I hear you ask)

Well, we would end up in a loop if the proxy has it’s out bound connections looped back to itself. So if we see a connection from the proxy then we don’t redirect it.

Abridged logs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
$ kubectl logs  pod-01 -c smesh-proxy
[2024/12/02T10:17:58.618] [application] [INFO] [main.go:66,main] Starting the SMESH 🐝
[2024/12/02T10:17:58.618] [application] [INFO] [main.go:94,main] detected Kernel 6.8.x
[2024/12/02T10:17:58.682] [application] [INFO] [connection.go:23,startInternalListener] internal proxy [pid: 7] 127.0.0.1:18000
[2024/12/02T10:17:58.682] [application] [INFO] [connection.go:33,startExternalListener] external proxy [pid: 7] 0.0.0.0:18001
[2024/12/02T10:17:58.682] [application] [INFO] [connection.go:62,startExternalTLSListener] external TLS proxy [pid: 7] 0.0.0.0:18443

< Proxy is up and running>
< We receieve a forwarded connection from eBPF 🐝 >

[2024/12/02T10:18:14.080] [application] [INFO] [connection.go:75,start] internal proxy connection from 127.0.0.1:33804 -> 127.0.0.1:18000

< We've looked up the connection through eBPF to find the original destination >
< The proxy connects to pod-02 (on it's local proxy port, where it takes care of forwarding to the application in the same pod) and we can now start sending traffic from pod-01 through the proxy >

[2024/12/02T10:18:14.087] [application] [INFO] [connection.go:156,internalProxy] Connected to remote endpoint 10.10.0.20:18443, original dest 10.10.0.20:9000

< The application in pod-02 has established a new connection in the opposite direction >

[2024/12/02T10:18:16.081] [application] [INFO] [connection.go:95,startTLS] external TLS proxy connection from 10.10.0.20:47292 -> 10.0.0.10:18443

Summary

This post steps through the bits needed in order to form a service mesh and how we use eBPF in order to redirect traffic to another process listening within the same pod. We know that this is achievable, but we now need to understand how to architect these pieces and get traffic across to the other pod! (which i’ll cover in the next post)

UPDATE: That post is now available here