eBPF adventures in networking
I’ve been wanting to write some hopefully useful posts around eBPF for sometime, although usually by the time I’ve come up with something I though may be useful someone has already beaten me to the punch. Given that I’ve been focussing in networking one way or another for a while, this has largely been the area that I’ve focussed on, although I did manage to put something together for the recent eBPF summit 2023 that I thought was quite fun. As mentioned there are a lot of people that are starting to write eBPF content, so I’ll potentially refer to their posts instead of duplicating content.
XDP vs TC, or even sysprobes
I’ll start with a few acronyms or even technologies in the Linux Kernel that you may or may not have come across. But basically from my perspective at least these are your main options for modifying a running system to interact with networking data.
XDP
There already exists a lot of information about the eXpress Data Plane, so I’ll not delve into too much detail. The tl;dr is that an XDP eBPF program that hooks into XDP will have access to the an incoming network frame before it is processed by the kernel itself. In some cases the eBPF program will be loaded into the NIC driver itself, which will effectively offload the program to the NIC itself. 
PROs
- The best performance
- Excellent for use cases such as firewalls, DDos protection or load balancing
- Sees incoming traffic before anything else can make any modifications
CONs
- Ingress only, any traffic that you see with an XDP program is only incoming and there is currently no way of seeing traffic that is outbound
- Uses the XDPdata structure, which is a little different theSKBthat is the default for most socket programming.
TC (or Traffic Control)
The Traffic Control is an integral part of the kernel networking structure, largely comprising of the capability of adding things such as qdiscs and filters to an interface. The qdisc largely focuses on providing a TBD and a filter can then be attached to this qdisc, often a filter will actually be an eBPF program under the covers.
A common workflow is:
- Create a qdisc or replace an existing one that concerns itself with either ingress or egress. The qdisc is attached to an interface.
- Load your eBPF program
- Create a filter, that attaches itself to either ingress or egress now exposed through the qdisc on an interface. That filters has the eBPF program attached too it, meaning all traffic either incoming or outgoing will now run through a program (if connected)
- Profit 💰
PROs
- Provides hooks for ingress and egress
- Uses the traditional SKBdata structure
CONs
- It’s slightly more complicated to attach a TC program to either their ingress or egress queues. The user will need to make use of qdiscs in order to do this, some eBPF SDKs don’t support TC program usage natively.
- The traffic a TC eBPF program sees may have already been modified by an earlier XDP program or even the kernel itself.
Syscalls
This might seem a little weird compared to the other two, which are specifically designed in order to handle networking. Whereas an alternative is to attach some eBPF code to a syscall within the kernel, specifically calls such as tcp4_connect() / tcp6_connect(). This is a little bit further down the stack as at this point an incoming packet has already been through a lot of the kernel logic and the eBPF introspection point is as the traffic is about to interact with an application itself. 
Programming a network!
So at this point we (hopefully) realise that we’ve a number of different entry points that will allow us to inject our code on the “conveyor belt” that a packet will traverse starting from the NIC all the way to the application (and back, in the case of egress).
Recap
At the beginning of our so called “conveyor belt” we can attach our XDP program and get the raw untouched network data. In the middle of the “conveyor belt” our TC program will become part of the path through the kernel and receive potentially modified network data. At the end of the conveyor belt we can attach code to functions that the application will call in order to get the network data just before it is ingested by the running application.
Data representation
Depending where you attach your program determines two main things, one the relative level of potential modification of traffic and how the traffic is represented.
The XDP struct
I’d write about it but DataDog already have done, you can read that here.
The SKB (Socket buffer)
The SKB is a data type that has existed within the kernel long before eBPF was added to the kernel, and it already comes with a number of helper functions that make interacting with an SKB object a little easier. For more deep dive into SKB you can read this -> http://vger.kernel.org/~davem/skb_data.html
Parsing the data
Regardless of which struct you interact with, they share some commonality and that is largely that there are two variables that are identical across both data types.
These are:
- *data, which is a pointer to the data received by the eBPF program
- data_len, which is an integer that specified how much data there is (to help make sure you never access- *datamore than- data_len(obvious really 🤓))
So that all seems simple enough, but wait… what is actually in *data?? (Well that is for you to discover)
Well we do that through continually “casting” the *data and moving along it to strip off the various headers in order to understand and find the underlying data!
casting?
You can skip this if you like, but this is a quick (and terrible) example of how we typically take some raw data and turn it into something that makes sense. At the moment *data will just be a stream of random data that won’t make any sense and we will need to effectively add “formatting” too it so that we can understand what it looks like.
Consider the following random line of data Bobby0004500100.503 Harvard Drive90210 some of it makes sense to the raw eye but some of it is unclear. 
Imagine the data structure called “person”:
| 1 | Name: string | 
If we were to “cast” our random data to the “person” structure above it would suddenly become:
| 1 | Name: Bobby | 
Now all of a sudden I’m able to both understand and access the underlying variables in the structure as they now make sense, I.e. person->Name and find out that this particular object of type person has the name variable “Bobby”!
This is exactly what we will do to our *data !
What’s in the data?
So the first step is to determine if the data starts with an Ethernet frame! Pretty much all of the data that travels around starts with an Ethernet frame, which is pretty simplistic but it’s role is to have a source and destination hardware address (regardless of virtualisation/containerisation/cabled network or WiFi). So our first step is to cast our *data to the type ETHHDR, if this is successful we will now be able to understand the variables that make up the Ethernet header data type. These would include the source and destination MAC addresses, but also more importantly the what the contents of the remaining data is. Again, in most circumstances the contents of the *data after the Ethernet header is typically an IP header, but we will validate be checking the Ethernet frames TBD variable. 
Once we validate that the next set of data is the IP Header we will need to cast the data after the Ethernet Header to the type IPHDR. Once we do this we will have access to the IP specific data such as source IP (saddr) or destination address (daddr), again importantly the IP header contains a variable that details what the data is after the end of the IP Header. This is usually a TCP header or UDP header, but there are other alternatives such as sctp etc.. 
Once we’ve looked inside the IPHeader and determined that the data type is TCP (could be UDP or something else), we will cast the data after both the Ethernet header and the IP header to the type TCP header! (Almost there). With access to the contends to the TCP header we have the TCP specific data, such as source port or destination port, the checksum to ensure validity of the data amongst other useful variables.
We now have almost everything, however the TCP header can be variable length so we will need to determine this by looking at the tcl_len variable, which we need to times by 4. We now have everything we need to get to the final data!
So, the *data points to the beginning of the data! We have determined that there is an Ethernet header followed by a IP header and finally a TCP header, which means *data + Ethernet header + IP header + TCP header = Actual application data !
What can we do with this information ?
As we parse through the various headers, we effectively unlock more and more information at different layers of the OSI model!
[layer 2] The Ethernet Header provides us with the source and destination hardware addresses, we could use this information to potentially stop frames being processed from source MAC addresses that we know to be dangerous.
[layer 3] The IP Header contains the source and destination IP addresses, again we can act like a firewall by having an eBPF program drop all traffic for a specific IP address. Alternatively we could have logic that will potentially redirect traffic based upon the IP addresses of we could even implement load balancing logic at this layer that will redirect to an underlying set of other IP addresses
[layer 4] The TCP or UDP Headers define the destination port numbers, which we can use to determine what the application protocol is (I.e. port 80 typically means that the remaining *data is likely to be HTTP data). More often than not we would perform actions such as load balancing at this layer, based upon the destination (I.e. balance across multiple other load balancer addresses)
[layer 7] As mentioned the data at the end of the collection of various headers is the actual application data, which we can also parse (as long as we know the format). So for instance if an external web browser were to try and access /index.html on my machine with an eBPF program attached, I’d parse all the way to TCP to determine that it was port 80 and then the application data should be in the HTTP format. I could validate this by looking at the first three characters of application data (after all the headers), with some pseudo code like below:
| 1 | ApplicationData = EthernetHDR + IPHDR + TCPHDR // Add all headers lengths together to find the data | 
Wrap up
Now we “kind of” understand the logic we should probably look at implementing some code to do all this .. that’s for another day though.