It’s no secret that the code for Docker for Mac ultimately comes from the FreeBSD hypervisor, and the people that have taken the time to modify it to bring it to the Darwin (Mac) platform have done a great job in tweaking code to handle the design decisions that ultimately underpin the Apple Operating System.
Recently I noticed that the bhyve project had released code for the E1000 network card so I decided to take the hyperkit code and see what was required in order to add in the PCI code. What follows is a (rambling and somewhat incoherent) overview of what was changed to move from bhyve to hyperkit and some observations to be aware of when porting further PCI devices to hyperkit. Again, please be aware i’m not a OS developer or a hardware designer so some of this based upon a possibly flawed understanding… feel free to correct or teach me 🙂
Update: Already heard from @justincormack about Docker for Mac, in that it uses vpnkit not vmnet.
One of the key factors that led to the portability of bhyve to OSX is that the darwin kernel is loosely based upon the original kernel that powers FreeBSD (family tree from wikipedia here), which typically meant that a lot of the kernel structure and API calls aren’t too different. However OSX is typically aimed at the consumer market and not the server market meaning that as OSX has matured the people from Apple have stripped away some of the kernel functionality that comes as shipped, the obvious one being the removal of TUN/TAP devices in the kernel (can still be exposed through loading a kext (kernel extension)) which although problematic hyperkit has a solution for.
When bhyve starts a virtual machine it will create the structure of the VM as requested (allocated vCPUs, allocate the memory, construct PCI devices etc.) these are then attached to device nodes under
/dev/vmm then the bhyve kernel module handles the VM execution. Also being able to examine
/dev/vmm/ provides a place for administrators to see what virtual machines are currently running and also to allow them to continue running unattended.
Internally the bhyve userland tools make use of virtual machine contexts that link together the VM name to the internal kernel structures that are running the VM instance. This allows a single tool to run multiple virtual machines that you typically see from VMMs such as Xen, KVM or ESXi.
Finally the networking configuration that takes place inside of bhyve… Unlike the OSX kernel, freeBSD typically comes prebuilt with support for TAP devices (if not the command kldload if_tap is needed). However simply put, with the use of a TAP device it greatly simplifies the usage of guest network interfaces. When an interface is created with bhyve a PCI network device inside the VM is created and then on the physical host a TAP device is created. Inside the VM when network frames are written to the PCI device bhyve actually writes these frames onto the TAP device on the physical host (using standard write(), read() functions on file descriptors) and those packets are then broadcast out on the physical interface on the host. If you are familiar with VMware ESXi then the concept is almost identical to the way a VSwitch functions.
So the first observation with the architecture for hyperkit is that all of the device node code
/dev/vmm/ has been removed, which has had the effect of making virtual machines process based. This means that when hyperkit starts a VM it will malloc() all of the requested memory etc.. and it become the singular owner of the virtual machine, essentially killing the process ID of hyperkit will kill the VM. Internally all of the virtual machine context code has been removed because hyperkit process to VM is now a 1:1 association.
The initial design to remove all of the context code (instead of possibly always tagging it to a single vm context) requires noticeable changes to every PCI module that is added/ported from bhyve as it’s all based on creating and applying these emulated devices to a particular VM context.
To manage VM execution hyperkit makes use of the hypervisor.framework which is a simplified framework for creating vCPUs, passing in mapped memory and creating an execution loop.
Finally are the changes around network interfaces, from inside the virtual machine the same virtio devices are created as would be created on bhyve. The difference is linking these virtual interfaces to a physical interface, as with OSX there is no TAP device that can be created to link virtual and physical. So their currently exists two methods to pass traffic between virtual and physical hosts, one of which is the virtIO to vmnet (virtio-vmnet) and the other is virtio to vpnkit (virtio-vpnkit) PCI devices. These both use the virtio drivers (specifically the network driver) that are part of any modern Linux kernel and then hand over to the backend of your choice on the physical system.
It’s worth pointing out here that the vmnet backend was the default networking method for xhyve and it makes use of the vmnet.framework, which as mentioned by other people is rather poorly documented. It also slightly complicates things by it’s design as it doesn’t create a file descriptors that would allow the existing simple code to read() and write() from, and it also requires elevated privileges to make use of.
With the work that has been done by the developers at Docker a new alternative method for communicating from virtual network interfaces to the outside world has been created. The solution from Docker is two parts:
(I will add more details around vpnkit, when I’ve learnt more … or learnt OCaml, which ever comes first)
All of the emulated PCI devices all adhere to a defined set of function calls along with a structure that defines pointers to functions and a string that identifies the name of the PCI device (memory dump below)
pci_emul_finddev(emul) will look for a PCI device e.g. (E1000, virtio-blk, virtio-nat) and then manage the calling of its
pe_init function that will initialise the PCI device and then add it to the virtual machine PCI bus as a device that the operating system can use.
Things to be aware of when porting PCI devices are:
paddr_guest2host()which maps physical addresses to guests etc.
write()to making use of the vmnet framework
With regards to the E1000 PCI code i’ve now managed to tie up the code so that the PCI device is created correctly and added to the PCI bus, just struggling to fix the vmnet code (so feel free to give take my poor attempt and fix it successfully 🙂 https://github.com/thebsdbox/hyperkit)