A peek inside Docker for Mac (Hyperkit, wait xhyve, no bhyve …)

It’s no secret that the code for Docker for Mac ultimately comes from the FreeBSD hypervisor, and the people that have taken the time to modify it to bring it to the Darwin (Mac) platform have done a great job in tweaking code to handle the design decisions that ultimately underpin the Apple Operating System.

Recently I noticed that the bhyve project had released code for the E1000 network card so I decided to take the hyperkit code and see what was required in order to add in the PCI code. What follows is a (rambling and somewhat incoherent) overview of what was changed to move from bhyve to hyperkit and some observations to be aware of when porting further PCI devices to hyperkit.  Again, please be aware i’m not a OS developer or a hardware designer so some of this based upon a possibly flawed understanding… feel free to correct or teach me 🙂

Update: Already heard from @justincormack about Docker for Mac, in that it uses vpnkit not vmnet.

VMM Differences

One of the key factors that led to the portability of bhyve to OSX is that the darwin kernel is loosely based upon the original kernel that powers FreeBSD (family tree from wikipedia here), which typically meant that a lot of the kernel structure and API calls aren’t too different. However OSX is typically aimed at the consumer market and not the server market meaning that as OSX has matured the people from Apple have stripped away some of the kernel functionality that comes as shipped, the obvious one being the removal of TUN/TAP devices in the kernel (can still be exposed through loading a kext (kernel extension)) which although problematic hyperkit has a solution for.

VM structure with bhyve

When bhyve starts a virtual machine it will create the structure of the VM as requested (allocated vCPUs, allocate the memory, construct PCI devices etc.) these are then attached to device nodes under /dev/vmm then the bhyve kernel module handles the VM execution. Also being able to examine /dev/vmm/ provides a place for administrators to see what virtual machines are currently running and also to allow them to continue running unattended.

Internally the bhyve userland tools make use of virtual machine contexts that link together the VM name to the internal kernel structures that are running the VM instance. This allows a single tool to run multiple virtual machines that you typically see from VMMs such as Xen, KVM or ESXi.

Finally the networking configuration that takes place inside of bhyve… Unlike the OSX kernel, freeBSD typically comes prebuilt with support for TAP devices (if not the command kldload if_tap is needed). However simply put, with the use of a TAP device it greatly simplifies the usage of guest network interfaces. When an interface is created with bhyve a PCI network device inside the VM is created and then on the physical host a TAP device is created. Inside the VM when network frames are written to the PCI device bhyve actually writes these frames onto the TAP device on the physical host (using standard write(), read() functions on file descriptors) and those packets are then broadcast out on the physical interface on the host. If you are familiar with VMware ESXi then the concept is almost identical to the way a VSwitch functions.

bhyve Network

VM Structure with Docker for Mac (hyperkit)

So the first observation with the architecture for hyperkit is that all of the device node code /dev/vmm/ has been removed, which has had the effect of making virtual machines process based. This means that when hyperkit starts a VM it will malloc() all of the requested memory etc.. and it become the singular owner of the virtual machine, essentially killing the process ID of hyperkit will kill the VM. Internally all of the virtual machine context code has been removed because hyperkit process to VM is now a 1:1 association.

The initial design to remove all of the context code (instead of possibly always tagging it to a single vm context) requires noticeable changes to every PCI module that is added/ported from bhyve as it’s all based on creating and applying these emulated devices to a particular VM context.

To manage VM execution hyperkit makes use of the hypervisor.framework which is a simplified framework for creating vCPUs, passing in mapped memory and creating an execution loop.

Finally are the changes around network interfaces, from inside the virtual machine the same virtio devices are created as would be created on bhyve. The difference is linking these virtual interfaces to a physical interface, as with OSX there is no TAP device that can be created to link virtual and physical. So their currently exists two methods to pass traffic between virtual and physical hosts, one of which is the virtIO to vmnet (virtio-vmnet) and the other is virtio to vpnkit (virtio-vpnkit) PCI devices. These both use the virtio drivers (specifically the network driver) that are part of any modern Linux kernel and then hand over to the backend of your choice on the physical system.

It’s worth pointing out here that the vmnet backend was the default networking method for xhyve and it makes use of the vmnet.framework, which as mentioned by other people is rather poorly documented. It also slightly complicates things by it’s design as it doesn’t create a file descriptors that would allow the existing simple code to read() and write() from, and it also requires elevated privileges to make use of.

With the work that has been done by the developers at Docker a new alternative method for communicating from virtual network interfaces to the outside world has been created. The solution from Docker is two parts:

  • The virtio-vpnkit device inside hyperkit that handles the reading and writing of network data from the virtual machine
  • The vpnkit component that has a full TCP/IP stack for communication with the outside world.

(I will add more details around vpnkit, when I’ve learnt more … or learnt OCaml, which ever comes first)

Networking overviews

bhyve overview (TAP devices)

bhyve_traffic

xhyve/hyperkit overview (VMNet devices)

hyperkit_traffic

 

 Docker for Mac / hyperkit overview (vpnkit)

docker_traffic

 

Porting (PCI devices) from bhyve to hyperkit

All of the emulated PCI devices all adhere to a defined set of function calls along with a structure that defines pointers to functions and a string that identifies the name of the PCI device (memory dump below)

pci_functions

The pci_emul_finddev(emul) will look for a PCI device e.g. (E1000, virtio-blk, virtio-nat) and then manage the calling of its pe_init function that will initialise the PCI device and then add it to the virtual machine PCI bus as a device that the operating system can use.

Things to be aware of when porting PCI devices are:

  • Removing VM context aware code, as mentioned it is a 1:1 between hyperkit and VM.
    • This also includes tying up paddr_guest2host() which maps physical addresses to guests etc.
  • Moving networking code from using TAP devices with read(), write() to making use of the vmnet framework

With regards to the E1000 PCI code i’ve now managed to tie up the code so that the PCI device is created correctly and added to the PCI bus, just struggling to fix the vmnet code (so feel free to give take my poor attempt and fix it successfully 🙂 https://github.com/thebsdbox/hyperkit)

img_6620

 

Further reading

http://bhyve.org/bhyve-fosdem2013.pdf

https://wiki.freebsd.org/bhyve

https://github.com/docker/hyperkit

EVO:RAIL – LoudMouth aka Zeroconf

What is Zeroconf?

Zeroconf was first proposed in November 1999 and finalised in 2003 and has found the largest adoption in Mac OS products, nearly all networked printers and other network device vendors. The most obvious and recognisable implementation of zeroconf is bonjour, which has been part of Mac OS since version 9 and is used to provide a number of shared network services. The basics of Zeroconf are explained quite simply on zeroconf.org with the following (abbreviated statement) “making it possible to take two laptop computers, and connect them … without needing a man in a white lab coat to set it all up for you”.

Basically zeroconf allows a server/appliance or client device to discover one another without any networking configuration. It is comparable to DHCP in some regards in that a computer with no network configuration can send out a DHCP request (essentially asking to be configured by the DHCP server), the response will be an assigned address and further configuration allowing communication on the network. Where it differs is that zeroconf also allows for advertisement of services (time capsule, printer services, iTunes shared libraries etc.), it also can advertise small amounts of data to identify itself as a type of device.

A Time machine advertisement over zeroconf: (MAC address removed)

[dan@mgmt ~]$ avahi-browse -r -a -p -t | grep TimeMachine
+;eth0;IPv4;WDMyCloud;Apple TimeMachine;local
=;eth0;IPv4;WDMyCloud;Apple TimeMachine;local;WDMyCloud.local;192.168.0.249;9;"dk0=adVN=TimeMachineBackup,adVF=0x83" "sys=waMA=00:xx:xx:xx:xx:xx,adVF=0x100"

How the EVO:RAIL team are using Zeroconf

From recollection of the deep-dive sessions, I may have mistaken the point (corrections welcome).

Zeroconf has found the largest adoption in networked printers and apple bonjour services, however in the server deployment area a combination of DHCP and MAC address matching is more commonly used (Auto deploy or kickstart from PXE boot).

The EVO:RAIL team have implemented a Zeroconf daemon that lives inside every vSphere instance and inside the VCSA instance. The daemon inside the VCSA wasn’t really explained however the vSphere daemon instances allow the EVO:RAIL engine to discover them and take the necessary steps to automate their configuration.

Implementing Zeroconf inside vSphere(esxi)

The EVO:RAIL team had to develop their own zeroconf daemon named loudmouth that is coded entirely in python. The reason behind this was explained in one of the technical deep dives, the problem being that the majority of pre-existing zeroconf implementations have dependancies on various linux shared libraries.

/lib # ls *so | wc -l
86
/lib # uname -a
VMkernel esxi02.fnnrn.me 5.5.0 #1 SMP Release build-1331820 Sep 18 2013 23:08:31 x86_64 GNU/Linux
....
[dan@mgmt lib]$ ls *so | wc -l
541
[dan@mgmt lib]$ uname -a
Linux mgmt.fnnrn.me 3.8.7-1-ARCH #1 SMP PREEMPT Sat Apr 13 09:01:47 CEST 2013 x86_64 GNU/Linux

As the quick example above shows (32bit libs) a vSphere instance contains only a few elf based libraries providing a limited subset of shared functionality. This means that whilst elf based binaries can be moved from a linux distribution over to a vSphere instance, the chance is that a requirement on a shared library won’t be met. Further more building a static binary possibly won’t help as the VMKernel (VMwares kernel implementation)doesn’t implement the full set of linux syscalls, which makes sense as it’s not an OS implementation the userland area of the vSphere is purely for management of the hypervisor. The biggest issue that an implementation of zeroconf which relies on UDP and datagrams is the lack of implementaion of IP_PKTINFO.

This rules out avahi, Zero Conf IP (zcif), and linux implementations of mDnsResponder.

What about loudmouth?

Unfortunately it is yet to be said if any components of EVO:RAIL will be open sourced or back ported to vSphere, so whilst VMware have a zeroconf implementation for vSphere it is likely it will remain proprietary.

What next…

I’ve improved on where I’ve been with my daemon, however i’m hoping to upload it to github sooner rather than later. Unfortunately work has occupied most of the weekend and most evenings so far .. that tied with catching up on episodes of elementary and dealing with endless segfaults as I add any simple functionality have slowed the progress more than I was expecting.

Also I decided to finish writing up this post, which took most of this evening 😐

Debugging on vSphere

A summary of what to expect inside vSphere can be read here and there is no point duplicating existing information (http://www.v-front.de/2013/08/a-myth-busted-and-faq-esxi-is-not-based.html). More importantly when dealing with the vSphere userland libraries or more accurately lack of, then the use of strace is hugely valuable. More details on strace can be found here (http://dansco.de/doku.php?id=technical_documentation:system_debugging).

Objective-C graphing and plotting with little-plot

As development has continued on a personal project it became obvious that I would need to implement UI elements that simply weren’t part of the Cocoa UI-kit. Essentially the main goal is presenting the user with a graph interface allowing them to quickly see a data set without having to read through line after line of figures. I looked at Core Plot (http://code.google.com/p/core-plot/), which whilst providing some great functionality looks like a HUGE amount of overkill when wanting a simple UI element.

So after a few days of tinkering I’ve created a couple of NSView subclasses allowing either manually created Views that can be presented arrays and will display the data accordingly.
I present Little-plot :

The above screenshot consists of three NSViews (LineView, PieView and LabelView), which each display a line graph, a pie chart and graph labels (or legends).

The project is available on GitHub here.

Updates will appear soon, along with some real documentation.

Face Detection in OSX

After reading this (http://maniacdev.com/2011/11/tutorial-easy-face-detection-with-core-image-in-ios-5/) tutorial for iOS 5 face detection I decided to try it for plain olde osX, I do find it somewhat annoying that the objective-C community is only focused on development using the iOS SDKs. The main changes involved turning the UI classes into NS classes and find ways around missing methods. I’ve uploaded my Source Code, which you can download and play with.

 

Window handling from the dock icon (Objective-C / Xcode)

I should probably post this stuff to stackoverflow, however I find that most of the people on there are rude and spend far too much time just berating anyone who asks for the slightest help. I apologise for anyone who had to waste time looking a little bit longer for this tip.

The default behaviour for a cocoa window when it’s closed with cmd+w or pressing the red X button is for the window to be closed ((but not destroyed) this applies to the main window, others may be created to release etc..). This means that the window simply needs being passed the makeKeyAndOrderFront: method to be made visible when its dock icon is clicked on:

 

-(BOOL)applicationShouldHandleReopen:(NSApplication *)theApplication hasVisibleWindows:(BOOL)flag
{
if (flag)
  {
 NSLog(@"Window already open");
  } else {
   [_window makeKeyAndOrderFront:self];
  }
 return YES;
}

Objective-C modal Window using sheets and Panels

Adding a modal sheet to a window in objective-C isn’t highly complicated however there are a number of issues to watch for that can leave you scratching your head. Most of the examples I’ve found on the internet point to an older useModal: (*window) function which is deprecated. From what i’ve read, the correct manner for using a modal dialog is to display a sheet that scrolls down from the menu bar and takes modal control. There are numerous examples of this in System Preferences:

Implementing this in an application coded with objective-C isn’t relatively complicated  however missing a particular setting can leave you with numerous errors or causing the application to fall back to the debugger.

Cocoa libssh2 wrapper

I’ve modified a simple wrapper for the libssh2 library that now has the following functionality:

  • Code moved to separate classes to allow reusability
  • Multiple sessions to different servers can be achieved with a few lines of code
  • A Session can be passed to the operator class allowing operations (commands sent to it), more will be added
At the current time it connects fine to OSX and Linux sshd however I can’t connect to ESXi even with the correct password it reports incorrect, However I think I Can resolve this shortly.
Original wrapper (designed for iOS) can be found from http://lukehagan.com/ in his Git Repo.
Download here: SSH Wrapper

SSH with Cocoa (Xcode and libssh2)

I fought with this about a year ago, and for some strange reason never managed to get things to compile or link. I chalk this down now to my lack of understanding with Objective-C/linking concepts. However it turns out that it is relatively simple (ensure you have Xcode 4 installed before trying).

  1. Point browser to http://www.libssh2.org/ and download the latest snapshot to a temporary location.
  2. Open a terminal window and navigate to the directory containing the the source files and run the following:
    dan$ ./configure
  3. This will output numerous content to the terminal window, present a summary of the configuration options and create a header file needed for compilation. (Running make / make install is NOT required).
  4. Open Xcode and create a new Xcode project, which should be a (Mac OS X -> Framework & Library -> C/C++ Library) and give it a Product Name (e.g. libssh2) and ensure that the type is Static then click create.
  5. Xcode will open with an empty project displaying the Build Settings. At this point we can start adding the files that are part of the libssh2 source tree.

Further Xcode – HelloWorld

The inevitable HelloWorld application is a staple in learning a programming language, and provides the learner with the feeling of accomplishment as their first program speaks back to them… or something. Either way, this example will present us with a basic framework which we can use to build upon.

To break it down this example consists of,

– Creating a blank project in Xcode

– Using the default Delegate class and adding our own method (interface)

– Linking the GUI to our class

– Adding code to our method (implementation)

– Drinking tea