The alchemy of turning (bare) metal into a cloud

Posted on 2019-12-08 Edited on 2019-12-11 Disqus:

This is a bit of a retrospective based upon lessons learnt over the last few months as I’ve been working on a spare-time project to manage bare-metal provisioning. The hope of all of this was to make the experience more “cloud like”.

The retrospective

I will separate out a lot of the near misses that I’ve had during the development of a provisioning tool into a separate post, as I think some of the daft mistakes pretty much warrant something separate (apologies to the hotel in Germany, and a few corporate networks).

During a conversation with a colleague I was asked about my “takeaway” from the whole process, and my feedback … that it’s still as painful now due to the technologies not improving in nearly/over 20 years.

The current technologies

This was written at the later stages of 2019, and if Blade runner was anything to go by then we should be all be travelling in flying cars at this point. Sadly the cars don’t fly and DHCP/TFTP and PXE booting is still the order of the day when it comes to provisioning bare-metal servers. ¯\_(ツ)_/¯

So what are these technologies and what do they do?

DHCP - Defined in 1993, and it’s role is to give networking configuration to a device that requests it.
TFTP - A simple technology to transmit data (or files), usually used in conjunction with DHCP. Typically the DHCP configuration for provisioning will include configuration information that will point to a TFTP server and files for the machine to then download.
PXE - Originally standardised in 1998 this is a loadable environment that a server will execute in order to hand over to something that may load an OS or install something (such as an OS).

We can see pretty quickly that a lot of the tooling we still use today is pretty long in the tooth.

How does all this hang together?

Server powers on and the NIC (network card) will request DHCP config.
A DHCP Server will offer a DHCP lease (an IP address, DNS, gateway and perhaps other configuration information … such as a TFTP Boot path !)
The powered on server will examine the lease and usually decide to accept this lease, it will then inform the DHCP server that at’s accepted the lease offer. The DHCP server will then add the lease to it’s leasing tables so it won’t give that config to another server.
The powered on server will then apply the network configuration to the interface and it will also examine the DHCP Options and act upon those. In a provisioning environment there will DHCP options such as option 67! (Or the other name Bootfile-Name), this is typically a (tftp) path to a loadable PXE environment. This will be then fetched from the tftp server and executed at which point the PXE environment will start an OS or a deployment process.

Well that all seems straight forward.. what are you complaining about?

Under most circumstances most people don’t need to care about anything that DHCP does, go to a location add your iPhone to a network and magic you’re on the network streaming ~~cat~~ goose memes. (Which in the above example, only needs steps 1-3)

The problems start to arise when we look at steps 4 on onwards… especially when I want a “cloud-like-experience”

What is a cloud-like-experience?

Ideally, regardless of the environment I’d like to simply take my credit card out (or I’d rather not actually… but nothing tends to be free) click a button or two * some magic occurs * and ultimately I get some compute resource to use.

I shouldn’t have to care about:

What the physical servers hardware address is
The physical servers location
The physical servers favourite movie or music
Building configuration or configuration files to define the servers deployment
Deleting and the clearing of the deployment when I’ve finished (or my credit card is declined 😬)

Unfortunately that just isn’t the case today with bare-metal provisioning of lifecycle management. With the tooling that exists today at least.

You’re still complaining .. whats the problem today?

The (Big)MAC 🍔 is king

The physical servers hardware address is probably the most important thing that is required in order to provision a server. This address is called the MAC address and is a unique address that every network device has. This standard is part of IEEE 802.3 standards and the MAC address comes from work done in 1980/1982, making it older than me :-D

It is this MAC address that we use to define a physical server on a network, before it has any other defining characteristics such a an IP address. The problem that this creates is that we need to be aware of these hardware addresses before we can do any provisioning work (not very cloud like).

Configuration files … perhaps yaml isn’t so bad after all?

(Spoiler: yaml is still bad)

With a correctly configured DHCP server a newly powered on bare-metal machine will request a network address, where it will typically be given a lease and off we go… but wait … what if a server needs to be provisioned with a specific identity.

Under most circumstance a brand-new server once booted will be given a network address and nothing else, at which point the server will reboot as it has nothing else to do. So how do we get to the point where the server knows to install something?

(Spoiler: more configuration)

It is at this point where we need to create specific configuration files that tie the above MAC address to some level of configuration file. The PXE spec http://www.pix.net/software/pxeboot/archive/pxespec.pdf first documented in 1998 covers the basics for this, but for the best part the following will happen:

(step 4-ish) DHCP magic has occurred and a new bare-metal server has been given an IP address and a bootrom path to a PXE boot image.
(step 5) The PXE boot image will then attempt to download a configuration file that matches the MAC address of the server currently being booted. So for the server 00:01:02:03:04:05 the PXE boot image will attempt to pull a file from the tftp server will the name 01-00-01-02-03-04-05
(step 6) This configuration file contains all of the information (kernel, init ramdisk and other useful files and) the machine will then boot.

From this we can start to see that the MAC address (or unique identifier) of the physical machine first has to be known. Then we need to craft some artisanal PXE configuration that is specific for this server before it attempts to boot.

Further more, imagine larger environments of anything more than a few dozen servers.. suddenly we have a TFTP root directory filled with various PXE files that we’re “hand-crafting”.

Example

In the event server 00:11:AA:DD:FF:45 isn’t booting correctly…

This is more than likely because you keep making the wrong choice editing the following two files:

/tftpboot/00-11-44-dd-ff-45
/tftpboot/00-11-aa-dd-ff-45

I’ve managed to get my server deployed! .. hurrah .. now what?

I’m skipping over things like the Ubunutu/Debian preboot and the RedHat/CentOS kickstart as these are such large systems, yet so poorly documented that I’ll probably have to split them out.. BUT at this point our server should have an OS installed hurrah !

It’s at this point where we typically would need yet another system or set of automation tooling. This tooling would be required to perform another set of steps to provision things like applications or cluster platforms, or even just to finish customising the Operating System installation where the previously mentioned systems can’t automate.

I want to do something different with this server now

In a cloud environment, when we’re done with resource we typically will delete it.

However this operation doesn’t particularly lend itself well to bare-metal infrastructure. Typically as there isn’t really a full-proof or standardised way to automate the wiping and freeing of physical infrastructure that can be easily automated.

At the moment, the most full proof way of accomplishing this would be to log into the OOB management of a server and instruct the disk controller (RAID controller) to wipe the disks, and then reboot the servers leaving it back to it’s blank state. This is still a typically manual thing for the following reasons:

Every OOB system is different (APIs/CLIs etc.)
A lot of OOB require licenses
No standardised API (RedFish attempted it…)
Not every server even has OOB

So I understand all the steps I need to make this work, how do I automate it?

Cue “A-Team music” and a lot of cups of tea

Modernising Bare-Metal Deployments

Unfortunately it’s not as simple as plucking PXEv2 out of thin air :-( these technologies are “literally” burnt into the hardware and can’t simply be changed. So what can we do?

In order to modernise and make the existing tooling behave more cloud like we can consider the following additions:

Phase 1: Simplify the tooling

Currently before anything can even be provisioned various services need to be enabled (DHCP, TFTP, HTTP, PXE, Configuration files, etc…). All of these are separate installations and configurations.

To simplify this we can make use of modern languages to create a new single server that encompasses all of these functions.

Note: I know micro-services are new hot-ness, but given these technologies are so old they are unlikely to change anytime soon I think it might be relatively safe to bundle them as a monolith.

Bundling these services together into a single binary allows a single configuration to stretch between all three services allowing:

Single binary
Single configuration
Shared state between the services
Simple deployment
Auto-detect of the environment, simplifying the configuration design

This would allow us the capability of doing something like:

1 2	./voldemort config detect > single_config.json ./voldemort start services --config ./config.json

In the event we want to autodetect the configuration from a different interface we could extend the auto-detection:

1	./voldemort config detect --interface eth0 > config.json

The above example will create a single configuration based upon the configuration of the deployment server and then start the needed services using that configuration. This vastly simplifies a lot of the required steps.

Phase 2: Enable via an API

To simplify provisioning we really need to wrap all of the management and configuration steps that were manual, with an API that will allow higher level systems and integrations to ease interaction and automation.

Ideally a Basic API should provide the following:

Manage server/services configuration
Create/Delete deployment configurations (OS type/packages etc.)
Create/Delete machine deployment using a MAC address and a deployment type

Additionally to make deployments more * Cloud-like * I don’t want to care about being aware of infrastructure before I want to consume it. To alleviate that we’ve can cobble together some functionality to “register” hardware and abstract it away to the end user.

When a server first starts we can register its MAC address as “free” and then either force the server to reboot until we need it, or we can power the server off and use something like WoL (Wake On LAN) when it’s time to provision. Doing this means that the end-user only needs to request resource, and if available our API can provision on available bare-metal.

With an API server it makes it a much better experience for other communities to build additional tooling and automation that can sit on top of a deployment engine and consume it’s functionality.

./voldemort start services --config ./config.json
[00001] API Server listening on port 12345

# We can now have a separate CLI tool that can speak to the deployment engine over the network

./voldermortctl get deployments
Deployment	MAC			Address
Harry 		00:11:22:33:44:55	192.168.0.100
Potter		11:22:33:44:55:66	192.168.0.101

Phase 3: Remove the need for lots of configuration files

We’ve simplified the deployment of the various services at this point, we can interact with these services and configuration with an API. What’s next…

As mentioned above, currently the main method for bare-metal deployments (following PXE standards) typically involves creating a file for EACH piece of physical hardware we want to provision that matches the hardware MAC address. This is error prone, hard to update, easy to break and over time will result in a build up of old configs.

To make life easier, move quicker and ensure consistency we can hold these configurations in memory. This allows us to not care about things like underlying filesystems, old persisting configuration etc… and have a single “source of truth” of configuration.

The benefits by moving to in-memory configurations removes a lot of additional tasks and greatly simplifies the management of configurations. We can now hold the configurations as data in-memory and point URLs of TFTP paths to these blocks of data. If a new configuration is created or updates then this block of in-memory data is modified, and the request to access it via URL/TFTP will just point to the updated block of data. If a deployment is deleted then we either delete the block of data or point the deployment to something else.

./voldermortctl create deployment --MAC 00:11:22:33:44:55 \
--address 192.168.0.100


         ----------------
        | Deployment data| (192.168.0.100)
         ----------------
            |      |
 -----------       --------- 
| TFTP path |     | HTTP URL|   
 -----------       ---------

./voldermortctl update deployment --MAC 00:11:22:33:44:55 \
--address 192.168.0.200


         ----------------
        | Deployment data| (192.168.0.200)
         ----------------
            |      |
 -----------       --------- 
| TFTP path |     | HTTP URL|   
 -----------       ---------

Phase 4: Additional things to make the end-users life easier

The above three phases have wrapped a lot of the older technologies with modern functionality to greatly ease deployment, automation and management… there are still a lot of other areas that can also make things easier.

Below are a few additions that when added should make things easier for end-users.

Templating of preseed/kickstart

All of the above technology/architecture options will provide a much better platform for being in a position to begin the provisioning. However the technologies that will automate the deployment of an Operating System are usually tied very closely to that particular OS. In my experience at least from a Linux perspective these tend to be preseed for Ubuntu/Debian and kickstart for CentOS and RedHat. Both suffer from slightly complex documentation and a myriad of out of date results on the internets that can be confusing

We can alleviate a lot of these issues by templating a lot of the standard OS deployment options and presenting a cleaner and simplified set of deployment options, allowing the end user to go FULL deployment if wanted.

Below is an example of how this could look:

./voldemortctl deploy --MAC 00:11:22:33:44:55 \
--address 192.168.0.100 \
--packages openssh, mariadb, docker \
--deploymentType kickstart

Installation Media

The majority of Linux Distributions will release their Operating Systems in the form of ISOs (CD-ROM/DVD images) … when was the last time anyone say a DVD drive?

Regardless, In order to read the contents of the installation media elevated privileges are required to mount the content. We can simplify this by having our deployment engine read content directly from the ISO.

./voldemortctl create boot \
--name preseed \
--isoPath /nfs/Operating Systems/Ubuntu/ubuntu-16.04.5-server-amd64.iso \
--isoPrefix ubuntu \
--kernel ubuntu/install/netboot/ubuntu-installer/amd64/linux \
--initrd ubuntu/install/netboot/ubuntu-installer/amd64/initrd.gz

By doing something like the above we can link our templated deployments to the contents of Operating System images.

Higher level abstractions

Finally we have a fully automated platform that can be configured through an API, automatically provision on available hardware. We can now start to build on top of that..

Terraform A provider could easily leverage these apis to find available hardware and automate the provisioning of bare metal using the terraform DSL

Kubernetes Through the use of Cluster-API, we can extend Kubernetes with the logic of machines and clusters. Extending that further to interact with the API server would allow Kubernetes to deploy new or extend existing clusters.

Alternative deployment architectures

For a long time the most common way to provision or deploy a new operating system was to make use of technologies such as preseed or kickstart. These technologies will step through scripted installers and install every package that is a dependancy. Whilst this is a true and tested technology it is a mutable method and errors can creep in, network delays/failures or bugs in installation scripts can at points lead to failed installs… however there are alternatives.

Image deployments

There has been a growing trend in the infrastructure space to focus on using a prebuilt image as the method for deployment. This technology can find its roots in technologies such as Norton Ghost and probably in other tooling prior. However the methodology is relatively straight forward, in that a “golden image” is created and hopefully through an automated process. Then that image is written to the bare-metal server during provisioning. This ensures that the installations are always identical and skips any additional steps such as installation scripts or other unknowns.

Alternatives (or things to be aware of)

NFS root disk(s)

Another alternative is that the deployment server will provide a kernel to boot over the network and also the path to an NFS share where the root file system lives (exported over an NFS share). Typically this involves a standard kernel and an initram disk that has NFS client utils present. Then as part of the deployment configuration, we present the bare-metal server with the kernel and Ramdisk and finally a additional flags that tell the Ramdisk to mount the NFS share and use it as a root disk.

Example grub config:

1 2	kernel vmlinuz append initrd=initramfs.img root=nfs:server-ip:/exported/root rw

Example Deployment engine:

./voldemortctl create boot \
--name preseed \
--cmdline root=nfs:server-ip:/exported/root rw
--kernel ubuntu/install/netboot/ubuntu-installer/amd64/linux \
--initrd ubuntu/install/netboot/ubuntu-installer/amd64/initrd.gz

iSCSI snapshot root disk(s)

A final option could be to utilise an iSCSI share and snapshots, where we’d create a master iSCSI LUN. We could then utilise snapshots to allow client to boot up utilising the underlying master LUN and keeping any of their changes in the snapshot that we’ve created for them. This can be achieved through the functionality that PXE boot loaders like iPXE currently support.

Conclusion

This conclusion is relatively short, but hopefully covers the points this post has been trying to make. The technologies and tooling that we have today is old, however it is for the best part stable and lends itself well to higher abstractions that can seek to modernised bare metal deployments. This will allow us for the time being to build systems on top that can ease automation and link bare metal deployments into new platforms.

My hope is that we get to a point where hardware vendors can agree to work on something new, something with automation and an API at it’s core. Also something that is open, because there are some technologies that can help today but not without a specific license for a specific vendor.

Voldemort ?

Currently nearly everything that is discussed in this post already exists (to a degree even the mocked up cli commands). However as this project currently isn’t really public I’ve not used it’s name …

hence:
“The deployment engine who must not be named”