Create the project with kubebuilder

1
2
3
kubebuilder init --domain fnnrn.me --owner "Dan Finneran"

kubebuilder create api --group katnav --kind Directions --version v1

Edit the Resources

The path is api/v1/directions_types.go

Edit the Directions Specification DirectionsSpec

The example is the usual Foo etc..

1
2
// Foo is an example field of Directions. Edit directions_types.go to remove/update
Foo string `json:"foo,omitempty"`

We will change this to what we need for the directions to actually work:

1
2
3
4
5
// Source is where the beginning of our journey is
Source string `json:"source"`

// Destination is the end of our journey
Destination string `json:"destination"`

Edit the Directions Status DirectionsStatus

The Status reflects what has happened to our resource, and the expected status will be the directions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Directions is a list of directions to our destination
Directions string `json:"directions"`

// Routesummary gives a simple overview of the route
RouteSummary string `json:"routeSummary"`

// StartLocation is the start from the directions API
StartLocation string `json:"startLocation"`

// EndLocation is the start from the directions API
EndLocation string `json:"endLocation"`

// Distance is the total distance of the journey
Distance string `json:"distance"`

// Duration is the amount of time the journey will take
Duration string `json:"duration"`

// Error captures an error message if the route isn't possible
Error string `json:"error,omitempty"`

Configure Google Account

To get an API key:

  1. Visit Google APIs Console and log in with a Google Account.
  2. Select one of your existing projects, or create a new project.
  3. Enable the API(s) you want to use. The Go Client for Google Maps Services accesses the following APIs:
  • Directions API

Create a new Server key from the credentials menu.

Billing may need to be enabled so do be careful with usage!

Apply API key to secret

This will create our secret (called katnav) within the “default” namespace:

1
kubectl create secret generic katnav --from-literal=directionsKey=<API TOKEN>

Add our code to the controller

The code that we’re going to add to the controller will do a few things:

  1. Connect to the Google Maps API
  2. Watch for directions resources being created within the Kubernetes API
  3. Update these resources (the status) with the directions and additional journey information

Add the Google Maps client

Add "googlemaps.github.io/maps" to the imports and a pointer to the client in the DirectionsReconciler struct:

1
2
3
4
5
6
// DirectionsReconciler reconciles a Directions object
type DirectionsReconciler struct {
client.Client
Scheme *runtime.Scheme
mClient *maps.Client
}

Initialise the Google Maps client

At the beginning of the SetupWithManager() function we will add in code that will do the following:

  1. Create our own Kubernetes client
  2. Retrieve our secret in the default namespace
  3. Create our maps client and update out pointer to the client so we can use it in the reconcile() function later:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

// So we have a Kubernetes cluster in r.Client, however we can't use it until the caches
// start otherwise it will just return an error. So in order to get things ready we will
// use our own client in order to get the key and set up the Google Maps client in advance
config, err := restClient.InClusterConfig()
if err != nil {
kubeConfig :=
cmdClient.NewDefaultClientConfigLoadingRules().GetDefaultFilename()
config, err = cmdClient.BuildConfigFromFlags("", kubeConfig)
if err != nil {
return err
}
}
// create the clientset
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
panic(err.Error())
}

secret, err := clientset.CoreV1().Secrets("default").Get(context.TODO(), "katnav", v1.GetOptions{})
if err != nil {
return err
}

token := string(secret.Data["directionsKey"])
if token == "" {
return fmt.Errorf("no Token found within API Key")
}
r.mClient, err = maps.NewClient(maps.WithAPIKey(token))
if err != nil {
return err
}

Reconcile !

Below is our full Reconcile() function that does the following:

  1. Retrieves our resource that needs reconciling
  2. Gets the source/destination from the spec
  3. Creates a Google Maps Directions request to the API
  4. Updates the status of the resource with the direction information (that we format, such as removing the HTML tags etc..)
  5. Apply the updated directions Object to the Kubernetes API!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
func (r *DirectionsReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// Set the log to an acutal value so we can create log messages
log := log.FromContext(ctx)
log.Info("Reconciling direction resources")

var directions katnavv1.Directions
if err := r.Get(ctx, req.NamespacedName, &directions); err != nil {
if errors.IsNotFound(err) {
// object not found, could have been deleted after
// reconcile request, hence don't requeue
return ctrl.Result{}, nil
}
log.Error(err, "unable to fetch Directions object")
// we'll ignore not-found errors, since they can't be fixed by an immediate
// requeue (we'll need to wait for a new notification), and we can get them
// on deleted requests.
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// your logic here
log.Info("Determining journey", "Source", directions.Spec.Source, "Destination", directions.Spec.Destination)

request := &maps.DirectionsRequest{
Origin: directions.Spec.Source,
Destination: directions.Spec.Destination,
Mode: maps.TravelModeDriving,
}

route, _, err := r.mClient.Directions(context.Background(), request)
if err != nil {
log.Error(err, "unable to fetch Directions")
}
var directionsString string
if len(route) == 0 {
return ctrl.Result{}, nil
}

log.Info("New Route", "Summary", route[0].Summary)
for x := range route[0].Legs {
for y := range route[0].Legs[x].Steps {
stripped := strip.StripTags(route[0].Legs[x].Steps[y].HTMLInstructions)
directionsString += stripped + "\n"
}
directions.Status.Distance = route[0].Legs[x].Distance.HumanReadable
directions.Status.Duration = fmt.Sprintf("Total Minutes: %f", route[0].Legs[x].Duration.Minutes())
directions.Status.StartLocation = route[0].Legs[x].StartAddress
directions.Status.EndLocation = route[0].Legs[x].EndAddress

}

directions.Status.RouteSummary = route[0].Summary
directions.Status.Directions = directionsString

err = r.Client.Status().Update(context.TODO(), &directions, &client.UpdateOptions{})
if err != nil {
log.Error(err, "unable to update journey")
}
return ctrl.Result{}, nil
}

Using !

Sheffield.yaml

This is our manifest for directions from York -> Sheffield!

1
2
3
4
5
6
7
apiVersion: katnav.fnnrn.me/v1
kind: Directions
metadata:
name: directions-to-sheffield
spec:
source: York, uk
destination: Sheffield, uk

We can apply this to Kubernetes with:

1
kubectl apply -f ./sheffield.yaml

We should see the following logs from the Controller:

1
2
3
2021-07-28T15:56:47.901+0100    INFO    controller-runtime.manager.controller.directions        New Route       {"reconciler group": "katnav.fnnrn.me", "reconciler kind": "Directions", "name": "directions-to-sheffield", "namespace": "default", "Summary": "M1"}
2021-07-28T15:56:47.906+0100 INFO controller-runtime.manager.controller.directions Reconciling direction resources {"reconciler group": "katnav.fnnrn.me", "reconciler kind": "Directions", "name": "directions-to-sheffield", "namespace": "default"}
2021-07-28T15:56:47.906+0100 INFO controller-runtime.manager.controller.directions Determining journey {"reconciler group": "katnav.fnnrn.me", "reconciler kind": "Directions", "name": "directions-to-sheffield", "namespace": "default", "Source": "York, uk", "Destination": "Sheffield, uk"}

Looking at our Directions 👀

If we examine the newly created directions object we should see that it’s status has been updated accordingly !

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
$ kubectl get directions directions-to-sheffield -o yaml
apiVersion: katnav.fnnrn.me/v1
kind: Directions
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"katnav.fnnrn.me/v1","kind":"Directions","metadata":{"annotations":{},"name":"directions-to-sheffield","namespace":"default"},"spec":{"destination":"Sheffield, uk","source":"York, uk"}}
creationTimestamp: "2021-07-28T14:34:56Z"
generation: 2
name: directions-to-sheffield
namespace: default
resourceVersion: "6986253"
uid: c4077b0b-8f2f-4fc3-82c0-2d690a24d98a
spec:
destination: Sheffield, uk
source: York, uk
status:
directions: |
Head southwest on Lendal Bridge/Station Rd/A1036 toward Rougier St/B1227Continue to follow A1036
Keep left to continue toward Station Rd/A1036
Continue onto Station Rd/A1036Continue to follow A1036
Turn right onto Blossom St/A1036Continue to follow A1036
At the roundabout, take the 2nd exit onto Tadcaster Rd Dringhouses/A1036
Take the ramp to Leeds
Merge onto A64
Merge onto A1(M) via the ramp to Leeds/M1/Manchester/M62
Keep right at the fork to continue on M1
At junction 34, take the A6109 exit to Sheffield(E)/Rotherham(C)/Meadowhall
At Meadowhall Roundabout, take the 4th exit onto Meadowhall Rd/A6109
Keep right to stay on Meadowhall Rd/A6109Continue to follow A6109
At the roundabout, take the 1st exit onto Brightside Ln/A6109Continue to follow A6109
Slight right onto Savile St/A6109
Turn right onto Derek Dooley Way/A61Continue to follow A61
Slight left onto Corporation St/A61
Slight left onto Corporation St/B6539
At the roundabout, take the 2nd exit onto W Bar Green/B6539Continue to follow B6539
At the roundabout, take the 3rd exit onto Broad Ln/B6539Continue to follow B6539
At the roundabout, take the 1st exit onto Upper Hanover St
Continue onto Hanover Way
At the roundabout, take the 1st exit onto Moore St
Continue onto Charter Row
Continue onto Furnival Gate
Furnival Gate turns left and becomes Pinstone St
Turn right onto Burgess St
Burgess St turns right and becomes Barker's Pool
Barker's Pool turns left and becomes Leopold St
distance: 93.8 km
duration: 'Total Minutes: 77.816667'
endLocation: Sheffield, UK
routeSummary: M1
startLocation: York, UK

Comment and share

(no title)

The “perfect” virtual Tinkerbell environment on Equinix Metal

pre-requisites

This is a rough shopping list of skills/accounts that will be a benefit for this guide.

  • Equinix Metal portal account
  • GO experience (basic)
  • iptables usage (basic)
  • qemu usage (basic)

Our Tinkerbell server considerations

Some “finger in the air” mathematics are generally required when selecting an appropriately sized physical host on Equinix metal. But if we take a quick look at the expected requirements:

1
2
3
4
5
6
7
8
CONTAINER ID        NAME                   CPU %               MEM USAGE / LIMIT    MEM %               NET I/O             BLOCK I/O           PIDS
582ce8ba4cdf deploy_tink-cli_1 0.00% 832KiB / 7.79GiB 0.01% 0B / 0B 13.1MB / 0B 1
4aeb11684865 deploy_tink-server_1 0.00% 7.113MiB / 7.79GiB 0.09% 75.7MB / 76.6MB 4.94MB / 0B 11
24c7c5961a2b deploy_boots_1 0.00% 8.652MiB / 7.79GiB 0.11% 0B / 0B 5.66MB / 0B 7
37d2e840bd54 deploy_registry_1 0.02% 8.344MiB / 7.79GiB 0.10% 0B / 0B 1.32MB / 0B 11
a70a54baff08 deploy_hegel_1 0.00% 2.984MiB / 7.79GiB 0.04% 0B / 0B 246kB / 0B 5
33d953c9b057 deploy_db_1 13.66% 6.363MiB / 7.79GiB 0.08% 24.7MB / 22.2MB 1.5MB / 1.34GB 8
2fa8a6064036 deploy_nginx_1 0.00% 2.867MiB / 7.79GiB 0.04% 341kB / 0B 0B / 0B 3

We can see that the components for the Tinkerbell stack are particularly light, with this in mind we can be very confident that we can have all of our userland components (tinkerbell/docker/bash etc..) within 1GB of ram and leave all remaining memory for the virtual machines.

That brings us onto the next part, which is how big should the virtual machines be?

In memory OS (OSIE)

Every machine that is booted by Tinkerbell will be passed the in-memory Operating System called OSIE which is an alpine based Linux OS that ultimately will run the workflows. As this is in-memory we will need to account for a few things (before we even install our Operating System through a workflow.

  • OSIE kernel
  • OSIE RAM Disk (Includes Alpine userland and the docker engine)
  • Action images (at rest)
  • Action containers (running)

The OSIE Ram Disk whilst it looks like a normal filesystem is actually held in the memory of the host itself so immediately will withhold that memory from other usage.

The Action image will be pulled locally from a repository and again written to disk, however the disk that these images are written to is a ram disk, so these images will again withhold available memory.

Finally, these images when ran (Action containers) will have binaries in them that will require available memory in order to run.

The majority of this memory usage from the as seen from above is for the in-memory filesystem in order to host the userland tools and the images listed in the workflow. From testing we’ve normally seen that >2GB is required, however if your workflow consists of large action images then this will need adjusting accordingly.

With all this in consideration, it is quite possible to run Tinkerbell on Equinix Metals smallest offering the t1.small.x86, however if you’re looking at deploying multiple machines with tinkerbell then ideally a machine with 32GB of ram will comfortably allow a comfortable amount of headroom.

Recomended instances/OS

Check the inventory of your desired facility, but the recommended instances are below:

  • c1.small.x86
  • c3.small.x86
  • x1.small.x86

For speed of deployment and modernity of the Operating System, either ubuntu 18.04 or ubuntu 20.04 are recommended.

Deploying Tinkerbell on Equinix Metal

In this example I’ll be deploying a c3.small.x86 in the Amsterdamn faclity ams6 with ubuntu 20.04. Once our machine is up and running, we’ll need to install our required packages for running tinkerbell and our virtual machines.

Update the packages

1
apt-get update -y

Install git (to clone the sandbox)

1
apt-get install -y git

Install required dependancies

1
2
3
4
5
6
7
8
apt-get install -y 	apt-transport-https \
ca-certificates \
curl \
gnupg-agent \
software-properties-common \
qemu-kvm \
libguestfs-tools \
libosinfo-bin

Grab shack (qemu wrapper)

1
2
3
wget https://github.com/plunder-app/shack/releases/download/v0.0.0/shack-0.0.0.tar.gz; \
tar -xvzf shack-0.0.0.tar.gz; \
mv shack /usr/local/bin

Create our internal tinkerbell network (not needed)

1
sudo ip link add tinkerbell type bridge

Create shack configuration

1
shack example > shack.yaml

Edit and apply configuration

Change the bridgeName: from plunder to tinkerbell, then run shack network create. This will create a new interface on our tinkerbell bridge

Run shack network create

Test virtual machine creation

1
2
3
4
5
6
7
shack vm start --id f0cb3c -v
<...>
shack VM configuration
Network Device: plndrVM-f0cb3c
VM MAC Address: c0:ff:ee:f0:cb:3c
VM UUID: f0cb3c
VNC Port: 6671

We can also examine that this has worked, by examining ip addr:

1
2
3
4
5
6
7
8
9
10
11: plunder: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
link/ether 2a:27:61:44:d2:07 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.1/24 brd 192.168.1.255 scope global plunder
valid_lft forever preferred_lft forever
inet6 fe80::bcc7:caff:fe63:8016/64 scope link
valid_lft forever preferred_lft forever
12: plndrVM-f0cb3c: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master plunder state UP group default qlen 1000
link/ether 2a:27:61:44:d2:07 brd ff:ff:ff:ff:ff:ff
inet6 fe80::2827:61ff:fe44:d207/64 scope link
valid_lft forever preferred_lft forever

Connect to the VNC port with a client (the random port generated in this example is 6671).. it will be exposed on the public address of our equinix metal host.

Kill the VM:

1
shack vm stop --id f0cb3c -d

Install sandbox dependencies

Docker

1
2
3
curl -fsSL https://download.docker.com/linux/ubuntu/gpg |sudo apt-key add - \ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" \
sudo apt-get update \
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

Docker compose

1
2
3
4
sudo curl -L \
"https://github.com/docker/compose/releases/download/1.26.0/docker-compose-$(uname -s)-$(uname -m)" \
-o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

Clone the sandbox

1
2
git clone https://github.com/tinkerbell/sandbox.git
cd sandbox

Configure the sandbox

1
2
./generate-envrc.sh plunder > .env
./setup.sh

Start Tinkerbell

1
2
3
4
# Add Nginx address to Tinkerbell
sudo ip addr add 192.168.1.2/24 dev plunder
cd deploy
source ../.env; docker-compose up -d

At this point we now have a server with available resource, we can create virtual machines and tinkerbell is listening on the correct internal network!

Create a workflow (debian example)

Clone the debian repository

1
2
3
cd $HOME
git clone https://github.com/fransvanberckel/debian-workflow
cd debian-workflow/debian

Build the debian content

1
2
3
./verify_json_tweaks.sh 
# The JSON syntax is valid
./build_and_push_images.sh

Edit configuration

Modify the create_tink_workflow.sh so that the mac address is c0:ff:ee:f0:cb:3c, this is the mac address we will be using as part of our demonstration.

For using VNC, modify the facility.facility_code from "onprem" to "onprem console=ttys0 vha=normal". This will ensure all output is printed to the VNC window that we connect to.

Create the workflow

Here we will be asked for some password credentials for our new machine:

1
./create_tink_workflow.sh

Start our virtual host to install on!

1
2
3
4
5
6
7
shack vm start --id f0cb3c -v
<...>
shack VM configuration
Network Device: plndrVM-f0cb3c
VM MAC Address: c0:ff:ee:f0:cb:3c
VM UUID: f0cb3c
VNC Port: 6671

We can now watch the install on the VNC port 6671

Troubleshooting

1
http://192.168.1.1/undionly.pxe could not be found

If a machine boots and has this error it means that it’s workflow has been completed, in order to boot this server

1
could not configure /dev/net/tun (plndrVM-f0cb3c): Device or resource busy 

This means that an old qemu session left an old adapter, we can remove it with the command below:

ip link delete plndrVM-f0cb3c

1
Is another process using the image [f0cb3c.qcow2]? 

We’ve left an old disk image laying around, we can remove this with rm

Comment and share

In this post i’ve a bunch of things I want to cover all about Type:LoadBalancer (or in most cases a VIP (Virtual IP address). In most Kubernetes environments a user will fire in some yaml defining a LoadBalancer service or do a kubectl expose and “magic” will occur. As far as the end-user is concerned their new service will have a brand new IP address attached to it and when an end-user hits that IP address their traffic will hit a pod that is attached to that service. But what’s actually occuring in most cases? Who can provide this address? How does it all hang together?

That’s what we’re going to cover in this post.

The venerable Tim Hockin covers this in a bunch of slidedecks here -> https://speakerdeck.com/thockin/bringing-traffic-into-your-kubernetes-cluster, so feel free to start here and then i’ll cover some bits in more detail or from a different angle :-)

Kubernetes Services

A lot of this is already mentioned above, but put simply a service is a method of providing access to a pod or number of pods either externally or internally. A common example is exposing a web server front end, where we may have a deployment of 10 nginx pods.. we need to allow end users to access these 10 pods. Within Kubernetes we can define a service that is attached to this deployment, and thanks to the logic within Kubernetes we don’t need to concern ourselves too much with these pods.. we can scale up/scale down.. kill pods etc.. as long as we’re coming through the service it will always have an upto date list of the pods underneath it.

Types of service

ClusterIP

This is an internal only service.. typically used for internal communication between two services such as some middlewhere level connectivity.

NodePort

A NodePort is a port created on every node in the cluster. An external user can connect to a node address on this port to access the service. If we were to use a NodePort with nginx then we’d be given a high port (usually 30000+) that will route traffic to the nginx ports e.g. worker0xx:36123 --> [nginx-pod0x:80]

LoadBalancer

The LoadBalancer service is used to allow external access into a service, it usually requires something external (Cloud Controller Manager) to inform the Kubernetes API what external address traffic should be accepted on.

ExternalName

This allows a service to be exposed on an external name to point to something else. The main use-case is being able to define a service name and having it point to an existing external service…

The Kubernetes Load Balancer service

So how does all this hang together…

-> also i’ll be walking through how you can implement this yourself NOT discussing how the big cloud providers do it :-)

If you take a “fresh” Kubernetes cluster and create a simple nginx deployment and try to expose it as a LoadBalancer you’ll find that it doesn’t work (or sits in pending).

1
kubectl expose deployment hello-world --type=LoadBalancer --name=my-service`

[some time later]

1
2
NAME         TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)    AGE
my-service LoadBalancer 10.3.245.137 <pending> 8080/TCP 54s

Why??? Well the Type:LoadBalancer isn’t the responsibility of kubernetes, it effectively doesn’t come out of the box… as we can see the internal IP (CLUSTER-IP) has been created but no external address exists. This EXTERNAL-IP is typically a piece of information that is infrastructure specific, in the cloud we need addresses to come from their IP address management systems and on-premises who knows how addresses are managed ¯\_(ツ)_/¯

Cloud Controller Manager

The name can be slightly confusing as anyone can write a CCM and their roles aren’t necessarily cloud specific, i presume it’s more that their main use-case is to extend a kubernetes cluster to be aware of a cloud providers functionality.

I’ve covered Cloud Controllers before, but to save the dear reader a few mouse-clicks i’ll cover it briefly again… The role of a CCM is usually to plug into the Kubernetes API and watch and act upon certain resources/objects. For objects of type node then the CCM can speak with the infrastructure to verify that the nodes are deployed correctly and enable them to handle workloads. The objects of type LoadBalancer would require the CCM to speak with an API to request an IP address that is available to be an EXTERNAL-IP, alternatively a CCM may be configured with an IP address range it can use in order to configure the EXTERNAL-IP.

LoadBalancer watcher

Within the CCM it will have code that will be watching the Kubernetes API, specifically for watching Kubernetes services… When one has the spec.type = LoadBalancer then the CCM will act!

Before the CCM has chance to act, this is what the service will look like:

kubectl get svc my-service -o yaml

(some stuff omitted…)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
- apiVersion: v1
kind: Service
metadata:
creationTimestamp: "2021-01-02T17:46:16Z"
name: my-service
namespace: default
resourceVersion: "5281940"
selfLink: /api/v1/namespaces/default/services/hello-world
uid: 7c14dbbd-f9fd-4279-8524-7418ff62d08f
spec:
clusterIP: 10.107.2.227
externalTrafficPolicy: Cluster
ports:
- nodePort: 30497
port: 80
protocol: TCP
targetPort: 80
selector:
app: hello-world
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer: {}

We can see that the status is blank and the spec.loadBalancerIP doesn’t exist.. well, whilst we’ve been reading this the CCM has acted.. if we’re in the cloud it’s done some API calls or if we’re running our own it may have looked at it’s pool of addresses and found a free address. The CCM will take this address and modify the object by updating the spec of the service.

kubectl get svc my-service -o yaml

(more stuff omitted…)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
- apiVersion: v1
kind: Service
[....]
spec:
clusterIP: 10.107.2.227
externalTrafficPolicy: Cluster
loadBalancerIP: 147.75.100.235
ports:
- nodePort: 30497
port: 80
protocol: TCP
targetPort: 80
selector:
app: hello-world
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer: {}

… and that is it.

That is all the CCM needs to do although but your service will still be <pending> as the status is still blank :-(

Two other things need to happen at this point!

Kubernetes Services proxy

What is this “proxy” I speak of! … well, if you’ve noticed the kube-proxy pod on your cluster and wondered why it’s there then read on!

There is plenty more detail => here, but i’ll break it down mainly for LoadBalancers.

When we expose a service within Kubernetes, then the API server will instruct the kube-proxy to inject a bunch of rules into iptables/ipvs that will effectively capture traffic and ensure that this traffic goes to the pods defined under the service. By default any service we create will have its clusterIP and a port written into these rules so that regardless which node inside the cluster we try to access this clusterIP:port then the services proxy will handle this traffic and distribute it accordingly. The clusterIP are virtual IP addresses that are managed by the Kubernetes API, however with LoadBalancers the traffic is external and we don’t know what the IP address will be ¯\_(ツ)_/¯!

Well, as shown above the CCM will eventually modify the spec.loadBalancerIP with an address from the environment.. once this spec is updated then the API will instruct kube-proxy to ensure that any traffic for this externalIP:port is also captured and proxied to the pods underneath the service.

We can see that these rules now exist by looking at the output for iptables-save, all traffic for the address of the LoadBalancer is now forwarded on…

1
2
root@k8s:~# iptables-save | grep 147
-A KUBE-SERVICES -d 147.75.100.235/32 -p tcp -m comment --comment "default/hello-world loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-2BRRD7ARGUSQ7MRP

The final piece of the puzzle is getting traffic to the machines themselves …

External traffic !

So between a CCM and the Kubernetes service proxy, we have been given an external IP address that we have for our service and the Kubernetes service proxy will ensure that any traffic in the cluster for that external IP address is distributed to the various pods. We now need to get traffic to the nodes themselves…

Hypothetically, if we had a second network adapter in one of the nodes then we could configure this network adapter with the externalIP and as long as we can route traffic to that IP address then the kubernetes service proxy will capture and distribute that traffic. Unfortunately that is a very manual operation, so what options/technologies could we adopt in order to manage this?

We would usually need a final running piece of software that “watches” a service, and once the service is updated with a spec.loadBalancerIP from the CCM we know that it’s good to advertise to the outside world! Also once we’re exposing this to the outside world we can modify the status of the service with the address that we’re exposing on so that clients and end-users know that they can now use this address!

1
2
3
4
status:
loadBalancer:
ingress:
- ip: 147.75.100.235

There are two main technologies that we can use to tell an existing environment about our new loadBalancer address, and when someone accesses this address where to then send the traffic!

ARP

ARP (Address resolution Protocol) is a Layer 2 protocol, who has the main task of working out which hardware (MAC address) an IP address belongs to. When an IP address first appears on the network it will typically broadcast to the network that it exists, it will also broadcast the MAC address of the adapter that it’s bound to. This informs the network of this IP <==> MAC binding, meaning that on a simple network when packets need to go to an IP address the switching infrastructure knows which machine to send the traffic.

We can see this mapping by using the arp command:

1
2
3
4
5
6
7
8
$ arp -a
_gateway (192.168.0.1) at b4:fb:e4:cc:d3:80 [ether] on ens160
? (10.0.73.65) at e6:ca:eb:b8:a0:f3 [ether] on cali43893380087
? (192.168.0.170) at f4:fe:fb:54:89:16 [ether] on ens160
? (10.0.73.67) at 52:5c:1b:5f:e1:50 [ether] on cali0e915999b8d
? (192.168.0.44) at 00:50:56:a5:13:11 [ether] on ens160
? (192.168.0.45) at 00:50:56:a5:c1:86 [ether] on ens160
? (192.168.0.40) at 00:50:56:a5:5f:1d [ether] on ens160

Load Balancing NOTE: In order for this to work with out Kubernetes cluster, we would need to select a single node that would become in charge of hosting this externalIP and using ARP to inform the network that traffic for this address should be sent to that machine. This is because of two reasons:

  • If another machine broadcasts an ARP update then existing connections will be disrupted
  • Multiple machines can’t have the same IP address exposed on the same network

The flow of operation is:

  • A leader is selected
  • spec.loadBalancerIP is updated to 147.75.100.235
  • 147.75.100.235 is added as an additional address to interface ens160
  • ARP broadcasts that traffic to 147.75.100.235 is available at 00:50:56:a5:4f:05
  • Update the service status that the service is being advertised

At this point the externalIP address is known to the larger network, and any traffic will be sent to the node that is elected leader. Once the traffic is captured by the rules in the kernel, the traffic is then sent to the pods that are part of the service.

BGP

BGP (Border Gateway Protocol) is a Layer 3 protocol, the main task is to update routers with the path to new routes (that new route can be a single address or a range of addresses). For our use-cases in a Kubernetes cluster we would use BGP to announce to the routers in the infrastructure that traffic for a our spec.loadBalancerIP should be sent to one or more machines.

Load Balancing NOTE: One additional benefit over ARP is that multiple nodes can advertise the same address, this actually provides both HA and load-balancing across all nodes advertising in the cluster. In order to do this, we can bind the externalIP to the localhost adapter (so it’s not present on the actual network) and leave it to the kernel to allow traffic from the routers into the kernel and allow it to be proxied by kube-proxy.

The flow of operation is:

  • spec.loadBalancerIP is updated to 147.75.100.235
  • 147.75.100.235 is added as an additional address to interface lo
  • A BGP peer advertisement updates the other peers (routers) that the loadBalancerIP is available by routing to the machines address on the network (all machines can/will do this)
  • Update the service status that the service is being advertised

Any client/end-user that tries to access 147.75.100.235 will have their traffic go through the router, where a route will exist to send that traffic to one of the nodes in the cluster, where it will be passed to the kubernetes service proxy.

Overview

At this point we can see there are a number of key pieces amd technologies that can all be harnesed to put together a load-balancing solution for Kubernetes. However the CCM is arguably the most important, as the CCM has the role of being “aware” of the topology or architecture of the infrastructure. It may need to prep nodes with configuration details (BGP configuration settings etc.) and speak with other systems to request valid addresses that can be used for the loadBalancer addresses.

Comment and share

I’ve just gone through the process of updating my current home machine to the latest release of macOS BigSur, which was a much more painful process than i’ve had over the years. It also made me reflect on the fact that i’ve been doing this for over a decade! and that quite possibly this will be one of the last times that I can run macOS on standard x86 hardware.

The Why?

I’ve had a workstation for personal use for a long time, and

Comment and share

Since starting at Packet Equinix Metal i’ve had to spend a bit of time getting my head around a technology called BGP. This is widely used at Equinix Metal, the main use case is to allow an Elastic IP address to route traffic externally (i.e. from the internet) into one or more physical machines in Equinix’s facilities. An external address will usually be routed into one of the Equinix datacenters, we can configure physical infrastructure to use this external IP and communicate to the networking equipment in this facility (using BGP) that traffic should be routed to these physical servers.

I’ve since been working to add BGP functionality to a Kubernetes load-balancing solution that i’ve been working on in my spare time kube-vip.io. This typically works in exactly the same method as described as above, where an EIP is advertised to the outside world and the BGP functionality will route external traffic to worker nodes and, where the service traffic is then handled by the service “mesh” (usually iptables rules capturing service traffic) within the cluster.

I’m hoping to take this functionality further and decided it would be nice to try and emulate the Equinix Metal environment as much as possible, i.e. having BGP at home. There are two methods that I could go down in order to have this sort of functionality within my house:

  • Create a local linux router, with it’s own network and use bgpd so I can advertise routes
  • Utilise my Unifi router, which after some light googling it turns out supports bgp

Given that:
a. It’s Sunday
b. I’m Lazy
c. The instructions for setting up a local linux router looked like a pita I opted to just use my USG.
d. The USG will enable me to advertise addresses on my existing network and the USG will route traffic the advertising hosts without me doing anything (exacerbating the laziness here)

Configuring the Unifi Security Gateway

These instructions are for the Unifi Security Gateway, however I suspect that the security gateway that is part of the “dream machine” should support the same level of functionality.

Before we begin we’ll need to ensure that we understand the network topology of the network so that we can configure for bgp to function as expected.

Device Address
gateway 192.168.0.1
k8sworker01 192.168.0.70
k8sworker02 192.168.0.71
k8sworker03 192.168.0.72

In the above table we can see a subset of hosts on my network, the first host that is important is the gateway address on my network (which is the address of the USG). All hosts on my network have this set as the default gateway, which means that when a machine needs to access an IP address that isn’t 192.168.0.1-254 then it will send the traffic to the gateway for it to be routed to that specific address.

Your gateway can be found by doing something like the following:

1
2
$  ip route | grep default
default via 192.168.0.1 dev ens160 proto static

The other addresses that are worth noting are the three worker hosts in my Kubernetes cluster that will advertise their bgp routes to the gateway.

NOTE:
All of the next steps require ssh access into the appliance in order to enable and configure the USG so that bgp is enabled on the network.

If you don’t know your ssh password for your gateway then help is at hand.. because for some bizarre reason you can get the ssh username/password straight from the web UI.. Navigate to Network Settings -> Device Authentication and here you’ll find the ssh username and password.

NOTE 2:
Whilst it is possible to enable bgp functionality through the cli, the actual web client isn’t aware of the bgp configuration. This will result in the bgp configuration being wiped when you do things like modify firewall rules or port forwarding in the web UI.

Enabling BGP

To begin with we’ll need to ssh to the Unifi USG with the credentials that can be found from the web portal.

Below is the configuration that we’ll add, i’ll break it down below:

1
2
3
4
5
6
7
8
configure
set protocols bgp 64501 parameters router-id 192.168.0.1
set protocols bgp 64501 neighbor 192.168.0.70 remote-as 64500
set protocols bgp 64501 neighbor 192.168.0.71 remote-as 64500
set protocols bgp 64501 neighbor 192.168.0.72 remote-as 64500
commit
save
exit

configure - will enable the configuration mode for the USG.

set protocols bgp 64501 parameters router-id 192.168.0.1 - will enable bgp on the USG with the router-id as it’s IP address 192.168.0.1 the AS number 64501 is used to determine our particular bgp network

set protocols bgp 64501 neighbor x.x.x.x remote-as 64500 - will allow our bgp instance to take advertised routes from x.x.x.x with the AS identifier 64500

commit | save - will both enable and then save the configuration

This is it.. the USG is now configured to take routes from our Kubernetes workers!!!

Configuring Kube-Vip

I’ve recently started on a release of kube-vip that will provide additional functionality in order to automate some of the bgp configuration, namely the capability to determine the host IP address for the bgp server-id for advertising from a pod.

We can follow the guide for setting up kube-vip with bgp with some additional modifications show below!

Ensure that the version of the kube-vip image is 0.2.2 or higher

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
env:
- name: vip_interface
value: "lo"
- name: vip_configmap
value: "plndr"
- name: bgp_enable
value: "true"
- name: vip_loglevel
value: "5"
- name: bgp_routerinterface
value: "ens160"
- name: bgp_as
value: "64500"
- name: bgp_peeraddress
value: "192.168.0.1"
- name: bgp_peeras
value: "64501"

The bgp_routerinterface will autodetect the host IP address of the interface specified and this will become our server-id for each pods bgp peering configuration. The bgp_peer<as|address> is the remote configuration for our USG as specified above.

Testing it out !

Create a easy deployment, we can use the nginx demo for this:

1
kubectl apply -f https://k8s.io/examples/application/deployment.yaml

Given that the USG is our default gateway we can actually advertise any address we like as a type:LoadBalancer and when we try and access if from inside the network the USG will route us back to the the Kubernetes servers!

For example we can create a load-balancer with the address 10.0.0.1

1
kubectl expose deployment nginx-deployment --port=80 --type=LoadBalancer --name=nginx-bgp --load-balancer-ip=10.0.0.1

Given the home network is in the 192.168.0.0/24 range anything that is outside of it will need to go through our USG router, where we’ve advertised (through BGP) that traffic needs to come to any of the Kubernetes workers.

We can see this below on the USG with show ip bgp

1
2
3
4
5
6
7
8
9
10
admin@Gateway:~$ show ip bgp
BGP table version is 0, local router ID is 192.168.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
* i10.0.0.1/32 192.168.0.45 100 0 i
*>i 192.168.0.44 100 0 i
* i 192.168.0.46 100 0 i

Comment and share

This is a post about:

  • Operating Systems
  • The Linux Boot process
  • Installation methods
  • Doing weird hacky things 😃

In this post I’m going to cover the options that are typically available to an end user when looking to install an Operating System (mainly Linux, but I may touch on others 🤷) , we will touch on their pros/cons and look at what the alternatives are. We will then discuss concepts of immutable Operating System deployments and how we may go about doing so, including the existing eco-system. Finally we’ll look at what is required (using Go) to implement an Image based Operating System creation and deployment, and how they’ve been implemented into plunder.

Getting an Operating System deployed

Most Operating Systems are deployed in relatively the same manner:

  1. A machine boots and reads from installation media (presented locally or over the network)
  2. The target disks are prepared, typically partitions may be created or HA technologies such as disk mirroring and then finally these partitions are “formatted” so that they contain a file system.
  3. Either a minimal set of packages or a custom selection of packages will be installed to the new file system. Most Operating Systems or distributions have their own concept of “packages” but ultimately under the covers the package contains binaries and required libraries for an application along with some logic that dictates where the files should be written too along with some versioning information that the package manager can use.
  4. There may be some final customisation such as setting users, network configuration etc..
  5. A Boot loader is written to the target disk so that when the machine is next powered on it can boot the newly provisioned Operating System.

The order of the steps may differ but pretty much all of the major Operating Systems (Linux, Windows, MacOS) follow the same pattern to deploy on target hardware.

Options for automating a deployment

There are usually two trains of thought when thinking about deploying an Operating System, which are scripted which will go through the steps listed above but no user interaction is required or image based which takes a copy of a deployment and uses that as a “rubber stamp” for other installs.

Scripted

Operating Systems were originally designed to be ran on hardware of a predetermined configuration, which meant that there was no need for customising of the installation. However as time passed a few things happened that suddenly required Operating Systems to become more flexible:

  • Usage of computers sky rocketed
  • The number of hardware vendors producing compute parts increased
  • A lot of traditional types of work became digitised.

All of these factors suddenly required an Operating System to support more and more types of hardware and it’s required configuration(s), furthermore end-users required the capability to tailor an Operating System to behave as needed. To provide this functionality Operating System vendors built rudimentary user interfaces that would ask questions or provide the capability for a user installing the OS to set various configuration options. This worked for a period of time but as more and more computer systems were deployed this became an administrative nightmare, as it was impossible to automate multiple installations as they required interaction to proceed and finally the need for humans to interact brings about the possibility of human error (Pesky humans) during the deployment.

In order for large scale IT system installations to take place then operations needed a method for unattended installations, where installations can happen without any involvement. The technique for this to work was to modify the Operating System installation code so that it could take a file that can answer all of the questions that would have previously required user input in order to progress the installation. These technologies are all named in a way that reflects that:

  • preeseed
  • answer file(s)
  • kickstart
  • jumpstart

Once a member of the operations team has “designed” the set of responses for their chosen Operating System then this single configuration can be re-used as many times as required. This removes the human element from accidentally entering the wrong data or clicking the wrong button during an Operating System installation and ensures that the installation is standardised and “documented”.

However, one thing to consider is that although a scripted installation is a repeatable procedure that requires no human interaction it is not always 100% reliable. This installation still runs through a lot of steps, such as every package has to be installed along with the prerequisite package management along with configuring devices and other things that happen during that initial installation phase. Whilst this may work perfectly on the initial machine undiscovered errors can appear when moving this installation method to different hardware. There have been numerous issues caused by packages relying on sleep during an installation step, the problem here usually is due to this package being developed on a laptop and then moved to much larger hardware. Suddenly this sleep is no longer in step with the behaviour of the hardware as the task completes much quicker than it had on slower hardware. This has typically led to numerous installation failures and can be thought of as a race-condition.

Image

Creating an image of an existing Operating System has existed for a long time, we can see it referenced in this 1960s IBM manual for their mainframes.

Shoutout for Norton Ghost !!

I have covered image based deployments, but my (google) research show ghost (1998) pre-dates Preseed/DebianInstaller (2004) and Kickstart(RHEL in 2000) and even Sun Microsystems with jumpstart (earliest mention I can find is Solaris 2.8 which is 1999).

Anatomy of a disk

We can usually think of a disk being like a long strip of paper starting at position 0 and ending with the length of the strip of paper (or its capacity). The positions are vitally important as they’re used by a computer when it starts in order to find things on disk.

The Boot sector is the first place a machine will attempt to boot from once it has completed its hardware initialisation and hardware checks the code in this location will be used in order to instruct the computer where to look for the rest of the boot sequence and code. In the majority of examples a computer will boot the first phase from this boot sector and then be told where to look for the subsequent phases and more than likely the remaining code will live within a partition.

A partition defines some ring-fenced capacity on an underlying device that can be then presented to the underlying hardware as usable storage. Partitions will then be “formatted” so that they have a structure that understands concepts such as folders/directories and files, along with additional functionality such as permissions.

Now that we know the makeup of a disk we can see that there are lots of different things that we may need to be aware of, such as type of boot loader, size or number of partitions, type of file systems and then the files and packages that need installing within those partitions.

We can safely move away from all of this by taking a full copy of the disk! Starting from position 0 we can read every byte until we’ve reached the end of the disk (EOF) and we have a full copy of everything from boot loaders to partitions and the underlying files.

The steps for creating and using a machine image are usually:

  • Install an Operating System once (correctly)
  • Create an image of this deployed Operating System
  • Deploy the “golden image” to all other hosts

Concepts for managing OS images

The concept for reading/writing OS images consists of a number of basic actions:

  • Reading or Writing to a storage device e.g. /dev/sda
  • A source image (either copied from a deployed system or generated elsewhere)
  • A destination for the date from the cloned system, when reading from a deployed system we need to store a copy of the data somewhere

In relation to the final point, there have been numerous options to capture the contents of an existing system over time:

  • Mainframes could clone the system to tape
  • Earlier “modern” systems would/could clone to a series of Floppy disks and later on CD-ROM media
  • Stream the contents of a storage device over the network to a remote listening server.

Reading (cloning) data from a storage device

When capturing from an existing system, especially when this OS image will be used to deploy else where, there are a few things that need to be considered about the state of the source system.

Firstly it is ideal that it is as “clean” as possible, which typically means that the OS is deployed and any required POST-configuration and nothing else. If a system has been up for time or has been used then the filesystem could already be dirty through excess logs or files generated that will clutter the source image and end up being deployed through to all destination servers. We may want to consider removing or zeroing the contents of swap partitions or files as well to ensure nothing is taken from the source and passed to destination machines that doesn’t need to be copied.

Secondly that the disk isn’t actually being used when we want to clone or image its contents, whilst technically possible if the Operating System is busy reading and writing whilst we attempt to copy the underlying storage we will end up with files that could be half-written to or ultimately corrupt. We can boot from alternative media (USB/CD-ROM/Alternate Disk/Network) in order to leave our source disk un-used, which will allow us to copy it’s contents in a completely stable state.

Once we’ve started up our cloning tooling we only need to read all date from the underlying storage device to a secondary data store. We need to be aware that even if we’ve only installed a small amount of data as part of the Operating System our cloned copy will be the exact size of the underling storage device. e.g. a 100GB disk that is installed with a basic Ubuntu 18.04 package set (900MB) will still result in a 100GB OS Image.

Writing an OS Image to a destination device

As with reading the contents of an underlying storage device we can’t (or I certainly wouldn’t recommend) trying to use the same storage device at the same time as writing too it. If you had an existing system running an Operating System such as Win10 and we used some tooling to re-write our ubuntu disk image to the same storage then well …

To circumvent this issue we would need to start our destination machine from another medium such as another storage device leaving our destination device alone until we’re ready to write our destination disk image to this storage.

Finally we also need to consider the sizes of the disk image and the destination storage:

  • If our destination storage is smaller than the disk image then our writing of the image will fail. However, the system may start afterwards as the main contents of the Operating System has fit on the storage device. This can lead to an unstable state as the filesystem and partitions have logical boundaries that exist well beyond the capacity of the destination disk.
  • On the other hand if we write a 10GB OS image to a 100GB physical storage device then the remaining 90GB is left unused without some additional steps, typically involving first growing the partition to occupy the remaining 90GB of space and then growing the filesystem so it is aware of this additional capacity.

Writing your own OS imaging software

This section will detail the design around software I wrote to accomplish all of the things I’ve discussed above. All of the code samples are written in Go and the source code for the project can be found on the BOOTy repository.

It will be in three parts:

  • First, creating a boot environment that allows us to safely read/write from underlying storage
  • Second, Reading and Writing from underlying storage
  • Finally, post steps.

A solution for a clean environment

To ensure we can make a clean read or write to a storage device we need to ensure that it’s not being used at the time, so we will need to boot from a different environment that has our tooling in it.

Simplest option

We can simply use a live-CD image that allow us to boot a fully fledged Operating System, from here we can use some tooling like dd to take a copy of contents of the storage. The problems with this solution are that we can’t really automate this procedure and we will need a writeable location for the contents of the source disk.

Writing a custom Operating System

Phase one: LinuxKit

My initial plan was to do this with LinuxKit, where the code to handle the reading and writing of the underlying storage would live within a container and I would then use LinuxKit to bake this into a custom Linux distribution that I could then start on remote hosts.

  1. Start remote host and point to our LinuxKit OS
  2. Start Linux Kernel
  3. Kernel boots, finds all hardware (including our storage device(s)
  4. LinuxKit starts /init which in turn will start our application container
  5. The container will then interact with the underlying storage device (e.g. write an image to the disk)

This solution was quick and easy to write, however a few issues:

  • Ensuring console output of what is transpiring
  • Stable rebooting or restarting of the host
  • Dropping to a console if a write fails

So whilst it did work, I decided I could probably do something a little smaller

Phase two: initramfs

This design uses two pieces of technology, mainly a Linux Kernel and an initramfs which is where the second stage of Linux Booting occurs, if we use the same chronology as above this design would look like the following:

  1. Start remote host and point to our OS
  2. Start Linux Kernel
  3. Kernel boots, finds all hardware (including our storage device(s)
  4. Kernel starts /init within our initramfs , this init is our code to manage underlying storage.

At this point our init is the only process started by the kernel and all work is performed within our init binary. Within our initramfs we can add in any additional binaries, such as a shell (busybox) or Logical Volume Manager tooling to manipulate disks (more on this later).

In order for this tooling to work our custom init will need to “setup” the environment so that various things exist within the Linux filesystem, including special device nodes such as /dev/tty or /dev/random which are required for things like writing console output or generating UUIDs for filesystems etc..

Below is the example code that we will need to generate our environment:

1
2
3
4
5
6
7
8
9
10
11
12
13
package main

import "syscall"

func main() {

// Mount /dev
err := syscall.Mount("devtmpfs", "/dev", "devtmpfs", syscall.MS_MGC_VAL, 0)

// Create /dev/random
err := syscall.Mknod("/dev/random", syscall.S_IFCHR, makedev(1, 3))

}

We will need to mount various additional file systems in order for additional functionality to work as expected, mainly systems such as /proc(proc) and /sys (sysfs).

Once our environment is up and running, we will need to use dhcp in order to get an address on the network. There are numerous libraries available that provide DHCP client functionality and we can leverage one to run in a go routine to ensure we have an address and renew it if we get close to our lease expiring before we finish our work.

At this point we would do our disk image work, this will be covered next !

With an init we need to handle failure carefully, if our init process just ends either successfully 0 or with failure -1 then our kernel will simply panic. This means that we need to ensure we handle all errors carefully and in the event something fails ensure it is captured on screen (with a delay long enough for the error to be read) and then issue a command to either start a shell or reboot the host.

On error and we want to drop to a shell then the following code will ensure that the interrupt is passed from our init process to its children.

1
2
3
4
5
6
7
8
9
10
11
12
// TTY hack to support ctrl+c
cmd := exec.Command("/usr/bin/setsid", "cttyhack", "/bin/sh")
cmd.Stdin, cmd.Stdout, cmd.Stderr = os.Stdin, os.Stdout, os.Stderr

err := cmd.Start()
if err != nil {
log.Errorf("Shell error [%v]", err)
}
err = cmd.Wait()
if err != nil {
log.Errorf("Shell error [%v]", err)
}

Finally instead of simply ending our init process we will need to issue a reboot or shutdown as simply ending init will result in a kernel panic and a hung system.

Note: Before doing this ensure that disk operations are complete and that anything mounted or open has been closed and unmounted, as this syscall will immediately reboot.

1
2
3
4
5
6
7
8
err := syscall.Reboot(syscall.LINUX_REBOOT_CMD_RESTART)
if err != nil {
log.Errorf("reboot off failed: %v", err)
Shell()
}

// We can opt to panic by doing the following
os.Exit(1)

We now have all of the required bits in place:

  1. Our custom /init starts up
  2. The relevant paths are created or mounted
  3. Any devices nodes are created allowing us to interact with the system as a whole
  4. We use a DHCP Client to retrieve an address on the network
  5. In the event of an error we can start a shell process allowing us to examine further as to what the error may be
  6. Finally we will issue a reboot syscall when we’ve finished

Managing images, and storage devices

With the above /init in place we can now boot a system that has all the underlying devices ready, is present on the network and in event of errors can either restart or drop to a point where we can look into debugging.

We are in a position where we can put together some code that will allow reading the storage and sending it across the network to a remote server OR pull that same data and write it to the underlying storage.

Identifying the server and behaviour

In order to know what action a server should take when it is booted from our custom init we need to use a unique identifier, luckily we have one that is built into every machine its MAC/Hardware address.

1
2
3
4
5
6
7
8
// Get the mac address of the interface ifname (in our case "eth0")
mac, err := net.InterfaceByName(ifname)
if err != nil {
return "", err
}

// Convert into something that RFC 3986 compliant
return strings.Replace(mac, ":", "-", -1)

The above piece of code will find the MAC address of an interface passed as ifname and it will then convert that into a string that is a compliant URI string 00:11:22:33:44:55 -> 00-11-22-33-44-55 . We can now use this to build a url that we can request to find out what action we should be performing, so once init is ready it will build this configuration and perform a GET on http://<serverURL>/<MAC address>.

You may be wondering where we specify the <serverURL> address, well we can hard code this into our init alternatively we can pass this on boot as a flag to the kernel MYURL=http://192.168.0.1, this will appear as a environment variable within our newly started Operating System.

1
serverURL := os.Getenv("MYURL")

Reading and Writing

We can now call back to a server that will can inform the init process if it should be reading from the storage device and creating an OS image or pulling an image and writing it to a device.

Writing an image to disk

This is arguably the easier task as we can use a number of pre-existing features within Go to make this very straight forward. Once we’ve been told from the server that we’re writing an image, we should also be given a URL that points to an OS image location. We can pass this imageURL to our writing function and use an io.Copy() to write/stream this image directly to the underlying storage.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// New request to GET the actual image
req, err := http.NewRequest(http.MethodGet, imageURL, nil)
if err != nil {
return err
}

resp, err := http.DefaultClient.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()

// Open the underlying storage
diskIn, err := os.OpenFile("/dev/sda", os.O_CREATE|os.O_WRONLY, 0644)
if err != nil {
return err
}

// Read from resp.Body and Write to the underlying Storage
count, err := io.Copy(diskIn, resp.Body)
if err != nil {
return fmt.Errorf("Error writing %d bytes to disk [%s] -> %v", count, destinationDevice, err)
}
Reading from Disk to an Image

The main issue we hit with reading from a disk to a remote location is around the large amounts of data, the underlying storage could easily be many many GBs of data that would need transmitting to a remote server. In order to send such large amounts of data over the HTTP protocol we can use a multipart writer that will break up the file and rebuild it on the server side. To do this we create a multipart writer and then as we read chunks of data from /dev/sda we send them as multiple parts over the network.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
body, writer := io.Pipe()

// New request to POST the image (contents of the disk)
req, err := http.NewRequest(http.MethodPost, uri, body)
if err != nil {
return nil, err
}

// Create our multipart Writer
mwriter := multipart.NewWriter(writer)
req.Header.Add("Content-Type", mwriter.FormDataContentType())

errchan := make(chan error)

// Go routine for the copy operation
go func() {
defer close(errchan)
defer writer.Close()
defer mwriter.Close()

// imageKey is the key that the client will look for and
// key is the what the file should be called, we can set this to teh MAC address of the host
w, err := mwriter.CreateFormFile("imageKey", key)
if err != nil {
errchan <- err
return
}

// Open the underlying storage
diskIn, err := os.Open("/dev/sda")
if err != nil {
errchan <- err
return
}

// Copy from the disk into the mulitpart writer (which inturn sends data over the network)
if written, err := io.Copy(w, diskIn); err != nil {
errchan <- fmt.Errorf("error copying %s (%d bytes written): %v", path, written, err)
return
}
}

resp, err := client.Do(req)
merr := <-errchan

if err != nil || merr != nil {
return resp, fmt.Errorf("http error: %v, multipart error: %v", err, merr)
}

Shrinking Images

Using disk images can be incredibly wasteful when thinking about network traffic, a 1GB Operating System on a 100GB disk requires sending 100GB of data (even when most of the disk is probably zeros). To save a lot of space we can expand our read/write functions so that we pass our data through a compressor (or expander in the other direction), we can do this very easily by modifying the above code examples.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Reading raw data from Disk, compressing it and sending it over the network

zipWriter := gzip.NewWriter(w)
if err != nil {
errchan <- fmt.Errorf("[ERROR] New gzip reader: %s", err)
return
}

// run an io.Copy on the disk into the zipWriter
if written, err := io.Copy(zipWriter, diskIn); err != nil {
errchan <- fmt.Errorf("error copying %s (%d bytes written): %v", path, written, err)
return
}

// Ensure we close our zipWriter (otherwise we will get "unexpected EOF")
err = zipWriter.Close()



// Expanding compressed data and writing it to Disk

// Create a gzip reader that takes the compressed data over the network
zipOUT, err := gzip.NewReader(resp.Body)
if err != nil {
fmt.Println("[ERROR] New gzip reader:", err)
}
defer zipOUT.Close()

// Read uncompressed data from gzip Reader and Write to the underlying Storage
count, err := io.Copy(fileOut, zipOUT)
if err != nil {
return fmt.Errorf("Error writing %d bytes to disk [%s] -> %v", count, destinationDevice, err)
}

We can see that we simply “man-in-the-middle” a compression solution that slots straight into the existing workflow that was shown earlier. The results of adding in compression are clear to see when taking an Ubuntu 18.04 standard install on any sized disk 4GB / 100GB the compressed OS Image is always around 900MB.

Tidying up

We could simply stop here and we have all of the components in place to both create and write Operating System images to various hardware over the network, however in a lot of cases we may want to perform some POST configuration once we’ve deployed an image.

Grow the storage

To perform this we (in most circumstances) require some external tooling, if our base Operating System image used technologies like LVM etc.. then we’ll need additional tooling to interact with them. So within our initramfs we may want to create a static build of LVM2 so that we can use this tooling without requiring a large amount of additional libraries within our RAM disk. One other technique we can use is to make use of tooling that may exist within the Operating System Image that we’ve just written to disk, below is a sample workflow:

  1. Write Operating System image to /dev/sda
  2. Exec out to /usr/sbin/partprobe within out ram-disk to scan /dev/sda and find our newly written disk contents
  3. Exec out to /sbin/lvm to enable any logical volumes
    TODO: look at BLKRRPART to replace partprobe

At this point our /init will have an update /dev that will have any partitions (/dev/sda1) or logical volumes (/dev/ubuntu-vg/root) present at which point we can act on these partitions to grow them and their filesystems.

Some cheeky chroot manuveurs

As mentioned we can use a workflow that means we can use the tooling that exists within our newly deployed Operating System instead of filling our ramdisk with additional tooling and the associated dependencies.

The following workflow will grow a 900MB image written to a 20GB disk that uses LVM:

  1. Mount the root volume inside the ram-disk /dev/ubuntu-vg/root -> /mnt
  2. Use chroot to pretend we’re inside the newly deployed Operating System grow the underlying partition chroot /mnt /usr/bin/growpart /dev/sda 1
  3. Again chroot and update LVM to see the newly grown disk chroot /mnt /sbin/pvresize /dev/sda1
  4. Grow the root logical volume chroot /mnt /sbin/lvresize -l +100%FREE /dev/ubuntu-vg/root
  5. Finally grow the filesystem within the logical volume chroot /mnt /sbin/resize2fs /dev/ubuntu-vg/root
  6. Finally, unmount the logical volume ensuring that any writes are flushed!

Configure networking and other things

The above workflow for growing the storage mounts our newly provisioned disk image, once we’ve finished growing the disk/partitions and filesystem we have the opportunity to do additional post-deployment steps. For networking this could include writing a static networking configuration given too us from the provisioning server to the underlying filesystem before we unmount it and boot from it.

Overview

In this post we have detailed some of the deployment technologies that exist in order to provision Operating Systems and we can appreciate that there are a number of options available to us about which approach can work best. We’ve also stepped through various code snippets that detail some of the functionality that has recently been added into plndr to add the capability to create and use OS images to quickly and efficiently deploy Operating Systems to bare-metal (and virtualised) servers.

All of the source code samples came from BOOTy which is a project to build an initramfs that can perform a call back to a plndr server to find out it’s course of action.

Any questions, mistakes or corrections either add in the comments or hit me up on twitter -> thebsdbox

Comment and share

This post will detail a number of (I think at least) awesome use-cases for client-go. Some of these are use-cases that are similar or may almost be identical to the examples that already exist, but with some additional text that details what some of these terms actually mean and when it makes sense to use them.

Why Client-Go?

Kubernetes exposes everything through an API (all managed by the active API server) from the control-plane, the API is rest based and is the sole way of controlling the a cluster. This means that things like CI/CD pipelines, various dashboards and even kubectl will all use the API through an endpoint (network address) with the credentials (key) in order communicate with a cluster.

As this is just a standard REST over http(s) then there are a myriad of methods that can be used in order to communicate with the Kubernetes API. We can demonstrate this in a quick example:

Check health of Kubernetes API

1
2
$ curl -k https://control-plane01:6443/healthz
ok

The above example is an endpoint that requires no authentication, however if we try to use another endpoint without the correct authentication we’ll receive something like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ curl -k https://control-plane01:6443/api
{
“kind”: “Status”,
“apiVersion”: “v1”,
“metadata”: {

},
“status”: “Failure”,
“message”: “forbidden: User \”system:anonymous\” cannot get path \”/api\””,
“reason”: “Forbidden”,
“details”: {

},
“code”: 403
}

Note: Under most circumstances the authentication a user would need to speak with a cluster will live within a $home/.kube/config file. Tooling like kubectl will automatically look for this file in order to communicate with the API endpoint.

I wouldn’t recommend interacting with the raw endpoints as shown above, they’re mainly shown as an example of what is possible. To make life far simpler for developers to interact with the Kubernetes API then there are a number of wrappers/SDKs that provide:

  • Control and management of both the SDK <-> API versions
  • Language specific objects and methods to provide sane interfaces to Kubernetes objects
  • Helper functions for logging in and managing cluster access
  • (A plethora of additional features to make your life easier)

As mentioned this post will cover client-go, but there are numerous SDKs in various languages that are covered in varying levels of detail here.

Accessing a cluster, either In-Cluster or Outside cluster

This can be confusing for the first time an end-user will attempt to authenticate with a cluster using client-go.

Comment and share

Whilst working on a number of kubernetes control-plane deployments and the fighting with kubeadm and components like haproxy, nginx and keepalived, I decided to try and create my own load-balancer. In the end this proved to be not too complicated to replicate both a Virtual IP and load-balancing over the backends (master nodes) with the standard Go packages. That project now is pretty stable and can easily be used to create a HA control-plane, all of that can be found on https://kube-vip.io. The next thing I wanted to try (and I’ve been considering learning about this for a while) was creating a load-balancer “within” Kubernetes, as in provide the capability and functionality to kubectl expose <...> --type=LoadBalancer. It turns out that in order to provide this functionality within Kubernetes you need to write a Cloud Provider that the Kubernetes Cloud Controller Manager can interface with.

This post will “chronicle” the process for doing that… :-D

Kubernetes Cloud Providers

We will start with the obvious question.

What are Cloud providers?

Out of the box Kubernetes can’t really do a great deal, it really needs a lot of components to sit on top of or to interface with in order for it to provide the capability to run workloads. For example even in a basic Kubernetes cluster there is a requirement to have a container runtime (CRI, Container Runtime Interface) in order to execute the container, then we would need a networking plugin (CNI, Container Network Interface) in order to provide networking within the cluster.

On the flip side, a typical cloud company (AWS, GCloud, Azure etc…) offers a plethora of cool features and functionality that it would be awesome to consume through the Kubernetes cluster:

  • Load Balancers
  • Cloud instances (VMs, in some placess bare-metal)
  • Areas/zones
  • Deep API integrations into the infrastructure

So how do we marry up these two platforms to share that functionality …

.. Kubernetes Cloud Providers ..

Using Cloud providers

In most circumstances, you won’t even know that you’re using a cloud provider (which I suppose is kind of the point) and only when you try to create an object that the cloud provide can create/manage/delete will it actually be invoked.

The most common use-case (and the one this post is focussing on) is the creation of a load balancer within Kubernetes and it’s “counterpart” being provided by the cloud vendor. In the case of the cloud vendor Amazon Web Services (AWS) then creating a services type: LoadBalancer will create an Elastic Load Balancer (ELB) that will then load balance traffic over the selected pods. All of this functionality from the Cloud Provider Interface abstracts away the underlying technology and regardless of where a cluster is running a LoadBalancer just becomes a LoadBalancer.

Creating a Cloud Provider!

So now onto the actual steps to creating your own cloud provider, this is all going to be written in Go and I’ll do my best to be as descriptive as possible.

Wait, what is the cloud-controller-manager?

In Kubernetes v1.6 the original design was that all the cloud providers would have their vendor specific code all live in the same place. This ultimately lead to a point where all Kubernetes clusters came with a large cloud-controller-manager that at startup would be told which vendor code path to run down.

These were originally called In Tree cloud providers and there has been a push over the last few years to move to Out of Tree providers. When deploying a Kubernetes cluster the only change is that instead of starting the cloud-controller-manager with a specific vendor path (e.g. vsphere or aws), the operator would deploy the vendor specific cloud-provider such as cloud-provider-aws.

A Note about the “why” of In Tree / Out of Tree

There has been a shift of stripping code and “vendor specific” functionality from the main Kubernetes source repositories and into their own repositories. The main reasons for this:

  • Removes a tight-coupling between external/vendor code and Kubernetes proper
  • Allowed these projects to move at a different release rate to the main project
  • Slims the Kubernetes code base and allows these things to become optional
  • Reduces the vulnerability footprint of vendor code within the Kubernetes project
  • The interfaces ensure ongoing compatibility for these Out Of Tree projects

So for someone to create their own cloud provider they will need to follow a standard that was set by the original cloud-controller-manager, this standard is exposed through method sets and interfaces which can be read about more here.

tl;dr simply put, the cloud-controller-manager sets a standard that means if I want to expose a Load Balancer service it needs to also expose a number of methods (with matching signature). We can further see in the LoadBalancer interface here all of the functions that my Loadbalancer must expose in order to work.

The interface

The interface for a cloud-provider can be viewed here, we can see that this interface provides a number of functions that will return the interface for a specific type of functionality.

The more common interfaces I’ve summarised below:

  • Instances controller - responsible for updating kubernetes nodes using cloud APIs and deleting kubernetes nodes that were deleted on your cloud.
  • LoadBalancers controller - responsible for load balancers on your cloud against services of type: LoadBalancer.
  • Routes controller - responsible for setting up network routes on your cloud

Example provider (code) cloud-provider-thebsdbox

This section will cover in Go code all of the basics for building a cloud-provider that will handle all of the services requests (that are type: LoadBalancer). When implementing your own, ensure you use correct paths and package names!

Our cloud-provider will use the following files:

pkg/thebsdbox/cloud.go
pkg/thebsdbox/loadbalancer.go
pkg/thebsdbox/disabled.go
main.go

The next sections will have the source code and then an overview at the end of each file.

cloud.go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
package thebsdbox

import (
“k8s.io/client-go/informers”
“k8s.io/client-go/tools/clientcmd”
cloudprovider “k8s.io/cloud-provider”
)

const ProviderName = “thebsbdox”

// ThebsdboxCloudProvider - contains all of the interfaces for our cloud provider
type ThebsdboxCloudProvider struct {
lb cloudprovider.LoadBalancer
}

var _ cloudprovider.Interface = &thebsdboxCloudProvider{}

func init() {
cloudprovider.RegisterCloudProvider(ProviderName, newThebsdboxCloudProvider)
}

func newThebsdboxCloudProvider(io.Reader) (cloudprovider.Interface, error) {
return &ThebsdboxCloudProvider{
lb: newLoadBalancer(),
}, nil
}

// Initialize - starts the clound-provider controller
func (t *ThebsdboxCloudProvider) Initialize(clientBuilder cloudprovider.ControllerClientBuilder, stop <-chan struct{}) {
clientset := clientBuilder.ClientOrDie(“do-shared-informers”)
sharedInformer := informers.NewSharedInformerFactory(clientset, 0)

sharedInformer.Start(nil)
sharedInformer.WaitForCacheSync(nil)

}

// ProviderName returns the cloud provider ID.
func (t *ThebsdboxCloudProvider) ProviderName() string {
return ProviderName
}

// ENABLED Services

// LoadBalancer returns a loadbalancer interface. Also returns true if the interface is supported, false otherwise.
func (t *ThebsdboxCloudProvider) LoadBalancer() (cloudprovider.LoadBalancer, bool) {
return t.lb, true
}

Our Cloud-Provider ThebsdboxCloudProvider struct{}

This struct{} contains our vendor specific implementations of functionality, such as load-balancers, instances etc..

1
2
3
type ThebsdboxCloudProvider struct {
lb cloudprovider.LoadBalancer
}

We’re only defining a loadbalancer lb variable as part of our cloud-provider instances as this is the only functionality our provider will expose.

init()

This function will mean that before our cloud-provider actually starts (before the main() function is called) we will register our vendor specific cloud-provider. It will also ensure that our newly registered cloud-provider will be instantiated with the newThebsdboxCloudProvider function.

Instantiating our cloud-provider newThebsdboxCloudProvider()

When our cloud-provider actually has started (the main() function has been called) the cloud-provider controller will look at all registered providers, and it will find ours that we registered in the init() function. It will then call our instantiation function newLoadBalancer(), which will do any pre-tasks for setting up our load balancer, it will then assign it to lb.

Enabling our Load-Balancer func (t *ThebsdboxCloudProvider) LoadBalancer() (cloudprovider.LoadBalancer, bool)

This function is pretty much the crux of enabling the load balancer functionality and as part of the cloud-controller-manager spec defines what functionality our cloud-provider will expose. These functions will return one of two things:

  1. Our instantiated functionality (in this case our loadbalancer object, returned as lb)
  2. If this is enabled or not (true/false)

Everything that we’re not exposing from our cloud-provider will return false and can be seen in the disabled.go source.

disabled.go

All of these functions disable these bits of functionality within our cloud-provider

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
package thebsdbox

import cloudprovider “k8s.io/cloud-provider”

// Instances returns an instances interface. Also returns true if the interface is supported, false otherwise.
func (t *ThebsdboxCloudProvider) Instances() (cloudprovider.Instances, bool) {
return nil, false
}

// Zones returns a zones interface. Also returns true if the interface is supported, false otherwise.
func (t *ThebsdboxCloudProvider) Zones() (cloudprovider.Zones, bool) {
return nil, false
}

// Clusters returns a clusters interface. Also returns true if the interface is supported, false otherwise.
func (t *ThebsdboxCloudProvider) Clusters() (cloudprovider.Clusters, bool) {
return nil, false
}

// Routes returns a routes interface along with whether the interface is supported.
func (t *ThebsdboxCloudProvider) Routes() (cloudprovider.Routes, bool) {
return nil, false
}

// HasClusterID provides an opportunity for cloud-provider-specific code to process DNS settings for pods.
func (t *ThebsdboxCloudProvider) HasClusterID() bool {
return false
}

loadbalancer.go

Our LoadBalancer source code again has to match the interface as expressed here, we can see those functions defined below and exposed as methods as part of our thebsdboxLBManager struct.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
package thebsdbox

import (
“context”
“fmt”

v1 “k8s.io/api/core/v1”
“k8s.io/client-go/kubernetes”
cloudprovider “k8s.io/cloud-provider”
“k8s.io/klog”
)

//thebsdboxLBManager -
type thebsdboxLBManager struct {
kubeClient *kubernetes.Clientset
nameSpace string
}

func newLoadBalancer() cloudprovider.LoadBalancer {
// Needs code to get a kubeclient => client
// Needs code to get a namespace to operate in => namespace

return &thebsdboxLBManager{
kubeClient: client,
namespace: ns,}
}

func (tlb *thebsdboxLBManager) EnsureLoadBalancer(ctx context.Context, clusterName string, service *v1.Service, nodes []*v1.Node) (lbs *v1.LoadBalancerStatus, err error) {
return tlb.syncLoadBalancer(service)
}
func (tlb *thebsdboxLBManager) UpdateLoadBalancer(ctx context.Context, clusterName string, service *v1.Service, nodes []*v1.Node) (err error) {
_, err = tlb.syncLoadBalancer(service)
return err
}

func (tlb *thebsdboxLBManager) EnsureLoadBalancerDeleted(ctx context.Context, clusterName string, service *v1.Service) error {
return tlb.deleteLoadBalancer(service)
}

func (tlb *thebsdboxLBManager) GetLoadBalancer(ctx context.Context, clusterName string, service *v1.Service) (status *v1.LoadBalancerStatus, exists bool, err error) {

// RETRIEVE EXISTING LOAD BALANCER STATUS

return &v1.LoadBalancerStatus{
Ingress: []v1.LoadBalancerIngress{
{
IP: vip,
},
},
}, nil
}

// GetLoadBalancerName returns the name of the load balancer. Implementations must treat the
// *v1.Service parameter as read-only and not modify it.
func (tlb *thebsdboxLBManager) GetLoadBalancerName(_ context.Context, clusterName string, service *v1.Service) string {
return getDefaultLoadBalancerName(service)
}

func getDefaultLoadBalancerName(service *v1.Service) string {
return cloudprovider.DefaultLoadBalancerName(service)
}
func (tlb *thebsdboxLBManager) deleteLoadBalancer(service *v1.Service) error {
klog.Infof(“deleting service ‘%s’ (%s)”, service.Name, service.UID)

// DELETE LOAD BALANCER LOGIC

return err
}

func (tlb *thebsdboxLBManager) syncLoadBalancer(service *v1.Service) (*v1.LoadBalancerStatus, error) {

// CREATE / UPDATE LOAD BALANCER LOGIC (and return updated load balancer IP)

return &v1.LoadBalancerStatus{
Ingress: []v1.LoadBalancerIngress{
{
IP: vip,
},
},
}, nil
}

Instantiating the LoadBalancer newLoadBalancer()

This function is called when the cloud-provider itself is initialised and can be seen in cloud.go as part of the newThebsdboxCloudProvider() method. The newly created load balancer object once created is then added to the cloud-providers main object for use when needed.

Interface methods

EnsureLoadBalancer

Creates a LoadBalancer if one didn’t exist to begin with and then return it’s status (with the load balancer address)

UpdateLoadBalancer

Updates an existing LoadBalancer, or will create one if it didn’t exist and then return it’s status (with the load balancer address)

EnsureLoadBalancerDeleted

Calls GetLoadBalancer first to ensure that the load balancer exists, and if so it will delete the vendor specific load balancer. If this completes successfully then the service of type: LoadBalancer is removed as an object within Kubernetes.

GetLoadBalancer

This will speak natively to the vendor specific load balancer to make sure that it has been provisioned correctly.

GetLoadBalancerName

Returns the name of the load balancer instance.

main.go

This is a the standard main.go as given by the actual cloud-controller-manager example. The one change is the addition of our // OUR CLOUD PROVIDER which adds all of our vendor specific cloud-provider methods.

  1. init() in cloud.go is called and registers our cloud-provider and the call back to our newCloudProvider() method.
  2. The command.Execute() in main.go starts the cloud-controller-manager
  3. The cloud-controller-manager method will look at all of the registered cloud-providers and find our registered provider.
  4. Our provider will have its newCloudProvider() method called which sets up everything that is needed for it to be able to complete it’s tasks.
  5. Our cloud provider is now running, when a user tries to create a resource that we’ve registered for (Load Balancers) our vendor code will be called to provide this functionality.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
package main

import (
“math/rand”
“os”
“time”

“k8s.io/component-base/logs”
“k8s.io/kubernetes/cmd/cloud-controller-manager/app”

_ “k8s.io/component-base/metrics/prometheus/version” // for version metric registration
// NOTE: Importing all in-tree cloud-providers is not required when
// implementing an out-of-tree cloud-provider.
_ “k8s.io/component-base/metrics/prometheus/clientgo” // load all the prometheus client-go plugins
_ “k8s.io/kubernetes/pkg/cloudprovider/providers”

// OUR CLOUD PROVIDER
_ “github.com/user/pkg/thebsdbox/”

)

func main() {
rand.Seed(time.Now().UnixNano())
command := app.NewCloudControllerManagerCommand()

logs.InitLogs()
defer logs.FlushLogs()

if err := command.Execute(); err != nil {
os.Exit(1)
}
}

Summary

Hopefully this is of some use as to how a Kubernetes cloud-provider is architected, to understand a few more examples I’ve included some other providers:

Good luck and any feedback feel free to message me

Comment and share

This post is relatively delayed due to laziness “business reasons”, also the last post about bare-metal Kubernetes deployments took too long.

A lot of people on the twitter sphere published a decade and/or a year in review.. so I’ve raided my iCloud photo library and will attempt to put something together.

Pre-2009

Added this in because it’s relevant to where I found myself ending up in 2009..

So prior to 2009 I had been through a couple of support roles, being a DB admin and running a support team of Linux/UNIX, Windows, DBs and backup engineers inside a (B)ritish (T)elecommunications company. Following this I became a consolidation architect with Sun Microsystems (/me pours one out for SUN), with the project goal of shrinking five creaking old data centres into a shiny brand new one for the same (B)ritish (T)elecommunications company. The main goal was taking their big old servers (E10Ks, V440s etc..) and migrating their applications into Solaris Zones.

E10K with human for comparison
E10K with human for comparison

You might be thinking why does this matter, well somehow I’ve ended up doing the same thing over ten years later ¯\_(ツ)_/¯

Back to the timeline, Towards the end of 2008 things took a pretty grim turn and the financial crash caused a 50 person team (In September) to be down to 4 people by November at which point it became a 3 person team (me being the 4th person 😔).

2009

In mid-January I had been out of work for three months, the financial crash combined with the Christmas period meant that the job market was pretty much none-existent. I think we had enough money for a month or two of rent for where we were living and then that would be it 😐…

Then finally a random recruitment email appeared (which I was surprised to find that I still have it in my gmail😀)

1
2
3
4
5
6
Hello Daniel,

Would you be interested in a permanent role with the
European Space Agency based in Frankfurt?

Regards

A number of interviews later I accepted the role, packed bags, booked flight to Frankfurt … and then got the bus to Darmstadt


So not actually Frankfurt after all

The European Space Agency (ESOC - Space Operations Centre) 2009-2012

I spent four crazy years living in Germany where I made a ton of friends from all over the world (Australia, Indonesia, Bulgaria, Italy, America, Spain, Greece and of course Germany). From a social perspective Germany has a fantastic outdoor lifestyle:

  • Endless wine and beer festivals
  • Dedicated cycling lanes and walking routes throughout the countryside
  • Excellent weather for lazy evenings
  • Cafes and restaurants have a focus chilling outdoors
  • Towards the end of the year it’s the autumn festivals and then the Christmas markets
  • Did I mention the beer festivals?

The work itself was pretty varied, sometimes a focus on writing reams of paperwork (100+ page guide to installing SuSE 🙄) to architecting and building platforms and systems for actual space missions. Also there was the opportunity to support some of the actual satellite missions which was always both exciting and pretty terrifying be part of.


Everyone terrifying witness the ATV (automated telemetry vehicle) dock un-assisted with ISS


Before we modernised the command centre


All brand new, (still powered by some of my hacky kernel modules)

Back to UK, SKY 2012-2014

With Kim unable really to find the role that made sense for her, we made the decision to head back to the UK as the market had started to improve. I took a role with Sky that was involved some of the heaviest infrastructure work I’ve ever been part of, but a fantastic experience none-the-less. Deploying a Disaster recovery site almost single handedly was a highlight! (Ah site B) Everything from Servers(HPE/Cisco with vSphere), to networking (Cisco Nexus switches) and storage (EMC/IBM XiV/Brocade) I got to play with and learn about. I also had the opportunity to decommission the original site, which was also fun and also a little bit on the gross side… When your cabling is “organically” grown it ultimately becomes an unfathomable under the floor mess for someone else (me) to sort out.


This took days to sort out

Suits, travel and presenting, HPE 2014 - 2017

By now Kims career had taken off, I’d spent a fantastic few years getting pretty low-level with various infrastructure vendors when one reached out to me. Approximately 48 interviews later on I’d bought a suit (mandatory) and was part of a team in HPE focussed on the merits of “converged infrastructure”!


I certainly did not choose the title on this 🙄

Over the three to four years at HPE I created a lot of powerpoint, learnt a great deal about PCI-E switching ¯\_(ツ)_/¯, memristors and a variety of other technologies. I was more than a little insecure about being the only person on the team without several CCIE certifications amongst other various industry accolades, however the entire team was 110% focussed on knowledge sharing and ensuring we all learnt from one another as much as possible (something I’m passionate about). It was here that I was asked to present in-front of an audience for the first time, which I still have nightmares about too this day… standing on stage shaking and stumbling over my words whilst trying to remember all my content about “sustainable data centre energy use” 😬. Although somehow after getting through that presentation I ended up being drafted to present more and more, to the point where I was regularly travelling around doing roadshows about various new HPE products. At some point I was asked to work on a new HPE partnership with a start up called Docker 🐳…

As part of some of the work I’d been doing at HPE around automating bare-metal, I’d stumbled across a project called Docker InfraKit and had written a plugin so that it could be extended to automate HPEs converged infrastructure. This lead to a chance email from an employee at Docker that asked if I’d be interested in participating in another project they were developing. I immediately said “yes” as it sounded super exciting to be part of! I received an email invite to a private GitHub repository to discover a new project for building Linux distributions .. to my horror the entire project was written in GO (I’d never written a single line of GO code at this point) 😱

Too embarrassed to say anything I decided to try and quickly get up to speed, I quietly worked away on simple fixes to command line flags and updating documentation. Finally I managed to write some code that “worked” and allowed the tool to build Linux images as VMDK files and run them in VMware Fusion and vSphere clusters.

A random start to a startup, Docker 2017 - 2018

Whilst still at HPE I went to DockerCon US 2017 in Austin, which turned out to be a very bonkers experience…

I had the privilege of joining the Docker Captains program, which is a group of people that are ambassadors around all things Docker.

I got to see the project I’d become part of finally be renamed (Moby -> LinuxKit) and be released to the wider community!

I made some amazing friends at Docker including Betty and Scott, amongst others :-)

Also getting to hang around with Solomon was great fun :-)

Whilst also in Austin I was asked to join the Docker team 🐳😀

It was a fantastic 18 months which involved:

  • Continuing working on InfraKit/LinuxKit
  • Working on Docker Swarm (swarmkit)
  • Presenting at DockerCon EU 17 and DockerCon US 18
  • Helping customers around EMEA successfully deploy Containers and clusters, also how to migrate their applications correctly
  • Building a modern app migration strategy for to Docker Enterprise Edition

36 hours in Seattle, vHeptio 2018 - now

Toward the end of the 18 months the changes in strategy had started to push me to consider other options, and after a “chance” meeting with Scott Lowe in a coffee shop in San Francisco I decided to go have a real conversation with Heptio (I’d also had a conversation with $current_employer but that’s another story). I flew from the UK to spend a day and a half in Seattle to speak with Heptio, I pretty much spent the same amount of time in airports and planes as I did in Seattle itself but it was worth it.


The Seattle Space Needle


Seattle has some amazing buildings

After accepting I became the sole “field engineer” in EMEA and ended up fiddling with, breaking and fixing Kubernetes clusters for a living :-)

Heptio was acquired by VMware in early 2019, and we’ve largely been continuing the same work.. The only change is that I’m now no longer the only person in EMEA 😀

Comment and share

Dan Finneran

author.bio


author.job


Sheffield (UK)