Cluster API Provider RKE2

GitHub


What is Cluster API Provider RKE2

The Cluster API brings declarative, Kubernetes-style APIs to cluster creation, configuration and management.

Cluster API Provider RKE2 (CAPRKE2) is a combination of 2 provider types, a Cluster API Control Plane Provider for provisioning Kubernetes control plane nodes and a Cluster API Bootstrap Provider for bootstrapping Kubernetes on a machine where RKE2 is used as the Kubernetes distro.


Getting Started

Follow our getting started guide to start creating RKE2 clusters with CAPI.

Developer Guide

Check our developer guide for instructions on how to setup your dev environment in order to contribute to this project.

Get in contact

You can get in contact with us via the #capbr channel on the Rancher Users Slack.

User guide

This section contains a getting started guide to help new users utilise CAPRKE2.

Getting Started

Cluster API Provider RKE2 is compliant with the clusterctl contract, which means that clusterctl simplifies its deployment to the CAPI Management Cluster. In this Getting Started guide, we will be using the RKE2 Provider with the docker provider (also called CAPD).

Prerequisites

  • clusterctl to handle the lifecycle of a Cluster API management cluster
  • kubectl to apply the workload cluster manifests that clusterctl generates
  • kind and docker to create a local Cluster API management cluster

Management Cluster

In order to use this provider, you need to have a management cluster available to you and have your current KUBECONFIG context set to talk to that cluster. If you do not have a cluster available to you, you can create a kind cluster. These are the steps needed to achieve that:

  1. Ensure kind is installed (https://kind.sigs.k8s.io/docs/user/quick-start/#installation)
  2. Create a special kind configuration file if you intend to use the Docker infrastructure provider:
cat > kind-cluster-with-extramounts.yaml <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: capi-test
nodes:
- role: control-plane
  extraMounts:
    - hostPath: /var/run/docker.sock
      containerPath: /var/run/docker.sock
EOF
  1. Run the following command to create a local kind cluster:
kind create cluster --config kind-cluster-with-extramounts.yaml
  1. Check your newly created kind cluster :
kubectl cluster-info

and get a similar result to this:

Kubernetes control plane is running at https://127.0.0.1:40819
CoreDNS is running at https://127.0.0.1:40819/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

Setting up clusterctl

CAPI >= v1.6.0

No additional steps are required and you can install the RKE2 provider with clusterctl directly:

clusterctl init --core cluster-api:v1.7.6 --bootstrap rke2:v0.7.0 --control-plane rke2:v0.7.0 --infrastructure docker:v1.7.6

Next, you can proceed to creating a workload cluster.

CAPI < v1.6.0

With CAPI & clusterctl versions less than v1.6.0 you need a specific configuration. To do this create a file called clusterctl.yaml in the $HOME/.cluster-api folder with the following content (substitute ${VERSION} with a valid semver specification - e.g. v0.5.0 - from releases):

providers:
  - name: "rke2"
    url: "https://github.com/rancher/cluster-api-provider-rke2/releases/${VERSION}/bootstrap-components.yaml"
    type: "BootstrapProvider"
  - name: "rke2"
    url: "https://github.com/rancher/cluster-api-provider-rke2/releases/${VERSION}/control-plane-components.yaml"
    type: "ControlPlaneProvider"

NOTE: Due to some issue related to how CAPD creates Load Balancer healthchecks, it is necessary to use a fork of CAPD by providing in the above configuration file the following :

  - name: "docker"
    url: "https://github.com/belgaied2/cluster-api/releases/v1.3.3-cabpr-fix/infrastructure-components.yaml"
    type: "InfrastructureProvider"

This configuration tells clusterctl where to look for provider manifests in order to deploy provider components in the management cluster.

The next step is to run the clusterctl init command:

clusterctl init --bootstrap rke2 --control-plane rke2 --infrastructure docker:v1.3.3-cabpr-fix

This should output something similar to the following:

Fetching providers
Installing cert-manager Version="v1.10.1"
Waiting for cert-manager to be available...
Installing Provider="cluster-api" Version="v1.3.3" TargetNamespace="capi-system"
Installing Provider="bootstrap-rke2" Version="v0.1.0-alpha.1" TargetNamespace="rke2-bootstrap-system"
Installing Provider="control-plane-rke2" Version="v0.1.0-alpha.1" TargetNamespace="rke2-control-plane-system"

Your management cluster has been initialized successfully!

You can now create your first workload cluster by running the following:

  clusterctl generate cluster [name] --kubernetes-version [version] | kubectl apply -f -

Create a workload cluster

There are some sample cluster templates available under the samples folder. This section assumes you are using CAPI v1.6.0 or higher.

For this Getting Started section, we will be using the docker samples available under samples/docker/online-default folder. This folder contains a YAML template file called rke2-sample.yaml which contains environment variable placeholders which can be substituted using the envsubst tool. We will use clusterctl to generate the manifests from these template files. Set the following environment variables:

  • CABPR_NAMESPACE
  • CLUSTER_NAME
  • CABPR_CP_REPLICAS
  • CABPR_WK_REPLICAS
  • KUBERNETES_VERSION
  • KIND_IMAGE_VERSION

for example:

export CABPR_NAMESPACE=example
export CLUSTER_NAME=capd-rke2-test
export CABPR_CP_REPLICAS=3
export CABPR_WK_REPLICAS=2
export KUBERNETES_VERSION=v1.27.3
export KIND_IMAGE_VERSION=v1.27.3

The next step is to substitue the values in the YAML using the following commands:

cd samples/docker/online-default
cat rke2-sample.yaml | clusterctl generate yaml > rke2-docker-example.yaml

At this moment, you can take some time to study the resulting YAML, then you can apply it to the management cluster:

kubectl apply -f rke2-docker-example.yaml

and see the following output:

namespace/example created
cluster.cluster.x-k8s.io/capd-rke2-test created
dockercluster.infrastructure.cluster.x-k8s.io/capd-rke2-test created
rke2controlplane.controlplane.cluster.x-k8s.io/capd-rke2-test-control-plane created
dockermachinetemplate.infrastructure.cluster.x-k8s.io/controlplane created
machinedeployment.cluster.x-k8s.io/worker-md-0 created
dockermachinetemplate.infrastructure.cluster.x-k8s.io/worker created
rke2configtemplate.bootstrap.cluster.x-k8s.io/capd-rke2-test-agent created
configmap/capd-rke2-test-lb-config created

Checking the workload cluster

After waiting several minutes, you can check the state of CAPI machines, by running the following command:

kubectl get machine -n example

and you should see output similar to the following:

NAME                                 CLUSTER          NODENAME                                 PROVIDERID                                          PHASE     AGE   VERSION
capd-rke2-test-control-plane-9fw9t   capd-rke2-test   capd-rke2-test-control-plane-9fw9t       docker:////capd-rke2-test-control-plane-9fw9t       Running   35m   v1.27.3+rke2r1
capd-rke2-test-control-plane-m2sdk   capd-rke2-test   capd-rke2-test-control-plane-m2sdk       docker:////capd-rke2-test-control-plane-m2sdk       Running   12m   v1.27.3+rke2r1
capd-rke2-test-control-plane-zk2xb   capd-rke2-test   capd-rke2-test-control-plane-zk2xb       docker:////capd-rke2-test-control-plane-zk2xb       Running   27m   v1.27.3+rke2r1
worker-md-0-fhxrw-crn5g              capd-rke2-test   capd-rke2-test-worker-md-0-fhxrw-crn5g   docker:////capd-rke2-test-worker-md-0-fhxrw-crn5g   Running   36m   v1.27.3+rke2r1
worker-md-0-fhxrw-qsk7n              capd-rke2-test   capd-rke2-test-worker-md-0-fhxrw-qsk7n   docker:////capd-rke2-test-worker-md-0-fhxrw-qsk7n   Running   36m   v1.27.3+rke2r1

Accessing the workload cluster

Once cluster is fully provisioned, you can check its status with:

kubectl get cluster -n example

and see an output similar to this:

NAMESPACE   NAME             CLUSTERCLASS   PHASE         AGE   VERSION
example     capd-rke2-test                  Provisioned   31m

You can also get an “at glance” view of the cluster and its resources by running:

clusterctl describe cluster capd-rke2-test -n example

This should output similar to this:

NAME                                                            READY  SEVERITY  REASON  SINCE  MESSAGE                                                                         
Cluster/capd-rke2-test                                          True                     2m56s                                                                                   
├─ClusterInfrastructure - DockerCluster/capd-rke2-test          True                     31m                                                                                     
├─ControlPlane - RKE2ControlPlane/capd-rke2-test-control-plane  True                     2m56s                                                                                   
│ └─3 Machines...                                               True                     28m    See capd-rke2-test-control-plane-9fw9t, capd-rke2-test-control-plane-m2sdk, ...  
└─Workers                                                                                                                                                                        
  └─MachineDeployment/worker-md-0                               True                     15m                                                                                     
    └─2 Machines...                                             True                     25m    See worker-md-0-fhxrw-crn5g, worker-md-0-fhxrw-qsk7n

🎉 CONGRATULATIONS! 🎉 You created your first RKE2 cluster with CAPD as an infrastructure provider.

Using ClusterClass for cluster creation

This provider supports using ClusterClass, a Cluster API feature that implements an extra level of abstraction on top of the existing Cluster API functionality. The ClusterClass object is used to define a collection of template resources (control plane and machine deployment) which are used to generate one or more clusters of the same flavor.

If you are interested in leveraging this functionality, you can refer to the examples here:

As with other sample templates, you will need to set a number environment variables:

  • CLUSTER_NAME
  • CABPR_CP_REPLICAS
  • CABPR_WK_REPLICAS
  • KUBERNETES_VERSION
  • KIND_IP

for example:

export CLUSTER_NAME=capd-rke2-clusterclass
export CABPR_CP_REPLICAS=3
export CABPR_WK_REPLICAS=2
export KUBERNETES_VERSION=v1.25.11
export KIND_IP=192.168.20.20

Remember that, since we are using Kind, the value of KIND_IP must be an IP address in the range of the kind network. You can check the range Docker assigns to this network by inspecting it:

docker network inspect kind

The next step is to substitue the values in the YAML using the following commands:

cat clusterclass-quick-start.yaml | clusterctl generate yaml > clusterclass-example.yaml

At this moment, you can take some time to study the resulting YAML, then you can apply it to the management cluster:

kubectl apply -f clusterclass-example.yaml

This will create a new ClusterClass template that can be used to provision one or multiple workload clusters of the same flavor. To do so, you can follow the same procedure and substitute the values in the YAML for the cluster definition:

cat rke2-sample.yaml | clusterctl generate yaml > rke2-clusterclass-example.yaml

And then apply the resulting YAML file to create a cluster from the existing ClusterClass.

kubectl apply -f rke2-clusterclass-example.yaml

Known Issues

When using CAPD < v1.6.0 unmodified, Cluster creation is stuck after first node and API is not reachable

If you use docker as your infrastructure provider without any modification, Cluster creation will stall after provisioning the first node, and the API will not be available using the LB address. This is caused by Load Balancer configuration used in CAPD which is not compatible with RKE2. Therefore, it is necessary to use our own fork of v1.3.3 by using a specific clusterctl configuration.

Topics

This section contains more detailed information about the features that CAPRKE2 offers and how to use them.

Air-Gapped Cluster Deployment

Introduction

The default way this provider uses to deploy RKE2 is by using the online installation method. This methdod needs access to Rancher servers and Docker.io registry for downloading scripts, RKE2 packages and container images neessary to the installation of RKE2.

Some users might prefer using Air-Gapped installation for multiple possible reasons like deployment on particularly secure environments, sporadic access issues (like Deployment to Edge Locations) or Bandwidth preservation.

RKE2 supports Air-Gapped installation using :

  • 2 methods for node preparation: Tarball on the node, Container Image Registry

  • 2 methods for actual RKE2 installation after the node is prepared: Manual deployment, and Using install.sh from https://get.rke2.io.

Methods supported by CABPR (Cluster API Provider RKE2)

In choosing between the RKE2 Air-Gapped cluster creation modes above, CABPR has chosen the best tradeoff in terms of simplicity, usability and limitation of dependencies.

Node preparation

The method that is supported by CABPR is the Tarball on the node using custom images. The reasons behind this choice include:

  • No dependency on the environments' network infrastructure and Image Registry, and the registry approach does not exempt from needing to use a custom image anyway.

  • CAPI's philosophy is to accept custom-defined base images for infrastructure providers, which makes it easy to build the RKE2 pre-requisites (for a specific RKE2 version) into a custom image to be used for all deployments.

RKE2 deployment

The method that is supported by CABPR for RKE2 deployment is by using the install.sh approach, described here. This approach is used because it automates a number of tasks needed for RKE2 to be deployed, like creating file hierarchy, unpacking Tarball, and creating systemd service units.

Since these tasks might change in the future, we prefer to rely on the upstream script from RKE2, available in the latest valid version at: https://get.rke2.io .

Pre-requisites on base image

Considering the above tradeoffs, base images used for Air-Gapped need to comply to some pre-requisites in order to work with CABPR. This sections list these pre-requisites:

  • Support and presence of cloud-init (ignition bootstrapping is also on the roadmap)

  • Presence of systemd (because RKE2's installation relies on systemd to start RKE2)

  • Presence of the folders /opt and /opt/rke2-artifacts with the following files inside these folders:

    • install.sh in /opt (this file has the content of the script available at https://get.rke2.io ). One way to create it at build time is by using curl -sfL https://get.rke2.io > /opt/install.sh using a linux user with write permissions to the /opt folder.

    • rke2-images.linux-amd64.tar.zst , rke2.linux-amd64.tar.gz and sha256sum-amd64.txt in the /opt/rke2-artifacts folder, these files can be downloaded for a specific version of RKE2 on its release page, for instance, this page : Release v1.23.16+rke2r1 · rancher/rke2 · GitHub for version v1.23.16+rke2r1 . The files can be found under the Assets sections of the page.

  • Previous pre-requisites should be built into an machine image, for instance, for instance a container image for CAPD or an AMI for AWS EC2. Each Infrastructure provider has its own way of defining machine images.

Configuration of CABPR for Air-Gapped use

In order to deploy RKE2 Clusters in Air-Gapped mode using CABPR, you need to set the fields spec.agentConfig.airGapped for the RKE2ControlPlane object and spec.template.spec.agentConfig.airGapped for RKE2ConfigTemplate object to true.

You can check a reference implementation for CAPD here including configuration for CAPD custom image.

Node Registration Methods

The provider supports multiple methods for registering a new node into the cluster.

Usage

The method to use is specified on the RKEControlPlane within the spec. If no method is supplied then the default method of internal-first will be used.

You cannot change the registration method after creation.

An example of using a different method:

apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: RKE2ControlPlane
metadata:
  name: test1-control-plane
  namespace: default
spec:
  agentConfig:
    version: v1.26.4+rke2r1
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: DockerMachineTemplate
    name: controlplane
  nodeDrainTimeout: 2m
  replicas: 3
  serverConfig:
    cni: calico
  registrationMethod: "address"
  registrationAddress: "172.19.0.3"

Registration Methods

internal-first

For each CAPI Machine that is used for the control plane, we take the internal ip address from Machine.status.addresses if it exists. If there is no internal ip for a machine then we will use an external address instead. For the ip address found for a machine then we add it to RKEControlPlane.status.availableServerIPs.

The first IP address listed in RKEControlPlane.status.availableServerIPs is then used for the join.

internal-only-ips

For each CAPI Machine that is used for the control plane, we take the internal ip address from Machine.status.addresses if it exists and then we add it to RKEControlPlane.status.availableServerIPs.

The first IP address listed in RKEControlPlane.status.availableServerIPs is then used for the join.

external-only-ips

For each CAPI Machine that is used for the control plane, we take the external ip address from Machine.status.addresses if it exists and then we add it to RKEControlPlane.status.availableServerIPs.

The first IP address listed in RKEControlPlane.status.availableServerIPs is then used for the join.

address

For this method you must supply an address in the control plane spec (i.e. RKE2ControlPlane.spec.registrationAddress). This address is then used for the join.

With this method its expected that you have a load balancer / VIP solution sitting in front of all the control plane machines and all the join requests will be routed via this.

Developer Guide

This section describes the workflow for regular developer tasks, such as:

  • Development guide
  • Releasing a new version of CAPRKE2

Development

The following instructions are for development purposes.

  1. Clone the Cluster API Repo into the GOPATH

Why clone into the GOPATH? There have been historic issues with code generation tools when they are run outside the go path

  1. Fork the Cluster API Provider RKE2 repo
  2. Clone your new repo into the GOPATH (i.e. ~/go/src/github.com/yourname/cluster-api-provider-rke2)
  3. Ensure Tilt and kind are installed
  4. Create a tilt-settings.json file in the root of your forked/cloned cluster-api directory.
  5. Add the following contents to the file (replace "yourname" with your github account name):
{
    "default_registry": "ghcr.io/yourname",
    "provider_repos": ["../../github.com/yourname/cluster-api-provider-rke2"],
    "enable_providers": ["docker", "rke2-bootstrap", "rke2-control-plane"],
    "kustomize_substitutions": {
        "EXP_MACHINE_POOL": "true",
        "EXP_CLUSTER_RESOURCE_SET": "true"
    },
    "extra_args": {
        "rke2-bootstrap": ["--v=4"],
        "rke2-control-plane": ["--v=4"],
        "core": ["--v=4"]
    },
    "debug": {
        "rke2-bootstrap": {
            "continue": true,
            "port": 30001
        },
        "rke2-control-plane": {
            "continue": true,
            "port": 30002
        }
    }
}

NOTE: Until this bug merged in CAPI you will have to make the changes locally to your clone of CAPI.

  1. Open another terminal (or pane) and go to the cluster-api directory.
  2. Run the following to create a configuration for kind:
cat > kind-cluster-with-extramounts.yaml <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: capi-test
nodes:
- role: control-plane
  extraMounts:
    - hostPath: /var/run/docker.sock
      containerPath: /var/run/docker.sock
EOF

NOTE: if you are using Docker Desktop v4.13 or above then you will you will encounter issues from here. Until a permanent solution is found its recommended you use v4.12

  1. Run the following command to create a local kind cluster:
kind create cluster --config kind-cluster-with-extramounts.yaml
  1. Now start tilt by running the following:
tilt up
  1. Press the space key to see the Tilt web ui and check that everything goes green.

CAPRKE2 Releases

Release Cadence

  • CAPRKE2 minor versions (v0.2.0 versus v0.1.0) are released every 1-2 months.
  • CAPRKE2 patch versions (v0.2.2 versus v0.2.1) are released as often as weekly or bi-weekly.

Release Process

  1. Clone the repository locally:
git clone git@github.com:rancher/cluster-api-provider-rke2.git
  1. Depending on whether you are cutting a minor or patch release, the process varies.

    • If you are cutting a new minor release:

      Create a new release branch (i.e release-X) and push it to the upstream repository.

          # Note: `upstream` must be the remote pointing to `github.com:rancher/cluster-api-provider-rke2`.
          git checkout -b release-0.2
          git push -u upstream release-0.2
          # Export the tag of the minor release to be cut, e.g.:
          export RELEASE_TAG=v0.2.0
      
    • If you are cutting a patch release from an existing release branch:

      Use existing release branch.

          git checkout upstream/release-0.2
          # Export the tag of the patch release to be cut, e.g.:
          export RELEASE_TAG=v0.2.1
      
  2. Create a signed/annotated tag and push it:

# Create tags locally
git tag -s -a ${RELEASE_TAG} -m ${RELEASE_TAG}

# Push tags
git push upstream ${RELEASE_TAG}

This will trigger a release GitHub action that creates a release with RKE2 provider components.

  1. Mark release as ready.

Published releases are initially marked as draft. If the published version is supposed to be latest, mark it so on the release page, while editing the release. Please note that we are using semantic versioning while choosing latest version.

  1. Perform mandatory post-release activities, which will ensure contract metadata.yaml file is up-to-date in case of a future minor/major version change.

Prepare main branch for development of the new release

The goal of this task is to bump the versions on the main branch so that the upcoming release version is used for e.g. local development and e2e tests. We also modify tests so that they are testing the previous release.

This comes down to changing occurrences of the old version to the new version, e.g. v1.5 to v1.6, and preparing metadata.yaml for a future release version:

1. Update E2E tests

Existing E2E tests that point to a specific version need to be updated to use the new version instead.

  1. Add a future release to the list of providers in test/e2e/config/e2e_conf.yaml following the format used for previous versions. This will be used as a fake provider version for testing the current state of the repository instead of the actual GitHub release.
  2. Update bootstrap/control plane versions* inside function initUpgradableBootstrapCluster in test/e2e/e2e_suite_test.go.
  3. Edit upgrade test* in test/e2e/e2e_upgrade_test.go.

*To maintain the upgrade test concise and clean, and avoid a growing list of versions, it is required to maintain N-1 minor as a starting version (e.g. if releasing version v4.x, starting version is v3.x and the upgrade is as follows: v3.x -> v4.x).

2. Add future version to metadata.yaml. For example, if v0.5 was just released, we add v0.6 to the list of releaseSeries:

apiVersion: clusterctl.cluster.x-k8s.io/v1alpha3
kind: Metadata
releaseSeries:
  - major: 0
    minor: 1
    contract: v1beta1
  - major: 0
    minor: 2
    contract: v1beta1
  ...
  ...
  ...
  - major: x
    minor: x
    contract: x

Versioning

Cluster API Provider RKE2 follows semantic versioning specification.

Example versions:

  • Pre-release: v0.2.0-alpha.1
  • Minor release: v0.2.0
  • Patch release: v0.2.1
  • Major release: v2.0.0

With the v0 release of our codebase, we provide the following guarantees:

  • A (minor) release CAN include:

    • Introduction of new API versions, or new Kinds.
    • Compatible API changes like field additions, deprecation notices, etc.
    • Breaking API changes for deprecated APIs, fields, or code.
    • Features, promotion or removal of feature gates.
    • And more!
  • A (patch) release SHOULD only include backwards compatible set of bugfixes.

Backporting

Any backport MUST not be breaking for either API or behavioral changes.

It is generally not accepted to submit pull requests directly against release branches (release-X). However, backports of fixes or changes that have already been merged into the main branch may be accepted to all supported branches:

  • Critical bugs fixes, security issue fixes, or fixes for bugs without easy workarounds.
  • Dependency bumps for CVE (usually limited to CVE resolution; backports of non-CVE related version bumps are considered exceptions to be evaluated case by case)
  • Cert-manager version bumps (to avoid having releases with cert-manager versions that are out of support, when possible)
  • Changes required to support new Kubernetes versions, when possible. See supported Kubernetes versions for more details.
  • Changes to use the latest Go patch version to build controller images.
  • Improvements to existing docs (the latest supported branch hosts the current version of the book)

Note: We generally do not accept backports to Cluster API Provider RKE2 release branches that are out of support.

Branches

Cluster API Provider RKE2 has two types of branches: the main branch and release-X branches.

The main branch is where development happens. All the latest and greatest code, including breaking changes, happens on main.

The release-X branches contain stable, backwards compatible code. On every major or minor release, a new branch is created. It is from these branches that minor and patch releases are tagged. In some cases, it may be necessary to open PRs for bugfixes directly against stable branches, but this should generally not be the case.

Support and guarantees

Cluster API Provider RKE2 maintains the most recent release/releases for all supported APIs. Support for this section refers to the ability to backport and release patch versions; backport policy is defined above.

  • The API version is determined from the GroupVersion defined in the top-level bootstrap/api/ and controlplane/api/ packages.
  • For the current stable API version (v1beta1) we support the two most recent minor releases; older minor releases are immediately unsupported when a new major/minor release is available.

Reference

This section contains reference documentation for CAPRKE2 API types.