E2E Test Failure Investigation Guide

This guide provides a structured approach to investigating end-to-end (e2e) test failures in the cluster-api-addon-provider-fleet project.

Understanding E2E Tests

Our CI pipeline runs several e2e tests to validate functionality across different Kubernetes versions:

Cluster Class Import Tests: Validate the cluster class import functionality
Import Tests: Validate the general import functionality
Import RKE2 Tests: Validate import functionality specific to RKE2 clusters

Each test runs on multiple Kubernetes versions (stable and latest) to ensure compatibility.

When e2e tests fail, the CI pipeline automatically collects and uploads artifacts containing valuable debugging information. These artifacts are created using crust-gather, a tool that captures the state of Kubernetes clusters.

Finding the Artifact URL

Navigate to the failed GitHub Actions workflow run
Scroll down to the "Artifacts" section
Find the artifact corresponding to the failed test (e.g., artifacts-cluster-class-import-stable)
Copy the artifact URL (right-click on the artifact link and copy the URL)

Using the serve-artifact.sh Script

The serve-artifact.sh script allows you to download and serve the test artifacts locally, providing access to the Kubernetes contexts from the test environment.

Prerequisites

A GitHub token with repo read permissions (set as GITHUB_TOKEN environment variable)
kubectl installed, krew installed.
crust-gather installed. Can be replicated with nix, if available.

Serving Artifacts

Fetch the serve-artifact.sh script from the crust-gather GitHub repository:

curl -L https://raw.githubusercontent.com/crust-gather/crust-gather/refs/heads/main/serve-artifact.sh -o serve-artifact.sh && chmod +x serve-artifact.sh

# Using the full artifact URL
./serve-artifact.sh -u https://github.com/rancher/cluster-api-addon-provider-fleet/actions/runs/15737662078/artifacts/3356068059 -s 0.0.0.0:9095

# OR using individual components
./serve-artifact.sh -o rancher -r cluster-api-addon-provider-fleet -a 3356068059 -s 0.0.0.0:9095

This will:

Download the artifact from GitHub
Extract its contents
Start a local server that provides access to the Kubernetes contexts from the test environment

Investigating Failures

Once the artifact server is running, you can use various tools to investigate the failure:

Using k9s

k9s provides a terminal UI to interact with Kubernetes clusters:

Open a new terminal
Run k9s
Press : to open the command prompt
Type ctx and press Enter
Select the context from the test environment (there may be multiple contexts). dev for the e2e tests.
Navigate through resources to identify issues:
- Check pods for crash loops or errors
- Examine events for warnings or errors
- Review logs from relevant components

Common Investigation Paths

Check Fleet Resources:
- FleetAddonConfig resources
- Fleet Cluster resource
- CAPI ClusterGroup resources
- Ensure all relevant labels are present on above.
- Check for created Fleet namespace cluster-<ns>-<cluster name>-<random-prefix> that it is consitent with the NS in the Cluster .status.namespace.
- Check for ClusterRegistrationToken in the cluster namespace.
- Check for BundleNamespaceMapping in the ClusterClass namespace if a cluster references a ClusterClass in a different namespace
Check CAPI Resources:
- Cluster resource
- Check for ControlPlaneInitialized condition to be true
- ClusterClass resources, these are present and have status.observedGeneration consistent with the metadata.generation
- Continue on a per-cluster basis
Check Controller Logs:
- Look for error messages or warnings in the controller logs in the caapf-system namespace.
- Check for reconciliation failures in manager container. In case of upstream installation, check for helm-manager container logs.
Check Kubernetes Events:
- Events often contain information about failures, otherwise CAAPF publishes events for each resource apply from CAPI Cluster, including Fleet Cluster in the cluster namespace, ClusterGroup and BundleNamespaceMapping in the ClusterClass namespace. These events are created by caapf-controller component.

Common Failure Patterns

Import Failures

Symptom: Fleet Cluster not created or in error state
Investigation: Check the controller logs in the cattle-fleet-system namespace for errors during import processing. Check for errors in the CAAPF logs for missing cluster definition.
Common causes:
- Fleet cluster import process is serial, and hot loop in other cluster import blocks further cluster imports. Fleet issue.
- CAPI Cluster is not ready and does not have ControlPlaneInitialized condition. Issue with CAPI or requires more time to be ready.
- Otherwise CAAPF issue.

Cluster Class Failures

Symptom: ClusterClass not properly imported or is not evaluated as a target.
Investigation: Check for the BundleNamespaceMapping in the ClusterClass namespace named after the Cluster resource. Check the controller logs in the caapf-system namespace for errors during ClusterClass processing. Check ClusterGroup resource in the Cluster namespace.
Common causes:
- Check for Cluster referencing ClusterClass in a different namespace.
- In the event of missing resources, CAAPF related error.

Cluster API Addon Provider Fleet