E2E Test Failure Investigation Guide
This guide provides a structured approach to investigating end-to-end (e2e) test failures in the cluster-api-addon-provider-fleet project.
Understanding E2E Tests
Our CI pipeline runs several e2e tests to validate functionality across different Kubernetes versions:
- Cluster Class Import Tests: Validate the cluster class import functionality
- Import Tests: Validate the general import functionality
- Import RKE2 Tests: Validate import functionality specific to RKE2 clusters
Each test runs on multiple Kubernetes versions (stable and latest) to ensure compatibility.
Accessing Test Artifacts
When e2e tests fail, the CI pipeline automatically collects and uploads artifacts containing valuable debugging information. These artifacts are created using crust-gather, a tool that captures the state of Kubernetes clusters.
Finding the Artifact URL
- Navigate to the failed GitHub Actions workflow run
- Scroll down to the "Artifacts" section
- Find the artifact corresponding to the failed test (e.g.,
artifacts-cluster-class-import-stable
) - Copy the artifact URL (right-click on the artifact link and copy the URL)
Using the serve-artifact.sh Script
The serve-artifact.sh
script allows you to download and serve the test artifacts locally, providing access to the Kubernetes contexts from the test environment.
Prerequisites
- A GitHub token with
repo
read permissions (set asGITHUB_TOKEN
environment variable) kubectl
installed,krew
installed.- crust-gather installed. Can be replicated with nix, if available.
Serving Artifacts
Fetch the serve-artifact.sh
script from the crust-gather GitHub repository:
curl -L https://raw.githubusercontent.com/crust-gather/crust-gather/refs/heads/main/serve-artifact.sh -o serve-artifact.sh && chmod +x serve-artifact.sh
# Using the full artifact URL
./serve-artifact.sh -u https://github.com/rancher/cluster-api-addon-provider-fleet/actions/runs/15737662078/artifacts/3356068059 -s 0.0.0.0:9095
# OR using individual components
./serve-artifact.sh -o rancher -r cluster-api-addon-provider-fleet -a 3356068059 -s 0.0.0.0:9095
This will:
- Download the artifact from GitHub
- Extract its contents
- Start a local server that provides access to the Kubernetes contexts from the test environment
Investigating Failures
Once the artifact server is running, you can use various tools to investigate the failure:
Using k9s
k9s provides a terminal UI to interact with Kubernetes clusters:
- Open a new terminal
- Run
k9s
- Press
:
to open the command prompt - Type
ctx
and press Enter - Select the context from the test environment (there may be multiple contexts).
dev
for the e2e tests. - Navigate through resources to identify issues:
- Check pods for crash loops or errors
- Examine events for warnings or errors
- Review logs from relevant components
Common Investigation Paths
-
Check Fleet Resources:
FleetAddonConfig
resources- Fleet
Cluster
resource - CAPI
ClusterGroup
resources - Ensure all relevant labels are present on above.
- Check for created
Fleet
namespacecluster-<ns>-<cluster name>-<random-prefix>
that it is consitent with the NS in the Cluster.status.namespace
. - Check for
ClusterRegistrationToken
in the cluster namespace. - Check for
BundleNamespaceMapping
in theClusterClass
namespace if a cluster references aClusterClass
in a different namespace
-
Check CAPI Resources:
- Cluster resource
- Check for
ControlPlaneInitialized
condition to betrue
- ClusterClass resources, these are present and have
status.observedGeneration
consistent with themetadata.generation
- Continue on a per-cluster basis
-
Check Controller Logs:
- Look for error messages or warnings in the controller logs in the
caapf-system
namespace. - Check for reconciliation failures in
manager
container. In case of upstream installation, check forhelm-manager
container logs.
- Look for error messages or warnings in the controller logs in the
-
Check Kubernetes Events:
- Events often contain information about failures, otherwise
CAAPF
publishes events for each resource apply from CAPICluster
, including FleetCluster
in the cluster namespace,ClusterGroup
andBundleNamespaceMapping
in theClusterClass
namespace. These events are created bycaapf-controller
component.
- Events often contain information about failures, otherwise
Common Failure Patterns
Import Failures
- Symptom: Fleet
Cluster
not created or in error state - Investigation: Check the controller logs in the
cattle-fleet-system
namespace for errors during import processing. Check for errors in theCAAPF
logs for missing cluster definition. - Common causes:
- Fleet cluster import process is serial, and hot loop in other cluster import blocks further cluster imports. Fleet issue.
- CAPI
Cluster
is not ready and does not haveControlPlaneInitialized
condition. Issue with CAPI or requires more time to be ready. - Otherwise
CAAPF
issue.
Cluster Class Failures
- Symptom: ClusterClass not properly imported or is not evaluated as a target.
- Investigation: Check for the
BundleNamespaceMapping
in theClusterClass
namespace named after theCluster
resource. Check the controller logs in thecaapf-system
namespace for errors during ClusterClass processing. CheckClusterGroup
resource in theCluster
namespace. - Common causes:
- Check for
Cluster
referencingClusterClass
in a different namespace. - In the event of missing resources,
CAAPF
related error.
- Check for