E2E Test Failure Investigation Guide
This guide provides a structured approach to investigating end-to-end (e2e) test failures in the cluster-api-addon-provider-fleet project.
Understanding E2E Tests
Our CI pipeline runs several e2e tests to validate functionality across different Kubernetes versions:
- Cluster Class Import Tests: Validate the cluster class import functionality
- Import Tests: Validate the general import functionality
- Import RKE2 Tests: Validate import functionality specific to RKE2 clusters
Each test runs on multiple Kubernetes versions (stable and latest) to ensure compatibility.
Accessing Test Artifacts
When e2e tests fail, the CI pipeline automatically collects and uploads artifacts containing valuable debugging information. These artifacts are created using crust-gather, a tool that captures the state of Kubernetes clusters.
Finding the Artifact URL
- Navigate to the failed GitHub Actions workflow run
- Scroll down to the "Artifacts" section
- Find the artifact corresponding to the failed test (e.g.,
artifacts-cluster-class-import-stable) - Copy the artifact URL (right-click on the artifact link and copy the URL)
Using the serve-artifact.sh Script
The serve-artifact.sh script allows you to download and serve the test artifacts locally, providing access to the Kubernetes contexts from the test environment.
Prerequisites
- A GitHub token with
reporead permissions (set asGITHUB_TOKENenvironment variable) kubectlinstalled,krewinstalled.- crust-gather installed. Can be replicated with nix, if available.
Serving Artifacts
Fetch the serve-artifact.sh script from the crust-gather GitHub repository:
curl -L https://raw.githubusercontent.com/crust-gather/crust-gather/refs/heads/main/serve-artifact.sh -o serve-artifact.sh && chmod +x serve-artifact.sh
# Using the full artifact URL
./serve-artifact.sh -u https://github.com/rancher/cluster-api-addon-provider-fleet/actions/runs/15737662078/artifacts/3356068059 -s 0.0.0.0:9095
# OR using individual components
./serve-artifact.sh -o rancher -r cluster-api-addon-provider-fleet -a 3356068059 -s 0.0.0.0:9095
This will:
- Download the artifact from GitHub
- Extract its contents
- Start a local server that provides access to the Kubernetes contexts from the test environment
Investigating Failures
Once the artifact server is running, you can use various tools to investigate the failure:
Using k9s
k9s provides a terminal UI to interact with Kubernetes clusters:
- Open a new terminal
- Run
k9s - Press
:to open the command prompt - Type
ctxand press Enter - Select the context from the test environment (there may be multiple contexts).
devfor the e2e tests. - Navigate through resources to identify issues:
- Check pods for crash loops or errors
- Examine events for warnings or errors
- Review logs from relevant components
Common Investigation Paths
-
Check Fleet Resources:
FleetAddonConfigresources- Fleet
Clusterresource - CAPI
ClusterGroupresources - Ensure all relevant labels are present on above.
- Check for created
Fleetnamespacecluster-<ns>-<cluster name>-<random-prefix>that it is consitent with the NS in the Cluster.status.namespace. - Check for
ClusterRegistrationTokenin the cluster namespace. - Check for
BundleNamespaceMappingin theClusterClassnamespace if a cluster references aClusterClassin a different namespace
-
Check CAPI Resources:
- Cluster resource
- Check for
ControlPlaneInitializedcondition to betrue - ClusterClass resources, these are present and have
status.observedGenerationconsistent with themetadata.generation - Continue on a per-cluster basis
-
Check Controller Logs:
- Look for error messages or warnings in the controller logs in the
caapf-systemnamespace. - Check for reconciliation failures in
managercontainer. In case of upstream installation, check forhelm-managercontainer logs.
- Look for error messages or warnings in the controller logs in the
-
Check Kubernetes Events:
- Events often contain information about failures, otherwise
CAAPFpublishes events for each resource apply from CAPICluster, including FleetClusterin the cluster namespace,ClusterGroupandBundleNamespaceMappingin theClusterClassnamespace. These events are created bycaapf-controllercomponent.
- Events often contain information about failures, otherwise
Common Failure Patterns
Import Failures
- Symptom: Fleet
Clusternot created or in error state - Investigation: Check the controller logs in the
cattle-fleet-systemnamespace for errors during import processing. Check for errors in theCAAPFlogs for missing cluster definition. - Common causes:
- Fleet cluster import process is serial, and hot loop in other cluster import blocks further cluster imports. Fleet issue.
- CAPI
Clusteris not ready and does not haveControlPlaneInitializedcondition. Issue with CAPI or requires more time to be ready. - Otherwise
CAAPFissue.
Cluster Class Failures
- Symptom: ClusterClass not properly imported or is not evaluated as a target.
- Investigation: Check for the
BundleNamespaceMappingin theClusterClassnamespace named after theClusterresource. Check the controller logs in thecaapf-systemnamespace for errors during ClusterClass processing. CheckClusterGroupresource in theClusternamespace. - Common causes:
- Check for
ClusterreferencingClusterClassin a different namespace. - In the event of missing resources,
CAAPFrelated error.
- Check for