SUSE Rancher Prime Issues

Guest cluster log collection

You can collect guest cluster logs and configuration files. Perform the following steps on each guest cluster node:

Log in to the node.
Download the Rancher v2.x Linux log collector script and generate a log bundle using the following commands:
```
curl -OLs https://raw.githubusercontent.com/rancherlabs/support-tools/master/collection/rancher/v2.x/logs-collector/rancher2_logs_collector.sh
sudo bash rancher2_logs_collector.sh
```
The output of the script indicates the location of the generated tarball.

For more information, see The Rancher v2.x Linux log collector script.

Importing of SUSE Virtualization clusters into Rancher

After the cluster-registration-url is set on SUSE Virtualization, a deployment named cattle-system/cattle-cluster-agent is created for importing of the SUSE Virtualization cluster into Rancher.

Import pending due to `unable to read CA file` error

The following error messages in the cattle-cluster-agent-* pod logs indicate that the SUSE Virtualization cluster cannot be imported into Rancher.

2025-02-13T17:25:22.520593546Z time="2025-02-13T17:25:22Z" level=info msg="Rancher agent version v2.10.2 is starting"
2025-02-13T17:25:22.529886868Z time="2025-02-13T17:25:22Z" level=error msg="unable to read CA file from /etc/kubernetes/ssl/certs/serverca: open /etc/kubernetes/ssl/certs/serverca: no such file or directory"
2025-02-13T17:25:22.529924542Z time="2025-02-13T17:25:22Z" level=error msg="Strict CA verification is enabled but encountered error finding root CA"

The root cause is ineffective configuration of Rancher’s agent-tls-mode setting, which controls how Rancher’s agents (cluster-agent, fleet-agent, and system-agent) validate Rancher’s certificate when establishing a connection. The default value of this setting depends on the Rancher version and installation type.

Type Versions Default Value

Type	Versions	Default Value
New installation	v2.8	`system-store`
New installation	v2.9 and later	`strict`
Upgrade	v2.8 to v2.9	`system-store`

New installation

v2.8

system-store

New installation

v2.9 and later

strict

Upgrade

v2.8 to v2.9

system-store

You can configure this setting to match your requirements by performing the following steps:

Log in to the Rancher UI.
Go to Global Settings → Settings.
Select agent-tls-mode, and then select ⋮ → Edit Setting to access the configuration options.
Select one of the following values:
- Strict: Rancher’s agents only trust certificates generated by the Certificate Authority (CA) specified in the cacerts setting. This is the recommended default TLS setting.
  
  The Strict option enables a higher level of security by requiring Rancher to have access to the CA that generated the certificate visible to the agents. In the case of certain certificate configurations (notably, external certificates), this is not automatic, and extra configuration is required. For more information about scenarios that require extra configuration, see Choose your SSL Configuration in the Rancher documentation.
- System Store: Rancher’s agents trust any certificate generated by a public CA specified in the operating system’s trust store. Use this setting if your setup uses an external trust authority and you don’t have ownership over the Certificate Authority.
  
  Using the System Store setting implies that the agent trusts all external authorities found in the operating system’s trust store including those outside of the user’s control.
Click Save.

Related issues:

SUSE Virtualization: #7105 and #7284
Rancher: #45628 (See this comment.)

Guest cluster load balancer IP is not reachable

Issue description

The load balancer service successfully obtains an IP address from the DHCP server or IP pool but remains inaccessible.

Guest cluster load balancer is inaccessible

To check if the issue exists in your environment, perform the following steps:

Create a new guest cluster with the following settings:
- Container Network: "Calico"
- Cloud Provider: "Harvester"

Deploy NGINX on the new guest cluster.

kubectl apply -f https://k8s.io/examples/application/deployment.yaml

Create a load balancer that uses NGINX.

Root cause

In the following example, a guest cluster node uses the IP address 10.115.1.46 and a new load balancer is assigned the IP address 10.115.6.200. When this load balancer’s IP address is later added to a new interface, such as vip-fd8c28ce (attached to @enp1s0), the Calico controller takes over the load balancer IP address. This conflict causes the load balancer IP address to become unreachable from outside the cluster.

To verify the cause, run the following command on the affected guest cluster node:

ip -d link show dev vxlan.calico
44: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 66:a7:41:00:1d:ba brd ff:ff:ff:ff:ff:ff promiscuity 0  allmulti 0 minmtu 68 maxmtu 65535
info: Using default fan map value (33)
    vxlan id 4096 local 10.115.6.200 dev vip-8a928fa0 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 65536 tso_max_segs 65535 gro_max_size 65536

The IP 10.115.6.200 is from the vip-* interface.

Affected Versions

The IP auto-detection feature is available in Calico v3.22 and other versions, with first-found as the default value.

SUSE® Rancher Prime: RKE2 v1.29 uses Calico v3.29.2, while v1.35 uses Calico v3.31.2.

Consequently, most RKE2 clusters using Calico as the default CNI alongside the Harvester Cloud Provider to deliver load balancer services are susceptible to this issue.

Workaround

Newly created clusters

When creating a new cluster in Rancher, select Add-on: Calico to open the YAML configuration window. Add nodeAddressAutodetectionV4 and skipInterface: vip.* to the spec.installation.calicoNetwork field.

installation:
  backend: VXLAN
  calicoNetwork:
    bgp: Disabled
    nodeAddressAutodetectionV4: (1)
      skipInterface: vip.* (2)

1	Configures IPv4 address auto-detection filtering.
2	Instructs the Calico controller to skip any interfaces matching the `vip.*` naming pattern.

These additional lines ensure that the Calico controller does not inadvertently intercept the assigned load balancer IP addresses.

Existing clusters

Run the command kubectl edit installation.
Go to the spec.calicoNetwork.nodeAddressAutodetectionV4 block and apply the following changes:
- Remove the firstFound: true entry if it is present.
- Add the skipInterface: vip.* parameter.
Save the changes.
Monitor the cluster for approximately two minutes while the calico-system/calico-node DaemonSet undergoes a rolling update.

The newly initialized pods automatically use the node IP for the VXLAN.

Check if the vxlan.calico interface uses the node IP (such as 10.115.1.46) instead of the VIP.

ip -d link show dev vxlan.calico

45: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 66:a7:41:00:1d:ba brd ff:ff:ff:ff:ff:ff promiscuity 0  allmulti 0 minmtu 68 maxmtu 65535
info: Using default fan map value (33)
    vxlan id 4096 local 10.115.1.46 dev enp1s0 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 65536 tso_max_segs 65535 gro_max_size 65536

If the vxlan.calico interface still uses the VIP, check the tigera-operator pod logs for the key phrase failed calling webhook.

kubectl -n tigera-operator logs tigera-operator-8566d6db5c-wfjkt
...
{"level":"error","ts":"2025-12-18T09:06:37Z","msg":"Reconciler error","controller":"tigera-installation-controller","object":{"name":"periodic-5m0s-reconcile-event"},"namespace":"","name":"periodic-5m0s-reconcile-event","reconcileID":"bae9d2da-a4bf-4d8b-89b8-c8a23a96f351","error":"Internal error occurred: failed calling webhook \"rancher.cattle.io.namespaces\": failed to call webhook: Post \"https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s\": context deadline exceeded"...}

If this error occurs, add the following container parameters directly to the calico-system/calico-node DaemonSet.

            - name: IP_AUTODETECTION_METHOD
              value: skip-interface=vip.*

Check the vxlan.calico interface again after a few minutes.

Once the interface stops using the VIP, the VIP will become reachable.

Related issues

#8072 and #9767

Guest cluster provisioning is stuck at `waiting for cluster agent to connect`

Affected versions

Rancher: v2.14.0
RKE2: v1.34.x, v1.35.1, v1.35.2, v1.35.3
SUSE Virtualization: v1.8.0

This issue frequently affects environments running Rancher v2.14.0, RKE2 v1.35.2 or v1.35.3, and SUSE Virtualization v1.8.0.

Although these specific releases are most affected, the bug can also affect other version combinations. Check if the symptoms exist in your cluster, and view the related GitHub issue for more details.

Issue description

Import the SUSE Virtualization cluster into Rancher.
Create a new guest cluster using the Harvester Node Driver with the default Harvester Cloud Provider.
On the Rancher UI, you will see the cluster state switch to Updating alongside a message such as `…waiting for cluster agent to connect.

Check the running pods.

kubectl get pods -A

NAMESPACE         NAME                                                     READY   STATUS             RESTARTS         AGE
calico-system     calico-kube-controllers-76456b5c58-xgzmn                 0/1     Pending            0                29m
calico-system     calico-node-n6ptz                                        0/1     Running            0                29m
calico-system     calico-typha-5b6744bd87-vfzpm                            0/1     Pending            0                29m
cattle-system     cattle-cluster-agent-66f658c49f-mm4xr                    0/1     Pending            0                29m
kube-system       etcd-rke2-v1352-7-pool1-wf5n2-8ng58                      1/1     Running            0                30m
kube-system       helm-install-harvester-csi-driver-h8tqb                  0/1     CrashLoopBackOff   10 (3m24s ago)   29m
kube-system       helm-install-rke2-calico-5t6hj                           0/1     Completed          2                29m
kube-system       helm-install-rke2-calico-crd-spg62                       0/1     Completed          0                29m
kube-system       helm-install-rke2-coredns-nx5rl                          0/1     Completed          0                29m
kube-system       helm-install-rke2-metrics-server-tw8xj                   0/1     Pending            0                29m
kube-system       helm-install-rke2-runtimeclasses-csqgq                   0/1     Pending            0                29m
kube-system       helm-install-rke2-snapshot-controller-crd-dwkbt          0/1     Pending            0                29m
kube-system       helm-install-rke2-snapshot-controller-d2crg              0/1     Pending            0                29m
kube-system       helm-install-rke2-traefik-crd-bqb6g                      0/1     Pending            0                29m
kube-system       helm-install-rke2-traefik-fnjwg                          0/1     Pending            0                29m
kube-system       kube-apiserver-rke2-v1352-7-pool1-wf5n2-8ng58            1/1     Running            0                30m
kube-system       kube-controller-manager-rke2-v1352-7-pool1-wf5n2-8ng58   1/1     Running            0                30m
kube-system       kube-proxy-rke2-v1352-7-pool1-wf5n2-8ng58                1/1     Running            0                30m
kube-system       kube-scheduler-rke2-v1352-7-pool1-wf5n2-8ng58            1/1     Running            0                30m
kube-system       rke2-coredns-rke2-coredns-5d4dd4bdd9-5tcdm               0/1     Pending            0                29m
kube-system       rke2-coredns-rke2-coredns-autoscaler-67b9856946-m44cd    0/1     Pending            0                29m
tigera-operator   tigera-operator-6db5b6cfd8-gxhfv                         1/1     Running            0                29m

kubectl describe pod -n calico-system calico-typha-5b6744bd87-vfzpm

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  34m                  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint(s). no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  9m12s (x5 over 29m)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint(s). no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

Treat failing stuck pods as symptoms. The root cause is that the cloud provider has not yet updated the corresponding node object.

Check the node object.

kubectl get node <node-name> -o yaml

      ...
      spec:
        ...
        taints:
        - effect: NoSchedule
          key: node.cloudprovider.kubernetes.io/uninitialized
          value: "true"

If the node has a node.cloudprovider.kubernetes.io/uninitialized:NoSchedule taint, this indicates that the cloud provider has not initialized the node.

Retrieve the description of the helm-install-harvester-cloud-provider job in the kube-system namespace.

kubectl describe job -n kube-system helm-install-harvester-cloud-provider
...
Name:             helm-install-harvester-cloud-provider
Namespace:        kube-system
...
Completion Mode:  NonIndexed
Suspend:          false
Backoff Limit:    1000
Start Time:       Tue, 31 Mar 2026 12:49:15 +0000
Pods Statuses:    0 Active (0 Ready) / 0 Succeeded / 0 Failed
...

The harvester-cloud-provider job appears stalled. The pod count in the Pods Status field remains at 0 Active (0 Ready) / 0 Succeeded / 0 Failed despite the job having started.

Root cause

A race condition during the RKE2 bootstrap stage causes the helm-install-harvester-cloud-provider pod to encounter a rapid create, delete, create loop. This cycling desynchronizes the job controller, stalling the job and preventing it from updating the node object.

kube-system-kube-controller-manager log:
...
[job_controller.go:659] "Unhandled Error" err="syncing job: tracking status: adding uncounted pods to status:
Operation cannot be fulfilled on jobs.batch \"helm-install-harvester-cloud-provider\":
StorageError: invalid object, Code: 4, Key: /registry/jobs/kube-system/helm-install-harvester-cloud-provider,
ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 6ae0b0b2-34c6-42e2-afc9-d7616ec93f2e,
  UID in object meta: 8515b052-3a07-43e8-bf01-d898e2b9c27c" logger="UnhandledError"

The UID mismatch in the StorageError log indicates that the job object was recreated so rapidly that the controller’s internal state became inconsistent with the actual object stored in etcd.

An upstream fix for this race condition is under development and will be included in upcoming releases.

Resolution

This issue is resolved in RKE2 v1.35.4 and all subsequent releases.

RKE2 v1.35.4 is available by default starting with Rancher v2.14.1. Refer to the Rancher documentation for information about upgrading Rancher and its guest clusters. SUSE Virtualization requires no changes or upgrades to resolve this issue.

Workaround

Manually delete the stalled job to force the controller to recreate it.

kubectl delete job -n kube-system helm-install-harvester-cloud-provider
job.batch "helm-install-harvester-cloud-provider" deleted from kube-system namespace

A new job will be created within a few minutes.

Check if the uninitialized taint has been dropped.
```
kubectl get nodes -o jsonpath='{.items[*].spec.taints}'
```
Once the job is recreated successfully, querying the node taints will return an empty response. This allows the pending pods to be scheduled. The guest cluster will automatically recover.
Reprovision the failed cluster.

Related ssue

#10188

SUSE Rancher Prime Issues

Guest cluster log collection

Importing of SUSE Virtualization clusters into Rancher

Import pending due to unable to read CA file error

Guest cluster load balancer IP is not reachable

Issue description

Root cause

Affected Versions

Workaround

Newly created clusters

Existing clusters

Related issues

Guest cluster provisioning is stuck at waiting for cluster agent to connect

Affected versions

Issue description

Root cause

Resolution

Workaround

Related ssue

Import pending due to `unable to read CA file` error

Guest cluster provisioning is stuck at `waiting for cluster agent to connect`