This is unreleased documentation for SUSE® Virtualization v1.8 (Dev).

SUSE Rancher Prime Issues

Guest cluster log collection

You can collect guest cluster logs and configuration files. Perform the following steps on each guest cluster node:

  1. Log in to the node.

  2. Download the Rancher v2.x Linux log collector script and generate a log bundle using the following commands:

    curl -OLs https://raw.githubusercontent.com/rancherlabs/support-tools/master/collection/rancher/v2.x/logs-collector/rancher2_logs_collector.sh
    sudo bash rancher2_logs_collector.sh

    The output of the script indicates the location of the generated tarball.

For more information, see The Rancher v2.x Linux log collector script.

Importing of SUSE Virtualization clusters into Rancher

After the cluster-registration-url is set on SUSE Virtualization, a deployment named cattle-system/cattle-cluster-agent is created for importing of the SUSE Virtualization cluster into Rancher.

Import pending due to unable to read CA file error

The following error messages in the cattle-cluster-agent-* pod logs indicate that the SUSE Virtualization cluster cannot be imported into Rancher.

2025-02-13T17:25:22.520593546Z time="2025-02-13T17:25:22Z" level=info msg="Rancher agent version v2.10.2 is starting"
2025-02-13T17:25:22.529886868Z time="2025-02-13T17:25:22Z" level=error msg="unable to read CA file from /etc/kubernetes/ssl/certs/serverca: open /etc/kubernetes/ssl/certs/serverca: no such file or directory"
2025-02-13T17:25:22.529924542Z time="2025-02-13T17:25:22Z" level=error msg="Strict CA verification is enabled but encountered error finding root CA"

The root cause is ineffective configuration of Rancher’s agent-tls-mode setting, which controls how Rancher’s agents (cluster-agent, fleet-agent, and system-agent) validate Rancher’s certificate when establishing a connection. The default value of this setting depends on the Rancher version and installation type.

Type Versions Default Value

New installation

v2.8

system-store

New installation

v2.9 and later

strict

Upgrade

v2.8 to v2.9

system-store

You can configure this setting to match your requirements by performing the following steps:

  1. Log in to the Rancher UI.

  2. Go to Global Settings → Settings.

  3. Select agent-tls-mode, and then select ⋮ → Edit Setting to access the configuration options.

    rancher global setting agent tls mode
  4. Select one of the following values:

    • Strict: Rancher’s agents only trust certificates generated by the Certificate Authority (CA) specified in the cacerts setting. This is the recommended default TLS setting.

      The Strict option enables a higher level of security by requiring Rancher to have access to the CA that generated the certificate visible to the agents. In the case of certain certificate configurations (notably, external certificates), this is not automatic, and extra configuration is required. For more information about scenarios that require extra configuration, see Choose your SSL Configuration in the Rancher documentation.

    • System Store: Rancher’s agents trust any certificate generated by a public CA specified in the operating system’s trust store. Use this setting if your setup uses an external trust authority and you don’t have ownership over the Certificate Authority.

      Using the System Store setting implies that the agent trusts all external authorities found in the operating system’s trust store including those outside of the user’s control.

  5. Click Save.

Related issues:

Guest cluster load balancer IP is not reachable

Issue description

The load balancer service successfully obtains an IP address from the DHCP server or IP pool but remains inaccessible.

Guest cluster load balancer is inaccessible

To check if the issue exists in your environment, perform the following steps:

  1. Create a new guest cluster with the following settings:

    • Container Network: "Calico"

    • Cloud Provider: "Harvester"

  2. Deploy NGINX on the new guest cluster.

    kubectl apply -f https://k8s.io/examples/application/deployment.yaml
  3. Create a load balancer that uses NGINX.

Root cause

In the following example, a guest cluster node uses the IP address 10.115.1.46 and a new load balancer is assigned the IP address 10.115.6.200. When this load balancer’s IP address is later added to a new interface, such as vip-fd8c28ce (attached to @enp1s0), the Calico controller takes over the load balancer IP address. This conflict causes the load balancer IP address to become unreachable from outside the cluster.

To verify the cause, run the following command on the affected guest cluster node:

ip -d link show dev vxlan.calico
44: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 66:a7:41:00:1d:ba brd ff:ff:ff:ff:ff:ff promiscuity 0  allmulti 0 minmtu 68 maxmtu 65535
info: Using default fan map value (33)
    vxlan id 4096 local 10.115.6.200 dev vip-8a928fa0 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 65536 tso_max_segs 65535 gro_max_size 65536

The IP 10.115.6.200 is from the vip-* interface.

Affected Versions

The IP auto-detection feature is available in Calico v3.22 and other versions, with first-found as the default value.

SUSE® Rancher Prime: RKE2 v1.29 uses Calico v3.29.2, while v1.35 uses Calico v3.31.2.

Consequently, most RKE2 clusters using Calico as the default CNI alongside the Harvester Cloud Provider to deliver load balancer services are susceptible to this issue.

Workaround

Newly created clusters

When creating a new cluster in Rancher, select Add-on: Calico to open the YAML configuration window. Add nodeAddressAutodetectionV4 and skipInterface: vip.* to the spec.installation.calicoNetwork field.

installation:
  backend: VXLAN
  calicoNetwork:
    bgp: Disabled
    nodeAddressAutodetectionV4: (1)
      skipInterface: vip.* (2)
1 Configures IPv4 address auto-detection filtering.
2 Instructs the Calico controller to skip any interfaces matching the vip.* naming pattern.

These additional lines ensure that the Calico controller does not inadvertently intercept the assigned load balancer IP addresses.

Existing clusters

  1. Run the command kubectl edit installation.

  2. Go to the spec.calicoNetwork.nodeAddressAutodetectionV4 block and apply the following changes:

    • Remove the firstFound: true entry if it is present.

    • Add the skipInterface: vip.* parameter.

  3. Save the changes.

  4. Monitor the cluster for approximately two minutes while the calico-system/calico-node DaemonSet undergoes a rolling update.

    The newly initialized pods automatically use the node IP for the VXLAN.

  5. Check if the vxlan.calico interface uses the node IP (such as 10.115.1.46) instead of the VIP.

    ip -d link show dev vxlan.calico
    
    45: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
        link/ether 66:a7:41:00:1d:ba brd ff:ff:ff:ff:ff:ff promiscuity 0  allmulti 0 minmtu 68 maxmtu 65535
    info: Using default fan map value (33)
        vxlan id 4096 local 10.115.1.46 dev enp1s0 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 65536 tso_max_segs 65535 gro_max_size 65536
  6. If the vxlan.calico interface still uses the VIP, check the tigera-operator pod logs for the key phrase failed calling webhook.

    kubectl -n tigera-operator logs tigera-operator-8566d6db5c-wfjkt
    ...
    {"level":"error","ts":"2025-12-18T09:06:37Z","msg":"Reconciler error","controller":"tigera-installation-controller","object":{"name":"periodic-5m0s-reconcile-event"},"namespace":"","name":"periodic-5m0s-reconcile-event","reconcileID":"bae9d2da-a4bf-4d8b-89b8-c8a23a96f351","error":"Internal error occurred: failed calling webhook \"rancher.cattle.io.namespaces\": failed to call webhook: Post \"https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s\": context deadline exceeded"...}

    If this error occurs, add the following container parameters directly to the calico-system/calico-node DaemonSet.

                - name: IP_AUTODETECTION_METHOD
                  value: skip-interface=vip.*
  7. Check the vxlan.calico interface again after a few minutes.

    Once the interface stops using the VIP, the VIP will become reachable.

#8072 and #9767

Guest cluster provisioning is stuck at waiting for cluster agent to connect

Affected versions

  • Rancher: v2.14.0

  • RKE2: v1.34.x, v1.35.1, v1.35.2, v1.35.3

  • SUSE Virtualization: v1.8.0

This issue frequently affects environments running Rancher v2.14.0, RKE2 v1.35.2 or v1.35.3, and SUSE Virtualization v1.8.0.

Although these specific releases are most affected, the bug can also affect other version combinations. Check if the symptoms exist in your cluster, and view the related GitHub issue for more details.

Issue description

  1. Import the SUSE Virtualization cluster into Rancher.

  2. Create a new guest cluster using the Harvester Node Driver with the default Harvester Cloud Provider.

  3. On the Rancher UI, you will see the cluster state switch to Updating alongside a message such as `…​waiting for cluster agent to connect.

    Guest cluster waiting for cluster agent to connect
  4. Check the running pods.

    kubectl get pods -A
    
    NAMESPACE         NAME                                                     READY   STATUS             RESTARTS         AGE
    calico-system     calico-kube-controllers-76456b5c58-xgzmn                 0/1     Pending            0                29m
    calico-system     calico-node-n6ptz                                        0/1     Running            0                29m
    calico-system     calico-typha-5b6744bd87-vfzpm                            0/1     Pending            0                29m
    cattle-system     cattle-cluster-agent-66f658c49f-mm4xr                    0/1     Pending            0                29m
    kube-system       etcd-rke2-v1352-7-pool1-wf5n2-8ng58                      1/1     Running            0                30m
    kube-system       helm-install-harvester-csi-driver-h8tqb                  0/1     CrashLoopBackOff   10 (3m24s ago)   29m
    kube-system       helm-install-rke2-calico-5t6hj                           0/1     Completed          2                29m
    kube-system       helm-install-rke2-calico-crd-spg62                       0/1     Completed          0                29m
    kube-system       helm-install-rke2-coredns-nx5rl                          0/1     Completed          0                29m
    kube-system       helm-install-rke2-metrics-server-tw8xj                   0/1     Pending            0                29m
    kube-system       helm-install-rke2-runtimeclasses-csqgq                   0/1     Pending            0                29m
    kube-system       helm-install-rke2-snapshot-controller-crd-dwkbt          0/1     Pending            0                29m
    kube-system       helm-install-rke2-snapshot-controller-d2crg              0/1     Pending            0                29m
    kube-system       helm-install-rke2-traefik-crd-bqb6g                      0/1     Pending            0                29m
    kube-system       helm-install-rke2-traefik-fnjwg                          0/1     Pending            0                29m
    kube-system       kube-apiserver-rke2-v1352-7-pool1-wf5n2-8ng58            1/1     Running            0                30m
    kube-system       kube-controller-manager-rke2-v1352-7-pool1-wf5n2-8ng58   1/1     Running            0                30m
    kube-system       kube-proxy-rke2-v1352-7-pool1-wf5n2-8ng58                1/1     Running            0                30m
    kube-system       kube-scheduler-rke2-v1352-7-pool1-wf5n2-8ng58            1/1     Running            0                30m
    kube-system       rke2-coredns-rke2-coredns-5d4dd4bdd9-5tcdm               0/1     Pending            0                29m
    kube-system       rke2-coredns-rke2-coredns-autoscaler-67b9856946-m44cd    0/1     Pending            0                29m
    tigera-operator   tigera-operator-6db5b6cfd8-gxhfv                         1/1     Running            0                29m
    kubectl describe pod -n calico-system calico-typha-5b6744bd87-vfzpm
    
    Events:
      Type     Reason            Age                  From               Message
      ----     ------            ----                 ----               -------
      Warning  FailedScheduling  34m                  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint(s). no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
      Warning  FailedScheduling  9m12s (x5 over 29m)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint(s). no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

    Treat failing stuck pods as symptoms. The root cause is that the cloud provider has not yet updated the corresponding node object.

  5. Check the node object.

    kubectl get node <node-name> -o yaml
          ...
          spec:
            ...
            taints:
            - effect: NoSchedule
              key: node.cloudprovider.kubernetes.io/uninitialized
              value: "true"

    If the node has a node.cloudprovider.kubernetes.io/uninitialized:NoSchedule taint, this indicates that the cloud provider has not initialized the node.

  6. Retrieve the description of the helm-install-harvester-cloud-provider job in the kube-system namespace.

    kubectl describe job -n kube-system helm-install-harvester-cloud-provider
    ...
    Name:             helm-install-harvester-cloud-provider
    Namespace:        kube-system
    ...
    Completion Mode:  NonIndexed
    Suspend:          false
    Backoff Limit:    1000
    Start Time:       Tue, 31 Mar 2026 12:49:15 +0000
    Pods Statuses:    0 Active (0 Ready) / 0 Succeeded / 0 Failed
    ...

    The harvester-cloud-provider job appears stalled. The pod count in the Pods Status field remains at 0 Active (0 Ready) / 0 Succeeded / 0 Failed despite the job having started.

Root cause

A race condition during the RKE2 bootstrap stage causes the helm-install-harvester-cloud-provider pod to encounter a rapid create, delete, create loop. This cycling desynchronizes the job controller, stalling the job and preventing it from updating the node object.

kube-system-kube-controller-manager log:
...
[job_controller.go:659] "Unhandled Error" err="syncing job: tracking status: adding uncounted pods to status:
Operation cannot be fulfilled on jobs.batch \"helm-install-harvester-cloud-provider\":
StorageError: invalid object, Code: 4, Key: /registry/jobs/kube-system/helm-install-harvester-cloud-provider,
ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 6ae0b0b2-34c6-42e2-afc9-d7616ec93f2e,
  UID in object meta: 8515b052-3a07-43e8-bf01-d898e2b9c27c" logger="UnhandledError"

The UID mismatch in the StorageError log indicates that the job object was recreated so rapidly that the controller’s internal state became inconsistent with the actual object stored in etcd.

An upstream fix for this race condition is under development and will be included in upcoming releases.

Resolution

This issue is resolved in RKE2 v1.35.4 and all subsequent releases.

RKE2 v1.35.4 is available by default starting with Rancher v2.14.1. Refer to the Rancher documentation for information about upgrading Rancher and its guest clusters. SUSE Virtualization requires no changes or upgrades to resolve this issue.

Workaround

  1. Manually delete the stalled job to force the controller to recreate it.

    kubectl delete job -n kube-system helm-install-harvester-cloud-provider
    job.batch "helm-install-harvester-cloud-provider" deleted from kube-system namespace

    A new job will be created within a few minutes.

  2. Check if the uninitialized taint has been dropped.

    kubectl get nodes -o jsonpath='{.items[*].spec.taints}'

    Once the job is recreated successfully, querying the node taints will return an empty response. This allows the pending pods to be scheduled. The guest cluster will automatically recover.

  3. Reprovision the failed cluster.