Fleet 监控

用于故障排除 SUSE® Rancher Prime Continuous Delivery 部署的高级诊断工具。

概述

通过捕获所有相关资源的快照并执行自动化诊断，从 SUSE® Rancher Prime Continuous Delivery 的 GitOps 生命周期中提取洞见。此命令有助于识别捆绑包在目标和部署阶段卡住的原因，并提供有关您的 SUSE® Rancher Prime Continuous Delivery 安装健康状况的可操作信息。

fleet monitor [flags]

选项

  -n, --namespace string          Namespace to monitor (default: all namespaces)
      --system-namespace string    {product_name} system namespace (default: cattle-fleet-system)
      --agent-staleness duration  Consider agent stale after this duration (default: 24h)
      --watch                     Watch for changes and output continuously
      --interval int              Interval in seconds between checks when watching (default: 60)
  -h, --help                      help for monitor

快速入门

监控命令输出 紧凑的 JSON（每行一个快照）。使用 jq 进行格式化以提升可读性。

# Single snapshot with formatted output
fleet monitor | jq

# Single snapshot with human-readable analysis
fleet monitor | fleet analyze

# Continuous monitoring with built-in watch mode (every 60 seconds)
fleet monitor --watch --interval 60 >> monitor.json

检测内容

监控命令执行全面诊断，以检测以下问题：

资源生命周期问题

生成不匹配的捆绑包：捆绑包未在其生命周期中进展（generation != observedGeneration）
卡住的捆绑部署：捆绑部署中代理未应用新 deploymentIDs 的情况
多个最终处理器：具有多个最终处理器的资源（表示存在错误 - 只有内容应具有多个最终处理器以进行引用计数，在 SUSE® Rancher Prime Continuous Delivery v0.11.1 到 v0.14.x 中）
孤立资源：具有无法被垃圾回收的删除时间戳的资源

数据一致性问题

API 时间旅行：Kubernetes API 服务器返回过时的缓存数据（通过多次获取资源检测）
提交哈希不匹配:捆绑包/捆绑部署未更新到 GitRepo 的最新提交
强制同步生成漂移:捆绑包未反映其 GitRepo 的 forceSyncGeneration 值
UID不匹配:拥有对已删除或重新创建资源的引用的 Secrets
DeploymentID不匹配:BundleDeployments where spec.deploymentID != status.appliedDeploymentID

目标匹配问题

没有部署的捆绑包：捆绑包创建后，没有集群匹配目标选择器
没有捆绑包的 GitRepos：GitRepos 未创建任何捆绑包（可能是路径、目标或处理错误）
没有集群的集群组：集群组的选择器未匹配到任何集群
孤立的捆绑部署：其父捆绑包被删除的捆绑部署

性能问题

大型捆绑包：捆绑包大于 1MB 可能会影响 etcd 性能
缺失内容资源:捆绑包关联了 resourcesSHA256Sum，但没有对应的内容资源
高资源计数:大量捆绑包资源可能导致 etcd 压力

代理和集群问题

代理未就绪:集群中代理状态为未就绪
缺失最后一次看到:没有代理检测信号时间戳的集群
过时的最后一次看到:代理最近没有签到的集群（默认：24小时，可通过`--agent-staleness`配置）
缺失代理捆绑包：集群名称空间中缺少预期的代理捆绑部署

所有权链问题

所有权损坏:没有 GitRepo 所有者的捆绑包，没有捆绑包所有者的捆绑部署
无效的 Secret 所有者：捆绑包 Secrets 中存在不正确或缺失的所有者引用

生成/观察不匹配

GitRepo 生成漂移:GitRepo 生成 != 观察到的生成（控制器未处理更新）
捆绑包生成漂移：捆绑包生成 != 观察到的生成（控制器未处理更新）
捆绑部署同步生成漂移：捆绑部署同步生成 != 强制同步生成（代理未应用强制同步）
内容过时生成:内容资源具有过时的生成值

使用示例

基本监控

# Single snapshot with pretty formatting
fleet monitor | jq

# Monitor specific namespace
fleet monitor -n fleet-local | jq

# Check fleet-default namespace (common for local clusters)
fleet monitor -n fleet-default | jq

持续监控

# Collect snapshots every 60 seconds using watch mode
fleet monitor --watch --interval 60 >> monitor.json

# Or monitor with a shorter interval (every 30 seconds)
fleet monitor --watch --interval 30 >> monitor.json

针对性诊断

# Check for stuck resources
fleet monitor | jq '.diagnostics | {
  bundlesWithGenerationMismatch: .bundlesWithGenerationMismatch | length,
  stuckBundleDeployments: .stuckBundleDeployments | length
}'

# Find bundles with old commits
fleet monitor | jq '.diagnostics.gitRepoBundleInconsistencies'

# Check agent health across all clusters
fleet monitor | jq '.diagnostics.clustersWithAgentIssues'

# Find large bundles that might impact etcd
fleet monitor | jq '.diagnostics.largeBundles'

# Check target matching issues
fleet monitor | jq '.diagnostics | {
  bundlesNoDeployments: .bundlesWithNoDeployments | length,
  gitreposNoBundles: .gitReposWithNoBundles | length,
  clusterGroupsNoClusters: .clusterGroupsWithNoClusters | length
}'

比较状态

# Before making changes
fleet monitor > before.json

# Make changes to GitRepo, bundles, etc.
kubectl edit gitrepo my-repo

# After changes
fleet monitor > after.json

常见故障排除场景

方案 1:Bundle 未部署

# Capture current state
fleet monitor | jq > bundle-status.json

# Check for bundles with generation mismatch
jq '.diagnostics.bundlesWithGenerationMismatch' bundle-status.json

# Check if bundle matched any targets
jq '.diagnostics.bundlesWithNoDeployments' bundle-status.json

# Check bundle-to-gitrepo consistency
jq '.diagnostics.gitRepoBundleInconsistencies' bundle-status.json

方案 2:代理未报告状态

# Check agent health
fleet monitor | jq '.diagnostics.clustersWithAgentIssues'

# See detailed cluster info
fleet monitor | jq '.clusters[] | select(.agentStatus != "ready")'

# Check when agents last checked in
fleet monitor | jq '.clusters[] | {name, lastSeen, agentAge}'

方案 3：资源被删除时间戳卡住

# Find resources with deletion timestamps
fleet monitor | jq '{
  bundles: [.bundles[] | select(.deletionTimestamp != null) | .name],
  bundledeployments: [.bundledeployments[] | select(.deletionTimestamp != null) | .name]
}'

# Check finalizers preventing deletion
fleet monitor | jq '.bundles[] | select(.deletionTimestamp != null) | {name, finalizers}'

方案 4：提交未传播

# Track commits through the lifecycle
fleet monitor | jq '{
  gitrepo: .gitrepos[0].commit[0:8],
  bundles: [.bundles[] | {name, commit: .commit[0:8]}],
  bundledeployments: [.bundledeployments[] | {name, commit: .commit[0:8]}]
}'

# Find commit mismatches
fleet monitor | jq '.diagnostics.gitRepoBundleInconsistencies[] |
  select(.commitMismatch == true)'

方案 5：性能问题

# Check bundle sizes
fleet monitor | jq '.diagnostics.largeBundles'

# Find bundles with most resources
fleet monitor | jq '[.bundles[] | {name, size: .sizeBytes, sizeMB: (.sizeBytes / 1048576 | floor)}] |
  sort_by(.size) | reverse'

# Check for missing content resources
fleet monitor | jq '.diagnostics.bundlesWithMissingContent'

持续监控工作流

用于长期监控和趋势分析:

# 1. Start continuous collection with watch mode (runs in background)
nohup fleet monitor --watch --interval 60 >> /var/log/fleet-monitor.json 2>&1 &

# 2. Periodically analyze for issues
watch -n 300 "fleet analyze --issues /var/log/fleet-monitor.json | tail -30"

# 3. Generate daily reports
fleet analyze --diff /var/log/fleet-monitor.json > fleet-report-$(date +%Y%m%d).txt

# 4. Log rotation (keep last 7 days)
find /var/log -name "fleet-report-*.txt" -mtime +7 -delete

与警报集成

监控命令可以与监控系统集成：

# Check if there are any issues (exit code 0 = healthy, 1 = issues)
if fleet monitor | jq -e '
  .diagnostics.bundlesWithGenerationMismatch != [] or
  .diagnostics.stuckBundleDeployments != [] or
  .diagnostics.clustersWithAgentIssues != []
' > /dev/null; then
  echo "ALERT: Fleet issues detected!"
  fleet monitor | jq '.diagnostics' | mail -s "Fleet Alert" admin@example.com
fi

# Prometheus-style metrics export
fleet monitor | jq -r '
  "fleet_stuck_bundles \(.diagnostics.bundlesWithGenerationMismatch | length)",
  "fleet_stuck_bundledeployments \(.diagnostics.stuckBundleDeployments | length)",
  "fleet_agent_issues \(.diagnostics.clustersWithAgentIssues | length)",
  "fleet_large_bundles \(.diagnostics.largeBundles | length)"
'

理解诊断

卡住的资源

当资源被认为是“卡住”时：

捆绑包（生成不匹配）：

generation != observedGeneration（控制器尚未处理最新规范）

注意：带有删除时间戳的捆绑包在 diagnostics.bundlesWithDeletionTimestamp 中单独跟踪计数。

捆绑部署（卡住）：

spec.deploymentID != status.appliedDeploymentID（代理尚未应用最新部署）
`syncGeneration`与`forceSyncGeneration`不匹配（强制同步未应用）
有`deletionTimestamp`但仍然存在

捆绑部署（同步生成不匹配）：

syncGeneration != forceSyncGeneration 当 forceSyncGeneration > 0 时（与卡住的捆绑部署单独跟踪）

API一致性检查

监控器对相同资源进行多次获取，以检测“时间旅行”——当Kubernetes API服务器由于过时的缓存返回不同的资源版本时。这至关重要，因为过时的数据可能会使捆绑包看起来卡住，而实际上它们正在进展。

提交跟踪

监控器在整个生命周期中跟踪Git提交哈希： 1.GitRepo 获取最新提交2。捆绑包 应该反映该提交 3。捆绑部署 应该匹配捆绑包的提交 4。捆绑包 Secrets 在注释中存储提交

不匹配指示同步过程失败的地方。

性能方面的考虑

监控命令获取集群中的所有 Fleet 资源
对于大型安装（1000+ 个捆绑包），请考虑：
使用 --namespace 限制范围
运行频率较低（例如，每 120+ 秒而不是 60 秒）
监控命令本身的资源使用情况

故障排除监控命令

如果监控命令失败：

# Check Fleet controller is running
kubectl get pods -n cattle-fleet-system

# Verify you have proper RBAC permissions
kubectl auth can-i list bundles --all-namespaces
kubectl auth can-i list bundledeployments --all-namespaces

# Check if CRDs are installed
kubectl get crds | grep fleet.cattle.io

# Enable verbose logging
fleet monitor --verbose 2>&1 | tee monitor-debug.log