Fleet 监控
用于故障排除 SUSE® Rancher Prime Continuous Delivery 部署的高级诊断工具。
概述
通过捕获所有相关资源的快照并执行自动化诊断,从 SUSE® Rancher Prime Continuous Delivery 的 GitOps 生命周期中提取洞见。此命令有助于识别捆绑包在目标和部署阶段卡住的原因,并提供有关您的 SUSE® Rancher Prime Continuous Delivery 安装健康状况的可操作信息。
fleet monitor [flags]
选项
-n, --namespace string Namespace to monitor (default: all namespaces)
--system-namespace string {product_name} system namespace (default: cattle-fleet-system)
--agent-staleness duration Consider agent stale after this duration (default: 24h)
--watch Watch for changes and output continuously
--interval int Interval in seconds between checks when watching (default: 60)
-h, --help help for monitor
快速入门
监控命令输出 紧凑的 JSON(每行一个快照)。使用 jq 进行格式化以提升可读性。
# Single snapshot with formatted output
fleet monitor | jq
# Single snapshot with human-readable analysis
fleet monitor | fleet analyze
# Continuous monitoring with built-in watch mode (every 60 seconds)
fleet monitor --watch --interval 60 >> monitor.json
检测内容
监控命令执行全面诊断,以检测以下问题:
资源生命周期问题
-
生成不匹配的捆绑包:捆绑包未在其生命周期中进展(generation != observedGeneration)
-
卡住的捆绑部署:捆绑部署中代理未应用新 deploymentIDs 的情况
-
多个最终处理器:具有多个最终处理器的资源(表示存在错误 - 只有内容应具有多个最终处理器以进行引用计数,在 SUSE® Rancher Prime Continuous Delivery v0.11.1 到 v0.14.x 中)
-
孤立资源:具有无法被垃圾回收的删除时间戳的资源
数据一致性问题
-
API 时间旅行:Kubernetes API 服务器返回过时的缓存数据(通过多次获取资源检测)
-
提交哈希不匹配:捆绑包/捆绑部署未更新到 GitRepo 的最新提交
-
强制同步生成漂移:捆绑包未反映其 GitRepo 的 forceSyncGeneration 值
-
UID不匹配:拥有对已删除或重新创建资源的引用的 Secrets
-
DeploymentID不匹配:BundleDeployments where spec.deploymentID != status.appliedDeploymentID
目标匹配问题
-
没有部署的捆绑包:捆绑包创建后,没有集群匹配目标选择器
-
没有捆绑包的 GitRepos:GitRepos 未创建任何捆绑包(可能是路径、目标或处理错误)
-
没有集群的集群组:集群组的选择器未匹配到任何集群
-
孤立的捆绑部署:其父捆绑包被删除的捆绑部署
性能问题
-
大型捆绑包:捆绑包大于 1MB 可能会影响 etcd 性能
-
缺失内容资源:捆绑包关联了
resourcesSHA256Sum,但没有对应的内容资源 -
高资源计数:大量捆绑包资源可能导致 etcd 压力
使用示例
基本监控
# Single snapshot with pretty formatting
fleet monitor | jq
# Monitor specific namespace
fleet monitor -n fleet-local | jq
# Check fleet-default namespace (common for local clusters)
fleet monitor -n fleet-default | jq
持续监控
# Collect snapshots every 60 seconds using watch mode
fleet monitor --watch --interval 60 >> monitor.json
# Or monitor with a shorter interval (every 30 seconds)
fleet monitor --watch --interval 30 >> monitor.json
针对性诊断
# Check for stuck resources
fleet monitor | jq '.diagnostics | {
bundlesWithGenerationMismatch: .bundlesWithGenerationMismatch | length,
stuckBundleDeployments: .stuckBundleDeployments | length
}'
# Find bundles with old commits
fleet monitor | jq '.diagnostics.gitRepoBundleInconsistencies'
# Check agent health across all clusters
fleet monitor | jq '.diagnostics.clustersWithAgentIssues'
# Find large bundles that might impact etcd
fleet monitor | jq '.diagnostics.largeBundles'
# Check target matching issues
fleet monitor | jq '.diagnostics | {
bundlesNoDeployments: .bundlesWithNoDeployments | length,
gitreposNoBundles: .gitReposWithNoBundles | length,
clusterGroupsNoClusters: .clusterGroupsWithNoClusters | length
}'
常见故障排除场景
方案 1:Bundle 未部署
# Capture current state
fleet monitor | jq > bundle-status.json
# Check for bundles with generation mismatch
jq '.diagnostics.bundlesWithGenerationMismatch' bundle-status.json
# Check if bundle matched any targets
jq '.diagnostics.bundlesWithNoDeployments' bundle-status.json
# Check bundle-to-gitrepo consistency
jq '.diagnostics.gitRepoBundleInconsistencies' bundle-status.json
方案 2:代理未报告状态
# Check agent health
fleet monitor | jq '.diagnostics.clustersWithAgentIssues'
# See detailed cluster info
fleet monitor | jq '.clusters[] | select(.agentStatus != "ready")'
# Check when agents last checked in
fleet monitor | jq '.clusters[] | {name, lastSeen, agentAge}'
方案 3:资源被删除时间戳卡住
# Find resources with deletion timestamps
fleet monitor | jq '{
bundles: [.bundles[] | select(.deletionTimestamp != null) | .name],
bundledeployments: [.bundledeployments[] | select(.deletionTimestamp != null) | .name]
}'
# Check finalizers preventing deletion
fleet monitor | jq '.bundles[] | select(.deletionTimestamp != null) | {name, finalizers}'
方案 4:提交未传播
# Track commits through the lifecycle
fleet monitor | jq '{
gitrepo: .gitrepos[0].commit[0:8],
bundles: [.bundles[] | {name, commit: .commit[0:8]}],
bundledeployments: [.bundledeployments[] | {name, commit: .commit[0:8]}]
}'
# Find commit mismatches
fleet monitor | jq '.diagnostics.gitRepoBundleInconsistencies[] |
select(.commitMismatch == true)'
方案 5:性能问题
# Check bundle sizes
fleet monitor | jq '.diagnostics.largeBundles'
# Find bundles with most resources
fleet monitor | jq '[.bundles[] | {name, size: .sizeBytes, sizeMB: (.sizeBytes / 1048576 | floor)}] |
sort_by(.size) | reverse'
# Check for missing content resources
fleet monitor | jq '.diagnostics.bundlesWithMissingContent'
持续监控工作流
用于长期监控和趋势分析:
# 1. Start continuous collection with watch mode (runs in background)
nohup fleet monitor --watch --interval 60 >> /var/log/fleet-monitor.json 2>&1 &
# 2. Periodically analyze for issues
watch -n 300 "fleet analyze --issues /var/log/fleet-monitor.json | tail -30"
# 3. Generate daily reports
fleet analyze --diff /var/log/fleet-monitor.json > fleet-report-$(date +%Y%m%d).txt
# 4. Log rotation (keep last 7 days)
find /var/log -name "fleet-report-*.txt" -mtime +7 -delete
与警报集成
监控命令可以与监控系统集成:
# Check if there are any issues (exit code 0 = healthy, 1 = issues)
if fleet monitor | jq -e '
.diagnostics.bundlesWithGenerationMismatch != [] or
.diagnostics.stuckBundleDeployments != [] or
.diagnostics.clustersWithAgentIssues != []
' > /dev/null; then
echo "ALERT: Fleet issues detected!"
fleet monitor | jq '.diagnostics' | mail -s "Fleet Alert" admin@example.com
fi
# Prometheus-style metrics export
fleet monitor | jq -r '
"fleet_stuck_bundles \(.diagnostics.bundlesWithGenerationMismatch | length)",
"fleet_stuck_bundledeployments \(.diagnostics.stuckBundleDeployments | length)",
"fleet_agent_issues \(.diagnostics.clustersWithAgentIssues | length)",
"fleet_large_bundles \(.diagnostics.largeBundles | length)"
'
理解诊断
卡住的资源
当资源被认为是“卡住”时:
捆绑包(生成不匹配):
-
generation != observedGeneration(控制器尚未处理最新规范)
注意:带有删除时间戳的捆绑包在 diagnostics.bundlesWithDeletionTimestamp 中单独跟踪计数。
捆绑部署(卡住):
-
spec.deploymentID != status.appliedDeploymentID(代理尚未应用最新部署) -
`syncGeneration`与`forceSyncGeneration`不匹配(强制同步未应用)
-
有`deletionTimestamp`但仍然存在
捆绑部署(同步生成不匹配):
-
syncGeneration!=forceSyncGeneration当forceSyncGeneration > 0时(与卡住的捆绑部署单独跟踪)
性能方面的考虑
-
监控命令获取集群中的所有 Fleet 资源
-
对于大型安装(1000+ 个捆绑包),请考虑:
-
使用
--namespace限制范围 -
运行频率较低(例如,每 120+ 秒而不是 60 秒)
-
监控命令本身的资源使用情况
故障排除监控命令
如果监控命令失败:
# Check Fleet controller is running
kubectl get pods -n cattle-fleet-system
# Verify you have proper RBAC permissions
kubectl auth can-i list bundles --all-namespaces
kubectl auth can-i list bundledeployments --all-namespaces
# Check if CRDs are installed
kubectl get crds | grep fleet.cattle.io
# Enable verbose logging
fleet monitor --verbose 2>&1 | tee monitor-debug.log