Fleetモニター

トラブルシューティングのための高度な診断ツール SUSE® Rancher Prime Continuous Delivery デプロイメント用。

概要

すべての関連リソースのスナップショットをキャプチャし、自動診断を実行することで、SUSE® Rancher Prime Continuous Delivery の GitOps ライフサイクルから洞察を抽出します。このコマンドは、バンドルがターゲティングおよびデプロイメントフェーズ中にスタックする理由を特定し、SUSE® Rancher Prime Continuous Delivery インストールの健康状態に関する実用的な情報を提供します。

fleet monitor [flags]

オプション

  -n, --namespace string          Namespace to monitor (default: all namespaces)
      --system-namespace string    {product_name} system namespace (default: cattle-fleet-system)
      --agent-staleness duration  Consider agent stale after this duration (default: 24h)
      --watch                     Watch for changes and output continuously
      --interval int              Interval in seconds between checks when watching (default: 60)
  -h, --help                      help for monitor

クイックスタート

モニターコマンドは コンパクトな JSON を出力します（1行につき1つのスナップショット）。jq を使用して可読性のためにフォーマットします。

# Single snapshot with formatted output
fleet monitor | jq

# Single snapshot with human-readable analysis
fleet monitor | fleet analyze

# Continuous monitoring with built-in watch mode (every 60 seconds)
fleet monitor --watch --interval 60 >> monitor.json

検出内容

モニターコマンドは、次の内容を検出するために包括的な診断を実行します：

リソースライフサイクルの問題

世代不一致のバンドル：ライフサイクルを進行しないバンドル（世代 != 観測世代）
スタックしたバンドルデプロイメント：エージェントが新しい deploymentIDs を適用していないバンドルデプロイメント
複数のファイナライザー：ファイナライザーが複数あるリソース（バグを示す - コンテンツのみが参照カウントのために複数のファイナライザーを持つべきであり、SUSE® Rancher Prime Continuous Delivery v0.11.1 から v0.14.x まで）
孤立したリソース：ガーベジコレクションできない削除タイムスタンプを持つリソース

データ整合性の問題

API タイムトラベル：Kubernetes API サーバーが古いキャッシュデータを返す（リソースを複数回取得することで検出）
コミットハッシュの不一致:バンドル／バンドルデプロイメントがGitRepoの最新のコミットに更新されていません
ForceSyncGenerationのドリフト:バンドルがGitRepoのforceSyncGeneration値を反映していません
UIDの不一致:削除または再作成されたリソースへのオーナー参照を持つシークレット
DeploymentIDの不一致:BundleDeployments where spec.deploymentID != status.appliedDeploymentID

ターゲットマッチングの問題

デプロイメントのないバンドル:作成されたバンドルですが、ターゲットセレクターに一致するクラスターがありません
バンドルのないGitRepos:バンドルを作成していないGitRepos（パス、ターゲット、または処理エラーが原因の可能性があります）
クラスターのないClusterGroups:クラスターに一致しないセレクターを持つClusterGroups
孤立したバンドルデプロイメント:親バンドルが削除されたバンドルデプロイメント

パフォーマンスの問題

大きなバンドル:サイズが1MBを超えるバンドルはetcdのパフォーマンスに影響を与える可能性があります
コンテンツリソースの欠落:`resourcesSHA256Sum`を含むバンドルですが、対応するコンテンツリソースがありません
高リソースカウント:etcdの圧力を引き起こす可能性のある大量のバンドルリソース

エージェントとクラスタの問題

エージェントが準備完了ではない:準備が整っていないエージェントステータスのクラスタ
最終確認の欠落:エージェントのハートビートタイムスタンプがないクラスタ
古い最終確認:最近エージェントがチェックインしていないクラスタ（デフォルト：24時間、`--agent-staleness`で設定可能）
エージェントバンドルの欠落:期待されるエージェントバンドルのデプロイメントがないクラスタのネームスペース

所有権チェーンの問題

壊れた所有権:GitRepoの所有者がないバンドル、バンドルの所有者がないバンドルデプロイメント
無効なシークレット所有者:不正または欠落した所有者参照を持つバンドルシークレット

世代/観察の不一致

GitRepo生成のドリフト:GitRepo生成 != 観測された生成（コントローラーが更新を処理していない）
バンドル生成のドリフト:バンドル生成 != 観測された生成（コントローラーが更新を処理していない）
バンドルデプロイメント同期生成のドリフト:バンドルデプロイメント同期生成 != 強制同期生成（エージェントが強制同期を適用していない）
コンテンツの古い生成:古い生成値を持つコンテンツリソース

使用例

基本的な監視

# Single snapshot with pretty formatting
fleet monitor | jq

# Monitor specific namespace
fleet monitor -n fleet-local | jq

# Check fleet-default namespace (common for local clusters)
fleet monitor -n fleet-default | jq

継続的な監視

# Collect snapshots every 60 seconds using watch mode
fleet monitor --watch --interval 60 >> monitor.json

# Or monitor with a shorter interval (every 30 seconds)
fleet monitor --watch --interval 30 >> monitor.json

ターゲット診断

# Check for stuck resources
fleet monitor | jq '.diagnostics | {
  bundlesWithGenerationMismatch: .bundlesWithGenerationMismatch | length,
  stuckBundleDeployments: .stuckBundleDeployments | length
}'

# Find bundles with old commits
fleet monitor | jq '.diagnostics.gitRepoBundleInconsistencies'

# Check agent health across all clusters
fleet monitor | jq '.diagnostics.clustersWithAgentIssues'

# Find large bundles that might impact etcd
fleet monitor | jq '.diagnostics.largeBundles'

# Check target matching issues
fleet monitor | jq '.diagnostics | {
  bundlesNoDeployments: .bundlesWithNoDeployments | length,
  gitreposNoBundles: .gitReposWithNoBundles | length,
  clusterGroupsNoClusters: .clusterGroupsWithNoClusters | length
}'

状態の比較

# Before making changes
fleet monitor > before.json

# Make changes to GitRepo, bundles, etc.
kubectl edit gitrepo my-repo

# After changes
fleet monitor > after.json

一般的なトラブルシューティングシナリオ

シナリオ1:バンドルがデプロイされない

# Capture current state
fleet monitor | jq > bundle-status.json

# Check for bundles with generation mismatch
jq '.diagnostics.bundlesWithGenerationMismatch' bundle-status.json

# Check if bundle matched any targets
jq '.diagnostics.bundlesWithNoDeployments' bundle-status.json

# Check bundle-to-gitrepo consistency
jq '.diagnostics.gitRepoBundleInconsistencies' bundle-status.json

シナリオ2:エージェントがステータスを報告しない

# Check agent health
fleet monitor | jq '.diagnostics.clustersWithAgentIssues'

# See detailed cluster info
fleet monitor | jq '.clusters[] | select(.agentStatus != "ready")'

# Check when agents last checked in
fleet monitor | jq '.clusters[] | {name, lastSeen, agentAge}'

シナリオ3:削除タイムスタンプでスタックしているリソース

# Find resources with deletion timestamps
fleet monitor | jq '{
  bundles: [.bundles[] | select(.deletionTimestamp != null) | .name],
  bundledeployments: [.bundledeployments[] | select(.deletionTimestamp != null) | .name]
}'

# Check finalizers preventing deletion
fleet monitor | jq '.bundles[] | select(.deletionTimestamp != null) | {name, finalizers}'

シナリオ4:コミットが伝播しない

# Track commits through the lifecycle
fleet monitor | jq '{
  gitrepo: .gitrepos[0].commit[0:8],
  bundles: [.bundles[] | {name, commit: .commit[0:8]}],
  bundledeployments: [.bundledeployments[] | {name, commit: .commit[0:8]}]
}'

# Find commit mismatches
fleet monitor | jq '.diagnostics.gitRepoBundleInconsistencies[] |
  select(.commitMismatch == true)'

シナリオ5:パフォーマンスの問題

# Check bundle sizes
fleet monitor | jq '.diagnostics.largeBundles'

# Find bundles with most resources
fleet monitor | jq '[.bundles[] | {name, size: .sizeBytes, sizeMB: (.sizeBytes / 1048576 | floor)}] |
  sort_by(.size) | reverse'

# Check for missing content resources
fleet monitor | jq '.diagnostics.bundlesWithMissingContent'

継続的な監視ワークフロー

長期的な監視とトレンド分析のために：

# 1. Start continuous collection with watch mode (runs in background)
nohup fleet monitor --watch --interval 60 >> /var/log/fleet-monitor.json 2>&1 &

# 2. Periodically analyze for issues
watch -n 300 "fleet analyze --issues /var/log/fleet-monitor.json | tail -30"

# 3. Generate daily reports
fleet analyze --diff /var/log/fleet-monitor.json > fleet-report-$(date +%Y%m%d).txt

# 4. Log rotation (keep last 7 days)
find /var/log -name "fleet-report-*.txt" -mtime +7 -delete

アラートとの統合

モニターコマンドは監視システムと統合できます：

# Check if there are any issues (exit code 0 = healthy, 1 = issues)
if fleet monitor | jq -e '
  .diagnostics.bundlesWithGenerationMismatch != [] or
  .diagnostics.stuckBundleDeployments != [] or
  .diagnostics.clustersWithAgentIssues != []
' > /dev/null; then
  echo "ALERT: Fleet issues detected!"
  fleet monitor | jq '.diagnostics' | mail -s "Fleet Alert" admin@example.com
fi

# Prometheus-style metrics export
fleet monitor | jq -r '
  "fleet_stuck_bundles \(.diagnostics.bundlesWithGenerationMismatch | length)",
  "fleet_stuck_bundledeployments \(.diagnostics.stuckBundleDeployments | length)",
  "fleet_agent_issues \(.diagnostics.clustersWithAgentIssues | length)",
  "fleet_large_bundles \(.diagnostics.largeBundles | length)"
'

診断の理解

スタックリソース

リソースは次の条件を満たすと「スタック」と見なされます：

バンドル（生成の不一致）：

generation != observedGeneration（コントローラーが最新の仕様を処理していない）

注意:削除タイムスタンプのあるバンドルは、`diagnostics.bundlesWithDeletionTimestamp`で別途カウントとして追跡されます。

バンドルデプロイメント（スタック）：

spec.deploymentID != status.appliedDeploymentID（エージェントが最新のデプロイメントを適用していない）
`syncGeneration`は`forceSyncGeneration`と一致しません（強制syncが適用されていない）
`deletionTimestamp`はありますが、まだ存在しています。

バンドルデプロイメント（SyncGenerationの不一致）：

syncGeneration != forceSyncGeneration の場合 forceSyncGeneration > 0（スタックされたバンドルデプロイメントとは別に追跡されます）

APIの整合性チェック

モニターは同じリソースの複数の取得を行い、「タイムトラベル」を検出します - Kubernetes APIサーバーが古いキャッシュのために異なるリソースバージョンを返す場合。これは重要です。なぜなら、古いデータがバンドルをスタックしているように見せることがあるからです。実際には進行中です。

コミット追跡

モニターはライフサイクル全体を通じてGitコミットハッシュを追跡します： 1.GitRepo は最新のコミットを取得します。Bundle はそのコミットを反映する必要があります。BundleDeployment は Bundle のコミットと一致する必要があります。Bundle Secrets は注釈にコミットを保存します。

不一致はsync処理が失敗している場所を示します。

パフォーマンスについて

モニターはクラスター内のすべての Fleet リソースを取得します。
大規模なインストール（1000以上のバンドル）の場合は、次のことを検討してください：
スコープを制限するために --namespace を使用すること。
頻度を下げて実行すること（例：60秒ではなく120秒ごとに）。
モニターコマンド自体のリソース使用状況を監視すること。

モニターコマンドのトラブルシューティング。

モニターコマンドが失敗した場合：

# Check Fleet controller is running
kubectl get pods -n cattle-fleet-system

# Verify you have proper RBAC permissions
kubectl auth can-i list bundles --all-namespaces
kubectl auth can-i list bundledeployments --all-namespaces

# Check if CRDs are installed
kubectl get crds | grep fleet.cattle.io

# Enable verbose logging
fleet monitor --verbose 2>&1 | tee monitor-debug.log