本文档采用自动化机器翻译技术翻译。 尽管我们力求提供准确的译文,但不对翻译内容的完整性、准确性或可靠性作出任何保证。 若出现任何内容不一致情况,请以原始 英文 版本为准,且原始英文版本为权威文本。

故障排除 etcd 节点

本节包含命令和用于故障排除具有`etcd`角色的节点的提示。

检查etcd容器是否正在运行

etcd容器的状态应为*运行中*。在 Up 之后显示的持续时间是容器运行的时间。

docker ps -a -f=name=etcd$

示例输出:

CONTAINER ID   IMAGE                                 COMMAND                  CREATED          STATUS          PORTS     NAMES
d26adbd23643   rancher/mirrored-coreos-etcd:v3.5.7   "/usr/local/bin/etcd…"   30 minutes ago   Up 30 minutes             etcd

etcd容器日志

容器日志中可能包含有关问题原因的信息。

docker logs etcd
日志 说明

health check for peer xxx could not connect: dial tcp IP:2380: getsockopt: connection refused

无法建立与显示的地址在2380端口的连接。检查etcd容器是否在显示的地址的主机上运行。

xxx is starting a new election at term x

etcd集群失去了法定人数,正在尝试建立新的领导者。当运行etcd的节点大多数宕机/无法访问时,可能会发生这种情况。

connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: i/o timeout"; Reconnecting to {0.0.0.0:2379 0 <nil>}

主机防火墙阻止了网络通信。

rafthttp: request cluster ID mismatch

记录`rafthttp: request cluster ID mismatch`的etcd实例的节点正在尝试加入已经与另一个对等体形成的集群。该节点应从集群中移除,然后重新添加。

rafthttp: failed to find member

集群状态(/var/lib/etcd)包含错误的信息,无法加入集群。该节点应从集群中移除,状态目录应清理,然后该节点应重新添加。

etcd集群和连接检查

etcd侦听的地址取决于运行etcd的主机的地址配置。如果为运行etcd的主机配置了内部地址,则需要明确指定`etcdctl`的端点。如果任何命令的响应为`Error: context deadline exceeded`,则etcd实例不健康(法定人数丢失或实例未正确加入集群)

检查所有节点上的etcd成员

输出应包含所有具有`etcd`角色的节点,并且所有节点上的输出应相同。

命令:

docker exec etcd etcdctl member list

检查端点状态

RAFT TERM 的值应该相等,RAFT INDEX 不应该相差太远。

命令:

docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint status --write-out table

示例输出:

+-----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| https://IP:2379 | 333ef673fc4add56 |  3.5.7  |   24 MB |     false |        72 |      66887 |
| https://IP:2379 | 5feed52d940ce4cf |  3.5.7  |   24 MB |      true |        72 |      66887 |
| https://IP:2379 | db6b3bdb559a848d |  3.5.7  |   25 MB |     false |        72 |      66887 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+

检查端点健康

命令:

docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint health

示例输出:

https://IP:2379 is healthy: successfully committed proposal: took = 2.113189ms
https://IP:2379 is healthy: successfully committed proposal: took = 2.649963ms
https://IP:2379 is healthy: successfully committed proposal: took = 2.451201ms

检查TCP/2379端口的连接性

命令:

for endpoint in $(docker exec etcd etcdctl member list | cut -d, -f5); do
   echo "Validating connection to ${endpoint}/health"
   docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -w "\n" --cacert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CACERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --cert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --key $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_KEY" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) "${endpoint}/health"
done

示例输出:

Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health
{"health": "true"}

检查TCP/2380端口的连接性

命令:

for endpoint in $(docker exec etcd etcdctl member list | cut -d, -f4); do
  echo "Validating connection to ${endpoint}/version";
  docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl --http1.1 -s -w "\n" --cacert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CACERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --cert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --key $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_KEY" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) "${endpoint}/version"
done

示例输出:

Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}

etcd 警报

etcd 会触发警报,例如当它的空间用尽时。

命令:

docker exec etcd etcdctl alarm list

当触发 NOSPACE 警报时的示例输出:

memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE

etcd 空间错误

相关的错误信息是 etcdserver: mvcc: database space exceededapplying raft message exceeded backend quota。警报 NOSPACE 将被触发。

解决方案:

压缩键空间

命令:

rev=$(docker exec etcd etcdctl endpoint status --write-out json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*')
docker exec etcd etcdctl compact "$rev"

示例输出:

compacted revision xxx

对所有 etcd 成员进行碎片整理

命令:

docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl defrag

示例输出:

Finished defragmenting etcd member[https://IP:2379]
Finished defragmenting etcd member[https://IP:2379]
Finished defragmenting etcd member[https://IP:2379]

检查端点状态

命令:

docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint status --write-out table

示例输出:

+-----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| https://IP:2379 |  e973e4419737125 |  3.5.7  |  553 kB |     false |        32 |    2449410 |
| https://IP:2379 | 4a509c997b26c206 |  3.5.7  |  553 kB |     false |        32 |    2449410 |
| https://IP:2379 | b217e736575e9dd3 |  3.5.7  |  553 kB |      true |        32 |    2449410 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+

解除警报

在验证数据库大小在压缩和碎片整理后确实减少后,需要解除警报,以便 etcd 允许再次写入。

命令:

docker exec etcd etcdctl alarm list
docker exec etcd etcdctl alarm disarm
docker exec etcd etcdctl alarm list

示例输出:

docker exec etcd etcdctl alarm list
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
docker exec etcd etcdctl alarm disarm
docker exec etcd etcdctl alarm list

配置日志级别

在etcd v3.5或更高版本中,您无法再动态更改日志级别。

etcd v3.5及更高版本

要配置etcd的日志级别,请编辑集群YAML:

services:
  etcd:
    extra_args:
      log-level: "debug"

etcd v3.4及更早版本

在早期的etcd版本中,您可以使用API动态更改日志级别。 使用以下命令配置调试日志记录:

docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"DEBUG"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log

要将日志级别重置为默认值(INFO),您可以使用以下命令。

命令:

docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"INFO"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log

etcd内容

如果您想调查etcd的内容,可以观看流式事件或直接查询etcd,以下是示例。

观看流式事件

命令:

docker exec etcd etcdctl watch --prefix /registry

如果您只想查看受影响的键(而不是二进制数据),可以在命令后附加`| grep -a ^/registry`以仅过滤键。

直接查询etcd

命令:

docker exec etcd etcdctl get /registry --prefix=true --keys-only

您可以处理数据以获取每个键的计数摘要,使用以下命令:

docker exec etcd etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nr

替换不健康的etcd节点

当您的etcd集群中的节点变得不健康时,建议的做法是在向集群添加新的etcd节点之前,修复或去除失败或不健康的节点。