如何解决kubernetes 上的 Rabbitmq pod 进入 pod 初始化状态
我在 Kubernetes 上运行 3 个节点的 rabbitmq 集群。 Kubernetes 集群在 AWS Spot 实例上运行,并且其中一个 Kubernetes 节点以某种方式意外终止,其中一个 Rabbitmq pod 正在运行。现在 pod git 调度到另一个节点上,从那时起我的 rabbitmq pod 就停留在 pod 初始化状态。
Kubernetes 事件显示“FailedPostStartHook”。
日志:
9m46s Warning FailedPostStartHook pod/rabbitmq-0 Exec lifecycle hook ([/bin/sh -c until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode": "automatic"}'
]) for Container "rabbitmq" in Pod "rabbitmq-0_devops(c96c1a6e-bf9a-450d-828d-ed0e8a0ad949)" Failed - error: command '/bin/sh -c until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode": "automatic"}'
' exited with 137: Error: unable to perform an operation on node 'rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution,TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI,clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local
* If target node is configured to use long node names,don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local']
Kubernetes statefulset 清单:
apiVersion: apps/v1
kind: StatefulSet
Metadata:
name: rabbitmq
namespace: devops
spec:
podManagementPolicy: OrderedReady
replicas: 3
revisionHistoryLimit: 3
selector:
matchLabels:
app: rabbitmq
serviceName: rabbitmq-service
template:
Metadata:
annotations:
labels:
app: rabbitmq
name: rabbitmq
spec:
containers:
- env:
- name: HOSTNAME
valueFrom:
fieldRef:
apiVersion: v1
fieldpath: Metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldpath: Metadata.namespace
- name: RABBITMQ_USE_LONGNAME
value: "true"
- name: RABBITMQ_BASIC_AUTH
valueFrom:
secretKeyRef:
key: password
name: rabbitmq
- name: RABBITMQ_NODENAME
value: rabbit@$(HOSTNAME).rabbitmq-service.$(NAMESPACE).svc.cluster.local
- name: K8S_SERVICE_NAME
value: rabbitmq-service
- name: RABBITMQ_DEFAULT_USER
value: admin
- name: RABBITMQ_DEFAULT_PASS
valueFrom:
secretKeyRef:
key: password
name: rabbitmq
- name: RABBITMQ_ERLANG_COOKIE
value: some-cookie
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldpath: Metadata.name
image: rabbitmq:3.8.1-management-alpine
imagePullPolicy: IfNotPresent
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- |
until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode": "automatic"}'
livenessProbe:
exec:
command:
- rabbitmqctl
- status
failureThreshold: 3
initialDelaySeconds: 60
periodSeconds: 10
successthreshold: 1
timeoutSeconds: 30
name: rabbitmq
ports:
- containerPort: 4369
protocol: TCP
- containerPort: 5672
protocol: TCP
- containerPort: 5671
protocol: TCP
- containerPort: 25672
protocol: TCP
- containerPort: 15672
protocol: TCP
readinessProbe:
exec:
command:
- rabbitmqctl
- status
failureThreshold: 3
initialDelaySeconds: 20
periodSeconds: 10
successthreshold: 1
timeoutSeconds: 30
resources:
limits:
cpu: "2"
memory: 3Gi
requests:
cpu: "1"
memory: 2Gi
volumeMounts:
- mountPath: /var/lib/rabbitmq/
name: rabbitmq-data
- mountPath: /etc/rabbitmq
name: config
dnsPolicy: ClusterFirst
initContainers:
- command:
- /bin/bash
- -euc
- |
rm -f /var/lib/rabbitmq/.erlang.cookie
cp /rabbitmqconfig/rabbitmq.conf /etc/rabbitmq/rabbitmq.conf
cp /rabbitmqconfig/enabled_plugins /etc/rabbitmq/enabled_plugins
image: rabbitmq:3.8.1-management-alpine
imagePullPolicy: Always
name: copy-rabbitmq-config
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /rabbitmqconfig
name: rabbitmq-configmap
- mountPath: /etc/rabbitmq
name: config
- mountPath: /var/lib/rabbitmq
name: rabbitmq-data
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: rabbitmq
serviceAccountName: rabbitmq
terminationGracePeriodSeconds: 10
volumes:
- configMap:
defaultMode: 420
items:
- key: rabbitmq.conf
path: rabbitmq.conf
- key: enabled_plugins
path: enabled_plugins
name: rabbitmq-configmap
name: rabbitmq-configmap
- emptyDir: {}
name: config
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
Metadata:
creationTimestamp: null
name: rabbitmq-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: gp2
volumeMode: Filesystem
我尝试过的事情:
- 登录到被击中的 Pod 并执行(这个命令只是被击中了,没有任何反应)
rabbitmqctl stop_app
rabbitmqctl 重置
- 登录到被攻击的 Pod 并执行
rabbitmqctl force_boot
- 登录到被攻击的 Pod 并执行
rm /var/log/rabbitmq/*
以上都没有帮助。
请注意,其他 2 个 rabbitmq 节点运行良好,并提供流量并显示故障节点:
rabbitmq-2 rabbitmq 2021-07-04 12:19:07.233 [info] <0.490.0> node 'rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local' up
rabbitmq-1 rabbitmq 2021-07-04 12:19:07.208 [info] <0.494.0> node 'rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local' up
解决方法
运行 statefulset 命令的 rollout restart 对我有用。
kubectl rollout restart statefulset rabbitmq -n devops
执行此命令后,rabbitmq 集群启动并运行,所有三个节点都没有任何问题地加入集群。
完成此操作后,需要重新启动连接到此 rabbitmq 集群的应用程序。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。