由于活性探测失败，Argo 工作流程陷入待定状态？

如何解决由于活性探测失败，Argo 工作流程陷入待定状态？

我正在尝试使用 this 在 Kubernetes 上设置 Hyperledger Fabric 网络。

我正在尝试创建频道。我运行命令 argo submit output.yaml -v，其中 output.yaml 是命令 helm template channel-flow/ -f samples/simple/network.yaml -f samples/simple/crypto-config.yaml 的输出，但添加了 spec.securityContext，如下所示：

...
spec:
  securityContext:
    runAsNonRoot: true
    #runAsUser: 8737 (I commented out this because I don't kNow my user ID; not sure if this Could cause a problem)

  entrypoint: channels
...

我的 argo 工作流程最终陷入了挂起状态。我这么说是因为我检查了我的订购者和同行日志，但我没有看到他们的日志中有任何移动。

我引用了 Argo sample workflows stuck in the pending state 并从获取 argo 日志开始：

[user@vmmock3 fabric-kube]$ kubectl logs -n argo -l app=workflow-controller
time="2021-05-31T05:02:41.145Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:41.150Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:46.162Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:46.168Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:51.179Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:51.185Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:56.193Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:56.199Z" level=info msg="Update leases 200"
time="2021-05-31T05:03:01.213Z" level=info msg="Get leases 200"
time="2021-05-31T05:03:01.219Z" level=info msg="Update leases 200"

我尝试描述工作流控制器 pod：

[user@vmmock3 fabric-kube]$ kubectl -n argo describe pod workflow-controller-57fcfb5df8-qvn74
Name:         workflow-controller-57fcfb5df8-qvn74
Namespace:    argo
Priority:     0
Node:         hlf-pool1-8rnem/10.104.0.8
Start Time:   Tue,25 May 2021 13:44:56 +0800
Labels:       app=workflow-controller
              pod-template-hash=57fcfb5df8
Annotations:  <none>
Status:       Running
IP:           10.244.0.158
IPs:
  IP:           10.244.0.158
Controlled By:  replicaset/workflow-controller-57fcfb5df8
Containers:
  workflow-controller:
    Container ID:  containerd://78c7f8dcb0f3a3b861293559ae0a11b92ce6843065e6f9459556a6b7099c8961
    Image:         argoproj/workflow-controller:v3.0.5
    Image ID:      docker.io/argoproj/workflow-controller@sha256:740dca63b11168490d9cc7b2d1b08c1364f4a4064e1d9b7a778ca2ab12a63158
    Ports:         9090/TCP,6060/TCP
    Host Ports:    0/TCP,0/TCP
    Command:
      workflow-controller
    Args:
      --configmap
      workflow-controller-configmap
      --executor-image
      argoproj/argoexec:v3.0.5
      --namespaced
    State:          Running
      Started:      Mon,31 May 2021 13:08:11 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon,31 May 2021 12:59:05 +0800
      Finished:     Mon,31 May 2021 13:03:04 +0800
    Ready:          True
    Restart Count:  1333
    Liveness:       http-get http://:6060/healthz delay=90s timeout=1s period=60s #success=1 #failure=3
    Environment:
      leader_ELECTION_IDENTITY:  workflow-controller-57fcfb5df8-qvn74 (v1:Metadata.name)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from argo-token-hflpb (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  argo-token-hflpb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  argo-token-hflpb
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                        From     Message
  ----     ------     ----                       ----     -------
  Warning  Unhealthy  7m44s (x3994 over 5d23h)   kubelet  Liveness probe Failed: Get "http://10.244.0.158:6060/healthz": dial tcp 10.244.0.158:6060: connect: connection refused
  Warning  BackOff    3m46s (x16075 over 5d22h)  kubelet  Back-off restarting Failed container

这个失败可能是我的 argo 工作流卡在待定状态的原因吗？我应该如何解决这个问题？

编辑：kubectl get pods --all-namespaces 的输出（仅供参考，这些正在 Digital Ocean 上运行）：

[user@vmmock3 fabric-kube]$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                    READY   STATUS             RESTARTS   AGE
argo          argo-server-5695555c55-867bx            1/1     Running            1          6d19h
argo          minio-58977b4b48-r2m2h                  1/1     Running            0          6d19h
argo          postgres-6b5c55f477-7swpp               1/1     Running            0          6d19h
argo          workflow-controller-57fcfb5df8-qvn74    0/1     CrashLoopBackOff   1522       6d19h
default       hlf-ca--atlantis-58bbd79d9d-x4mz4       1/1     Running            0          21h
default       hlf-ca--karga-547dbfddc8-7w6b5          1/1     Running            0          21h
default       hlf-ca--nevergreen-7ffb98484c-nlg4j     1/1     Running            0          21h
default       hlf-orderer--groeifabriek--orderer0-0   1/1     Running            0          21h
default       hlf-peer--atlantis--peer0-0             2/2     Running            0          21h
default       hlf-peer--karga--peer0-0                2/2     Running            0          21h
default       hlf-peer--nevergreen--peer0-0           2/2     Running            0          21h
kube-system   cilium-2kjfz                            1/1     Running            3          26d
kube-system   cilium-operator-84bdd6f7b6-kp9vb        1/1     Running            1          6d20h
kube-system   cilium-operator-84bdd6f7b6-pkkf9        1/1     Running            1          6d20h
kube-system   coredns-55ff57f948-jb5jc                1/1     Running            0          6d20h
kube-system   coredns-55ff57f948-r2q4g                1/1     Running            0          6d20h
kube-system   csi-do-node-4r9gj                       2/2     Running            0          26d
kube-system   do-node-agent-sbc8b                     1/1     Running            0          26d
kube-system   kube-proxy-hpsc7                        1/1     Running            0          26d

解决方法

我将部分回答您的问题，因为我不保证其他一切都会正常工作，但我知道如何解决 function isNecklace(b) { let p = 1; for (let i = 1; i < b.length; i++) { if (b[i - p] > b[i]) { return 0; } if (b[i - p] < b[i]) { p = i + 1; } } return +!(b.length % p); }; function pcr4(a) { let b = Array.from(a); b[0] = 0; return a[0] ^ isNecklace(b); }; function DB(n) { let output = ""; let a = Array(n).fill(0); do { output += a[0]; let new_bit = pcr4(a); a.shift(); a.push(new_bit); } while (a.includes(1)); return output; } console.log(DB(6)); pod 的问题。

回答

简而言之，您需要将 argo workflow-controller 更新到新版本（至少 3.0.6，理想情况下 3.0.7 可用），因为它看起来像是 3.0.5 版本中的错误。

我是如何到达那里的

首先我安装了 argo 3.0.5 version（这是尚未准备好生产）

以 argo workflows pod 重新启动结束：

workflow-controller

同样的kubectl get pods -n argo NAME READY STATUS RESTARTS AGE argo-server-645cf8bc47-sbnqv 1/1 Running 0 9m7s workflow-controller-768565d958-9lftf 1/1 Running 2 9m7s curl-pod 1/1 Running 0 6m47s：

liveness probe failed

我还使用基于 kubectl describe pod workflow-controller-768565d958-9lftf -n argo Name: workflow-controller-768565d958-9lftf Namespace: argo Priority: 0 Node: worker1/10.186.0.3 Start Time: Tue,01 Jun 2021 14:25:00 +0000 Labels: app=workflow-controller pod-template-hash=768565d958 Annotations: <none> Status: Running IP: 10.244.1.151 IPs: IP: 10.244.1.151 Controlled By: ReplicaSet/workflow-controller-768565d958 Containers: workflow-controller: Container ID: docker://4b797b57ae762f9fc3f7acdd890d25434a8d9f6f165bbb7a7bda35745b5f4092 Image: argoproj/workflow-controller:v3.0.5 Image ID: docker-pullable://argoproj/workflow-controller@sha256:740dca63b11168490d9cc7b2d1b08c1364f4a4064e1d9b7a778ca2ab12a63158 Ports: 9090/TCP,6060/TCP Host Ports: 0/TCP,0/TCP Command: workflow-controller Args: --configmap workflow-controller-configmap --executor-image argoproj/argoexec:v3.0.5 State: Running Started: Tue,01 Jun 2021 14:33:00 +0000 Last State: Terminated Reason: Error Exit Code: 2 Started: Tue,01 Jun 2021 14:29:00 +0000 Finished: Tue,01 Jun 2021 14:33:00 +0000 Ready: True Restart Count: 2 Liveness: http-get http://:6060/healthz delay=90s timeout=1s period=60s #success=1 #failure=3 Environment: LEADER_ELECTION_IDENTITY: workflow-controller-768565d958-9lftf (v1:metadata.name) Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ts9zf (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-ts9zf: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: BestEffort Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 8m57s default-scheduler Successfully assigned argo/workflow-controller-768565d958-9lftf to worker1 Normal Pulled 57s (x3 over 8m56s) kubelet Container image "argoproj/workflow-controller:v3.0.5" already present on machine Normal Created 57s (x3 over 8m56s) kubelet Created container workflow-controller Normal Started 57s (x3 over 8m56s) kubelet Started container workflow-controller Warning Unhealthy 57s (x6 over 6m57s) kubelet Liveness probe failed: Get "http://10.244.1.151:6060/healthz": dial tcp 10.244.1.151:6060: connect: connection refused Normal Killing 57s (x2 over 4m57s) kubelet Container workflow-controller failed liveness probe,will be restarted 映像的同一命名空间中的 pod 测试了此端点 - 它具有内置的 curlimages/curl。

这是一个curl

pod.yaml

apiVersion: v1 kind: Pod metadata: namespace: argo labels: app: curl name: curl-pod spec: containers: - image: curlimages/curl name: curl-pod command: ['sh','-c','while true ; do sleep ; done'] dnsPolicy: ClusterFirst restartPolicy: Always

导致同样的错误：

kubectl exec -it curl-pod -n argo -- curl http://10.244.1.151:6060/healthz

下一步是尝试更新版本（3.10rc 然后是 3.0.7）。它成功了！

curl: (7) Failed to connect to 10.244.1.151 port 6060: Connection refused

并使用 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 27m default-scheduler Successfully assigned argo/workflow-controller-74b4b5455d-skb2f to worker1 Normal Pulling 27m kubelet Pulling image "argoproj/workflow-controller:v3.0.7" Normal Pulled 27m kubelet Successfully pulled image "argoproj/workflow-controller:v3.0.7" in 15.728042003s Normal Created 27m kubelet Created container workflow-controller Normal Started 27m kubelet Started container workflow-controller 进行检查：

curl