微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

多次重启后 kubernetes StatefulSet POD 中的 Microsoft Orleans 崩溃

如何解决多次重启后 kubernetes StatefulSet POD 中的 Microsoft Orleans 崩溃

微软奥尔良 v3.4.3 领事集群 在 K8S 中运行

siloBuilder
     .UseConsulClustering(opt =>
     {
         opt.Address = new Uri(AppConfig.Orleans.ConsulUrl);
         opt.AclClientToken = AppConfig.Orleans.AclClientToken;
     })
     .Configure<ClusterOptions>(options =>
     {
         options.ClusterId = AppConfig.Orleans.ClusterID;
         options.ServiceId = AppConfig.Orleans.ServiceID;
     })
     .siloBuilder.UseKubernetesHosting();

我根据 doc 为我的 POD 配置了标签和环境变量。

          - name: ORLEANS_SERVICE_ID #required by Orleans 
            valueFrom:
              fieldRef:
                fieldpath: Metadata.labels['orleans/serviceId']
          - name: ORLEANS_CLUSTER_ID #required by Orleans 
            valueFrom:
              fieldRef:
                fieldpath: Metadata.labels['orleans/clusterId']
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldpath: Metadata.labels['statefulset.kubernetes.io/pod-name']
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldpath: Metadata.namespace
          - name: POD_IP
            valueFrom:
              fieldRef:
                fieldpath: status.podIP

它是一个 StatefulSet,只有 1 个用于测试的 POD。 在初始启动时,它运行良好。 但是,每次我重新启动 POD 时,都会在 Consul 中创建一个新条目。

enter image description here

并且在后续启动时崩溃。

日志说

System.AggregateException: One or more errors occurred. (Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184])
 ---> Orleans.Runtime.MembershipService.OrleansClusterConnectivityCheckFailedException: Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184]
   at Orleans.Runtime.MembershipService.MembershipAgent.ValidateInitialConnectivity()
   at Orleans.Runtime.MembershipService.MembershipAgent.BecomeActive()
   at Orleans.Runtime.MembershipService.MembershipAgent.<>c__displayClass26_0.<<Orleans-ILifecycleParticipant<Orleans-Runtime-ISiloLifecycle>-Participate>g__OnBecomeActiveStart|6>d.MoveNext()
--- End of stack trace from prevIoUs location where exception was thrown ---
   at Orleans.Runtime.SiloLifecycleSubject.Monitoredobserver.OnStart(CancellationToken ct)
   at Orleans.LifecycleSubject.OnStart(CancellationToken ct)
   at Orleans.Runtime.Scheduler.AsyncclosureWorkItem.Execute()
   at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken)
   at Orleans.Hosting.SiloHost.StartAsync(CancellationToken cancellationToken)
   at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken)
   at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
   at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host,CancellationToken token)
   at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host,CancellationToken token)
   at UBS.OrleansServer.EntryPoint.Start() in /app/UBS/OrleansServer/EntryPoint.cs:line 102
   --- End of inner exception stack trace ---

我必须删除 Consul 中的所有条目,然后重新启动 POD,然后一切正常。

StatefulSet 的 POD 的 POD_NAME 是一样的,每次 POD 重启都会在 Consul 中创建一个新条目是否正确?

可能是什么原因?

提前致谢


更新 几轮死机后重启,终于不死机了。在日志中我看到以下消息

Processtableupdate (called from DeclareDead) membership table: 5 silos,1 are Active,4 are Dead,Version=<31,28123>. All silos: [SiloAddress=S10.18.123.244:11111:361163684 SiloName=ubs-job-dev-0 Status=Active,SiloAddress=S10.18.123.200:11111:361158057 SiloName=ubs-job-dev-0 Status=Dead,SiloAddress=S10.18.123.210:11111:361161905 SiloName=ubs-job-dev-0 Status=Dead,SiloAddress=S10.18.123.217:11111:361157424 SiloName=ubs-job-dev-0 Status=Dead,SiloAddress=S10.18.123.244:11111:361163558 SiloName=ubs-job-dev-0 Status=Dead]

SiloName 永远不会改变,StatefulSet 中只有一个 POD,但它看到 5 个孤岛,其中 4 个已经死了。似乎每个新的 POD,即使 Pod 名称没有改变,也被视为一个新的筒仓。这是预期的吗?

解决方法

(Failed to get ping responses from 1 of 1 active silos. 
Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. 
Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184])

看起来您的成员资格表(在 consul 中)认为您已经有活跃的筒仓。当您的“新”筒仓出现并查看成员资格表时,它会在表的 IP 地址中看到这些 active 筒仓。

为了保持集群正确,新的 joining 孤岛必须能够与现有的孤岛通信。但是,如果成员资格表不正确(IP 地址状态为 3/active),那么您会遇到一个问题,即新的silo 尝试ping active silos 并且无法访问它们将无法{{1} } 并快速自身。

您有几个解决方案:

  • 在部署解决方案时清除 consul 表
  • 在每次部署时更改部署 ID。

你显然找到了第一个解决方案(清除表格)

silo lifecycle

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?