如何解决在 Google Kubernetes Engine 上使用 Horizontal Pod Autoscaler 失败:无法读取所有指标
我正在尝试设置 Horizontal Pod Autoscaler 以根据 cpu 使用情况自动扩展和缩减我的 api 服务器 Pod。
我目前有 12 个 Pod 为我的 API 运行,但它们使用了大约 0% 的 cpu。
kubectl get pods
NAME READY STATUS RESTARTS AGE
api-server-deployment-578f8d8649-4cbtc 2/2 Running 2 12h
api-server-deployment-578f8d8649-8cv77 2/2 Running 2 12h
api-server-deployment-578f8d8649-c8tv2 2/2 Running 1 12h
api-server-deployment-578f8d8649-d8c6r 2/2 Running 2 12h
api-server-deployment-578f8d8649-lvbgn 2/2 Running 1 12h
api-server-deployment-578f8d8649-lzjmj 2/2 Running 2 12h
api-server-deployment-578f8d8649-nztck 2/2 Running 1 12h
api-server-deployment-578f8d8649-q25xb 2/2 Running 2 12h
api-server-deployment-578f8d8649-tx75t 2/2 Running 1 12h
api-server-deployment-578f8d8649-wbzzh 2/2 Running 2 12h
api-server-deployment-578f8d8649-wtddv 2/2 Running 1 12h
api-server-deployment-578f8d8649-x95gq 2/2 Running 2 12h
model-server-deployment-76d466dffc-4g2nd 1/1 Running 0 23h
model-server-deployment-76d466dffc-9pqw5 1/1 Running 0 23h
model-server-deployment-76d466dffc-d29fx 1/1 Running 0 23h
model-server-deployment-76d466dffc-frrgn 1/1 Running 0 23h
model-server-deployment-76d466dffc-sfh45 1/1 Running 0 23h
model-server-deployment-76d466dffc-w2hqj 1/1 Running 0 23h
我的 api_hpa.yaml 看起来像:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
Metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server-deployment
minReplicas: 4
maxReplicas: 12
targetcpuutilizationPercentage: 50
现在已经 24 小时了,即使没有看到 cpu 使用率,HPA 仍然没有将我的 pod 缩小到 4 个。
当我查看 GKE 部署详细信息仪表板时,我看到警告 Unable to read all metrics
这是否会导致自动缩放程序无法缩小我的 Pod?
我该如何解决?
据我了解,GKE 会自动运行指标服务器:
kubectl get deployment --namespace=kube-system
NAME READY UP-TO-DATE AVAILABLE AGE
event-exporter-gke 1/1 1 1 18d
kube-dns 2/2 2 2 18d
kube-dns-autoscaler 1/1 1 1 18d
l7-default-backend 1/1 1 1 18d
metrics-server-v0.3.6 1/1 1 1 18d
stackdriver-Metadata-agent-cluster-level 1/1 1 1 18d
这是该指标服务器的配置:
Name: metrics-server-v0.3.6
Namespace: kube-system
CreationTimestamp: Sun,21 Feb 2021 11:20:55 -0800
Labels: addonmanager.kubernetes.io/mode=Reconcile
k8s-app=metrics-server
kubernetes.io/cluster-service=true
version=v0.3.6
Annotations: deployment.kubernetes.io/revision: 14
Selector: k8s-app=metrics-server,version=v0.3.6
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable,25% max surge
Pod Template:
Labels: k8s-app=metrics-server
version=v0.3.6
Annotations: seccomp.security.alpha.kubernetes.io/pod: docker/default
Service Account: metrics-server
Containers:
metrics-server:
Image: k8s.gcr.io/metrics-server-amd64:v0.3.6
Port: 443/TCP
Host Port: 0/TCP
Command:
/metrics-server
--metric-resolution=30s
--kubelet-port=10255
--deprecated-kubelet-completely-insecure=true
--kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
Limits:
cpu: 48m
memory: 95Mi
Requests:
cpu: 48m
memory: 95Mi
Environment: <none>
Mounts: <none>
metrics-server-nanny:
Image: gke.gcr.io/addon-resizer:1.8.10-gke.0
Port: <none>
Host Port: <none>
Command:
/pod_nanny
--config-dir=/etc/config
--cpu=40m
--extra-cpu=0.5m
--memory=35Mi
--extra-memory=4Mi
--threshold=5
--deployment=metrics-server-v0.3.6
--container=metrics-server
--poll-period=300000
--estimator=exponential
--scale-down-delay=24h
--minClusterSize=5
--use-metrics=true
Limits:
cpu: 100m
memory: 300Mi
Requests:
cpu: 5m
memory: 50Mi
Environment:
MY_POD_NAME: (v1:Metadata.name)
MY_POD_NAMESPACE: (v1:Metadata.namespace)
Mounts:
/etc/config from metrics-server-config-volume (rw)
Volumes:
metrics-server-config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: metrics-server-config
Optional: false
Priority Class Name: system-cluster-critical
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewreplicasetAvailable
Oldreplicasets: <none>
Newreplicaset: metrics-server-v0.3.6-787886f769 (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
normal Scalingreplicaset 3m10s (x2 over 5m39s) deployment-controller Scaled up replica set metrics-server-v0.3.6-7c9d64c44 to 1
normal Scalingreplicaset 2m54s (x2 over 5m23s) deployment-controller Scaled down replica set metrics-server-v0.3.6-787886f769 to 0
normal Scalingreplicaset 2m50s (x2 over 4m49s) deployment-controller Scaled up replica set metrics-server-v0.3.6-787886f769 to 1
normal Scalingreplicaset 2m33s (x2 over 4m34s) deployment-controller Scaled down replica set metrics-server-v0.3.6-7c9d64c44 to 0
编辑:2021-03-13
这是api服务器部署的配置:
apiVersion: apps/v1
kind: Deployment
Metadata:
name: api-server-deployment
spec:
replicas: 12
selector:
matchLabels:
app: api-server
template:
Metadata:
labels:
app: api-server
spec:
serviceAccountName: api-kubernetes-service-account
nodeselector:
#<labelname>:value
cloud.google.com/gke-nodepool: api-nodepool
containers:
- name: api-server
image: gcr.io/questions-279902/taskserver:latest
imagePullPolicy: "Always"
ports:
- containerPort: 80
#- containerPort: 443
args:
- --disable_https
- --db_ip_address=127.0.0.1
- --modelserver_address=http://10.128.0.18:8501 # kubectl get service model-service --output yaml
resources:
# You must specify requests for cpu to autoscale
# based on cpu utilization
requests:
cpu: "250m"
- name: cloud-sql-proxy
...
解决方法
我没有看到任何“resources:”字段(例如 cpu、mem 等)被分配,这应该是根本原因。 请注意,在 HPA(Horizontal Pod Autoscaler)上设置资源是一项要求,官方解释 Kubernetes documentation
请注意,如果 Pod 的某些容器没有设置相关的资源请求,则不会定义 Pod 的 CPU 利用率,并且自动缩放器不会针对该指标采取任何措施。
这肯定会导致消息无法读取目标部署的所有指标。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。