经过测试触发OOM问题
测试:3.10.0-862.3.2.el7.x86_64(内核)
开启7个异常会触发OOM的节点,在一个NODE上,经过测试发现,3.10内核,是并行创建了7个任务,同时触发oom,导致内核锁耗死。测试 2-3分钟内,服务器会死掉,模拟测试连续触发OOM问题直到cpu耗尽。服务器自动重启
kernel: BUG: soft lockup - cpu#4 stuck for 22s! [handler20:1542] 此类也是3.10内核BUG
Nov 6 10:42:55 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,order=0,oom_score_adj=-998
Nov 6 10:42:55 GFS-6 kernel: runc:[1:CHILD] cpuset=c156bcb333882b0a8de413c6e7cbe73867d388dc63d99c7b72d926aa6e669b6a mems_allowed=0
Nov 6 10:43:02 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov 6 10:43:02 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov 6 10:43:03 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov 6 10:43:03 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov 6 10:43:07 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov 6 10:43:07 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov 6 10:43:08 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov 6 10:43:08 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov 6 10:43:09 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov 6 10:43:09 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov 6 10:43:11 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov 6 10:43:11 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov 6 10:43:58 GFS-6 kernel: Initializing cgroup subsys cpuset
Nov 6 10:43:58 GFS-6 kernel: Initializing cgroup subsys cpu
Nov 6 10:43:58 GFS-6 kernel: Initializing cgroup subsys cpuacct
Nov 6 10:43:58 GFS-6 kernel: setup_percpu: NR_cpuS:5120 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
Nov 6 10:43:58 GFS-6 kernel: PERcpu: Embedded 35 pages/cpu @ffff96fa7fc00000 s104856 r8192 d30312 u262144
Nov 6 10:43:58 GFS-6 kernel: #011RCU restricting cpus from NR_cpuS=5120 to nr_cpu_ids=8.
Nov 6 10:43:58 GFS-6 kernel: core: cpuID marked event: 'cpu cycles' unavailable
Nov 6 10:43:58 GFS-6 kernel: NMI watchdog: disabled (cpu0): hardware events not enabled
Nov 6 10:43:58 GFS-6 kernel: NMI watchdog: Shutting down hard lockup detector on all cpus <<--cpu全挂了 服务器异常自动重启
Nov 6 10:46:02 GFS-6 systemd: Started Docker Application Container Engine. << --重启。。
Nov 6 10:46:02 GFS-6 systemd: Reached target multi-user System.
Nov 6 10:46:02 GFS-6 systemd: Starting multi-user System.
Nov 6 10:46:02 GFS-6 systemd: Starting Update UTMP about System Runlevel Changes...
Nov 6 10:46:02 GFS-6 systemd: Started Update UTMP about System Runlevel Changes.
Nov 6 10:46:02 GFS-6 systemd: Startup finished in 1.456s (kernel) + 4.661s (initrd) + 9.786s (userspace) = 15.904s.
Nov 6 10:46:05 GFS-6 systemd: kubelet.service holdoff time over,scheduling restart.
Nov 6 10:46:05 GFS-6 systemd: Starting kubelet: The Kubernetes Node Agent...
Nov 6 10:46:05 GFS-6 systemd: Started kubelet: The Kubernetes Node Agent.
k8s已经无法管理node节点 ,node节点pod节点全挂了
[root@k8s-m1 test]# kubectl get po -o wide --all-namespaces |grep k8snode6
default ngx-pod-6f977cf846-7k4vm 0/1 ContainerCreating 0 2m <none> k8snode6
default ngx-pod-6f977cf846-85mtx 0/1 ContainerCreating 0 2m <none> k8snode6
default ngx-pod-6f977cf846-hsf6x 0/1 ContainerCreating 0 2m <none> k8snode6
default ngx-pod-6f977cf846-lt68h 0/1 ContainerCreating 0 2m <none> k8snode6
default ngx-pod-6f977cf846-mqvcf 0/1 ContainerCreating 0 2m <none> k8snode6
default ngx-pod-6f977cf846-rmxzj 0/1 ContainerCreating 0 2m <none> k8snode6
default ngx-pod-6f977cf846-sgvrd 0/1 ContainerCreating 0 2m <none> k8snode6
kube-system kube-proxy-9mtnw 0/1 Error 3 125d 10.80.136.179 k8snode6
monitoring kube-prometheus-node-exporter-xbf9k 0/1 Error 1 63d 10.80.136.179 k8snode6
调整内核 4.1.19,测试触发OOM问题
开启7个异常会触发OOM的节点,在一个NODE上
测试:4.19.1-1.el7.elrepo.x86_64(内核)
测试发现,4.19内核创建任务,非并向,暂时无法触发内核锁BUG。
[root@k8snode7-180v136-taiji ~]# tail -f /var/log/messages|grep oom_kill
Nov 6 11:32:58 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov 6 11:32:59 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov 6 11:33:00 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov 6 11:33:01 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov 6 11:33:02 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov 6 11:33:02 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov 6 11:33:03 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov 6 11:33:03 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov 6 11:33:03 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov 6 11:33:04 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov 6 11:33:05 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov 6 11:33:06 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov 6 11:33:07 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov 6 11:33:08 GFS-7 kernel: oom_kill_process+0x262/0x290
......................
[root@k8s-m1 test]# kubectl get po --all-namespaces -o wide |grep k8snode7
default ngx-pod-74c88d6495-79krh 0/1 ContainerCreating 0 33m <none> k8snode7
kube-system kube-proxy-xt4c7 1/1 Running 1 55d 10.80.136.180 k8snode7
monitoring kube-prometheus-node-exporter-bbsjn 1/1 Running 1 60d 10.80.136.180 k8snode7
总结:暂时灰度部分服务器升级内核到4.1.19。后续补充
升级内核操作
1.源
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
2、列出可用的系统内核相关包
yum --disablerepo="*" --enablerepo="elrepo-kernel" list available
3、安装最新的主线稳定内核
yum --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel -y
查看默认启动顺序
awk -F\' '$1=="menuentry " {print $2}' /etc/grub2.cfg
默认启动的顺序是从0开始,但我们新内核是从头插入(目前位置在1,而4.0.2的是在0),所以需要选择0,如果想生效最新的内核,需要
grub2-set-default 0
grub2-mkconfig -o /boot/grub2/grub.cfg
cat /boot/grub2/grub.cfg
yum remove kernel-3.10.0-327.el7.x86_64 kernel-devel-3.10.0-327.el7.x86_64 -y
自定义内核
下面链接可以下载到其他归档版本的
- ubuntuIndex of /~kernel-ppa/mainline
- RHELhttp://mirror.rc.usf.edu/compute_lock/elrepo/kernel/el7/x86_64/RPMS/
- 官方内核库 https://cdn.kernel.org
下面是ml的内核和上面归档内核版本任选其一的安装方法
自选版本内核安装方法
export Kernel_Version=4.18.9-1
4.20.13-1
wget http://mirror.rc.usf.edu/compute_lock/elrepo/kernel/el7/x86_64/RPMS/kernel-ml{,-devel}-${Kernel_Version}.el7.elrepo.x86_64.rpm
wget http://mirror.rc.usf.edu/compute_lock/elrepo/kernel/el7/x86_64/RPMS/kernel-ml{,-devel}-${4.20.13-1}.el7.elrepo.x86_64.rpm
yum localinstall -y kernel-ml*
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。