CentOS 7 VMs on ESXi-hypervisor hung, but still can ping

urtv666 · Post by **urtv666** » 2018/07/05 07:30:57

Hi folks,

I met a strange issue that multiple centos 7.0 VMs on my VMware cluster hung at random time. When they got hung, ping still works but CPU/memory/disk usage drop to almost 0. And SSH or console have no response. Applications on them get hung too.
Not all VMs have this issue and the occur time seems random.

I believe I can rule out the VMware level issue as the impacted VMs are on different VMware cluster and using different storage. Also I can rule out application level issue as the impacted VMs are running totally different applications.

I captured the VMSS dump and used crash tool to analyze 2 of the impacted VMs.
What I found is the VMs seems still running in OS basic level (ping still works, uptime is still going) but no active process on any CPU while some processes in UN state. Most of the the UN are pending mutex lock and others seems pending xfs like below.
I'm not a developer and have little knowledge of kernel but this looks like some bug between kernel and xfs module...
Can anyone provide some clue or even provide a bug number similiar to this issue?
Thank you in advance.

mutex lock:
crash> bt 3555
PID: 3555 TASK: ffff8804265071c0 CPU: 0 COMMAND: "python"
#0 [ffff88008d9bbc10] __schedule at ffffffff815e6c7d
#1 [ffff88008d9bbc78] schedule_preempt_disabled at ffffffff815e83d9
#2 [ffff88008d9bbc88] __mutex_lock_slowpath at ffffffff815e616d
#3 [ffff88008d9bbce0] mutex_lock at ffffffff815e550f
#4 [ffff88008d9bbcf8] do_last at ffffffff811bf406
#5 [ffff88008d9bbda0] path_openat at ffffffff811c0322
#6 [ffff88008d9bbe48] do_filp_open at ffffffff811c0f3b
#7 [ffff88008d9bbf18] do_sys_open at ffffffff811aede3
#8 [ffff88008d9bbf70] sys_open at ffffffff811aeefe
#9 [ffff88008d9bbf80] tracesys at ffffffff815f2327 (via system_call)
RIP: 00007f523eccb9e0 RSP: 00007fff9e9581e8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: ffffffff815f2327 RCX: ffffffffffffffff
RDX: 0000000000000180 RSI: 00000000000200c2 RDI: 0000000001a8d3e0
RBP: 00000000010bd0a0 R8: 00007fff9e957f40 R9: 00007fff9e957fc0
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffff811aeefe
R13: ffff88008d9bbf78 R14: 00007f523f44fb48 R15: 0000000001a91850
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b

xfs:
PID: 2265 TASK: ffff88002655f1c0 CPU: 1 COMMAND: "python"
#0 [ffff8804153679f8] __schedule at ffffffff815e6c7d
#1 [ffff880415367a60] schedule at ffffffff815e71b9
#2 [ffff880415367a70] xlog_grant_head_wait at ffffffffa020a06d [xfs]
#3 [ffff880415367ac0] xlog_grant_head_check at ffffffffa020a1ee [xfs]
#4 [ffff880415367b00] xfs_log_reserve at ffffffffa020daef [xfs]
#5 [ffff880415367b38] xfs_trans_reserve at ffffffffa01c7204 [xfs]
#6 [ffff880415367b80] xfs_create at ffffffffa01fbd0c [xfs]
#7 [ffff880415367c50] xfs_vn_mknod at ffffffffa01bd67b [xfs]
#8 [ffff880415367cb0] xfs_vn_create at ffffffffa01bd7d3 [xfs]
#9 [ffff880415367cc0] vfs_create at ffffffff811befdd
#10 [ffff880415367cf8] do_last at ffffffff811bfbf4
#11 [ffff880415367da0] path_openat at ffffffff811c0322
#12 [ffff880415367e48] do_filp_open at ffffffff811c0f3b
#13 [ffff880415367f18] do_sys_open at ffffffff811aede3
#14 [ffff880415367f70] sys_open at ffffffff811aeefe
#15 [ffff880415367f80] tracesys at ffffffff815f2327 (via system_call)
RIP: 00007fb84ab099e0 RSP: 00007fffb57e43a8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: ffffffff815f2327 RCX: ffffffffffffffff
RDX: 0000000000000180 RSI: 00000000000200c2 RDI: 0000000002bba3e0
RBP: 00000000021ea0a0 R8: 00007fffb57e4100 R9: 00007fffb57e4180
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffff811aeefe
R13: ffff880415367f78 R14: 00007fb84b28db48 R15: 0000000002bbe850
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b

Post by **TrevorH** » 2018/07/05 08:12:26

CentOS 7.0 is 4 years old and unsupported. Run yum update to get up to date and onto 7.5.1804 and see if the problem is fixed.

urtv666 · Post by **urtv666** » 2018/07/05 09:14:48

Thank you for the upgrade suggestion. I'll talk with user but I'm afraid they will consider more of application compatibility with new version.
One more question, is there any guideline to use crash tool?

Post by **TrevorH** » 2018/07/05 09:21:29

CentOS has a stable ABI that should mean that all apps built for one point release should function on subsequent ones.

By not updating you are exposing yourself to all the bugs listed on the Redhat Errata pages (https://access.redhat.com/errata/#/?q=& ... _version=7). There are numerous high severity issues in 7.0 that have been fixed in later versions and these mean that 7.0 is completely unsafe to use. There are currently 10 pages of updates marked as "critical" and a further 27 pages marked as important - at 10 entries per page that's over 370 important or critical security vulnerabilities. Looking at all levels of security vulnerabilities listed there, there are 66 pages - 660 vulnerabilities. Ask Equifax how ignoring security updates went for them!

urtv666 · Post by **urtv666** » 2018/07/06 02:08:30

Thank you very much!

I searched the red hat support and found a bug looks exactly the same as mine and the version matches. Now I feel 98% confident to force application team cooperate us to upgrade to new version.

Here is the link of red hat bug in case someone else need it
https://access.redhat.com/solutions/2108471

CentOS

CentOS 7 VMs on ESXi-hypervisor hung, but still can ping

CentOS 7 VMs on ESXi-hypervisor hung, but still can ping

Re: CentOS 7 VMs on ESXi-hypervisor hung, but still can ping

Re: CentOS 7 VMs on ESXi-hypervisor hung, but still can ping

Re: CentOS 7 VMs on ESXi-hypervisor hung, but still can ping

Re: CentOS 7 VMs on ESXi-hypervisor hung, but still can ping