CentOS

Our CentOS 6.2 system on a blade server (hostname: Blade1) sporadically freezes when accessing drives mounted with mount.cifs. When the issue occurs, the command df will freeze when checking cifs mounts. ls will also freeze when run in the directory where the cifs mounts are mounted. Eventually, several minutes later, the system will respond, and continue normally. All of our work is being done with networked Windows cifs mounted drives, and when this happens, we cannot access the networked drives, so this freezing is a huge hindrance.

I've done Google searches on the topic, and nothing seems to resolve the issue.

The kernel is 2.6.32-220.2.1.el6.x86_64. This is a minimal install with no GUI, and additional networking packages added to facilitate Samba integration onto our Windows Server network. The cifs mounts are being hosted by a Windows SBS Server.

output of modinfo /lib/modules/2.6.32-220.2.1.el6.x86_64/kernel/fs/cifs/cifs.ko:
version: 1.68

The issue always occurs at the same time /var/log/messages includes messages of the following type:
Jan 6 11:40:38 Blade1 kernel: CIFS VFS: Unexpected lookup error -512

And you can see where the df command hangs:
Jan 6 11:46:19 Blade1 kernel: INFO: task df:2387 blocked for more than 120 seconds.
Jan 6 11:46:19 Blade1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 6 11:46:19 Blade1 kernel: df D 000000000000000d 0 2387 2373 0x00000080
Jan 6 11:46:19 Blade1 kernel: ffff882ffbea7bc8 0000000000000086 0000000000000000 0000000000000880
Jan 6 11:46:19 Blade1 kernel: 0000000000000000 ffff882ffbea7b88 ffff882ffbea7bc8 ffffffff8122b9e4
Jan 6 11:46:19 Blade1 kernel: ffff882ffb885078 ffff882ffbea7fd8 000000000000f4e8 ffff882ffb885078
Jan 6 11:46:19 Blade1 kernel: Call Trace:
Jan 6 11:46:19 Blade1 kernel: [] ? context_struct_compute_av+0x324/0x420
Jan 6 11:46:19 Blade1 kernel: [] __mutex_lock_slowpath+0x13e/0x180
Jan 6 11:46:19 Blade1 kernel: [] ? find_nls+0x59/0x100
Jan 6 11:46:19 Blade1 kernel: [] mutex_lock+0x2b/0x50
Jan 6 11:46:19 Blade1 kernel: [] cifs_reconnect_tcon+0x15a/0x340 [cifs]
Jan 6 11:46:19 Blade1 kernel: [] ? mntput_no_expire+0x30/0x110
Jan 6 11:46:19 Blade1 kernel: [] ? avc_has_perm+0x71/0x90
Jan 6 11:46:19 Blade1 kernel: [] ? __link_path_walk+0x768/0x1030
Jan 6 11:46:19 Blade1 kernel: [] smb_init+0x39/0x70 [cifs]
Jan 6 11:46:19 Blade1 kernel: [] CIFSSMBQFSInfo+0x64/0x250 [cifs]
... and it goes on.

Based on web searches, I've tried decreasing wsize in the mount options, and tried changing parameters in /proc/fs/cifs/, yet this issue still occurs. I believe that this issue may have been reported on other Linux Distros, and resolutions seem to rely on Kernel updates and updates to the cifs version, but I am hoping there is a workaround I can use in the meantime.

This [url=https://bugzilla.redhat.com/show_bug.cgi?id=760018]upstream bugzilla[/url] might be related. You can try applying the [url=http://git.fedorahosted.org/git/?p=initscripts.git;a=commitdiff;h=3031ac68e251d5090c3e021fb387a9ba0c8f343d]proposed fix[/url] and see how it goes.

[code]
--- a/rc.d/init.d/network
+++ b/rc.d/init.d/network
@@ -184,10 +184,8 @@ case "$1" in
# If this is a final shutdown/halt, check for network FS,
# and unmount them even if the user didn't turn on netfs
if [ "$RUNLEVEL" = "6" -o "$RUNLEVEL" = "0" -o "$RUNLEVEL" = "1" ]; then
- NFSMTAB=$(LC_ALL=C awk '$3 ~ /^nfs/ { print $2 }' /proc/mounts)
- SMBMTAB=$(LC_ALL=C awk '$3 == "smbfs" { print $2 }' /proc/mounts)
- NCPMTAB=$(LC_ALL=C awk '$3 == "ncpfs" { print $2 }' /proc/mounts)
- if [ -n "$NFSMTAB" -o -n "$SMBMTAB" -o -n "$NCPMTAB" ] ; then
+ NETMOUNTS=$(findmnt -m -t nfs,nfs4,smbfs,ncpfs,cifs 2>/dev/null)
+ if [ -n "$NETMOUNTS" ] ; then
/etc/init.d/netfs stop
fi
fi
[/code]

Thanks, I have applied the proposed fix on the system and will monitor the issue.

However,

Is there any explanation why this is manifesting sporadically while the system is running in runlevel 3, and why this change might fix it when the fix seems to only apply to runlevels 0,1, and 6?

[quote]
The kernel is 2.6.32-220.2.1.el6.x86_64.
[/quote]
There seems to be a question-mark regarding reliability of that kernel. Nothing is actually proven, yet. However if you still continue to experience issues, it might be worthwhile booting the earlier kernel-2.6.32-220.el6.x86_64 package and seeing how that behaves.

Sorry for my incompleteness, but this issue was also present in kernel-2.6.32-220.el6.x86_64. I had upgraded to 2.6.32-220.2.1.el6.x86_64 in hopes that this issue had been resolved. The issue has not re-occurred in the past few hours since the implementation of the fix.

If this is indeed fixed, why would these changes correct the issue from happening like it was "seemingly randomly" occurring while running in run level 3? The script only applies to run levels 0,1, and 6.

[quote]
If this is indeed fixed, why would these changes correct the issue from happening like it was "seemingly randomly" occurring while running in run level 3? The script only applies to run levels 0,1, and 6.[/quote]
I don't know. After all, the bug referenced earlier might not be related to the problem you are seeing.

For mounting Windows shares I almost exclusively use autofs. This is because Windows are known to crash often (well the situation may not be as bad as it used to be...) causing stale mounts. Not sure if use of autofs would help improve your status, so this is just a thought.

I don't think that bug applies here, it has to do with pauses during shutdown due to (what looks like) incorrect order of initscripts or some sort of race condition with the shutdown of the network and the umounting of cifs shares.

I suspect that the 220 series of kernels have a few rather large changes in them. Reading the list of bugs fixed [url=https://rhn.redhat.com/errata/RHSA-2011-1530.html]here on TUV web site[/url] I found a relevant sounding fix in [url=https://bugzilla.redhat.com/show_bug.cgi?id=654198]this bugzilla[/url] within the top 10 entries :-( I especially like the description of the fix in the bugzilla as "an experimental, proof-of-concept patch". There are several other cifs fixes in the list that could also be candidates to cause this though.

The delta between each successive kernel version is available [url=http://www.centos.toracat.org/ajb/kernel-clog-diff/el6/]here[/url].

I am having doubts about the current [b]-220.x.y.el6[/b] kernel series. :-(

The fix did not work. I have rolled the kernel back to my previously installed 2.6.32-71.29.1 and will monitor the system to see if the issue still occurs.

There are several intermediate kernels you could try too - kernel-2.6.32-131.21.1.el6.x86_64 is the latest that isn't one of the 220 line.

CentOS

CentOS 6.2 CIFS Network Freeze (ds, mount, ls)

CentOS 6.2 CIFS Network Freeze (ds, mount, ls)

CentOS 6.2 CIFS Network Freeze (ds, mount, ls)

Re: CentOS 6.2 CIFS Network Freeze (ds, mount, ls)

Re: CentOS 6.2 CIFS Network Freeze (ds, mount, ls)

Re: CentOS 6.2 CIFS Network Freeze (ds, mount, ls)

Re: CentOS 6.2 CIFS Network Freeze (ds, mount, ls)

Re: CentOS 6.2 CIFS Network Freeze (ds, mount, ls)

Re: CentOS 6.2 CIFS Network Freeze (ds, mount, ls)

Re: CentOS 6.2 CIFS Network Freeze (ds, mount, ls)

Re: CentOS 6.2 CIFS Network Freeze (ds, mount, ls)