Server hangs up every day with messages like "soft lockup - CPU#0 stuck for 10s"

Issues related to hardware problems
Post Reply
atrandafir
Posts: 5
Joined: 2011/08/10 18:12:50
Contact:

Server hangs up every day with messages like "soft lockup - CPU#0 stuck for 10s"

Post by atrandafir » 2011/08/10 18:49:40

Hello everybody,

I am having problems with a server from 1and1. It started to hang up once per week, then once every few days, and now it does it every day, even sometimes more than once per day.

The only error messages I was able to see when logged in at the emergency console, are similar to this one: BUG: soft lockup - CPU#0 stuck for 10s! [events/0:38]
You can see more of this here: http://pastebin.com/gsbK8JdN

I can't find anything else in /var/log/messages related to the problem.

This is a screenshot when the server was using a lot of resources for a php cron job, just before it got frozen: http://i55.tinypic.com/282oxas.png

Could someone point me to the right direction? Until now other people told me it could be a hardware related problem but I'd like to know what else can I investigate without having to bring down the server.

I would really apreciate some help on this one because I have to keep my eye on the website 24/7 to reboot the server when it hangs up, the hosting company wants to shut it down for testing without offering an alternative to keep the website online.

Thank you,
Alex.

User avatar
TrevorH
Site Admin
Posts: 33216
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Server hangs up every day with messages like "soft lockup - CPU#0 stuck for 10s"

Post by TrevorH » 2011/08/11 08:25:36

First thing to try would be to update to the latest level of software - what you are running now - 2.6.18-194.26.1.el5 - is quite old. That is CentOS 5.4 and the latest is 5.6. No idea if your problem is fixed in the meantime but it is worth a shot.

The messages you posted, are those the first ones that appear? If not, can you check and post what happens first.

Your screenshot shows nothing stressed, plenty of RAM, plenty of CPU free.

atrandafir
Posts: 5
Joined: 2011/08/10 18:12:50
Contact:

Re: Server hangs up every day with messages like "soft lockup - CPU#0 stuck for 10s"

Post by atrandafir » 2011/08/11 09:49:37

Hello,

How do I update? I tried to fix this problem by reinstalling the entire OS some weeks ago, and after that I did a yum update and what I've got now is this:


[root@s15412399 ~]# uname -a
Linux domain.com 2.6.18-238.19.1.el5 #1 SMP Fri Jul 15 07:31:24 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
[root@s15412399 ~]# cat /etc/*release*
CentOS release 5.6 (Final)

Last night it crashed again, I have setup Putty on the emergency console to log all the activity and nothing appeared on the screen. And today it just crashed again a few minutes ago. I don't know where to look for information. I'm going to check services logs to see if maybe there's a problem somewhere.

r_hartman
Posts: 711
Joined: 2009/03/23 15:08:11
Location: Netherlands
Contact:

Server hangs up every day with messages like "soft lockup -

Post by r_hartman » 2011/08/11 09:50:57

I've only ever seen this with RHEL5.2 on a Sun AMD box with 8 cpus and 16 NICs running Xen.
Solved by removing nics, strange as it may sound. Xen only supports 7 or 8 nics, iirc.

But your kernel is not a Xen kernel, so I'm not sure it's related. There's no indication of what virtualization software is being used.
Anyway, just thought I might mention it.

Maybe you can [url=https://www.centos.org/modules/newbb/viewtopic.php?topic_id=25128&forum=47]provide more info[/url] on your system?

Oh, and by the way, welcome to the CentOS fora. Please see the recommended reading in my signature.

EDIT: from your last post (crossed mine), you are at the latest kernel now, so while updating again (yum update) may get you a few packages, it is not going to solve this.

You state 'crashed'; do you mean it really dies? On the Sun box I mentioned the system would just keep running, issuing these messages at intervals but never grinding to a halt. Rebooting did not help in any way. From your listing it appears that generating these messages is the only thing the box is still doing. Is that correct?

atrandafir
Posts: 5
Joined: 2011/08/10 18:12:50
Contact:

Re: Server hangs up every day with messages like "soft lockup - CPU#0 stuck for 10s"

Post by atrandafir » 2011/08/11 11:07:51

[quote]
r_hartman wrote:
I've only ever seen this with RHEL5.2 on a Sun AMD box with 8 cpus and 16 NICs running Xen.
Solved by removing nics, strange as it may sound. Xen only supports 7 or 8 nics, iirc.

But your kernel is not a Xen kernel, so I'm not sure it's related. There's no indication of what virtualization software is being used.
Anyway, just thought I might mention it.

Maybe you can [url=https://www.centos.org/modules/newbb/viewtopic.php?topic_id=25128&forum=47]provide more info[/url] on your system?

Oh, and by the way, welcome to the CentOS fora. Please see the recommended reading in my signature.

EDIT: from your last post (crossed mine), you are at the latest kernel now, so while updating again (yum update) may get you a few packages, it is not going to solve this.

You state 'crashed'; do you mean it really dies? On the Sun box I mentioned the system would just keep running, issuing these messages at intervals but never grinding to a halt. Rebooting did not help in any way. From your listing it appears that generating these messages is the only thing the box is still doing. Is that correct?[/quote]

Pastebin result for getinfo.sh: http://pastebin.centos.org/37726
Specs of the server on the hosting page: http://bit.ly/oYnBuS (if it may help)

What are NICs? LAN cards?

And also, the server is dedicated, there is no virtualization software used.

By 'crashed' I mean that the server stops responding to anything, I think when it happens I have no more logs for any service during that perior.
One time I was connected to the emergency console when errors started to appear on the screen, and the server was still working, just don't answering to anything from the outside.
This is a long from when I tried to create a zip to backup all the images (15GB):
http://pastebin.centos.org/37728

Thank you. Let me know what else can I provide that it might help.

User avatar
TrevorH
Site Admin
Posts: 33216
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Server hangs up every day with messages like "soft lockup - CPU#0 stuck for 10s"

Post by TrevorH » 2011/08/11 11:58:43

Ok, you're now up to date which at least eliminates known bugs!

Can you look in /var/log/messages and find where the latest batch of problems started and place an excerpt of the messages around this point on pastebin? Say a couple of minutes prior to the first lockup message and then the _first_ lockup only. It's the period before the lockup that'll be most interesting/

atrandafir
Posts: 5
Joined: 2011/08/10 18:12:50
Contact:

Re: Server hangs up every day with messages like "soft lockup - CPU#0 stuck for 10s"

Post by atrandafir » 2011/08/11 14:42:51

[quote]
TrevorH wrote:
Ok, you're now up to date which at least eliminates known bugs!

Can you look in /var/log/messages and find where the latest batch of problems started and place an excerpt of the messages around this point on pastebin? Say a couple of minutes prior to the first lockup message and then the _first_ lockup only. It's the period before the lockup that'll be most interesting/[/quote]

300 lines of /var/log/messages before manual restart: http://pastebin.centos.org/37734

I see nothing but drweb antivirus, that I have previously deleted too and the problem didn't go away.

User avatar
TrevorH
Site Admin
Posts: 33216
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Server hangs up every day with messages like "soft lockup - CPU#0 stuck for 10s"

Post by TrevorH » 2011/08/11 15:27:00

But that's completely different to the problem you had before? Before you were getting soft lockup messages with stack traces. This one looks like a power off/unplanned reboot - is that correct? So was it updating that changed the symptoms or has it always been a mixture of the two?

There are some settings that control things like when to reboot following a kernel panic. The main one is the time from kernel panic to automatic-reboot and is controlled by setting a number in /proc/sys/kernel/panic. The default setting is 0 which means to not auto-reboot so if it panics then it should just sit there with the panic screen still visible on the console - if this is happening then the contents of the panic screen would be useful info. If /proc/sys/kernel/panic contains anything other than 0 then it's the number of seconds to wait following the panic before an automatic reboot takes place - useful if you are concerned about overall system uptime rather than debugging why a particular crash is happening.

If there is no kernel panic then the most likely explanation is a hardware problem and handing the box to your provider to run diagnostics on would be the next step IMO.

Incidently, the warnings that your AV product is issuing about the lack of the lsb_release command could be addressed by installing the redhat-lsb package.

atrandafir
Posts: 5
Joined: 2011/08/10 18:12:50
Contact:

Re: Server hangs up every day with messages like "soft lockup - CPU#0 stuck for 10s"

Post by atrandafir » 2011/08/13 12:23:28

We've got a new dedicated server and we are planning to migrate our entire website there during the next days and leave the current server after that.

Right now I have transfered the MySQL database to the new server and use it from there, to avoid table corruptions that use to happen when the server is being rebooted.

I Have also setup the kernel panic time to 60, and now I hope the server will reboot by itself when it will hangup again and this way everything should get back to normal automatically. This way I hope that the server will get back faster when it hangs up at 3 in the morning when I'm sleeping and I don't hear the Pingdom alerts of website is down.

I just hope they will take a look at the server after we leave and don't give it to someone else without solving the problem.

Thank you all for your guidance!

PS: I tried before to configure in the /boot/grub/grub.conf the kernel panic=60 but it did not reboot by itself as I remember. Let's see what happens now.

Post Reply