CentOS 5/6 - random temporary freezes
CentOS 5/6 - random temporary freezes
I am working on high load daemon development, which listens on UDP and processes packets. Last few months I noticed some strange issue when it takes 500-700 ms to answer packet, while usually it takes 20 ms. I've run strace on all daemon processes and found this thing:
13:35:36.979887 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
13:35:36.979944 write(3, "[2014-04-15 13:35:36,979] WARNING WORKER 26 - [pkt#105132/AUTH] loadPresets - memory used: 0 kb\n", 99) = 99
13:35:37.599793 sendto(10, "Q\0\0\0\rSELECT 1\0", 14, MSG_NOSIGNAL, NULL, 0) = 14
13:35:37.599865 poll([{fd=10, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=10, revents=POLLIN}])
You can see that between write and sendto passed around 600 ms. At this time server was not overloaded (LA = 0.4, 16 cores). There were free memory. There were no load on disks.
So I took straces of other daemon processes plus database processes. And then did:
grep '13:35:37.4' *
grep '13:35:37.3' *
grep '13:35:37.2' *
None of commans show any lines. So I guess the system was just doing nothing during this 600 ms.
Is there any way to diagnose this issue? What it might be?
Kernel 2.6.18-194.11.1.el5
13:35:36.979887 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
13:35:36.979944 write(3, "[2014-04-15 13:35:36,979] WARNING WORKER 26 - [pkt#105132/AUTH] loadPresets - memory used: 0 kb\n", 99) = 99
13:35:37.599793 sendto(10, "Q\0\0\0\rSELECT 1\0", 14, MSG_NOSIGNAL, NULL, 0) = 14
13:35:37.599865 poll([{fd=10, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=10, revents=POLLIN}])
You can see that between write and sendto passed around 600 ms. At this time server was not overloaded (LA = 0.4, 16 cores). There were free memory. There were no load on disks.
So I took straces of other daemon processes plus database processes. And then did:
grep '13:35:37.4' *
grep '13:35:37.3' *
grep '13:35:37.2' *
None of commans show any lines. So I guess the system was just doing nothing during this 600 ms.
Is there any way to diagnose this issue? What it might be?
Kernel 2.6.18-194.11.1.el5
Re: CentOS 5/6 - random temporary freezes
Start by updating your systemzinok wrote:Kernel 2.6.18-194.11.1.el5
Re: CentOS 5/6 - random temporary freezes
Other server with the same problem has 2.6.32-431.el6.x86_64
Re: CentOS 5/6 - random temporary freezes
cpuspeed activated? NTP adjusting the time and there isn't an actual freeze?
Re: CentOS 5/6 - random temporary freezes
I think it is your 'write' that is taking longer than expected. Where is it writing to? To a local log file or to a network socket? Running strace with the -T option would show how long it takes to execute each system call.
Re: CentOS 5/6 - random temporary freezes
Write is writing to just a local file. Unfortunately the problem is not only on writes. Here is two more examples today:
13:08:12.663623 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
13:08:12.926411 write(3, "[2014-04-15 13:08:12,663] WARNING WORKER 80 - [pkt#3395853/ACCT-START] Time between ...
300 ms on just stat /etc/localtime
and
13:08:13.347081 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
13:08:13.530962 poll([{fd=18, events=POLLOUT}], 1, 3000) = 1 ([{fd=18, revents=POLLOUT}])
200 ms more...
It is not NTP time adjustment as seen on two separate servers and measured by tcpdump at both of them.
13:08:12.663623 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
13:08:12.926411 write(3, "[2014-04-15 13:08:12,663] WARNING WORKER 80 - [pkt#3395853/ACCT-START] Time between ...
300 ms on just stat /etc/localtime
and
13:08:13.347081 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
13:08:13.530962 poll([{fd=18, events=POLLOUT}], 1, 3000) = 1 ([{fd=18, revents=POLLOUT}])
200 ms more...
It is not NTP time adjustment as seen on two separate servers and measured by tcpdump at both of them.
Re: CentOS 5/6 - random temporary freezes
Or alternatively, your program does something else between those two system calls.. Use the -T option for strace to see how much each system call takes. Is your application multithreaded? If yes, how do you differentiate the system calls running in separate threads in your strace output?
Re: CentOS 5/6 - random temporary freezes
Will add -T. As for multithreaded - no, it is multiprocess daemon. Separate straces recorded for each process. There are two cases - most of the time other processes continue to work during this "freeze" at one of processes. Sometimes (seen 2-3 times) - none of processes work during freeze.
Re: CentOS 5/6 - random temporary freezes
Both servers are the same make/model hardware?zinok wrote:It is not NTP time adjustment as seen on two separate servers and measured by tcpdump at both of them.
Re: CentOS 5/6 - random temporary freezes
Nope, totally different.drk wrote:Both servers are the same make/model hardware?zinok wrote:It is not NTP time adjustment as seen on two separate servers and measured by tcpdump at both of them.