I have an Amazon AWS EC2 virtual machine (a T2.medium instance) with two VCPUs and 4GB of memory. The machine is running a LAMP stack of web sites - Apache 2.4 httpd, PHP-FPM and Mariadb. Plus I have an old copy of Firefox running headless to a VNC server being used to generate PDF files (it works!).
This machine was running with Centos 7.2 for 1.5 years. The 8GB swap device would creep up in utilisation over the months to something like 2GB used, but the system was stable and performed well with very little swapping.
Until I did a 'mass update' of Centos two weeks ago. I did a simple 'yum update', stood back, waited 15 minutes, rebooted, everything worked (other than having to remove (again) a couple of default configuration files that got reinstalled). Great.
But we soon started seeing very poor performance. Mariadb was reporting lots of queries to its 'slow log' (SQL statements taking more than 10 seconds). My analysis of the Apache httpd access_log showed that many web pages were taking 2-3 times to run than usual. lfd was sending me messages multiple times a day telling me about a high load average (6 - 9+).
iostat showed me that the disk is 100% utilised during these patches of slow performance, pegging out at 62.5MB/s throughput to the sole (Amazon EBS) disk volume. This is *exactly* the peak throughput I'd seen over the past year when the disk was running flat out (for backups and such). Nothing on the (virtual) hardware side of things has changed at all; just my doing a 'yum update all'.
The periods of bad performance come and go, once every half hour or more frequent. I think this *might* coincide with when the Firefox process kicks off (every 15 minutes) to do its PDF runs. The mariadb process would typically take 1.5GB - 2GB of virtual memory, Firefox roughly 1GB.
The Amazon performance graphs show that, while the maximum disk throughput limit is unaltered at 62.5MB/s, there's a lot more disk I/O going on these past two weeks than all of the months before.
At the moment I'm assuming that the system is simply swapping heavily, thrashing, at these times of poor performance. I'll discuss that further with my 'detailed' questions about my performance monitoring results below. Or it could be something that's just demanding more disk activity, and maybe swapping isn't actually the issue. Either way, I don't know why this happened immediately after a simple 'yum update' from Centos 7.2 to 7.4.1708. I don't know if a Linux kernel parameter has changed - swappiness? Disk readahead? Disk scheduling? Something else? Or if mariadb has changed its memory use somehow. The Firefox I'm running is an old version; that didn't change, but maybe linked libraries did?
So my general question is - does anyone know of any changes in memory use, any change in memory footprint, any kernel disk/memory changes, in going from Centos 7.2 (a couple of years old) to the current 7.4.1708? Has anyone had any problems like this? Any ideas?
I'd also appreciate any help in understanding the output of a couple of the Linux performance monitoring tools. When the system is slow everything points to the prime reason being disk activity - top says that 80% - 100% of the time is spent in I/O 'wait', iostat reports that the disk is 100% utilised. Here's the 60-second iostat output for a situation this morning when when the machine slowed to a crawl for just over a minute, at 11:56:
The system is spending 86% of its time in I/O wait, and the disk is 100%+ utilised. So the primary symptom of the problem is that of extreme disk activity. Which makes me automatically think that the system is heavily swapping, as I know from experience that Linux performance dives when it starts swapping badly.
Code: Select all
11/18/2017 11:56:04 AM avg-cpu: %user %nice %system %iowait %steal %idle 4.29 0.00 6.92 86.47 0.24 2.08 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 26.70 115.22 1140.21 7.54 64626.80 510.73 113.50 107.98 93.86 93.90 88.09 0.89 101.89 xvdap1 9.97 0.56 1125.92 3.42 64485.09 35.62 114.26 106.08 93.71 93.89 31.67 0.90 101.89 xvdap2 16.73 114.66 14.29 4.12 141.71 475.11 67.00 1.91 103.24 94.10 134.95 34.55 63.61
But note that the swap device itself - xvdap2 is barely being used. Almost all of the throughput is going through the sole root disk partition/mount. The *utilisation* of the swap device is high because its I/O is being queued with all of the other requests, but it only read & wrote 600KB/second.
vmstat output is consistent with that of iostat:
vmstat didn't run at exactly the same time as iostat for its 60-second sample so I include the two lines of output that showed the heavy system activity.
Code: Select all
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp----- r b swpd free buff cache si so bi bo in cs us sy id wa st AEST 0 10 1072284 114196 0 461412 64 9 33366 425 2034 764 8 2 53 37 0 2017-11-18 11:55:20 1 0 1153372 118272 0 523608 180 1398 53783 1534 2165 985 12 8 8 72 0 2017-11-18 11:56:20
Here the bi & bo reflect pretty much what iostat told us - throughput close to the 62.5MB/s limit of the disk volume (in other tests the bi/bo is practically identical to that reported by iostat). Wait time is near the 80% with the second line, which is where vmstat saw most of the performance hit.
And the si/so - swap in/out - is very low, again consistent with iostat.
So is the system swapping, or is there something else to blame for the periodic increase in heavy disk I/O and associated abrupt loss of system performance?
I have the output of one other system utility to present - I also have iotop running to see what processes are using up the disk. Here are the first few lines of the iotop output around this time, sorted on disk read througput:
My first 'what is Linux doing?' question is - is the system swapping or not?
Code: Select all
11:56:05 Total DISK READ : 49.51 M/s | Total DISK WRITE : 75.94 K/s 11:56:05 Actual DISK READ: 63.16 M/s | Actual DISK WRITE: 1384.80 K/s TIME PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 11:56:05 5406 be/4 psrv 28.45 M/s 62.89 K/s 0.00 % 3.63 % ./firefox/firefox http://pdf.php 11:56:05 5274 be/4 mysql 3.74 M/s 12.19 K/s 0.00 % 1.52 % mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-file=/var/ru n/mariadb/mariadb.pid --socket=/var/lib/mysql/mysql.sock 11:56:05 13187 be/4 userxxxx 1908.05 K/s 0.00 B/s 0.00 % 20.44 % php-fpm: pool xxxxxxxx 11:56:05 1530 be/4 apache 1657.59 K/s 62.71 B/s 0.00 % 59.76 % httpd -DFOREGROUND
My second is "just what is iotop doing?".
Notice that the 'Total DISK READ' is 49MB/s while the 'Actual DISK READ' is 63MB/sec, the top limit of the disk's throughput. The man page for iotop says that these two figures "may not be equal at any given moment of time due to data caching and I/O operations reordering that take place inside Linux kernel".
During all of these periods of poor system performance the ACTUAL is normally right at the limit of 62MB/s whereas the TOTAL is usually much less, sometimes only a third, around 20MB/s.
The other thing to note is that the Firefox process is using up half of the disk's throughput - almost 30MB/s - but mariadb is using a measly 4MB/s, and nothing else is using the disk much at all.
I'll offer my thoughts on what's happen and then beg for help from the floor.
I think the system is swapping. But I think most of the data being shovelled on and off the disk is program *text* - the binary code - and not program *data*. So the I/O isn't going to the swap device at all, since program text isn't written to the swap device, as the binaries are always present on the root volume.
Is that reasonable? Are Firefox and mariadb's mysqld - the two big players here - pretty big, in terms of code size?
So I think that's why iostat shows that the actual swap device is only being lightly used. And I'm *guessing* that the si/so metrics of vmstat refer likewise to data being moved on and off the *swap device*, rather than reflecting actual kernel *swapping activity*. (The man page says 'swapped in/to disk', but doesn't say *what* disk.) Something I've never come across until now.
And I'm assuming that - in my iotop example above - there was 14MB/s of disk activity, the difference betwee the TOTAL and ACTUAL disk figures - and usually that difference in other observerations is more like 20MB/s - 30MB/s - that was swapping overhead ... to the root volume, not the swap device.
So, I would very much welcome any ideas. Obviously I'd love it if someone could say, well, yes, there is a well-know issue in going from Centos 7.2 to 7.4, Mariadb uses memory more aggressively, the kernel does more disk I/O, scheduler changed, something, to solve my primary problem:
Q1. Does anyone know of any changes going from Centos 7.2 to 7.4 that would cause a system to suddenly start incurring twice the amount of disk I/O and suffer severe intermittent performance degradation? (And is there a fix?)
And on the kernel side I'd appreciate any advice on the other four secondary questions:
Q2. Is my system swapping?
Q3. Does swapping program text get reported by vmstat, or is it 'invisible' to the vmstat si/so metrics? Is it reasonable to think that most of my swap activity is program text/code rather than data to the swap device?
Q4. Is swapping activity a well-known reason for the difference between the TOTAL and ACTUAL disk throughput reported by iotop?
Q5. Is there any other metric in Linux that shows swapping *activity* rather than disk I/O to the *swap device*? So I can know for sure whether Linux is madly swapping code/text pages to/from the root volume, even if vmstat and iostat say there is no activity with the swap device?
Thanks for any help!