Dropping Connections
Posted: 2016/05/27 19:43:26
We're having a strange issue in one of our networks used for website hosting and email services. We have a number of Xenserver hosts each with any number of CentOS VMs. Within the VMs there maybe 1 or more public IP addresses using using eth0:1 to eth0:xx
Up until recently these were all working normally. A few days ago we started seeing SSH connections dropping randomly. At first it was intermittent and then on some VMs it become frequent to the point that we couldn't do anything. This also seemed to be affecting the apache connections as the websites would suddenly go offline. The only way to restore service was to run a small script that goes through and does an ifup eth0:xx for each virtual interface as well as the base eth0 interface.
Having seen this before I thought there was an IP conflict however extensive searching has not turned up any duplicate IPs. Next we searched for duplicate MAC addresses. We didn't find any of these either but to be safe we did go ahead and change some of the MAC addresses manually to be sure. Move forward a few days and it seems like whatever is causing this problem is getting worse. It's affecting about 20% of the CentOS5 VMs on the network. If we run the ifup on the virtual interfaces every few seconds the sites and SSH connections stay up most of the time.
I've been checking logs and running arpwatch but so far I'm not finding any clues to point at what the cause is. We've cold booted some of the VMs, the machines, the main switch, an upstream switch. Loads and network traffic have been within normal. No strange logins or other activity that we have found so far. Has anyone else seen something like this?
Up until recently these were all working normally. A few days ago we started seeing SSH connections dropping randomly. At first it was intermittent and then on some VMs it become frequent to the point that we couldn't do anything. This also seemed to be affecting the apache connections as the websites would suddenly go offline. The only way to restore service was to run a small script that goes through and does an ifup eth0:xx for each virtual interface as well as the base eth0 interface.
Having seen this before I thought there was an IP conflict however extensive searching has not turned up any duplicate IPs. Next we searched for duplicate MAC addresses. We didn't find any of these either but to be safe we did go ahead and change some of the MAC addresses manually to be sure. Move forward a few days and it seems like whatever is causing this problem is getting worse. It's affecting about 20% of the CentOS5 VMs on the network. If we run the ifup on the virtual interfaces every few seconds the sites and SSH connections stay up most of the time.
I've been checking logs and running arpwatch but so far I'm not finding any clues to point at what the cause is. We've cold booted some of the VMs, the machines, the main switch, an upstream switch. Loads and network traffic have been within normal. No strange logins or other activity that we have found so far. Has anyone else seen something like this?