Dropping Connections

Issues related to configuring your network
Post Reply
rotorboy
Posts: 27
Joined: 2005/03/03 23:02:04

Dropping Connections

Post by rotorboy » 2016/05/27 19:43:26

We're having a strange issue in one of our networks used for website hosting and email services. We have a number of Xenserver hosts each with any number of CentOS VMs. Within the VMs there maybe 1 or more public IP addresses using using eth0:1 to eth0:xx
Up until recently these were all working normally. A few days ago we started seeing SSH connections dropping randomly. At first it was intermittent and then on some VMs it become frequent to the point that we couldn't do anything. This also seemed to be affecting the apache connections as the websites would suddenly go offline. The only way to restore service was to run a small script that goes through and does an ifup eth0:xx for each virtual interface as well as the base eth0 interface.

Having seen this before I thought there was an IP conflict however extensive searching has not turned up any duplicate IPs. Next we searched for duplicate MAC addresses. We didn't find any of these either but to be safe we did go ahead and change some of the MAC addresses manually to be sure. Move forward a few days and it seems like whatever is causing this problem is getting worse. It's affecting about 20% of the CentOS5 VMs on the network. If we run the ifup on the virtual interfaces every few seconds the sites and SSH connections stay up most of the time.

I've been checking logs and running arpwatch but so far I'm not finding any clues to point at what the cause is. We've cold booted some of the VMs, the machines, the main switch, an upstream switch. Loads and network traffic have been within normal. No strange logins or other activity that we have found so far. Has anyone else seen something like this?

aks
Posts: 3073
Joined: 2014/09/20 11:22:14

Re: Dropping Connections

Post by aks » 2016/05/28 17:13:22

Hmm, a bit of a mystery.

So let's be clear, the interface is going into the "down" state "all by itself" - otherwise you'd have to ifdown and then ifup the interface?

If that's the case, Google link flap and link flap prevention. Also check CPU utilization.

User avatar
TrevorH
Site Admin
Posts: 33191
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Dropping Connections

Post by TrevorH » 2016/05/28 19:26:32

Mostly it sounds like duplicate ip addresses. If you're hosting other people then maybe they are just adding an extra ip that they shouldn't be using?
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

rotorboy
Posts: 27
Joined: 2005/03/03 23:02:04

Re: Dropping Connections

Post by rotorboy » 2016/05/28 20:08:20

I'll do some googling on link flap etc. I was using different terms so maybe that'll get me what I need.

There's nothing in the logs to indicate the interface has been brought down or disconnected. The connection simply stops responding to some or all IPs. It seems exactly like what you'd get if an IP is being brought up in duplicate across a different system or possibly even if another system fires up with a duplicate MAC address, except I'm not finding any evidence that anyone else is trying to bring up my IPs and I've tried changing MAC addresses to no avail.

Over the last 18hrs or so arpwatch has reported the correct MAC and IP for most of the IPs in the block. So far it reported only one case where an IP went between 2 MAC addresses. I thought I found something there until I realized that it was the MAC for eth1 on the same machine. I'm not sure why the server had the IP on eth0 and then switched it to eth1 for a little while but I don't think that was anything more than an unrelated blip. The IP didn't actually go down during that time.

I'm in the data centre now and will try switching switches to see if maybe the switch is messed up. If the switch has a memory problem or something else screwed up I'm thinking whenever we run the ifup eth0:xx the server advertises the MAC and the route gets re-established at least temporarily. It might just be grasping at straws.

Thanks for the help so far.

Post Reply