Centos 7 network disappearing periodically

Issues related to configuring your network
Post Reply
SpacedCowboy
Posts: 4
Joined: 2017/09/25 02:50:32
Location: San Jose, CA, USA

Centos 7 network disappearing periodically

Post by SpacedCowboy » 2017/09/25 03:06:11

So I have a new server set up (Threadripper 1950X, Asus X399 ROG Zenith Extreme motherboard) and everything is great apart from the network interface.

This machine has a public IP address (x.x.x.x) and a private IP address (192.168.1.102). Trying to get NetworkManager to assign both of those to a single ethernet port was an exercise in frustration - I could either get one or the other, but not both IP addresses available at the same time. A lot of the time 'ifconfig -a' would show one, neither or both of the "configured" IP addresses, and there was no logic to whether the interface would work or not.

So, nix NetworkManager.

Code: Select all

[root@xanadu]# systemctl status NetworkManager
● NetworkManager.service - Network Manager
   Loaded: loaded (/usr/lib/systemd/system/NetworkManager.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:NetworkManager(8)
... and now the config files look like:

Code: Select all

[root@xanadu]# ls -1 /etc/sysconfig/network-scripts/ifcfg*
/etc/sysconfig/network-scripts/ifcfg-eth0    
/etc/sysconfig/network-scripts/ifcfg-eth0:1
/etc/sysconfig/network-scripts/ifcfg-lo

[root@xanadu]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=eth0
HWADDR=10:7B:44:90:34:7B
DEVICE=eth0
ONBOOT=yes
DNS1=192.168.1.254
DOMAIN=gornall.net
IPV6_PRIVACY=no
IPADDR=<public-address>
NETMASK=<public-netmask>
GATEWAY=<public-gateway>
DNS2=8.8.8.8
PREFIX=24

[root@xanadu]# cat /etc/sysconfig/network-scripts/ifcfg-eth0:1
TYPE=Alias
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=eth0
HWADDR=10:7B:44:90:34:7B
DEVICE=eth0
ONBOOT=yes
DNS1=192.168.1.254
DOMAIN=gornall.net
IPV6_PRIVACY=no
IPADDR=192.168.1.102
NETMASK=255.255.255.0
GATEWAY=192.168.1.254
DNS2=8.8.8.8
PREFIX=24
This works, after a fashion. The network comes up on boot, and stays up for a period of time (I'm not sure how long, but it's on the order of an hour or so). Then nothing can connect from the outside. From the console, 'ip addresses' will show the public and private IP, but the machine can't ping out.

At this point, doing a 'systemctl restart network' will recover the network, and everything is sweetness and light again. At the moment my stopgap is to have cron run this every 5 minutes...

So, what is causing me to drop my network interface ? Is it some last vestige of NetworkManager that I don't know about, or it crossed my mind it could be a cron-job (but I can't find anything in the crontabs), or maybe power-management is switching the interface off ? Is there an easy way to turn off power management on Centos 7 ? I couldn't see anything in the BIOS that was related to power-management...

Any hints gratefully received :)

User avatar
TrevorH
Forum Moderator
Posts: 24052
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Centos 7 network disappearing periodically

Post by TrevorH » 2017/09/25 13:19:52

You have way too many things in your alias file. Only a very limited subselection of the available parameters can be safely used in an alias file. Stick to DEVICE=eth0:0, IPADDR=,NETMASK=, and ONBOOT=yes
CentOS 5 died in March 2017 - migrate NOW!
Full time Geek, part time moderator. Use the FAQ Luke

SpacedCowboy
Posts: 4
Joined: 2017/09/25 02:50:32
Location: San Jose, CA, USA

Re: Centos 7 network disappearing periodically

Post by SpacedCowboy » 2017/09/25 14:17:12

Ok, thanks :) I just copied the original one, wasn't sure what was supposed to be there. I'll give that a try tonight.

It could also be a symptom of something more fundamentally wrong. I noticed that 'dmesg' was reporting a lot of stuff, so I ran 'dmesg --clear' and about an hour later (that's just when I checked, I'm not sure it needs an hour), it's *full* of

Code: Select all

[39180.055463] WARNING: CPU: 22 PID: 60661 at drivers/iommu/amd_iommu.c:2445 dma_ops_domain_unmap.part.16+0x62/0x70
[39180.055465] Modules linked in: arc4 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bnep vfat fat snd_hda_codec_hdmi edac_mce_amd edac_core kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw eeepc_wmi gf128mul asus_wmi glue_helper ablk_helper sparse_keymap cryptd ath10k_pci ath10k_core snd_hda_codec_realtek snd_hda_codec_generic ath mac80211 snd_hda_intel
[39180.055501]  snd_hda_codec wil6210 btusb snd_hda_core btrtl btbcm btintel snd_hwdep bluetooth snd_seq cfg80211 pcspkr snd_seq_device snd_pcm sg rfkill snd_timer ccp snd soundcore i2c_piix4 shpchp i2c_designware_platform pinctrl_amd gpio_amdpt i2c_designware_core acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic nouveau video drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul igb crct10dif_common crc32c_intel ahci libahci mxm_wmi libata ptp serio_raw nvme pps_core dca nvme_core i2c_algo_bit atlantic(T) i2c_core wmi dm_mirror dm_region_hash dm_log dm_mod
[39180.055541] CPU: 22 PID: 60661 Comm: httpd Tainted: G        W      ------------ T 3.10.0-693.2.2.el7.x86_64 #1
[39180.055542] Hardware name: System manufacturer System Product Name/ROG ZENITH EXTREME, BIOS 0211 07/12/2017
[39180.055544]  0000000000000000 000000005ad0ae86 ffff88103cd83cc0 ffffffff816a3db1
[39180.055547]  ffff88103cd83d00 ffffffff810879c8 0000098d68ffc000 ffff8810388b2b78
[39180.055549]  0000000000000000 0000000000001090 0000000001f6f661 ffff881038f5a400
[39180.055551] Call Trace:
[39180.055552]  <IRQ>  [<ffffffff816a3db1>] dump_stack+0x19/0x1b
[39180.055562]  [<ffffffff810879c8>] __warn+0xd8/0x100
[39180.055565]  [<ffffffff81087b0d>] warn_slowpath_null+0x1d/0x20
[39180.055568]  [<ffffffff8154e272>] dma_ops_domain_unmap.part.16+0x62/0x70
[39180.055570]  [<ffffffff8154f143>] __unmap_single.isra.22+0xa3/0x110
[39180.055573]  [<ffffffff8154f5a1>] unmap_page+0x51/0x70
[39180.055583]  [<ffffffffc0060f56>] aq_ring_tx_clean+0x46/0x100 [atlantic]
[39180.055591]  [<ffffffffc0060748>] aq_vec_poll+0x168/0x220 [atlantic]
[39180.055595]  [<ffffffff8158799d>] net_rx_action+0x16d/0x380
[39180.055598]  [<ffffffff81090b3f>] __do_softirq+0xef/0x280
[39180.055602]  [<ffffffff816b6a9c>] call_softirq+0x1c/0x30
[39180.055606]  [<ffffffff8102d3c5>] do_softirq+0x65/0xa0
[39180.055608]  [<ffffffff81090ec5>] irq_exit+0x105/0x110
[39180.055611]  [<ffffffff816b7636>] do_IRQ+0x56/0xe0
[39180.055614]  [<ffffffff816ac22d>] common_interrupt+0x6d/0x6d
[39180.055615]  <EOI>  [<ffffffff816b512b>] ? sysret_audit+0x17/0x21
[39180.055619] ---[ end trace 96b9b203d4458f03 ]---
[39352.515625] AMD-Vi: Event logged [IO_PAGE_FAULT device=07:00.0 domain=0x000e address=0x0000000003aaa8d8 flags=0x0000]
[41044.540004] AMD-Vi: Event logged [IO_PAGE_FAULT device=07:00.0 domain=0x000e address=0x0000000006a0a4cc flags=0x0000]
[41044.540009] AMD-Vi: Event logged [IO_PAGE_FAULT device=07:00.0 domain=0x000e address=0x0000000006a0a4cc flags=0x0000]
... or variants like:

Code: Select all

[ 4556.829916] WARNING: CPU: 18 PID: 0 at drivers/iommu/amd_iommu.c:2445 dma_ops_domain_unmap.part.16+0x62/0x70
 (lots of these)
[ 4556.830472] CPU: 18 PID: 748 Comm: systemd-journal Tainted: G        W      ------------ T 3.10.0-693.2.2.el7.x86_64 #1
[ 4557.632782] CPU: 18 PID: 0 Comm: swapper/18 Tainted: G        W      ------------ T 3.10.0-693.2.2.el7.x86_64 #1
[39180.055541] CPU: 22 PID: 60661 Comm: httpd Tainted: G        W      ------------ T 3.10.0-693.2.2.el7.x86_64 #1
I noticed the BIOS is version 0211, and there's been 2 more since then so I'm also going to try flashing the BIOS this evening. These are new boards, so a certain amount of turmoil and churn is inevitable. Generally I don't like updating BIOS's on $500 motherboards, but in this case I think it's warranted...

It's kind of weird to me that the box can have all these issues and still carry on. Back when I were a lad, walking up hill both to and from work at the coal mine in the sleeting rain, kernel and/or device driver issues crashed the machine... [grin]

Anyway, thanks for the help - I'll make the changes and post back.

User avatar
TrevorH
Forum Moderator
Posts: 24052
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Centos 7 network disappearing periodically

Post by TrevorH » 2017/09/25 15:34:29

I'm not sure how stable ryzen support is as yet. I was reading the upstream kernel release notes just now and notice that even in $latest mainline kernel, there are still fixes being made for ryzen support.
CentOS 5 died in March 2017 - migrate NOW!
Full time Geek, part time moderator. Use the FAQ Luke

SpacedCowboy
Posts: 4
Joined: 2017/09/25 02:50:32
Location: San Jose, CA, USA

Re: Centos 7 network disappearing periodically

Post by SpacedCowboy » 2017/09/25 17:30:32

Well... I was under the impression that it actually was stable...
... but it's not totally surprising if they've found more issues - my own fault for not properly doing my homework. Not sure my wife will be quite as sanguine as I am though [grin].

When the machine is working, it's a complete beast, I just want that "warm fuzzy" feeling that there's nothing going wrong with it behind the scenes. I hate kludges...

Here's hoping that the BIOS update and the fixes you identified in the network config will help :)

SpacedCowboy
Posts: 4
Joined: 2017/09/25 02:50:32
Location: San Jose, CA, USA

Re: Centos 7 network disappearing periodically

Post by SpacedCowboy » 2017/09/26 01:43:47

Ok, so two things:
  • Installing the new BIOS seems to have fixed the kernel-messages issue. It's been up for over an hour now and dmesg output is unchanged (i.e.: there's been no messages in that time, and I was getting a good few dozen previously in the same time frame).
  • Editing the ifcfg-eth0:0 file as suggested has made the ethernet rock-solid as well. There's no need for the cron-job and for the first time I'm seeing devices eth0 and eth0:0 separately in the output of ifconfig
So, all's well that ends well. Many many thanks for the help on the ethernet configuration :)

[Edit] Well, maybe not. I came back from my evening meal and the network wasn't responsive (pinging from my Mac was saying "host is down"). Logging into the physical console worked fine, and I could 'service restart network', which then restored the network functionality.

I'd previously done a 'dmesg clear', and on regaining network access, typing 'dmesg' returns:

Code: Select all

[ 6074.448784] AMD-Vi: Event logged [IO_PAGE_FAULT device=07:00.0 domain=0x000e address=0x0000000003cf0ccc flags=0x0000]
[ 6312.577996] fuse init (API version 7.22)
[ 6317.870908] Bluetooth: RFCOMM TTY layer initialized
[ 6317.870916] Bluetooth: RFCOMM socket layer initialized
[ 6317.870922] Bluetooth: RFCOMM ver 1.11
[ 6380.164841] TCP: lp registered
(The TCP: lp registered message only appeared after I typed 'service restart network', the others were there before).

I've seen a few suggestions for using 'iommu=amd' or 'iommu=soft' as kernel arguments which might solve the 'AMD-Vi' thing. As for the networking, is it possible the Bluetooth stack is somehow interfering with my network ? I don't need or want bluetooth on this machine (it's a server) so I've disabled it in case that's the issue.

So in the time it took to type this, the network dropped out again, and in 'dmesg' I see two new lines:

Code: Select all

[ 6674.495456] AMD-Vi: Event logged [IO_PAGE_FAULT device=07:00.0 domain=0x000e address=0x0000000006b8b8cc flags=0x0000]
[ 6674.495462] AMD-Vi: Event logged [IO_PAGE_FAULT device=07:00.0 domain=0x000e address=0x0000000006b8b8cc flags=0x0000]
Interestingly enough PCIe device 07:00.0 is the 10G ethernet controller that I'm currently using. One more thing I can try is to downgrade to the 1G ethernet controller that's onboard. Generally I like having the network more capable than the internet pipe, but beggars can't be choosers :)

And after a bit of googling, it seems the drivers for the 10G card were upstreamed into Linux 4.11, so this could be the reason for the dropping ethernet. So far, running on the 1G seems ok (only tested for 50 minutes so far though. We'll see if it gets through the night...)

rklrkl
Posts: 71
Joined: 2005/10/22 22:06:04
Location: U.K.

Re: Centos 7 network disappearing periodically

Post by rklrkl » 2017/12/29 23:05:15

I'm running CentOS 7.4 fine on a Ryzen 7 1700 (SM961 NVMe drive, Intel SSD, some SATA 3 HDDs, 64GB RAM). My Asus Prime X370 Pro motherboard has a ridiculous number of BIOS updates (15 at the last count!) and I religiously apply those of course because they do help a lot with such a new platform. It's great that Asus let you update the BIOS from within the BIOS, which is something that should have been available many years ago (instead of either DOS boot floppies with a DOS BIOS exe or Windows-only BIOS updaters).

The first step is to try a live 7.4 ISO and see if your machine will boot from it - if it can (and hopefully see the network at least for a brief time!), install it onto your drive and then I would highly recommend upgrading the kernel to the latest stable release from ELRepo at http://elrepo.org/ - they build fresh kernels using CentOS 7's kernel config within hours of the latest stable release coming out.

You'll need the kernel-ml and kernel-ml-devel packages from EL Repo (as for the old 3.X kernel packages, I left them all installed except the kernel package itself, which I deleted once I successfully booted into the EL Repo 4.X kernel). The reason for the kernel upgrade is that it has pretty well all the Ryzen/Threadripper CPU/motherboard support, unlike the one that ships with CentOS 7. I can now even monitor the CPU temp using the MATE Sensors Applet, which I definitely couldn't do with the 3.X kernel.

Post Reply