CentOS 7.3 Installation (Anaconda) failure and SSD on NVMe

Issues related to hardware problems
Post Reply
cwahlgren
Posts: 12
Joined: 2017/03/03 09:06:03

CentOS 7.3 Installation (Anaconda) failure and SSD on NVMe

Post by cwahlgren » 2017/04/17 16:59:29

Hi,

This is a sort of a continuation on an older topic viewtopic.php?f=49&t=61593&p=259750, where I tried to install CentOS 7.3 on a new Intel NUC-like box from GIGABYTE that has an SSD on a NVMe interface. I'm now trying to get to the bottom to find what could be wrong either with CentOS 7/Fedora 25 or if there's something wrong with this little box.

Hardware:
- Gigabyte GB-BKi5A-7200 with Intel i5 7200U (7:th gen)
- Crucial 32GB 2133MHz DDR4 No ECC, CT2K16G4SFD8213
- Intel 512GB M.2 PCI Express 3.0 x4 (NVMe), SSDPEKKW512G7X1

Works:
- Memtest86+ 5.01 says 0 Errors after 3 completed test rounds (> 9 hours).
- Ubuntu 16.04.1: Installed server with 5 VMs running for several days without any problem.
- CentOS-7.3 (x86_64-Everything-1702): With initial Minimal installation and EXT4 everywhere. Updated OS, then install group "Server with GUI", install and running oVirt with some VMs for weeks without any problem.

Fails:
- Fedora 25: Installation from Fedora 25 Live ISO hangs after reboot with XFS errors (see other topic).
- CentOS 7.3: Install any installation type with XFS and/or more packages than Minimal usually causes sudden freeze during package installation, or kernel:nvme errors, as seen by latest installation (see logs below). The latest failures surprised me since I've succeeded earlier to do Minimal installation with EXT4.


Looking at syslog (see below) from latest failed installations it seems like a hardware error (for non-professional eyes). But, if I succeed in installing CentOS (eventually) or Ubuntu as earlier, the box runs flawlessly for days without any problems. Is there a bug in Anaconda kernel in CentOS 7.3 (and Fedora 25)? I've searched the internet on the below and I only see some similar references to "kernel:nvme ... reset controller" in kernel archives in 2014, and other references that all major distributions has support for NVMe since a while. Also, earlier installations referred in the other topic I thought it was XFS that caused these issues, but it seems I get other kind of errors still with EXT4 as well, just that the kernel might handle the nvme errors differently (i.e. freeze).

- Should I trust this box (defect hardware)?
- Should I avoid CentOS/Fedora?
- Something else to try (installation options)?
- Why does it seems I'm alone with this problem, or is this still pretty new hardware and not tested enough?
Gigabyte says this box is usually aimed for Windows 10 usage when I had contact with them.

BR,
Christian

syslog:
---
...
09:54:42,655 WARNING kernel:[ 330.638786] nvme 0000:03:00.0: Failed status: 0xfee00398, reset controller.
09:54:42,656 WARNING kernel:nvme 0000:03:00.0: Failed status: 0xfee00398, reset controller.
09:54:43,557 INFO kernel:[ 331.746218] nvme 0000:03:00.0: Refused to change power state, currently in D3
09:54:43,908 INFO kernel:nvme 0000:03:00.0: Refused to change power state, currently in D3
09:54:43,908 WARNING kernel:nvme 0000:03:00.0: Removing after probe failure status: -19
09:54:43,908 INFO kernel:nvme0n1: detected capacity change from 512110190592 to 0
09:54:43,908 ERR kernel:blk_update_request: I/O error, dev nvme0n1, sector 233234688
09:54:43,908 ERR kernel:blk_update_request: I/O error, dev nvme0n1, sector 23892048
09:54:43,908 ERR kernel:blk_update_request: I/O error, dev nvme0n1, sector 233234944
...

lscpi -v (just before installation, when above error occurs, device 03:00.0 disappears):
---
00:00.0 Host bridge: Intel Corporation Device 5904 (rev 02)
Subsystem: Gigabyte Technology Co., Ltd Device 1000
Flags: bus master, fast devsel, latency 0
Capabilities: [e0] Vendor Specific Information: Len=10 <?>

00:02.0 VGA compatible controller: Intel Corporation Device 5916 (rev 02) (prog-if 00 [VGA controller])
Subsystem: Gigabyte Technology Co., Ltd Device 1000
Flags: bus master, fast devsel, latency 0, IRQ 11
Memory at de000000 (64-bit, non-prefetchable)
Memory at c0000000 (64-bit, prefetchable)
I/O ports at f000
Expansion ROM at <unassigned> [disabled]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
Capabilities: [ac] MSI: Enable- Count=1/1 Maskable- 64bit-
Capabilities: [d0] Power Management version 2
Capabilities: [100] Process Address Space ID (PASID)
Capabilities: [200] Address Translation Service (ATS)
Capabilities: [300] Page Request Interface (PRI)
Kernel modules: i915

00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21) (prog-if 30 [XHCI])
Subsystem: Gigabyte Technology Co., Ltd Device 1000
Flags: bus master, medium devsel, latency 0, IRQ 125
Memory at df330000 (64-bit, non-prefetchable)
Capabilities: [70] Power Management version 2
Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+
Kernel driver in use: xhci_hcd

00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)
Subsystem: Gigabyte Technology Co., Ltd Device 1000
Flags: fast devsel, IRQ 11
Memory at df34e000 (64-bit, non-prefetchable)
Capabilities: [50] Power Management version 3
Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-

00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)
Subsystem: Gigabyte Technology Co., Ltd Device 1000
Flags: bus master, fast devsel, latency 0, IRQ 137
Memory at df34d000 (64-bit, non-prefetchable)
Capabilities: [50] Power Management version 3
Capabilities: [8c] MSI: Enable+ Count=1/1 Maskable- 64bit+
Kernel driver in use: mei_me
Kernel modules: mei_me

00:17.0 SATA controller: Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode] (rev 21) (prog-if 01 [AHCI 1.0])
Subsystem: Gigabyte Technology Co., Ltd Device 1000
Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 133
Memory at df348000 (32-bit, non-prefetchable)
Memory at df34c000 (32-bit, non-prefetchable)
I/O ports at f090
I/O ports at f080
I/O ports at f060 [size=32]
Memory at df34b000 (32-bit, non-prefetchable) [size=2K]
Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [70] Power Management version 3
Capabilities: [a8] SATA HBA v1.0
Kernel driver in use: ahci
Kernel modules: ahci

00:1c.0 PCI bridge: Intel Corporation Device 9d10 (rev f1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 122
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
Memory behind bridge: df200000-df2fffff
Capabilities: [40] Express Root Port (Slot+), MSI 00
Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [90] Subsystem: Gigabyte Technology Co., Ltd Device 1000
Capabilities: [a0] Power Management version 3
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Access Control Services
Capabilities: [220] #19
Kernel driver in use: pcieport
Kernel modules: shpchp

00:1c.5 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #6 (rev f1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 123
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
Memory behind bridge: df100000-df1fffff
Capabilities: [40] Express Root Port (Slot+), MSI 00
Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [90] Subsystem: Gigabyte Technology Co., Ltd Device 1000
Capabilities: [a0] Power Management version 3
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Access Control Services
Capabilities: [220] #19
Kernel driver in use: pcieport
Kernel modules: shpchp

00:1d.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 (rev f1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 124
Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
Memory behind bridge: df000000-df0fffff
Capabilities: [40] Express Root Port (Slot+), MSI 00
Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [90] Subsystem: Gigabyte Technology Co., Ltd Device 1000
Capabilities: [a0] Power Management version 3
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Access Control Services
Capabilities: [220] #19
Kernel driver in use: pcieport
Kernel modules: shpchp

00:1f.0 ISA bridge: Intel Corporation Device 9d58 (rev 21)
Subsystem: Gigabyte Technology Co., Ltd Device 1000
Flags: bus master, medium devsel, latency 0

00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)
Subsystem: Gigabyte Technology Co., Ltd Device 1000
Flags: bus master, fast devsel, latency 0
Memory at df344000 (32-bit, non-prefetchable) [size=16K]

00:1f.3 Audio device: Intel Corporation Device 9d71 (rev 21)
Subsystem: Gigabyte Technology Co., Ltd Device fa55
Flags: bus master, fast devsel, latency 32, IRQ 11
Memory at df340000 (64-bit, non-prefetchable) [size=16K]
Memory at df320000 (64-bit, non-prefetchable) [size=64K]
Capabilities: [50] Power Management version 3
Capabilities: [60] MSI: Enable- Count=1/1 Maskable- 64bit+
Kernel modules: snd_hda_intel

00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
Subsystem: Gigabyte Technology Co., Ltd Device 1000
Flags: medium devsel, IRQ 16
Memory at df34a000 (64-bit, non-prefetchable) [size=256]
I/O ports at f040 [size=32]
Kernel driver in use: i801_smbus
Kernel modules: i2c_i801

00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection I219-LM (rev 21)
Subsystem: Gigabyte Technology Co., Ltd Device e000
Flags: bus master, fast devsel, latency 0, IRQ 132
Memory at df300000 (32-bit, non-prefetchable) [size=128K]
Capabilities: [c8] Power Management version 3
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [e0] PCI Advanced Features
Kernel driver in use: e1000e
Kernel modules: e1000e

01:00.0 USB controller: ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller (prog-if 30 [XHCI])
Subsystem: Gigabyte Technology Co., Ltd Device 1000
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at df200000 (64-bit, non-prefetchable) [size=32K]
Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
Capabilities: [68] MSI-X: Enable+ Count=8 Masked-
Capabilities: [78] Power Management version 3
Capabilities: [80] Express Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [200] Advanced Error Reporting
Capabilities: [280] #19
Capabilities: [300] Latency Tolerance Reporting
Kernel driver in use: xhci_hcd

02:00.0 Network controller: Intel Corporation Device 24fb (rev 10)
Subsystem: Intel Corporation Device 2110
Flags: bus master, fast devsel, latency 0, IRQ 138
Memory at df100000 (64-bit, non-prefetchable) [size=8K]
Capabilities: [c8] Power Management version 3
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [40] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 30-e3-7a-ff-ff-8f-52-55
Capabilities: [14c] Latency Tolerance Reporting
Capabilities: [154] L1 PM Substates
Kernel driver in use: iwlwifi
Kernel modules: iwlwifi

03:00.0 Non-Volatile memory controller: Intel Corporation Device f1a5 (rev 03) (prog-if 02 [NVM Express])
Subsystem: Intel Corporation Device 390a
Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0
Memory at df000000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [b0] MSI-X: Enable+ Count=16 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [158] #19
Capabilities: [178] Latency Tolerance Reporting
Capabilities: [180] L1 PM Substates
Kernel driver in use: nvme
Kernel modules: nvme
Last edited by cwahlgren on 2017/04/17 17:47:57, edited 1 time in total.

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: CentOS 7.3 Installation (Anaconda) failure and SSD on NVMe

Post by hunter86_bg » 2017/04/17 17:28:43

Could you try installing "Server with GUI" for the same minor version with xfs and then try to "yum upgrade" to latest version ?

Maybe it's just some issue with XFS file system and anaconda, which sounds like a bug.

cwahlgren
Posts: 12
Joined: 2017/03/03 09:06:03

Re: CentOS 7.3 Installation (Anaconda) failure and SSD on NVMe

Post by cwahlgren » 2017/04/17 17:46:42

hunter86_bg wrote:Could you try installing "Server with GUI" for the same minor version with xfs and then try to "yum upgrade" to latest version ?
le
Maybe it's just some issue with XFS file system and anaconda, which sounds like a bug.
This I've tried several times - Anaconda just freezes during package installation (say about 500 of 1200 packages). My guess as I wrote, is that the probable "nvme" bug hits harder when XFS is involved, since with EXT4 when it fails, I get an error dialog and I can still access the other consoles (and get logs). And, if I remember, as soon as I've been able to upgrade to later kernel I can install the rest. I think I've already tried to install "Server with GUI" after first reboot and that also failed.

I was just able to install with Minimal installation in the same way as I described earlier - Minimal with EXT4, reboot, update OS, reboot, install "Server with GUI". Now it has been running without any errors in /var/log/messages for some hour (but, exact same procedure has also failed sometimes). I've used one partition for /data with 410GB which I will reformat to XFS now at this stage to at least confirm it hasn't anything specific to do with XFS itself on NVMe.

tsol
Posts: 9
Joined: 2017/04/19 20:33:01

Re: CentOS 7.3 Installation (Anaconda) failure and SSD on NVMe

Post by tsol » 2017/04/22 04:28:21

Did you try disabling the Virtualization I/O in the bios for the install?

cwahlgren
Posts: 12
Joined: 2017/03/03 09:06:03

Re: CentOS 7.3 Installation (Anaconda) failure and SSD on NVMe

Post by cwahlgren » 2017/09/14 16:09:32

Hi,

Just an update on this thread. I have the same issue with the brand new CentOS-7.4.1708 with Anaconda, that it either totally freezes or halts during package installations (this time at ~500 of ~1200 packages) due to NVMe write error when I have select XFS as the root file system. That is at least similar behaviour I've seen with C-7.3 and also the current Fedora in April. As I think I mentioned earlier, I could install and run Ubuntu (latest version in April) with XFS without any issues at all. Now I'm running with with C-7.4 and EXT4 without any problems.

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: CentOS 7.3 Installation (Anaconda) failure and SSD on NVMe

Post by TrevorH » 2017/09/14 16:12:49

There's a bug report on bugs.centos.org and upstream about this and the blame is put firmly on the firmware for the Intel 600p SSD. Are you running the latest?
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

Post Reply