SMART Monitoring for NVME Storage

Issues related to hardware problems
Post Reply
dcrdev
Posts: 70
Joined: 2015/10/25 23:42:17

SMART Monitoring for NVME Storage

Post by dcrdev » 2017/08/07 13:30:35

Just built my new storage server - it uses an Samsung p951 NVME drive for the OS, something I didn't realise having gone with CentOS over Fedora Server this time round, is that the version of smartmontools in RHEL 7 doesn't actually support NVME drives. smartctl just prompts you to manually specify a device type and then it fails.

NVME support was introduced in smartmontools 6.5 whereas RHEL uses 6.3 - I saw this bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1418868 asking for a rebase, but it was closed as a duplicate of another bug I don't appear to have permission to view. Additionally the version of smartmontools shipping with RHEL 7.4 hasn't been rebased nor has it got NVME support backported.

This is a bit problematic for me as I need regular monitoring of the drive and also notification via email if something is wrong. Does anyone know what the status on backported support or a rebase is? It would have seemed logical to me that support would have been included in 7.4 given the heavy focus on NVME features...

I could/probably-will rebase the package myself - but that's overhead I don't particularly want.

User avatar
TrevorH
Site Admin
Posts: 33220
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: SMART Monitoring for NVME Storage

Post by TrevorH » 2017/08/07 14:15:02

I've got no idea what is/isn't in the 7.4 package but when I faced this issue in 7.3 I rebuilt the Fedora f23 SRPM.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

dcrdev
Posts: 70
Joined: 2015/10/25 23:42:17

Re: SMART Monitoring for NVME Storage

Post by dcrdev » 2017/08/07 15:23:11

Yeah that's what I'm going to do - however as mentioned it's something I'd rather not have to do as it's another package I have to manage updates for.

I downloaded the RHEL 7.4 package from my developer account and no it doesn't have NVME support. Do you think it's worth opening another bug report? As if NVME support is provided then the associated monitoring tools should also support it - like I say though they closed the last report as a duplicate of another bug that is restricted, so I have no way of knowing what's happening with it.

User avatar
TrevorH
Site Admin
Posts: 33220
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: SMART Monitoring for NVME Storage

Post by TrevorH » 2017/08/07 15:25:34

I can't read the linked bug either, only @redhat.com or the reporter can. And yes, I suspect it's worth opening another bug if only in order to say "I can't read that one" as part of it!
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke


dcrdev
Posts: 70
Joined: 2015/10/25 23:42:17

Re: SMART Monitoring for NVME Storage

Post by dcrdev » 2017/08/07 21:16:55

One more thing - I've built the Fedora package and minus having to create an selinux policy module to allow nvme access, it's all working well. However I just received an email stating "Device: /dev/nvme0, number of Error Log entries increased from 21 to 23". The output of smartctl -a is very different from the output I'm used to with non nvme drives.

Could someone tell me whether this is concerning at all:

Code: Select all

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-514.26.2.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLW256HEHP-00000
Serial Number:                      REMOVED
Firmware Version:                   CXB7401Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 256,060,514,304 [256 GB]
Unallocated NVM Capacity:           0
Controller ID:                      2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Utilization:            34,961,362,944 [34.9 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Mon Aug  7 22:07:33 2017 BST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Warning  Comp. Temp. Threshold:     68 Celsius
Critical Comp. Temp. Threshold:     71 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.60W       -        -    0  0  0  0        0       0
 1 +     6.00W       -        -    1  1  1  1        0       0
 2 +     5.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1500
 4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    803,005 [411 GB]
Data Units Written:                 525,252 [268 GB]
Host Read Commands:                 15,073,989
Host Write Commands:                12,399,531
Controller Busy Time:               11
Power Cycles:                       83
Power On Hours:                     40
Unsafe Shutdowns:                   42
Media and Data Integrity Errors:    0
Error Information Log Entries:      23
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               41 Celsius
Temperature Sensor 2:               67 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0         23     0  0x0000  0x4212  0x028            0     -     -
  1         22     0  0x0000  0x4212  0x028            0     -     -
  2         21     0  0x001d  0x4004  0x02c            0     0     -
  3         20     0  0x001c  0x4004  0x02c            0     0     -
  4         19     0  0x001f  0x4004  0x02c            0     0     -
  5         18     0  0x001e  0x4004  0x02c            0     0     -
  6         17     0  0x0075  0x4004  0x02c            0     0     -
  7         16     0  0x0057  0x4004  0x02c            0     0     -
  8         15     0  0x0075  0x4004  0x02c            0     0     -
  9         14     0  0x0035  0x4004  0x02c            0     0     -
 10         13     0  0x00bd  0x4004  0x02c            0     0     -
 11         12     0  0x00b8  0x4004  0x02c            0     0     -
 12         11     0  0x00ac  0x4004  0x02c            0     0     -
 13         10     0  0x0089  0x4004  0x02c            0     0     -
 14          9     0  0x0004  0x4004  0x02c            0     0     -
 15          8     0  0x001d  0x4004  0x02c            0     0     -
... (7 entries not shown)

ddemchak
Posts: 12
Joined: 2017/07/31 13:01:52

Re: SMART Monitoring for NVME Storage

Post by ddemchak » 2017/08/08 01:51:10

SMART really isn't the best for monitoring SSDs. Its great for HDDs as the vendors have mostly standardized the attributes.
If you are going to use SMART for SSDs , the vendor's tool will be the most accurate. Keep in mind that even for HDDs, which SMART is more reliable on that it is only about 80% accurate per Google's massive HDD study.

With all that said I highly recommend using a RAID-1 set-up. RAID-Z1 with ZFS is my current recommendation, as unlike other RAID, it can detect and correct errors, not just replace failing drives. With a RAID-Z1 set up you will be able to see if there is anything throwing errors or failing from dmesg/ syslog.
You will get so much more with RAID-Z1 than simple monitoring. You get fault-tolerance and ECC along with redundancy.
I recommend reading :
https://en.wikipedia.org/wiki/ZFS

You can also use mdadm for a "plain Jane" RAID set-up. Considering this is an OS drive, you can use software RAID as it does not matter much for mirroring (RAID1). I also took into consideration the high price of controllers supporting NVMe RAID.

If you still want to peruse the SMART route. You will need to look at the vendor's tool or documentation to determine what the SMART attributes mean, and whatnot as there is much variance in SSDs currently in regard to SMART attrs.

dcrdev
Posts: 70
Joined: 2015/10/25 23:42:17

Re: SMART Monitoring for NVME Storage

Post by dcrdev » 2017/08/08 08:48:23

I'm already using ZFS as my root filesystem and striped mirrored vdevs as my storage pool - admittedly the root filesystem is a single disk pool. I just don't see much benefit in raid for the os - there's no valuable data on it and the configuration is backed up regularly to the storage raid.

Back to the matter at hand though - zfs will show me whether there are any existing data errors, but it will not tell me whether the actual disk controller has an error. SMART monitoring would generally be a good indicator - but like I said the output is very different than what I'm used to. This isn't an SSD thing - I've used SSDs in the past and their SMART output is a lot more meaningful than the above.

I think your right - I need to use the manufacturers tool to test the drive, but in usual fashion that will require Windows, sigh.

ddemchak
Posts: 12
Joined: 2017/07/31 13:01:52

Re: SMART Monitoring for NVME Storage

Post by ddemchak » 2017/08/08 17:02:35

Fair point, yes there is usually more data. I have seen similar output when polling SMART info from HDDs behind hardware RAID controllers.
I'm not sure if that's all there is for the output for this particular drive or NVMe is causing it to show much less. This takes us back to using the OEM tool - I'm not suggesting this as a permanent solution. It would be useful to see if the drive has any issues now, and to see what the vendor-specific attrs for SMART are in regard to that SSD. I would forward any useful attr info over to the smartmontools devs.

As far as RAID-Z1 goes on an OS drive, I do prefer it, even in cases where there is only the OS due to the fact that I do not like dealing with erratic system behaviour caused by a corrupt root filesystem / failing OS drive. I consider end-to-end ECC a requirement for my storage servers.

User avatar
TrevorH
Site Admin
Posts: 33220
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: SMART Monitoring for NVME Storage

Post by TrevorH » 2017/08/08 17:18:31

The smartctl output for nvme drives is totally different to non-nvme.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

Post Reply