Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps?

Issues related to configuring your network
alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps?

Post by alpha754293 » 2019/09/10 16:38:24

chemal wrote:
2019/09/10 04:49:43
Marvell? That's fake RAID, isn't it? (I see you mentioned Marvell before, I must have read over it.)

A single 860 EVO writes at about 500 MB/s (sequentially).

Edit: Google says, a Marvell 9230 is h/w RAID but really low-end. It has two PCIe 2.0 lanes for a theoretical max of 1 GB/s of which you can get ~800 MB/s in reality.
Yeah, it's 2x PCIe 2.0 lanes, which should be 10 Gbps.

At this point though, that's hard to tell whether it's because of the Marvell controller or whether it's because the system is truly limited to 10 Gbps mostly because the two are coincidentally the same.

The only way that I would be able to test that would be if I got like a Avago/Broadcom/LSI MegaRAID 12 Gbps SAS RAID controller (like the MegaRAID 9341-8i) to see if that will alleviate some of the bandwidth limitation issues because at least that's a PCIe 3.0 x8 (64 Gbps).

Also interesting enough, there AREN'T any Avago/Broadcom/LSI PCIe 3.0 x16 RAID controllers at ALL. They have host bus adapters, but not RAID HBAs, which also means that if I were to use a non-RAID HBA, the RAID would be software RAID, which of course, comes with its own set of issues.

I'm still debating on whether I want to switch over to the MegaRAID 9341-8i because doing so will max out my system's PCIe lane supply/demand. (Core i7-4930K supplies 40 PCIe 3.0 lanes, so 16 is taken up by the Mellanox NIC, 16 is taken up by the GTX Titan, which will leave 8 for the MegaRAID HBA if I go with it.)

I'm undecided in regards to whether I want to go with SAS/SATA/NVMe 12 Gbps MegaRAID, or just SAS and SATA (MegaRAID 9341-8i) or just SAS/SATA 6 Gbps (9271-8i).

Again though, at 768.67 MB/s, for the RAID array, that works out to be 192.1675 MB/s which is a far cry from the ~500 MB/s sequential write speed.

192.1675 MB/s is only about 1.5 Gbps (for each drive), where each drive also has a 6 Gbps interface.

Again, combined, it only musters 6.14936 Gbps (out of 10 Gbps for the PCIe 2.0 x2 link for the Marvell HW RAID controller itself), and that suggests that the RAID controller's PCIe 2.0 x2 interface isn't the limiting factor unless it's only like 61.5% efficient in regards to its usage of the PCIe 2.0 x2 interface.

I would have expected that if I were to max out the controller's interface, that I would be seeing something closer to 10 Gbps given the testing methodology that you suggested, so I'm still confused and something is still not adding up for me (in regards to maxing out the interface bandwidth either of the PCIe 2.0 x2 HW RAID controller) or with the suggested testing methodology.

With the hardware that I've got, and if the advertised specs are to be believed, I should be pushing upwards of 16.64 Gbps.

The suggested testing methodology I think, takes care of the buffering which might be inflating the results (which is fair), but then I'm getting results that are closer to the maximum random write speeds (based on number of IOPS max * 4 drives) than I am towards the advertised sequential speeds.

Again, maybe it's just me, but the data and the results doesn't make sense given the hardware.

chemal
Posts: 776
Joined: 2013/12/08 19:44:49

Re: Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps?

Post by chemal » 2019/09/10 20:24:10

To me this makes perfect sense. Your bottleneck is the Marvell controller.

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps?

Post by alpha754293 » 2019/09/10 22:29:29

chemal wrote:
2019/09/10 20:24:10
To me this makes perfect sense. Your bottleneck is the Marvell controller.
What is your rationale for that?

The local host can get 6.14936 Gbps, okay, but when I test it with NFSoRDMA, I'm at 4.7 Gbps for the entire array.

Wouldn't you think that if it was the Marvell RAID controller that was the bottleneck/limiting factor here, that I would be able to peg said Marvell RAID controller with a higher data rate than what the current data shows?

This is also coupled with the fact that when I run the same test with NFSoRDMA, I can only achieve about 75% of what the local host can manage despite the 100 Gbps interface. Therefore; so far, based on the data presented here and what's been discussed here, I am not sure if I follow your logic in regards to how you might be able to conclude that the bottleneck rests with the Marvell RAID controller and not with the Samsung SSDs themselves.

Part of the reason why I am asking is because before I invest MORE money into getting the 12 Gbps MegaRAID SAS RAID HBA, what I want to be sure of is that it isn't the Samsung 860 EVO SSDs that's the limiting factor.

IF the Samsung SSD was the limiting factor, then I would have bought that for no reason as it would do little to alleviate the bottleneck that is experienced/encountered by the Samsung SSDs.

The headnode only has 64 GB of RAM, otherwise, I'd test it some more, but using a RAM drive (tmpfs) which, should, again, in theory, be faster than the Samsung SSD SATA RAID.

chemal
Posts: 776
Joined: 2013/12/08 19:44:49

Re: Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps?

Post by chemal » 2019/09/10 23:42:45

Wouldn't you think that if it was the Marvell RAID controller that was the bottleneck/limiting factor here, that I would be able to peg said Marvell RAID controller with a higher data rate than what the current data shows?
The Marvell is connected with two PCIe 2.0 lanes because one isn't enough and (say) 1.6 aren't an option. Two lanes don't imply that the controller is capable of constantly accepting and processing data at a rate of 10 Gb/s (which is 8 Gb/s net because of 8b/10b encoding).

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps?

Post by alpha754293 » 2019/09/11 03:12:58

chemal wrote:
2019/09/10 23:42:45
Wouldn't you think that if it was the Marvell RAID controller that was the bottleneck/limiting factor here, that I would be able to peg said Marvell RAID controller with a higher data rate than what the current data shows?
The Marvell is connected with two PCIe 2.0 lanes because one isn't enough and (say) 1.6 aren't an option. Two lanes don't imply that the controller is capable of constantly accepting and processing data at a rate of 10 Gb/s (which is 8 Gb/s net because of 8b/10b encoding).
Ahhh..okay...that makes sense now. Forgot about the 8b/10b encoding rate. Damn it!

So, at 6.14936 Gbps out of 8 Gbps (data, not line) rate, that represents an efficiency of 76.867%.

So really, if I want to attempt to see higher throughput, I would need to drop in a faster RAID HBA like the MegaRAID 9341-8i, right? (Please check my way of thinking. And thank you, for correcting/educating/reminding me about the 8b/10b encoding rate).

Though that STILL does leave the question of why then, does the NFSoRDMA STILL only achieve 4.7-ish Gbps out of the 6 Gbps that's possible with the array from the PCIe 2.0 x2 HW RAID controller?

4.712 Gbps / 8 Gbps (max PCIe 2.0 x2 data rate) = 58.90%

4.712 Gbps / 6.14936 Gbps (max the RAID0 host can achieve) = 76.63%

I'm not sure if I understand why the NFSoRDMA can't achieve the ~6 Gbps that the host is able to attain.

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps?

Post by TrevorH » 2019/09/11 11:38:05

Your measurement method is flawed. If you run dd without telling it not to then the system will cache the output file in RAM and only flush it to disk when it feels like it. You need to add one of oflag=direct or conv=fdatasync to the dd to stop that from happening. Without it, you are measuring the speed of your RAM and not much else. You could also add status=progress to the dd so it shows you the throughput as it runs.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps?

Post by alpha754293 » 2019/09/11 12:16:01

TrevorH wrote:
2019/09/11 11:38:05
Your measurement method is flawed. If you run dd without telling it not to then the system will cache the output file in RAM and only flush it to disk when it feels like it. You need to add one of oflag=direct or conv=fdatasync to the dd to stop that from happening. Without it, you are measuring the speed of your RAM and not much else. You could also add status=progress to the dd so it shows you the throughput as it runs.
a) Again, as I have mentioned/told you before, what do you think happens with asynchronous writes?

b) I first discovered that this was actually an INVALID testing methodology that you proposed because it is LITERALLY NOT a valid option when you test it with tmpfs mount points, which means that the statement is NOT universally true nor valid.

c) I did a little research into this and insodoing, I have determined that this method actually only tests SYNCHRONOUS writes, which my mount point is not. I do not need the file system to send back the acknowledgement of that the write operation has completed successfully prior to the next write operation.

This is what the flag "conv=fdatasync" does.

(Sources: https://linux.die.net/man/2/fdatasync, http://man7.org/linux/man-pages/man1/dd.1.html)

The proposed testing methdology would be true and valid IF my NFS my point in /etc/exports was going to have the 'sync' option. By default, if the sync option is not specified, 'aysnc' is the default operation mode for NFS exports.

My research also indicates that you want to use sync if you want to protect against data loss during the write operation such that each write operation must send back ACK before the next write operation can commence, whereas in async operation, that ACK confirmation is not required which means that in the event of a power outage or some other drive function/operation anomaly, the data (block/file/object) may not necessary have completed being written to disk properly which can result in data corruption.

d) Further, I actually tested my file system with conf=fdatasync. I just didn't publish the results. Therefore; you cannot assume, based, ONLY on what I have published, what I have or haven't run/tested.

For your information, testing it (on the host) with:

Code: Select all

dd if=/dev/zero of=10Gfile bs=1024k count=10240 conv=fdatasync
Results in:

Code: Select all

10737418240 bytes (11 GB) copied, 13.8863 s, 773 MB/s
which, as you will note, is not that different than async write speeds.

(That's with:

Code: Select all

$ cat /proc/sys/vm/dirty_background_bytes
0
$ cat /proc/sys/vm/dirty_bytes
0
)

The system is in the middle of a VOF-to-DPM run right now (it's got about another day-and-a-half left on it, out of a 3 day run), so I can do more testing with it after that.

But suffice it to say that with the exception of NFSoRDMA, where the results are still lower than what the host can obtain, and I don't really have a good, or reasonable explanation that could or would explain this difference, the results are close enough between sync and async that to me, it doesn't really seem to matter.

Further, again, since the NFS mount point defaults is it mounts in async mode, unless you explicitly set it as sync via /etc/exports, therefore; it wouldn't make any sense to test synchronous writes for an asynchronous mount point.

That's not how it is ultimately going to be used, and therefore; the results from the synchronous write tests would be invalidated simply by the fact that the NFS mount point is exported as an asynchronous mount point, and is also mounted by the clients as such.

(The reverse would also be true -- testing asynchronous writes for a synchronous mount point would also be, likewise, invalid.)

(And I'm also not worried about power loss because the systems are all sitting behind UPSes. And if, given that, it still experiences loss or interruption of power, my first thought is that the drive is dying, which means that it will have to be replaced anyways. And being that this is a scratch file server where ALL of the data is transient/short-term data, again, I'm not that worried about it.)

Thanks.

I paused the run for a little while so that I can test it with sync for NFSoRDMA and here is result:

Code: Select all

10737418240 bytes (11 GB) copied, 26.6492 s, 403 MB/s
Again, I actually ran through the assessment process to determine whether I should be running async or sync and it came down to "how important is the data?" or "how bad is it if the data gets corrupted due to async writes?"

What I ended up with was that if the results were corrupted in the middle of a run, the whole run is trashed anyways, which would necessitate a re-run from the very beginning (i.e. NOT a restart).

Ergo; if I'm just going to trash the data and start over again, rather than to try and picked up from where it left off, then what that also tells me is that in the event of a data corruption (intermittent power during an async write), executions/runs would stop in order to find out what's going on with the drive (normally via SMART data/properties) and that will tell me if there's an issue.

As I've already experienced, when that happens, the system goes into the emergency maintenance mode anyways, and I can review the SMART data to find out what's the issue, which almost always results in replacing the drive.

That means that until that happens, the drive, especially as I've worn out SSDs before, would put itself into a read-only state which means that it'll only write at about 2 MB/s sync or async, which also means that I won't really be able to execute the runs anyways on account of that.

So, the end net result, given all of these factors, async is fine for me. I don't have hot spares and the data itself isn't all that important to me otherwise I wouldn't be running a RAID0 array to begin with.

Again, in almost all cases of that, it will result in a drive needing to be physically replaced (which managing the RAID is going to be "fun"/"interesting", although at least this is HW RAID, so it would probably be easier to manage than if it had been md RAID).
Last edited by alpha754293 on 2019/09/11 12:38:58, edited 1 time in total.

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps?

Post by TrevorH » 2019/09/11 12:31:20

10737418240 bytes (11 GB) copied, 13.8863 s, 773 MB/s
Though you wrote a lot of justification, the numbers now add up:

773 * 1024 * 1024 * 8 = 6484393984 bps
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps?

Post by alpha754293 » 2019/09/11 12:42:50

TrevorH wrote:
2019/09/11 12:31:20
10737418240 bytes (11 GB) copied, 13.8863 s, 773 MB/s
Though you wrote a lot of justification, the numbers now add up:

773 * 1024 * 1024 * 8 = 6484393984 bps
Yeah, I tested both sync and async writes.

I just didn't (originally) published the sync results because ultimately, I'm not using a synchronous NFS mount point/export, so synchronous test results would be inappropriate/invalid/meaningless with respect to an asychronous NFS mount point/export.

Ergo; there's no point in having a discussion about something that I wasn't going to use anyways.

But I DID test it. (Because I was curious to see what, if any, differences the RAID0 array was going to have.)

And as you can tell, with no dirty data in the buffers; there's not much in the way of a difference (between async and sync).

But in the NFSoRDMA results, there's quite a bit of difference, so I'm going to pick (and utilise) the faster of the two (async).

chemal
Posts: 776
Joined: 2013/12/08 19:44:49

Re: Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps?

Post by chemal » 2019/09/11 15:05:39

But in the NFSoRDMA results, there's quite a bit of difference, ...
You can experiment with the number of nfs server processes in /etc/sysconfig/nfs.

Post Reply