Setting up for computation speed

General support questions
Post Reply
CentWarren
Posts: 1
Joined: 2017/05/21 18:25:04

Setting up for computation speed

Post by CentWarren » 2017/05/21 19:55:50

Hi all,

I am new to CentOS and Linux (just installed the operating system five days ago on a newly purchased computer). I come from a Windows background and was so relieved when I found the Emacs psychotherapist programme to talk to. I've read CentOS System Administration Essentials and have been working through The Definitive Guide to CentOS and CentOS Bible. I have a few books on the Linux Kernel too but still a complete newbie. I'm using the KDE environment with Centos 7.3 kernel version 3.10.0-514.16.1.e17.x86_64.

I'm implementing quantitative finance models at the retail level and am trying to set up to make maximum use of my hardware. I do not have an FPGA so calculations will have to be implemented at the software level. The goal is to get calculation times as close to the FPGA level of determinism as possible.

My hardware is as follows:
1)Intel Core i7 6800K 3.4GHz Broadwell 12x Threads LGA2011-V3
2)MSI X99A Gaming Pro Carbon Intel Motherboard
3)Antec HCP 1300W PSU
4)Corsair Vengence 64GB 3200MHz DDR4 RAM
5)3 * MSI Geforce GTX 1060 6GB GDDR5 1280 Core graphics cards
6)Samsung 960 EVO 250GB NVMe SSD
7) 4 * 4TB Hard drives

I've managed to install the graphics card drivers and get CUDA up and running, install R and Eclipse etc. My problem is I don't have any background in systems administration or enterprise level operating systems. I'm familiar with C/C++/R/VBA/Javascript/CUDA C and am in my first week of BASH. I'm attempting to write my application as far as possible in C.

I'd like to know if the following would actually result in a speed up or if I'm just grasping at straws:

1)Install the Real-Time Linux patch to give incoming tick data priority and the ability to preempt processes. Is there any way to get the OS to only execute on a designated core and to reserve other CPU cores exclusively for calculations? Would this stop system processes from introducing random latencies?

2)Create a RAID 1 setup for the operating system disk with my SSD (3200 MB/s read speed) (with a small partition just for the OS and essential libraries/programmes) and a RAM disk. Each time the operating system starts, create a RAM disk and use RAID 1 mirroring to mirror the contents of the SSD where the boot and operating system partitions would be. Does anyone know if this would work? Assuming with a lean kernel and if load balancing is possible. What would happen if we unmount the SSD?

3)Create a RAID 10 setup with the other four 4TB hard disks for reading tick data - is this the best one can do for speed and redundancy from conventional hard drives? I was thinking with btrfs.

4)Optimise the Kernel - does anyone know any good textbooks for this?

5)Instead of 2) just create a kernel with everything needed to run the application and with no need to save changes after power off. Then load that kernel entirely into RAM.

Any other suggestions would be great. Determinism at the expense of throughput is fine as more machines can just be bought.

I understand this will take time and am motivated to undertake the learning. This is my fifth day of Linux. I'm coming from a Statistical/Financial background and really apologise for misusing and misintepretating any computer science ideas above. I can optimise code but I don't know how to further optimise the computer setup. I don't have a corporation backing me so I have to do everything from the ground up. I have every day Monday-Sunday to sit and code and will gladly accept any textbook recommendations on creating efficient systems.

Thank you

Edit:
6)Set hard processor affinities instead of real-time patch. Or a combination of both? Should hyperthreading be disabled when doing this? Does hyperthreading cause random latencies? I have 6 cores. Leave two open for operating system and break up programme along other four cores? Would one full core suffice?

aks
Posts: 3073
Joined: 2014/09/20 11:22:14

Re: Setting up for computation speed

Post by aks » 2017/05/22 17:39:02

Is there any way to get the OS to only execute on a designated core and to reserve other CPU cores exclusively for calculations?
Yes, but this is probably tilting at windmills :D You can look at cpuset and cgroups to "tie" processes to specific processor)/corse. Often this will not achieve what most people think it will achieve (Linux is pretty good at scheduling jobs these days). Here's an excellent write up on CPU utilisation: http://www.brendangregg.com/blog/2017-0 ... wrong.html Read it - it is very good.
Each time the operating system starts, create a RAM disk and use RAID 1 mirroring to mirror the contents of the SSD where the boot and operating system partitions would be. Does anyone know if this would work? Assuming with a lean kernel and if load balancing is possible. What would happen if we unmount the SSD?


RAID1 is good for reading (you could "round robin" the reads) but there is a penalty for writes (flush the write to both disks, and often (depending on numerous details), don't return until BOTH operations are complete). Yes, you can have a mirror of the boot disks/partitions, IMO, it's more for redundancy (disk 1 dies, but I can still boot from the "other" disk).
Create a RAM disk
: containing what? So far, RAM is still quicker than disk (including "traditional" SSD) - but would the application be better of using that RAM rather than (whatever is in the RAM disk)?
Assuming with a lean kernel and if load balancing is possible
: I don't know what you mean now.
What would happen if we unmount the SSD?
What would happen if you unmount the boot disk (NOTE: not the root volume)? Then you'd run into problems with updates. Most of the "stuff" in the boot partition/disk is now resident in RAM (or swapped to disk, depending), so we only need to go back to disk/partition to read (if something is not already resident) or to write something. Removing the root volume is a bad thing (unless we have it somewhere else like in RAM for example, but that would probably fail as we need to be able to write various things at various times and often require it to be flushed to persistent storage).
Create a RAID 10 setup with the other four 4TB hard disks for reading tick data - is this the best one can do for speed and redundancy from conventional hard drives? I was thinking with btrfs.
BTRFS is a Copy On Write (COW) filesystem like ZFS. It offers better performance (generally) than "traditional" filesystems (like say ext3). But RH still has not "switched if on" as default for new systems as there are still quite a few edge cases where performance suffers (as far as I understand). Additionally, saying something is "faster" than something else without taking into account the usage patterns is completely meaningless. As a general rule for storage systems, columns add performance. This adage comes from the "spinning rust" (magnetic disks) era, whereby accessing a large file spread across multiple mechanical devices (i.e.: columns) will always be quicker than accessing the same large file over one disk. Mind if the file is a sector (or less) long, then it can't be "spread". Also there is still the write penalty of the mirror part in RAID10 (or even RAID01). I don't know what tick data is, I kinda assume it's application layer data.
Optimise the Kernel - does anyone know any good textbooks for this?
Well a good place to start is the kernel documentation (IMO). But it's always a trade off and this is so dependant on what your application(s) usage patterns are.
Instead of 2) just create a kernel with everything needed to run the application and with no need to save changes after power off. Then load that kernel entirely into RAM.
RAM is still "faster" than disk (even SSD - and not all SSDs are the same). There's new tech. coming out often marketed as "persistent RAM", so far (AFAIK) it's approaching RAM speed for some specfic use cases, but RAM is still faster than persistent storage (well in the affordable market). The ultimate performant application is something that exists in only the L1 cache of the CPU, and completely dominates the CPU (so it's never evicted from the L1 cache). Unfortunately this amount of code would not do anything useful.
Set hard processor affinities instead of real-time patch. Or a combination of both? Should hyperthreading be disabled when doing this? Does hyperthreading cause random latencies? I have 6 cores. Leave two open for operating system and break up programme along other four cores? Would one full core suffice?
I think you're making a rod for your back, don't try and second guess the scheduling algorithms - they are written by very clever and experienced people. Mind, if you can do better, then please do and send it upstream! Hyperthreading (IMO) is quite useful for virtual workloads where you are "throwing" bucket loads of workloads at the CPU. That depends though. If I send a workload (A) and it completes in time, then I send another workload (B). If workload A and workload B are very similar (so I don't have to evict various caches and registers, no interrupts happen in the mean time and each task can be completed in a single CPU cycle) then yes that's good, otherwise not ... well once again it depends. Yes hyperthreading can "create" (seemingly) random latency, it depends. For some workloads I run on a single core, it depends on what is required.

Sorry, no easy answer for these kind of questions. On a side note, as you're doing it all in C, have a look at the optimisation flags you can pass for your specific hardware (often doing something in hardware is "faster" than in software for the specific criteria that the hardware was designed for).

Any typos are my own!

Post Reply