[RESOLVED] - ELRepo NVIDIA RPM sets and survivability

Issues related to applications and software problems
Post Reply
User avatar
warron.french
Posts: 616
Joined: 2014/03/27 20:21:58

[RESOLVED] - ELRepo NVIDIA RPM sets and survivability

Post by warron.french » 2018/05/29 16:27:00

Can someone please tell me since I am doing this work for my employment, and not at home, if I am working with a redhat-7 variant (in this case I am actually working with Red Hat 7, and of the minor version 4) would the installation of an NVIDIA RPM-set survive updates to the kernel?

Admitting up front that I don't have much experience with compiling drivers, and updating kernels and then observing what happens to the functionality of the driver over kernel updates - my question is whether an NVIDIA RPM-set will survive at least 3-4 kernel updates? Also, important, will the same NVIDIA drivers installed from a specific NVIDIA RPM-set survive Red Hat 7 minor version updates; from 7.4 to 7.5 and again 7.6 for example?

I don't know how to ask this question more precise or more generically (for that matter) to ensure driver survivability.

Thank you for any education, advice, and information.
Last edited by warron.french on 2018/05/31 01:32:18, edited 1 time in total.
Thanks,
War

User avatar
jlehtone
Posts: 4530
Joined: 2007/12/11 08:17:33
Location: Finland

Re: ELRepo NVIDIA RPM sets and survivability

Post by jlehtone » 2018/05/29 19:50:19

Regular kernel modules are tied to specific kernel version. Most of them are are supplied with the kernel and thus the coupling is not a problem. Modules that you compile ayourself re likely to have the same strict coupling.

The ElRepo team compiles modules with less version dependency; their modules will not require specific version number, and when you install a new kernel the module is copied to the new kernel too (symlink, not actual copy). The same module thus functions with multiple kernels. The NVidia drivers to get updates too, and thus one rarely spans multiple kernel versions with same NVidia module.

When does this break? When the kernel API changes drastically. Luckily, the RHEL/CentOS kernels are rather stable and the security updates usually do not change the API.

The RHEL point updates, however, tend to add more to the kernel and thus change the API; the old module won't function with the new kernel. However, ElRepo has usually recompiled the NVidia kernel module (NVidia version remains the same) against the new kernel soon after RHEL point update release.

IME, the "survivability" is good with ElRepo.


I must add that I have the elrepo repository disabled by default. I don't allow cron to update from there. I wan't more control. For example, when the CentOS-7.5 updates became available, I did run:
1. update --exclude=kernel
2. update
3. reboot to multi-user.target
4. --enablerepo=elrepo update
5. reboot back to graphical
All of that might not be necessary, but what works ...

What I particularly ensure, is that the NVidia driver updates after Xorg Mesa OpenGL-packages have been updated.

User avatar
warron.french
Posts: 616
Joined: 2014/03/27 20:21:58

Re: ELRepo NVIDIA RPM sets and survivability

Post by warron.french » 2018/05/30 20:07:20

jlehtone, thanks for replying and with so much detail. Much of it is taking me awhile to comprehend, but it is sinking in. I understand your first and second paragraph; however, I do start losing you around the 3rd and 4th paragraphs. For instance, paragraph3 says the kernel module(s) break when the kernel API changes drastically, but in paragraph4 you state the RHEL-point updates tend to add more to the kernel and thus change the API.

Do these statements mean that typical kernel updates for RHEL, for example, kernel-3.10.0-693, when updated to kernel-3.10.0-693.17.1, will not be a problem, but if the RHEL-point update moved the kernel (if the RHEL-point was applied) to, as in my example, kernel-3.10.0-862 - it will break the kernel modules? They will not be back-linked to the original installed (actual file copy)? Am I correct in this understanding?


jlehtone wrote:Regular kernel modules are tied to specific kernel version. Most of them are are supplied with the kernel and thus the coupling is not a problem. Modules that you compile ayourself re likely to have the same strict coupling.

The ElRepo team compiles modules with less version dependency; their modules will not require specific version number, and when you install a new kernel the module is copied to the new kernel too (symlink, not actual copy). The same module thus functions with multiple kernels. The NVidia drivers to get updates too, and thus one rarely spans multiple kernel versions with same NVidia module.

When does this break? When the kernel API changes drastically. Luckily, the RHEL/CentOS kernels are rather stable and the security updates usually do not change the API.

The RHEL point updates, however, tend to add more to the kernel and thus change the API; the old module won't function with the new kernel. However, ElRepo has usually recompiled the NVidia kernel module (NVidia version remains the same) against the new kernel soon after RHEL point update release.

IME, the "survivability" is good with ElRepo.


I must add that I have the elrepo repository disabled by default. I don't allow cron to update from there. I wan't more control. For example, when the CentOS-7.5 updates became available, I did run:
1. update --exclude=kernel
2. update
3. reboot to multi-user.target
4. --enablerepo=elrepo update
5. reboot back to graphical
All of that might not be necessary, but what works ...

What I particularly ensure, is that the NVidia driver updates after Xorg Mesa OpenGL-packages have been updated.
It is very important that I understand this mechanism, the standards, the expected behaviors and etc.. Thank you for educating me.
Thanks,
War

User avatar
jlehtone
Posts: 4530
Joined: 2007/12/11 08:17:33
Location: Finland

Re: ELRepo NVIDIA RPM sets and survivability

Post by jlehtone » 2018/05/30 20:41:21

warron.french wrote:Do these statements mean that typical kernel updates for RHEL, for example, kernel-3.10.0-693, when updated to kernel-3.10.0-693.17.1, will not be a problem, but if the RHEL-point update moved the kernel (if the RHEL-point was applied) to, as in my example, kernel-3.10.0-862 - it will break the kernel modules? They will not be back-linked to the original installed (actual file copy)?
There is no hard rule on how far compatibility continues, but 693 to 693.* might have retained the API and 693.* to 862 did definitely change it.

The symlinks are formed, for that system does not know about API. It is when the kernel attempts to load the kernel module (on boot), when errors like "undefined reference" or "missing symbols" appear, module cannot load and as result X11 fails to start.

chemal
Posts: 776
Joined: 2013/12/08 19:44:49

Re: ELRepo NVIDIA RPM sets and survivability

Post by chemal » 2018/05/30 21:55:32

warron.french wrote:It is very important that I understand this mechanism, the standards, the expected behaviors and etc..
RH guarantees a stable kernel ABI for all kernels in rhel7. There is a list of all kernel symbols covered by this guarantee, see https://access.redhat.com/solutions/444773. Nvidia's kernel module also uses symbols that are not on this list. In theory, this means nvidia's kernel module could break with every kernel update. In practice, it usually survives all bugfix updates between minor releases, but tends to need a rebuild for each new minor release.

User avatar
warron.french
Posts: 616
Joined: 2014/03/27 20:21:58

Re: ELRepo NVIDIA RPM sets and survivability

Post by warron.french » 2018/05/31 01:31:14

Thank you both for the feedback.

I also got a response from Elrepo (Phil Perry) with similar answer. He wrote:
On 29/05/18 17:25, warron.french wrote:
Can someone please tell me since I am doing this work for my employment, and not at home, if I am working with a redhat-7 variant (in this case I am actually working with Red Hat 7, and of the minor version 4) would the installation of an NVIDIA RPM-set survive updates to the kernel?


Yes, elrepo's kmod packages are kABI-tracking, meaning they will survive kernel updates as long as the kmod module only uses kernel symbols on Red Hat's kABI whitelist. If they use symbols not on the whitelist then they will only survive kernel updates for which those symbol's ABI is not changed.

Admitting up front that I don't have much experience with compiling drivers, and updating kernels and then observing what happens to the functionality of the driver over kernel updates - my question is whether an NVIDIA RPM-set will survive at least 3-4 kernel updates? Also, important, will the same NVIDIA drivers installed from a specific NVIDIA RPM-set survive Red Hat 7 minor version updates; from 7.4 to 7.5 and again 7.6 for example?


In practice, the nvidia kmods typically survive kernel updates within a point release, but will typically require a package update for point releases (e.g, when going from 7.4 to 7.5)
Anyway, I now have a clear understanding of what to expect and how to handle it.

I am going to mark this as RESOLVED. Thanks guys.
Thanks,
War

Post Reply