7.2->7.3 = kernel regression: Intel e1000e-drivers won't load anymore / cannot enumerate switches below PCI2PCIe bridges

Issues related to hardware problems
Post Reply
rujaajur
Posts: 4
Joined: 2017/03/31 07:09:48

7.2->7.3 = kernel regression: Intel e1000e-drivers won't load anymore / cannot enumerate switches below PCI2PCIe bridges

Post by rujaajur » 2017/03/31 08:37:26

Hi and thank you for reading,

we are using a couple of embedded boards (Intel Atom D525-based) with a Dual-Intel 82574L Gbit network card (PCI-104 based). Since the upgrade from 7.2 -> 7.3 we are facing an issue, that the drivers (e1000e) for the network-addon card can no longer be successfully loaded. Therefore both additional intel gbit-network ports from the addon-card are not usable/accessible anymore. The same system was working properly under all used 7.0, 7.1 as well as 7.2 kernels! Rebooting to an older (pre-7.3) kernel, instantly brings back the missing network ports.

systems getinfo-script (for driver-data):
http://pastebin.centos.org/75791/



To test whether this was kernel or subsystem/driver related: with the fully updated (current) rpm packages for 7.3 loaded and installed, but with the older kernel still available and not yet removed, it is possible to show that only the 7.3 kernel series seems to be affected with this issue. I further assume that this is due to some improper/incomplete kernel-backport for 7.3, because I also tested the recent "elrepo" kernels (4.4.x & 4.10.x) which do not show this issue! the following kernels are currently installed for further testing-purpose:

Code: Select all

kernel-3.10.0-327.36.3.el7.x86_64
kernel-3.10.0-514.10.2.el7.x86_64
kernel-3.10.0-620.el7_bz1394089_v2.x86_64
kernel-lt-4.4.54-1.el7.elrepo.x86_64
kernel-ml-4.10.3-1.el7.elrepo.x86_64
kernel-plus-3.10.0-514.el7.centos.plus.x86_64
The only kernels exibiting this ACPI-related issue are all 7.3 series kernels (3.10.0-514*), including the "centos-plus kernel (3.10.0-514.el7.centos.plug) as well as that (3.10.0-620.el7_bz1394089_v2) which I found and tested on the official RHEL bug-tracker.


Here are some dmesg extracts via "dmesg | grep e1000e" ...
from 7.2 (successful) booting:

Code: Select all

[    0.795616] pci 0000:04:06.0:   bridge window [mem 0xfe100000-0xfe2fffff]
[    0.801994] pci 0000:05:00.0:   bridge window [mem 0xfe100000-0xfe2fffff]
[    0.808156] pci 0000:08:00.0: reg 0x14: [mem 0xfe100000-0xfe17ffff]
[    0.813680] pci 0000:06:02.0:   bridge window [mem 0xfe100000-0xfe1fffff]
[    1.255382] pci 0000:06:02.0:   bridge window [mem 0xfe100000-0xfe1fffff]
[    1.273735] pci 0000:05:00.0:   bridge window [mem 0xfe100000-0xfe2fffff]
[    1.292101] pci 0000:04:06.0:   bridge window [mem 0xfe100000-0xfe2fffff]
[    1.325285] pci_bus 0000:05: resource 1 [mem 0xfe100000-0xfe2fffff]
[    1.325296] pci_bus 0000:06: resource 1 [mem 0xfe100000-0xfe2fffff]
[    1.325319] pci_bus 0000:08: resource 1 [mem 0xfe100000-0xfe1fffff]
[   11.975360] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.5-k
[   11.975361] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
[   11.976509] e1000e 0000:07:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[   12.023565] e1000e 0000:07:00.0: irq 29 for MSI/MSI-X
[   12.023584] e1000e 0000:07:00.0: irq 30 for MSI/MSI-X
[   12.023985] e1000e 0000:07:00.0: irq 31 for MSI/MSI-X
[   12.155177] e1000e 0000:07:00.0 eth2: registered PHC clock
[   12.160750] e1000e 0000:07:00.0 eth2: (PCI Express:2.5GT/s:Width x1) 00:d0:81:0a:03:c4
[   12.168788] e1000e 0000:07:00.0 eth2: Intel(R) PRO/1000 Network Connection
[   12.175760] e1000e 0000:07:00.0 eth2: MAC: 3, PHY: 8, PBA No: FFFFFF-0FF
[   12.183702] e1000e 0000:08:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[   12.193366] e1000e 0000:08:00.0: irq 32 for MSI/MSI-X
[   12.193383] e1000e 0000:08:00.0: irq 33 for MSI/MSI-X
[   12.193397] e1000e 0000:08:00.0: irq 34 for MSI/MSI-X
[   12.325919] e1000e 0000:08:00.0 eth3: registered PHC clock
[   12.331662] e1000e 0000:08:00.0 eth3: (PCI Express:2.5GT/s:Width x1) 00:d0:81:0a:03:c5
[   12.339700] e1000e 0000:08:00.0 eth3: Intel(R) PRO/1000 Network Connection
[   12.346671] e1000e 0000:08:00.0 eth3: MAC: 3, PHY: 8, PBA No: FFFFFF-0FF
full dmesg from 7.2: http://pastebin.centos.org/75796/

from 7.3 (erroneous) booting:

Code: Select all

[    0.816656] pci 0000:04:06.0:   bridge window [mem 0xfe100000-0xfe2fffff]
[    0.822119] pci 0000:05:00.0:   bridge window [mem 0xfe100000-0xfe2fffff]
[    1.004875] pci 0000:08:00.0: reg 0x14: [mem 0xfe100000-0xfe17ffff]
[    1.074173] pci 0000:08:00.0: can't claim BAR 1 [mem 0xfe100000-0xfe17ffff]: address conflict with PCI Bus 0000:00 [mem 0xbf700000-0xffffffff window]
[    1.449062] pci 0000:05:00.0:   bridge window [mem 0xfe100000-0xfe2fffff]
[    1.468398] pci 0000:04:06.0:   bridge window [mem 0xfe100000-0xfe2fffff]
[    1.501577] pci_bus 0000:05: resource 1 [mem 0xfe100000-0xfe2fffff]
[    1.501589] pci_bus 0000:06: resource 1 [mem 0xfe100000-0xfe2fffff]
[   12.750261] e1000e: Intel(R) PRO/1000 Network Driver - 3.3.5.3-NAPI
[   12.750263] e1000e: Copyright(c) 1999 - 2016 Intel Corporation.
[   12.750441] e1000e 0000:07:00.0: can't derive routing for PCI INT A
[   12.782223] e1000e 0000:07:00.0: PCI INT A: no GSI - using ISA IRQ 5
[   12.791139] e1000e: probe of 0000:07:00.0 failed with error -5
[   12.791366] e1000e 0000:08:00.0: can't derive routing for PCI INT A
[   12.803800] e1000e 0000:08:00.0: PCI INT A: no GSI - using ISA IRQ 5
[   12.810823] e1000e: probe of 0000:08:00.0 failed with error -5
full dmesg from 7.3: http://pastebin.centos.org/75801/[/i]


Judging from the details from within dmesg:

Code: Select all

[    0.224182] pci 0000:00:1f.0: can't claim BAR 13 [io  0x0800-0x087f]: address conflict with ACPI CPU throttle [io  0x0810-0x0815]
...
[    0.289182] PCI: Using ACPI for IRQ routing
[    0.290021] PCI: Discovered peer bus 07
[    0.291006] PCI: root bus 07: using default resources
[    0.291011] PCI: Probing PCI hardware (bus 07)
[    0.291085] ACPI: \: failed to evaluate _DSM (0x1001)
[    0.292006] PCI host bridge to bus 0000:07
[    0.293010] pci_bus 0000:07: root bus resource [io  0x0000-0xffff]
[    0.294009] pci_bus 0000:07: root bus resource [mem 0x00000000-0xfffffffff]
[    0.295008] pci_bus 0000:07: No busn resource found for root bus, will use [bus 07-ff]
[    0.296010] pci_bus 0000:07: busn_res: can not insert [bus 07-ff] under domain [bus 00-ff] (conflicts with (null) [bus 00-ff])
[    0.296065] pci 0000:07:00.0: [8086:10d3] type 00 class 0x020000
[    0.296113] pci 0000:07:00.0: reg 0x10: [mem 0xfe280000-0xfe29ffff]
[    0.296146] pci 0000:07:00.0: reg 0x14: [mem 0xfe200000-0xfe27ffff]
[    0.296180] pci 0000:07:00.0: reg 0x18: [io  0xc000-0xc01f]
[    0.296213] pci 0000:07:00.0: reg 0x1c: [mem 0xfe2b0000-0xfe2b3fff]
[    0.296300] pci 0000:07:00.0: reg 0x30: [mem 0xfe2a0000-0xfe2affff pref]
[    0.296436] pci 0000:07:00.0: PME# supported from D0 D3hot D3cold
[    0.296646] pci_bus 0000:07: busn_res: [bus 07-ff] end is updated to 07
[    0.296655] pci_bus 0000:07: busn_res: can not insert [bus 07] under domain [bus 00-ff] (conflicts with (null) [bus 00-ff])
[    0.296666] PCI: Discovered peer bus 08
[    0.297007] PCI: root bus 08: using default resources
[    0.297011] PCI: Probing PCI hardware (bus 08)
[    0.297067] ACPI: \: failed to evaluate _DSM (0x1001)
[    0.298006] PCI host bridge to bus 0000:08
[    0.299008] pci_bus 0000:08: root bus resource [io  0x0000-0xffff]
[    0.300009] pci_bus 0000:08: root bus resource [mem 0x00000000-0xfffffffff]
[    0.301008] pci_bus 0000:08: No busn resource found for root bus, will use [bus 08-ff]
[    0.302010] pci_bus 0000:08: busn_res: can not insert [bus 08-ff] under domain [bus 00-ff] (conflicts with (null) [bus 00-ff])
[    0.302061] pci 0000:08:00.0: [8086:10d3] type 00 class 0x020000
[    0.302106] pci 0000:08:00.0: reg 0x10: [mem 0xfe180000-0xfe19ffff]
[    0.302140] pci 0000:08:00.0: reg 0x14: [mem 0xfe100000-0xfe17ffff]
[    0.302173] pci 0000:08:00.0: reg 0x18: [io  0xb000-0xb01f]
[    0.302207] pci 0000:08:00.0: reg 0x1c: [mem 0xfe1b0000-0xfe1b3fff]
[    0.302296] pci 0000:08:00.0: reg 0x30: [mem 0xfe1a0000-0xfe1affff pref]
[    0.302427] pci 0000:08:00.0: PME# supported from D0 D3hot D3cold
[    0.303035] pci_bus 0000:08: busn_res: [bus 08-ff] end is updated to 08
[    0.303043] pci_bus 0000:08: busn_res: can not insert [bus 08] under domain [bus 00-ff] (conflicts with (null) [bus 00-ff])
[    0.304278] PCI: pci_cache_line_size set to 64 bytes
[    0.304343] pci 0000:07:00.0: can't claim BAR 0 [mem 0xfe280000-0xfe29ffff]: address conflict with PCI Bus 0000:00 [mem 0xbf700000-0xffffffff window]
[    0.305012] pci 0000:07:00.0: can't claim BAR 1 [mem 0xfe200000-0xfe27ffff]: address conflict with PCI Bus 0000:00 [mem 0xbf700000-0xffffffff window]
[    0.306010] pci 0000:07:00.0: can't claim BAR 2 [io  0xc000-0xc01f]: address conflict with PCI Bus 0000:00 [io  0x0d00-0xffff window]
[    0.307010] pci 0000:07:00.0: can't claim BAR 3 [mem 0xfe2b0000-0xfe2b3fff]: address conflict with PCI Bus 0000:00 [mem 0xbf700000-0xffffffff window]
[    0.309014] pci 0000:08:00.0: can't claim BAR 0 [mem 0xfe180000-0xfe19ffff]: address conflict with PCI Bus 0000:00 [mem 0xbf700000-0xffffffff window]
[    0.310018] pci 0000:08:00.0: can't claim BAR 1 [mem 0xfe100000-0xfe17ffff]: address conflict with PCI Bus 0000:00 [mem 0xbf700000-0xffffffff window]
[    0.311010] pci 0000:08:00.0: can't claim BAR 2 [io  0xb000-0xb01f]: address conflict with PCI Bus 0000:00 [io  0x0d00-0xffff window]
[    0.312010] pci 0000:08:00.0: can't claim BAR 3 [mem 0xfe1b0000-0xfe1b3fff]: address conflict with PCI Bus 0000:00 [mem 0xbf700000-0xffffffff window]
[    0.313046] Expanded resource reserved due to conflict with PCI Bus 0000:00
...
[    0.464309] pci 0000:07:00.0: can't claim BAR 6 [mem 0xfe2a0000-0xfe2affff pref]: address conflict with reserved [mem 0xbf600000-0xffffffff]
[    0.477053] pci 0000:08:00.0: can't claim BAR 6 [mem 0xfe1a0000-0xfe1affff pref]: address conflict with reserved [mem 0xbf600000-0xffffffff]
...
[    0.659113] pci 0000:07:00.0: BAR 1: no space for [mem size 0x00080000]
[    0.665810] pci 0000:07:00.0: BAR 1: trying firmware assignment [mem size 0x00080000]
[    0.673761] pci 0000:07:00.0: BAR 1: [mem size 0x00080000] conflicts with reserved [mem 0xbf600000-0xffffffff]
[    0.683885] pci 0000:07:00.0: BAR 1: failed to assign [mem size 0x00080000]
[    0.690927] pci 0000:07:00.0: BAR 0: no space for [mem size 0x00020000]
[    0.697618] pci 0000:07:00.0: BAR 0: trying firmware assignment [mem size 0x00020000]
[    0.705569] pci 0000:07:00.0: BAR 0: [mem size 0x00020000] conflicts with reserved [mem 0xbf600000-0xffffffff]
[    0.715691] pci 0000:07:00.0: BAR 0: failed to assign [mem size 0x00020000]
[    0.722739] pci 0000:07:00.0: BAR 6: no space for [mem size 0x00010000 pref]
[    0.729870] pci 0000:07:00.0: BAR 6: failed to assign [mem size 0x00010000 pref]
[    0.737390] pci 0000:07:00.0: BAR 3: no space for [mem size 0x00004000]
[    0.744083] pci 0000:07:00.0: BAR 3: trying firmware assignment [mem size 0x00004000]
[    0.752036] pci 0000:07:00.0: BAR 3: [mem size 0x00004000] conflicts with reserved [mem 0xbf600000-0xffffffff]
[    0.762157] pci 0000:07:00.0: BAR 3: failed to assign [mem size 0x00004000]
[    0.769199] pci 0000:07:00.0: BAR 2: no space for [io  size 0x0020]
[    0.775544] pci 0000:07:00.0: BAR 2: trying firmware assignment [io  size 0x0020]
[    0.783147] pci 0000:07:00.0: BAR 2: [io  size 0x0020] conflicts with PCI Bus 0000:00 [io  0x0d00-0xffff window]
[    0.793443] pci 0000:07:00.0: BAR 2: failed to assign [io  size 0x0020]
[    0.800136] pci_bus 0000:07: Some PCI device resources are unassigned, try booting with pci=realloc
[    0.809308] pci_bus 0000:07: resource 4 [io  0x0000-0xffff]
[    0.809314] pci_bus 0000:07: resource 5 [mem 0x00000000-0xfffffffff]
[    0.809325] pci 0000:08:00.0: BAR 1: no space for [mem size 0x00080000]
[    0.816019] pci 0000:08:00.0: BAR 1: trying firmware assignment [mem size 0x00080000]
[    0.823974] pci 0000:08:00.0: BAR 1: [mem size 0x00080000] conflicts with reserved [mem 0xbf600000-0xffffffff]
[    0.834094] pci 0000:08:00.0: BAR 1: failed to assign [mem size 0x00080000]
[    0.841138] pci 0000:08:00.0: BAR 0: no space for [mem size 0x00020000]
[    0.847830] pci 0000:08:00.0: BAR 0: trying firmware assignment [mem size 0x00020000]
[    0.855787] pci 0000:08:00.0: BAR 0: [mem size 0x00020000] conflicts with reserved [mem 0xbf600000-0xffffffff]
[    0.865911] pci 0000:08:00.0: BAR 0: failed to assign [mem size 0x00020000]
[    0.872954] pci 0000:08:00.0: BAR 6: no space for [mem size 0x00010000 pref]
[    0.880083] pci 0000:08:00.0: BAR 6: failed to assign [mem size 0x00010000 pref]
[    0.887602] pci 0000:08:00.0: BAR 3: no space for [mem size 0x00004000]
[    0.894295] pci 0000:08:00.0: BAR 3: trying firmware assignment [mem size 0x00004000]
[    0.902244] pci 0000:08:00.0: BAR 3: [mem size 0x00004000] conflicts with reserved [mem 0xbf600000-0xffffffff]
[    0.912369] pci 0000:08:00.0: BAR 3: failed to assign [mem size 0x00004000]
[    0.919410] pci 0000:08:00.0: BAR 2: no space for [io  size 0x0020]
[    0.926724] pci 0000:08:00.0: BAR 2: trying firmware assignment [io  size 0x0020]
[    0.934329] pci 0000:08:00.0: BAR 2: [io  size 0x0020] conflicts with PCI Bus 0000:00 [io  0x0d00-0xffff window]
[    0.944628] pci 0000:08:00.0: BAR 2: failed to assign [io  size 0x0020]
[    0.951323] pci_bus 0000:08: Some PCI device resources are unassigned, try booting with pci=realloc
[    0.960494] pci_bus 0000:08: resource 4 [io  0x0000-0xffff]
[    0.960500] pci_bus 0000:08: resource 5 [mem 0x00000000-0xfffffffff]
...
[    4.374074] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
[    4.380032] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
[    4.387708] e1000e 0000:07:00.0: can't derive routing for PCI INT A
[    4.394117] e1000e 0000:07:00.0: PCI INT A: no GSI - using ISA IRQ 5
[    4.402578] e1000e: probe of 0000:07:00.0 failed with error -5
[    4.409403] e1000e 0000:08:00.0: can't derive routing for PCI INT A
[    4.415801] e1000e 0000:08:00.0: PCI INT A: no GSI - using ISA IRQ 5
[    4.423473] e1000e: probe of 0000:08:00.0 failed with error -5
as well as the regular changelog for the 7.2-7.3 kernel commit-diff (@ https://git.centos.org/summary/rpms!kernel) it appears most probable to me, that this issue appears to originate from one of the following types of kernel commits: "ACPI", "BAR", "PAT", "MTRR".




QUESTIONs:

Do you have ...
a) any suggestions how to further pin-point the origin of that issue, that the dual-gbit network card cannot work with 7.3 anymore?

b) any idea how we could circumvent this ACPI ressource issue with all 7.3+ kernels?

c) how to possibly report that regression issue upstream to RedHat(!)? ... as apparently only their backported kernels starting with 7.3 appear to be affected! (see above: e.g. elrepo mainline/lite 4.10/4.4 kernels are definitely NOT affected) Would be really great if that could at least be fixed for 7.4 hopefully.


Thanks a lot for your help!

Best regards,
RJ



[EDIT]: just modified subject, to reflect the actual reason causing this regression![/EDIT]
Last edited by rujaajur on 2017/12/11 08:47:57, edited 1 time in total.

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Update 7.2->7.3 = kernel regression [ACPI?!]: Intel e1000e-drivers cannot be loaded anymore / resource conflicts

Post by TrevorH » 2017/03/31 09:57:06

[ 12.750261] e1000e: Intel(R) PRO/1000 Network Driver - 3.3.5.3-NAPI
Are you sure that's using the 3.10.0-514.10.2.el7 kernel? My modinfo here says e1000e is 3.2.6-k not 3.3.5.3-NAPI
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

rujaajur
Posts: 4
Joined: 2017/03/31 07:09:48

Re: Update 7.2->7.3 = kernel regression [ACPI?!]: Intel e1000e-drivers cannot be loaded anymore / resource conflicts

Post by rujaajur » 2017/03/31 10:35:34

TrevorH wrote: Are you sure that's using the 3.10.0-514.10.2.el7 kernel? My modinfo here says e1000e is 3.2.6-k not 3.3.5.3-NAPI
Hi TrevorH,

yeah, 100% positive that we are using 3.10.0-514.10.2.el7! Because first the stock e1000e 3.2.6-k (as you mentioned) could not be loaded ... (due to that mentioned ACPI resource conflict), we then tried to recompile the latest official Intel driver (yes, Intel does host their official driver sources @ SF) -> https://sourceforge.net/projects/e1000/ ... e/3.3.5.3/ ... as this e1000e-3.3.5.3 claims in its changelog that it specifically fixes/added support for RHEL 7.3. Unfortunately that latest driver did not work either.

I also tried the other 7.3 kernels, namely 3.10.0-514.el7, 3.10.0-514.2.2.el7, 3.10.0-514.6.2.el7 with unfortunately the very same result as with the latest 3.10.0-514.10.2.el7. Once looking closer at the upper parts of the linked "dmesg log", it seems that the hardware cannot get valid ACPI resources to begin with, once any 7.3 kernel starts loading. So the not loaded e1000e-driver is mainly a result and not the initial cause of the trouble.



EDIT: I just tried and installed in addition the most recent Oracle UEK4 kernel for RHEL7 (4.1.12-61.1.28.el7uek - Feb 24th 2017) and can confirm that this kernel(-series) is also unaffected, just like the elrepo-kernels.

For now only official RHEL7.3 (3.10.0-514*) and newer 7.4 BETA (e.g. 3.10.0-620.el7_bz1394089_v2) appear affected from this regression. Unfortunately I do not have access to intermediate RHEL7.3 BETA kernel revisions to further pinpoint when this bug was actually introduced.

rujaajur
Posts: 4
Joined: 2017/03/31 07:09:48

Re: Update 7.2->7.3 = kernel regression [ACPI?!]: Intel e1000e-drivers cannot be loaded anymore / resource conflicts

Post by rujaajur » 2017/05/03 08:29:50

Re-tested with the latest 7.3 kernel (*-514.16.1), though as somewhat expected the issue is still prevailing. I therefore registered also at the official redhat-bugzilla and opened a bug myself there (though it is still marked as private-only unfortunately, but I cannot change this myself):


SHORT QUESTION :?: : Are there ANY ill-fated side-effects to be expected, if we would run/boot a fully updated 7.3 SYSTEM but with the last regular 7.2 kernel (3.10.0-327.36.3)?


At least for the mean-time, until this regression gets fixed, this could be a viable way to proceed. :? What do you think?
=> I am very thankful for your input!

rujaajur
Posts: 4
Joined: 2017/03/31 07:09:48

Re: Update 7.2->7.3 = kernel regression [ACPI?!]: Intel e1000e-drivers cannot be loaded anymore / resource conflicts

Post by rujaajur » 2017/12/11 08:44:50

over the last months ... this bug had been analyzed, resolved and even verified (over at the official bugzilla of RedHat) ...



Still an apparent fix won't come and isn't scheduled before RHEL 7.5 unfortunately:

* actual reason for this bug: cannot (anymore) enumerate PCIe switches below PCI-to-PCIe bridges

Post Reply