Shutdown called from unknown source[Solved]

General support questions
Post Reply
TimO
Posts: 9
Joined: 2016/04/21 20:35:34

Shutdown called from unknown source[Solved]

Post by TimO » 2018/09/23 00:48:16

Centos 7.5.1804 on an HP DL380 G7

At 08:09:53 on Sept 20, something initiated a shutdown of this system. Since that time, running the system in anything except init 1 will last about 2 minutes before something calls for a shutdown. Running in init 1 will go indefinitely as will running a live DVD of any kind and the iLO on the server shows it's health to be fine. Also ran memtest for 18 hours with no errors. I'm sure it's not hardware related. I've disabled everything I can think of that would initiate an automatic shutdown: all the hp tools, snmp, Eaton power monitor, disabled ASR in the BIOS.

Log snippet:

Sep 22 13:29:46 tgate journal: No devices in use, exit
Sep 22 13:29:50 tgate systemd-shutdownd: Shutdown canceled.
Sep 22 13:30:19 tgate systemd-shutdownd: Shutdown canceled.
Sep 22 13:30:21 tgate systemd-shutdownd: Shutting down at Sat 2018-09-22 13:30:21 MDT (poweroff)...
Sep 22 13:30:21 tgate systemd-shutdownd: Creating /run/nologin, blocking further logins...
Sep 22 13:30:21 tgate alsactl[930]: /usr/sbin/alsactl: save_state:1595No soundcards found...
Sep 22 13:30:21 tgate alsactl[930]: alsactl daemon stopped
--
Sep 22 14:04:49 tgate hpasrd[8184]: Successfully mapped device.
Sep 22 14:04:55 tgate hpasrd[8184]: Stopping the timer.
Sep 22 14:04:58 tgate journal: No devices in use, exit
Sep 22 14:05:35 tgate systemd-shutdownd: Shutting down at Sat 2018-09-22 14:05:35 MDT (poweroff)...
Sep 22 14:05:35 tgate systemd-shutdownd: Creating /run/nologin, blocking further logins...
Sep 22 14:05:35 tgate snmpd[7888]: Received TERM or STOP signal... shutting down...
Sep 22 14:05:35 tgate ntpd[7444]: ntpd exiting on signal 15


In this log snippet; I just started to type "shutdown -c" repeatedly after I set init 2. Only managed to type it twice before something called for a shutdown. This is driving me quite crazy. Am currently downloading the newest iso to try and force an overwrite of all packages. Definitely don't want to do a fresh install.

Does anybody have an idea where this shutdown request may be originating from?
Last edited by TimO on 2018/10/08 03:10:32, edited 1 time in total.

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: Shutdown called from unknown source

Post by hunter86_bg » 2018/09/23 04:48:08

What is the 'IdleAction' value in '/etc/systemd/logind.conf'?
It could be defined in separate file -> '/etc/systems/logins.d/file.conf'

TimO
Posts: 9
Joined: 2016/04/21 20:35:34

Re: Shutdown called from unknown source

Post by TimO » 2018/09/23 05:12:43

hunter86_bg wrote:
2018/09/23 04:48:08
What is the 'IdleAction' value in '/etc/systemd/logind.conf'?
It could be defined in separate file -> '/etc/systems/logins.d/file.conf'
It's shown thusly.
#IdleAction=ignore
#IdleActionSec=30min

If only there was a way to record requests going in to systemctl.....

Addendum: Another strange thing is that on the initial shutdown when this issue began; no notifications were sent to any of the 3 machines logged into it. It doesn't do so on the local terminal either; the screen just goes blank and the machine shuts down.

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: Shutdown called from unknown source

Post by hunter86_bg » 2018/09/23 11:44:06

Check for script in the cron or systemd timer that invokes a service to shutdown.
Also, old school admins prefer putting stuff in rc.local , so check there also.
Consider changing the '/etc/systemd/logind.conf' so the machine will ignore the push of the power button.

TimO
Posts: 9
Joined: 2016/04/21 20:35:34

Re: Shutdown called from unknown source

Post by TimO » 2018/09/25 04:01:30

No odd crons or .timers.

Little more info:

I haven't found the problem, but I stumbled into a work-a-round of sorts.
If I boot into runlevel 1, do systemctl start network, and init 3 the system will stay up and mostly work. It actually appeared to be working normally until I noticed that some local network addresses weren't resolving correctly. I checked /etc/resolv.conf and saw that my modem by name was listed first, modem by IP second, ISP nameserver 3rd and my local(this same machine) was listed last. I removed the modem entries and put my machine first and within 2 minutes the machine received a poweroff command and shut down.

If I start named as well as networking while in runlevel 1; resolv.conf is correct: my nameserver first and ISP namserver 2nd and all DNS services work correctly(locally anyway). Then if I init 3; my resolv.conf gets set as above and things just mostly work.

This is sort of leading me to believe there is an issue with bind or the network manager or something that is called in multi-user.target is borking things up.

Thoughts???

desertcat
Posts: 843
Joined: 2014/08/07 02:17:29
Location: Tucson, AZ

Re: Shutdown called from unknown source

Post by desertcat » 2018/09/25 16:51:59

TimO wrote:
2018/09/25 04:01:30
No odd crons or .timers.

Little more info:

I haven't found the problem, but I stumbled into a work-a-round of sorts.
If I boot into runlevel 1, do systemctl start network, and init 3 the system will stay up and mostly work. It actually appeared to be working normally until I noticed that some local network addresses weren't resolving correctly. I checked /etc/resolv.conf and saw that my modem by name was listed first, modem by IP second, ISP nameserver 3rd and my local(this same machine) was listed last. I removed the modem entries and put my machine first and within 2 minutes the machine received a poweroff command and shut down.

If I start named as well as networking while in runlevel 1; resolv.conf is correct: my nameserver first and ISP namserver 2nd and all DNS services work correctly(locally anyway). Then if I init 3; my resolv.conf gets set as above and things just mostly work.

This is sort of leading me to believe there is an issue with bind or the network manager or something that is called in multi-user.target is borking things up.

Thoughts???
OK there are a few things that come to mind, these are things to do just to rule them out as a cause ie they probably are NOT the cause, but who knows?

1) Can you log in as another user?? See if the problem goes away, if it does then the problem is in the users /home/USER .
[While I did not have your problem, the second I logged in as alternate User the problem disappeared.]

2) Have you checked the amount of FREE space in each partition still available?
[ It could be you are running out of space, in which case you need to free up some space in the partition. Copy huge files to a backup media]

3) Check temperatures.
[If something is getting too hot, the machine shuts itself down to protect itself.]

4) Have you done a hardware diagnostics test, especially of the HDD or SSD?
[This could be a sign that there is an impending disaster on the horizon. If you get wonky answers.BACKUP THE ENTIRE DRIVE NOW! (Ask me how I know!)]

5) This *may* be a problem with GDM. Check the size contents of your /var/log/Xorg.0.log files, as well as you /var/log/gdm files and check their size and contents.
[See about switching to LightDM. I try to avoid GDM like the plaque.]

6) Have you tried an alternate Desktop Environment?
[Say KDE if you are running GNOME, to see if this is a problem with your DE.]

OK that's all I can think of at this point, but these are things that have happened to me in the past and how I fixed them. By eliminating these problems you can possibly narrow it down to say a faulty config file.

TimO
Posts: 9
Joined: 2016/04/21 20:35:34

Re: Shutdown called from unknown source

Post by TimO » 2018/09/30 18:28:49

desertcat wrote:
2018/09/25 16:51:59
TimO wrote:
2018/09/25 04:01:30
No odd crons or .timers.

Little more info:

I haven't found the problem, but I stumbled into a work-a-round of sorts.
If I boot into runlevel 1, do systemctl start network, and init 3 the system will stay up and mostly work. It actually appeared to be working normally until I noticed that some local network addresses weren't resolving correctly. I checked /etc/resolv.conf and saw that my modem by name was listed first, modem by IP second, ISP nameserver 3rd and my local(this same machine) was listed last. I removed the modem entries and put my machine first and within 2 minutes the machine received a poweroff command and shut down.

If I start named as well as networking while in runlevel 1; resolv.conf is correct: my nameserver first and ISP namserver 2nd and all DNS services work correctly(locally anyway). Then if I init 3; my resolv.conf gets set as above and things just mostly work.

This is sort of leading me to believe there is an issue with bind or the network manager or something that is called in multi-user.target is borking things up.

Thoughts???
OK there are a few things that come to mind, these are things to do just to rule them out as a cause ie they probably are NOT the cause, but who knows?

1) Can you log in as another user?? See if the problem goes away, if it does then the problem is in the users /home/USER .
[While I did not have your problem, the second I logged in as alternate User the problem disappeared.]
Doesn't matter whether I log in or not. If I boot the machine directly in to any multi-user runlevel; it shuts down within 2 minutes of completing its boot.
2) Have you checked the amount of FREE space in each partition still available?
[ It could be you are running out of space, in which case you need to free up some space in the partition. Copy huge files to a backup media]
No disks or partitions are approaching capacity; neither in space or i-node usage.
3) Check temperatures.
[If something is getting too hot, the machine shuts itself down to protect itself.]
All temperatures are green. System health is green across the board. Snmp/HP-Health would have reported any issues as well
4) Have you done a hardware diagnostics test, especially of the HDD or SSD?
[This could be a sign that there is an impending disaster on the horizon. If you get wonky answers.BACKUP THE ENTIRE DRIVE NOW! (Ask me how I know!)]
Everything hardware related tests fine. The machine stays up indefinitely as long as I don't correct resolv.conf
5) This *may* be a problem with GDM. Check the size contents of your /var/log/Xorg.0.log files, as well as you /var/log/gdm files and check their size and contents.
[See about switching to LightDM. I try to avoid GDM like the plaque.]
Generally don't use a graphical desktop. When I do; it's LightDM and Mate.
6) Have you tried an alternate Desktop Environment?
[Say KDE if you are running GNOME, to see if this is a problem with your DE.]
Haven't tried an alternative DE since no DE or login is required for the issue to present itself
OK that's all I can think of at this point, but these are things that have happened to me in the past and how I fixed them. By eliminating these problems you can possibly narrow it down to say a faulty config file.
Thank you for your thoughts, but I really think the issue revolves around something network related.

TimO
Posts: 9
Joined: 2016/04/21 20:35:34

Re: Shutdown called from unknown source

Post by TimO » 2018/09/30 20:08:02

I've also found that rebooting my modem in multi-user mode results in a shutdown after the network is reconfigured.

Log of modem reboot:
Sep 29 14:33:33 tgate kernel: e1000e: ens5 NIC Link is Down
Sep 29 14:33:39 tgate NetworkManager[1556]: <info> [1538253219.3169] device (ens5): state change: activated -> unavailable (reason 'carrier-changed', sys-iface-state: 'external')
Sep 29 14:33:39 tgate NetworkManager[1556]: <info> [1538253219.3206] manager: NetworkManager state is now CONNECTED_LOCAL
Sep 29 14:33:39 tgate dbus[1009]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Sep 29 14:33:39 tgate dbus[1009]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Sep 29 14:33:39 tgate nm-dispatcher: req:1 'down' [ens5]: new request (4 scripts)
Sep 29 14:33:39 tgate nm-dispatcher: req:1 'down' [ens5]: start running ordered scripts...
Sep 29 14:33:39 tgate nm-dispatcher: req:2 'connectivity-change': new request (4 scripts)
Sep 29 14:33:39 tgate nm-dispatcher: req:2 'connectivity-change': start running ordered scripts...
Sep 29 14:34:06 tgate kernel: e1000e: ens5 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
Sep 29 14:34:06 tgate kernel: e1000e 0000:15:00.0 ens5: 10/100 speed: disabling TSO
Sep 29 14:34:06 tgate NetworkManager[1556]: <info> [1538253246.3494] device (ens5): carrier: link connected
Sep 29 14:34:06 tgate NetworkManager[1556]: <info> [1538253246.3506] device (ens5): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'external')
Sep 29 14:34:17 tgate kernel: e1000e: ens5 NIC Link is Down
Sep 29 14:34:17 tgate NetworkManager[1556]: <info> [1538253257.8295] device (ens5): state change: disconnected -> unavailable (reason 'carrier-changed', sys-iface-state: 'external')
Sep 29 14:34:35 tgate NetworkManager[1556]: <info> [1538253275.3587] device (ens5): carrier: link connected
Sep 29 14:34:35 tgate kernel: e1000e: ens5 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
Sep 29 14:34:35 tgate kernel: e1000e 0000:15:00.0 ens5: 10/100 speed: disabling TSO
Sep 29 14:34:35 tgate NetworkManager[1556]: <info> [1538253275.3594] device (ens5): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'external')
Sep 29 14:34:52 tgate kernel: e1000e: ens5 NIC Link is Down
Sep 29 14:34:52 tgate NetworkManager[1556]: <info> [1538253292.9685] device (ens5): state change: disconnected -> unavailable (reason 'carrier-changed', sys-iface-state: 'external')
Sep 29 14:34:55 tgate NetworkManager[1556]: <info> [1538253295.3136] device (ens5): carrier: link connected
Sep 29 14:34:55 tgate kernel: e1000e: ens5 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
Sep 29 14:34:55 tgate kernel: e1000e 0000:15:00.0 ens5: 10/100 speed: disabling TSO
Sep 29 14:34:55 tgate NetworkManager[1556]: <info> [1538253295.3143] device (ens5): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'external')
Sep 29 14:35:00 tgate systemd-shutdownd: Shutting down at Sat 2018-09-29 14:35:00 MDT (poweroff)...
Sep 29 14:35:00 tgate systemd-shutdownd: Creating /run/nologin, blocking further logins...


I'm sure this is because it sets my resolv.conf to the correct values which I verified on reboot. The correct resolv.conf which is set if I boot directly in to a multi-user runlevel and has been in use for 3+ years and will(now) cause the machine to shutdown:
search pri.teedyo.net teedyo-mt.us
nameserver 172.16.39.2
nameserver 127.0.0.1


Incorrect resolv.conf which is set if I start the network while in runlevel 1 before going multi-user and does not cause a shutdown.
search PK5001Z pri.teedyo.net teedyo-mt.us
nameserver 192.168.0.1
nameserver 205.171.2.65
nameserver 172.16.39.2


PK5001Z/192.168.0.1 is my modem.

So far, I haven't been able to identify anything in the configuration files or scripts that will trigger a shutdown if my resolv.conf is set a certain way. I don't really understand how anything related to resolv.conf can trigger a shutdown.

Will keep looking, of course.

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: Shutdown called from unknown source

Post by hunter86_bg » 2018/09/30 23:40:44

Try the following (at the same time):
1) rename the shutdown binary to something else
2) rename the init binary to something else
3) rename the reboot binary to something else
If something is calling them, they won't exist. In case you need to restore them, you can boot from rescue media and rename them back.

Also, boot from rescue media, and check the logs in detail. You can chroot into your installation and disable ssh , if you have ipmi.

TimO
Posts: 9
Joined: 2016/04/21 20:35:34

Re: Shutdown called from unknown source

Post by TimO » 2018/10/08 03:03:19

I finally sorted out the issue. Turns out it was a hardware problem, but not with the server. One of the UPSs lost the ability to communicate with its web/snmp card. Apparently this put the web/snmp card in to some kind of panic state. It would wait the power loss verification time and the notification delay and then issue a shutdown. It didn't send a notification because the communication problem scrambled some of its settings and it didn't know that it was supposed to. I thought that I had isolated this possibility from peripherals when I disabled snmp. In hindsight, it looks as though the monitoring program (Eaton-IPP) maintains a connection to Eaton branded UPSs even without system level snmp. The reason my 3 step boot process allowed the machine to stay up was because the incorrect resolv.conf prevented the web/snmp card from identifying the machine it was supposed to be protecting.

It was a fluke that I found the communication problem with the UPS. I was examining the web page for the UPS and noticed that it hadn't performed a battery test for a while so I decided to initiate a manual test. It informed me that it was unable to initiate the command so I tried a few others with the same result. I decided to restart the UPS and when it came up it was obvious that it had lost communication with the web/snmp card because everything about the UPS was listed as 'unknown'.

As to why my 3 step boot process resulted in an improper resolv.conf; I can only speculate. When I did 'systemctl start network' from runlevel 1, it configured the network correctly via the init version of network. When I entered runlevel 3; NetworkManger ran and found its normal interfaces already configured. For some reason it went looking for other items to configure and in so doing it overwrote resolv.conf with improper settings. Confusing as it was; I'm grateful because it allowed the machine to operate almost normally until I found the problem. If this odd issue hadn't allowed the machine to remain up; I would have had to come up with something like hunter's disabling of shutdown on my own due to the lack of internet access. Yep, no failover server here. OK, there is, but it's disabled.

Thank you all for your ears.

Post Reply