PDA

View Full Version : GSC/cPanel shutdown timeouts may case data corruption


PeterNic
08-19-2007, 01:25 PM
This thread is also about GSC that has had cPanel (or another control panel or large app) installed.

If the GSC/cPanel appliance takes too long to shutdown, AppLogic may assume it has hung on shutdown and kill it -- the same way as you would turn the power off on a physical server that appears to be hung on shutdown.

The shutdown timeout in AppLogic is sufficient for most single-function appliances; however, cPanel has many services running inside, and, especially if it is not given enough CPU/memory resources, may take longer to shutdown.

Having the shutdown aborted may lead to database and/or filesystem corruption. While both databases and file systems take significant care to minimize the risk, aborted shutdowns -- especially regular ones -- will eventually cause problems.

Here's what you can do:

1. In AppLogic 1.2

a) give the appliance enough CPU and memory, so that it can shutdown within the timeout. BTW, if it doesn't have enough resources to shutdown reasonably quickly, then it probably doesn't have enough resources to operate well anyway.

OR

b) prior to stopping the cPanel application, log in to the appliance, execute 'shutdown now' (don't specify -h or use the halt command). This will stop all services -- while AppLogic still thinks the appliance is running -- without actually shutting down the appliance (if you did halt, AppLogic would assume the appliance crashed and helpfully restart it for you). As you lose the shell console (which happens when the shutdown stops the network services after everything else -- incl. MySQL -- has stopped), go to the grid shell and stop the cPanel application.

AND/OR

c) upgrade to AppLogic 2.0 (even the beta version)


2. AppLogic 2.0 (beta and later)

AppLogic 2 introduces 3 significant changes in the shutdown process compared to version 1.2:

default shutdown timeout is significantly extended
you can now override the shutdown timeout in appliance's attributes and specify custom timeout
you can force-kill an appliance that has hung during shutdown (if you decide not to wait the now longer shutdown timeout)


If the extended timeout is still not long enough for your appliance to shutdown successfully, use one of the following approaches:

give the appliance more resources (see (a) above)
extend the shutdown timeout of the appliance to give it sufficient time to shutdown (open the cPanel application in the AppLogic infrastructure editor, right click on the appliance, select Attributes, check the override shutdown timeout and specify the new shutdown timeout in seconds)
use the (b) manual shutdown described above -- this really is not needed if you use the other approaches


You can confirm that the appliance shutdown completes normally by inspecting the /var/log/messages log inside the appliance -- you should see at the end of the shutdown notices something like the following lines (look for the lines in bold):

Aug 19 20:20:13 srv1 shutdown: shutting down for system halt
Aug 19 20:20:13 srv1 init: Switching to runlevel: 0
Aug 19 20:20:14 srv1 rc: Stopping applogic_appliance: succeeded
Aug 19 20:20:14 srv1 haldaemon: haldaemon -TERM succeeded
Aug 19 20:20:14 srv1 messagebus: messagebus -TERM succeeded
Aug 19 20:20:14 srv1 atd: atd shutdown succeeded
Aug 19 20:20:14 srv1 xfs[3393]: terminating
Aug 19 20:20:14 srv1 xfs: xfs shutdown succeeded
Aug 19 20:20:14 srv1 httpd: httpd shutdown succeeded
Aug 19 20:20:14 srv1 sshd: sshd -TERM succeeded
Aug 19 20:20:14 srv1 xinetd[3281]: Exiting...
Aug 19 20:20:14 srv1 xinetd: xinetd shutdown succeeded
Aug 19 20:20:15 srv1 crond: crond shutdown succeeded
Aug 19 20:20:15 srv1 rc: Stopping applogic_cca: succeeded
Aug 19 20:20:15 srv1 rc: Stopping applogic_vma: succeeded
Aug 19 20:20:15 srv1 netfs: Unmounting NFS filesystems: succeeded
Aug 19 20:20:18 srv1 netfs: Unmounting CIFS filesystems: succeeded
Aug 19 20:20:18 srv1 rpc.statd[3080]: Caught signal 15, un-registering and exiting.
Aug 19 20:20:18 srv1 nfslock: rpc.statd shutdown succeeded
Aug 19 20:20:18 srv1 portmap: portmap shutdown succeeded
Aug 19 20:20:18 srv1 kernel: Kernel logging (proc) stopped.
Aug 19 20:20:18 srv1 kernel: Kernel log daemon terminating.
Aug 19 20:20:19 srv1 exiting on signal 15


Regards,
-- Peter