PDA

View Full Version : Occasional server crash -- possible reasons


PeterNic
06-19-2009, 09:49 PM
We have identified two cases in which a server that is part of a grid may fail (hang or reboot). AppLogic restarts the appliances on other servers in order to recover the applications, but this still disrupts the app for a short time and breaks some volume mirrors.

Reason 1: observed primarily on servers that are loaded up with many small appliances, it is possible for the hypervisor to steal memory from the AppLogic kernel and I/O subsystem, leaving insufficient memory. Grid maintainers can verify the server logs to confirm whether this effect is being observed -- either before or following a server failure.

Reason 2: certain hotfixes replace the hypervisor and may, as a side effect revert the memory allocation for the AppLogic kernel and I/O subsystem from 384M / 512M to 256M (hypervisor's default). This is easy to check -- maintainers can see 'cat /proc/meminfo' at MemTotal, as well as compare/check the /etc/grub.conf for the dom0 size in the Xen command line. (It is also easy to fix -- use aldo set <grid> dom0_vm_mb=XXX, where XXX is the desired value -- usually 384 for AppLogic 2.1/2.2 and 512 for AppLogic 2.4+).

These problems are quite rare in most systems. If you suspect this may affect your system, please contact your service provider (for VPDC and other service users) or our helpdesk (for licensees).

We will post hotfix information as hotfixes for these issues are prepared.


Regards,
-- Peter

PeterNic
06-24-2009, 09:00 AM
The following hotfixes cause the issue listed at Reason 2 above:

AppLogic 2.4.7/2.4.8: hf2841 is now recalled

AppLogic 2.1.1: hf2114 and e2365 are now recalled (hf2114 is OK if installed as a distro hotfix, the problem occurs only when the hotfix is installed on an already existing grid)


We have reviewed the grid version reports and there were very few grids affected; the customers are being notified.

Best regards,
-- Peter