View Full Version : controller failover
JeremyN
08-27-2008, 03:07 AM
Hi Guys,
Not sure if/how many times this may have been mentioned in the past, but lately we have had a rash of hardware faults. This usually is no problem, as applogic is very fault tolerant in most any circumstance, unless the failure happens to be on srv1(or whatever node is running the controller). Of course it's usually not too difficult to move the controller to another node in most cases, however at times this can cause extended downtime for a customer depending on the circumstances. I would be interested to see something implemented(or even help) that would keep track of the location of the mirrors for the controller vm boot/meta volumes, and have a watchdog timer, or something similar in place on this node that would detect a controller fault. Then, when the fault is detected, enable is_controller on the node holding the other volume streams and reboot the grid(or just start the controller) bringing the grid back online seamlessly. Beyond this even, it would be a good idea to automatically replicate the boot/meta volumes upon bringing the new controller online r if (god forbid) two faults in a row happen(and this has happened to me once :). I realize there are a lot of variables to take into account here, and it may be easier said than done(and done in more ways than one), but it seems the only real weakness I see here is the failure of the node hosting the controller domu. I would be interested in any feedback you guys may have on this.
-Jeremy
PeterNic
08-27-2008, 10:32 PM
Jeremy,
I will be happy to discuss what can be done right now -- most of the things you are describing can be done outside, either in an appliance (e.g., on a monitoring grid) or on the aldo server. Let's chat on the phone the next few days - with you and/or anyone else at LT who might be involved.
We're also going to have the built-in solution in the next major release post 2.4 (as discussed elsewhere on the forums) to remove the final SPOFs from the grid that require manual intervention. This will also include some level of hardware monitoring, although I have not found a reliable indication of an impending HDD failure (no one has -- including the Google folks in an amazing HDD reliability report they published a few years back). I believe we have an adequate set of solutions and it will create an even more solid and self-healing system.
Best regards,
-- Peter
PeterNic
08-27-2008, 10:47 PM
Oh, btw, the failure of 2 adjacent servers is a near impossibility (I have done the computations and the odds are staggering -- the probability is less than that of winning the lotto jackpot AND getting hit by a lightning on the way to claim your prize).
If I remember the case correctly, there was a network switch failure that made it look like srv2 had also failed. I was unable to verify whether srv1 had failed, since it was taken offline before I got to see the fault. And, of course, you can always configure your grid for 3x mirroring... but that's hardly worth it.
(PM me if your account of the incident is different -- but there weren't that many and I am sure we're talking about the same case.)
Best regards,
-- Peter
P.S.: having more hardware monitoring and acting upon it can reduce the probability of double fault to the mathematical one. The math assumes you *see* the failure and begin the recovery process; the second fault needs to happen within the recovery window. However, if you don't notice the failure -- which with HDDs is possible -- then the calculation does not model the physical environment correctly.
To put some numbers to that, if a server has an MTBF of 44,000 hours, then the chances of both failing at the same time are 0.0000074380% for any given day by my quick calculations. The reason the MTBF is so low for a single server, is that you're only ever as good as the lowest MTBF of non-redundant components, usually the motherboard.
PeterNic
08-28-2008, 11:22 AM
Karl, this is correct. You have to take the MTTR (mean time to repair) into account -- which you have set to essentially 1 day, which is a typical number.
I would like to add one more variable to the theory, slightly tongue-in-cheek: MTTN (mean time to notice); it gets added to the mean time to repair. Over a large population of servers, it is important to have the right level of detailed monitoring, so a non-fatal failure does not go unnoticed.
Also, from the failures I have noticed, the hard disks seem to be responsible for over 50% of the failures - I think JeremyN may have better statistics over a larger sample than I see.
Thanks for the input!
-- Peter
Hi,
Are we anywhere nearer to controller redundancy at all? We've just had a power outage (bit of kit went dodgy, kept causing PDU to trip) that seems to have killed one of our nodes (Early fatal page fault v. early in the boot process), which happened to be running the controller - which obviously caused downtime of longer than ideal.
Cheers,
PeterNic
01-25-2009, 04:04 AM
Karl,
We're hard at work on this -- it is finally at the top of the list. Our next major release will have it (when I say major, I don't mean 3.0 -- I mean 2.5 or so).
-- Peter
That's good news, as it's one of the first things most of our customers ask when we explain the system to them, usually, "So the controller is redundant and moves around like the appliances then?"
PeterNic
09-21-2009, 07:43 PM
Yes, Yes, Yes, this is finally in!
The controller reboot / failure causes no impact on the running apps. Failure of the server that happens to run the controller is dealt with exactly as with any other appliance under AppLogic -- the controller is automatically restarted on another server with minimum fuss and you get full control over the grid in a few minutes (with a neat recovery progress screen during the process so you don't have to guess what's going on).
In addition, we have added some extensive diagnostics, most notably the early detection of imminent storage device failures. So far, since this was implemented in the 2.7 beta, outside of any tests, we got disk warnings on 3 servers and each of the servers had a disk fail within 24-72 hours. So far we have not gotten any false positives and I don't think we've got hard disk failures without a warning (although the latter is always possible).
All of this takes us even further into allowing sysadmins to sleep well at night (and all through the night). Better SLAs and better service all around. I am really excited that this support is finally in place and operating well -- and I hope to see all grids upgrade to the 2.7 production release as soon as it is out.
Best regards,
-- Peter
vBulletin® v3.7.5, Copyright ©2000-2012, Jelsoft Enterprises Ltd.