PDA

View Full Version : deciphering available resources


jonesy
05-20-2008, 06:27 AM
I seem to bump into this issue every time I exhale, and I'm hoping someone can help me avoid apps that fail to start due to 'insufficient resource' errors.

The latest issue involves a new app that I set up with nothing more than one NET device, one LUX5 device, and one IN device. That's it. The gateways each take up 0.05 CPUs for a total of 0.10 CPUs, and the LUX5 device takes up 0.5 CPUs. So, altogether, 0.6 CPUs are needed. My dashboard says I have 3.20 CPUs available, but the application won't start due to 'insufficient resources'. The log says "insufficient CPU on at least one of the servers". I hope everyone can understand the frustration this causes, and would probably agree that, from the perspective of the guy who's setting up the applications, the dashboard is LYING, and there's no other obvious, convenient way to know at any given time if a grid has sufficient resources. It would also appear that I'm unable to take advantage of resources I'm paying for.

This isn't so bad right now because I'm just setting up an additional application, but this *has* happened to me on our existing, production, high-traffic web app, which causes me to have to go back and rob the resources I just doled out to help the app handle the load more efficiently, and then restart the grid *again* (there's another several minutes of downtime).

If anyone has any input on how to solve this issue, please let me know.

PeterNic
05-20-2008, 03:24 PM
Brian,

On grids with few servers or when you have lots of apps, it is possible to get fragmentation.
I have seen this only on grids that were full to the brim (above 90% -- straight into the red zone)...

I will try to give you more details in the hope to assist you with avoiding/solving this problem. The resource usage is pretty straightforward; we haven't had a reason to communicate it until now -- but here it is. Also, I would like to ask for some more info (see at the end).

Generally, AppLogic uses a scheduling policy that minimizes that impact of fragmentation but it is possible to get in a bad situation where you have enough resources but they are not distributed in a way that you can fully utilize them. There is a relatively simple way to get out of this situation, too (if you get into it in the first place).

With some simplification, the AppLogic scheduler essentially tries to fill up servers that already have things allocated on them before starting appliances on empty servers; using the proverbial "put the big stones in the jar first". This way, it leaves the biggest possible fragments for scheduling appliances. (There are other scheduling modes you can select, but this one is the recommended and the default.)

You can see the current utilization and fragmentation using the "srv list" or "srv list --map" commands.

I don't know the size of your grid but it looks like something else is fishy here. In order to not be able to schedule your .6 CPU app, this means you don't have any servers that have 0.5 CPU left and/or no other two servers having 0.05 each. To get a 0.05+0.05+0.50 CPUs not be able to be scheduled with 3.2 CPUs available, you need to have at least 9 servers in your grid (3.2 / 0.49 = 8.5). Is that the case?

Also, if you don't have at least one whole free server, then you are probably running without HA capability - if you are on a production grid, that probably means you need to get an extra server or two.

In general, you can correct the fragmentation by restarting apps and/or individual appliances. The scheduler will pick the best layout to pack the appliances among the available servers, leaving the largest possible resource fragments available (note that there is no resource fragmentation within a server, it exists only between servers).

This is as much as I can give you without more specifics on your usage (what the server/appliance map was and how it got this way). In any event, the next time you have this situation, please ping support and/or send the output of 'srv list --map' plus the expected app -- maybe there is a bug in the scheduler that we haven't seen until now, or there is a simple way to avoid the situation.

Best regards,
-- Peter