View Full Version : local_only vols do not move when I pin a component to a different node
jody3t
03-01-2008, 05:27 PM
Hi,
First let me describe what my objective is:
I have several identical apps each with a mysql component that is HD IO intensive.
I am wanting to have these scheduled so that the mysql servers do not run on the same grid node. I see how to do this within an app using the appliance's group field, but how can this be done across apps?
I am doing this manually right now by pinning each mysql component in each app to a different node in the grid.
I have recently changed the pinned server node to a different node on a couple of the mysql components and restarted the apps, but the "local_only" vols do not move when I pin a component to a different node.
So once vols for a component are scheduled and created on certain nodes, does this mean I cannot pin those components to a different node and still have the vols as "local_only"?
Maybe I am misunderstanding the meaning of "local_only" and High bandwidth volumes? I figured local_only vols would have at least 1 of the vols on the same node that the component is running on.
(For example the component is c2:main.mysql on my grid is running on srv3, but its "local_only" dbData vol is on srv1 and srv2 and the scheduler allowed it start.)
This is from the docs:
"Performance constraints for the volume. AppLogic supports three volume constraints: none, high bandwidth and local only. For volumes requiring high-bandwitdh access, AppLogic tries to schedule the appliance on the same server where the volume resides. If this is not possible, AppLogic logs a warning but it nevertheless runs the application. For volumes set here as requiring local-only access, AppLogic will not start the application unless it can ensure that the appliance can run on the same server where the volume is."
thanks,
Jody
jody3t
03-01-2008, 06:01 PM
I guess what I want is a "global group" where the scope of this attribute is across all apps on the grid. I would put my master mysql components in this "global group".
I will also be using the existing "group" attribute which has a scope limited to the app for multiple jboss components and also for the master and slave mysql components within a single app.
BUT if I can pin a component and have its "local_only" vols get pinned with it that would work for now too. I understand that the mirror would still have to run on another node, but it seems like at least 1 of the vols should be located on the same node with the pinned mysql. Also it would be nice to have the mirrors of the local_only vols spread more evenly across the grid nodes. Currently all of the master mysql data vols for our 3 main production apps are being placed on srv1 and srv2 even though the components are running on different nodes, ie, srv2, srv3, and srv4, so in this case only 1 of the mysql components really has its data vol as a local_only vol.
PeterNic
03-01-2008, 11:39 PM
Jody,
The "local_only" and "high_bandwidth" settings were originally designed to drive the scheduler when choosing how to place appliances on servers. These have somewhat lost their usability when we added mirroring. They still control the preferred server on which a component will be placed, based on where the volume is. Note that AppLogic will not move a volume on its own -- this is a long operation that can place significant load on the servers involved, so we don't want to do it without giving you the ability to control when and whether this is done.
We will look at the possibilities for adding support for global groups.
The problem with pinning a component is that this disables HA -- if the server to which the component is pinned dies, the component will not be restarted elsewhere.
What I think you can do now:
- use the high_bandwidth attribute to direct AppLogic to schedule the appliance wherever one of the volume streams are (btw, please test with and without this, let me know if it really makes a difference -- unless you approach the max bandwidth, I don't think it will)
- more importantly, you can move the volumes around as you need them; this will allow you to distribute the I/O to different spindles and avoid excessive seeks
There is no direct command to move a volume to a particular server (maybe we should add one). There is, however, an easy way to cause a volume to be moved. Handle with care -- this is an advanced operation; during the stream move, the volume is not redundant.
Let say you had a 4 server grid; you just added 2 more servers and now have 6. You want to move a volume that has streams on servers 1 and 2, to servers 5 and 6. Here's how to do it:
- ensure that you are the only user working with the grid user interface
- disable all servers: "srv disable --all". Disabling a server tells AppLogic not to use that server for allocating new resources; it does not affect what currently runs on the server.
- enable servers 1 and 5: "srv enable srv1" , "srv enable srv5"
- migrate the desired volume, e.g.: "vol migrate myapp:data". This will move the stream from srv2 to srv5
- disable serve 1 and enable server 6
- migrate the desired volume. This will move the stream from srv1 to srv6.
- re-enable all servers, "srv enable --all"
This operation can be done while the application is running (it will, of course, affect performance, as volume migration will essentially copy the full stream). The migrate command is intended to assist in preparing a server from removal from a grid, hence its ability to move volumes off a server; by leaving a particular server enabled, you can direct where the new stream will be created.
A final note: when manually positioning streams, please try to keep streams of the same volume on adjacent servers (1+2, 3+4, 5+6, etc.). When moving a stream from a server with an odd number, move it to another server with an odd number; when moving the stream from a server with an even number (usually the odd number+1), move it to a server with an even number. This will keep the probability for total volume loss in case of two node failure somewhat lower.
Best regards,
-- Peter
PeterNic
03-02-2008, 09:26 PM
Jody, another thing you can do for improved write performance and uniformity is to prefill the volumes when you create them. See http://support.3tera.net/showthread.php?p=752#post752 for details.
-- Peter
jody3t
03-03-2008, 07:07 AM
Thanks Peter,
The High Bandwidth and local_only constraints do not appear to do anything. In fact it seemed like the scheduler was trying to put the mysqlp appliance on a node other than where its volumes were located, so I use the srv disable command to get it to schedule to the nodes where their vols were located.
I don't know for sure that it makes a big difference, but we have 2 somewhat high load apps each with a mysql appliance and a "vmstat 60" log file indicates that the mysql located on the same node as its vols was able to regularly achieve 6000 to 8000 in the bi column while the mysql located on a node other than its data vol only got up to 2000 and 3000 in the bi column, and this machine also produced slower queries on average. This is anecdotal since I did not try specific queries on each one as a baseline.
It might be useful to have something that tracks HD IO usage for each appliance and where their volumes live and then recommended some volume migrations to better balance the IO. I am fairly sure that the 3 (we will be adding more so I guess even more data vols could end up on the same nodes) mysql DBs pointing to the same nodes is not going to be as efficient as having them spread more evenly.
I was thinking that the --rebalance option might do something like that but it isn't clear exactly what it does or how to use it.
Can you explain this vol migrate --rebalance option?
PeterNic
03-03-2008, 08:00 PM
Jody,
The High Bandwidth and local_only constraints do not appear to do anything. In fact it seemed like the scheduler was trying to put the mysqlp appliance on a node other than where its volumes were located, so I use the srv disable command to get it to schedule to the nodes where their vols were located.
I'll submit this to our QA as a defect report for reproduction; if we've broken this in the last release, we will fix it.
The workaround you found is great -- it is much better than pinning appliances to servers, as it will allow failover. In short, if developing/testing, it is easier to use pinning (because you are restarting more often); if running in production, tweaking the server assignment by disabling and re-enabling servers is better.
I don't know for sure that it makes a big difference, but we have 2 somewhat high load apps each with a mysql appliance and a "vmstat 60" log file indicates that the mysql located on the same node as its vols was able to regularly achieve 6000 to 8000 in the bi column while the mysql located on a node other than its data vol only got up to 2000 and 3000 in the bi column, and this machine also produced slower queries on average. This is anecdotal since I did not try specific queries on each one as a baseline.
Our tests show much smaller difference between volumes that have one local stream and volumes that have both streams remote (from viewpoint of the appliance). I'll look again; if you have some test that we can use to reproduce this, please let me know.
It might be useful to have something that tracks HD IO usage for each appliance and where their volumes live and then recommended some volume migrations to better balance the IO. I am fairly sure that the 3 (we will be adding more so I guess even more data vols could end up on the same nodes) mysql DBs pointing to the same nodes is not going to be as efficient as having them spread more evenly.
Agree, it is a good idea; we have discussed it several times and it is on the roadmap. Might want to accelerate it.
I was thinking that the --rebalance option might do something like that but it isn't clear exactly what it does or how to use it.
Can you explain this vol migrate --rebalance option?
The --rebalance option is a documentation error. We may add such option which can do pretty much what you are looking for, except initially it won't take volume access stats into account.
Were you able to move volumes around using the vol migrate or the vol copy procedures in the previous post?
Regards,
-- Peter
jody3t
03-06-2008, 09:58 AM
>> Were you able to move volumes around using the vol migrate or the vol copy procedures in the previous post?
Yes, the technique you described for migrating vols worked fine; although the progress indicator paused as 99% for nearly half of the migration time which a little bit disconcerting. I did the vol migration while the appliance was running (although hardly any users were on it since it was around midnight) and it seems to have work fine. This increases my confidence tha mirroring and HA will actually work (if it ever has too).
>> Our tests show much smaller difference between volumes that have one local stream and volumes that have both streams remote (from viewpoint of the appliance). I'll look again; if you have some test that we can use to reproduce this, please let me know.
Note that I had 3 mysql appliances using the same spindles and it was during high usage. So I don't think it is so much that the vol was remote as it was the fact that 3 mysql DB data vols were located on the same nodes. So the the global group idea would be more useful for balancing the vols rather than balancing the processes on the nodes (ie, more for balancing HD IO than for having VMs running on different nodes for failover).
Perhaps the differences only start to appear when there are many HD seeks that pile up. Undoubtedly a slowdown would eventually be seen after enough processes start to try to get enough things from the same spindle. Exactly when it would start to exhibit thrashing like behavior I cannot say, but undoubtedly it would eventually happen. Moving the vols to a different spindle did seem to modestly improve the average cpu usage, particularly when concurrent slow queries were running, BUT the difference is not as large as I expected (on average), which I think it is a testament to the efficiency of your underlying mirroring technology and possibly some unlucky timing of concurrent slow queries hitting.
Note that we have a few reports in our app that scan big multi GB tables and cannot be optimized further, and we will probably use an asynchronous batch processing system (emailing the resulting reports) as soon as we can code it, along with a slave mysql to offload these from the master mysql. Unfortunately I cannot quantify this stuff for you currently, since there are too may variables in our system and we haven't had time to devise good metrics yet (something we are lacking, but it is a brand new app, so that's how it goes). Another thing that exacerbates the problem is that our school customers tend to do similar things at about the same time (eg, several slow reports will hit the DBs at the same time) because their school schedules are so similar. This problem with spiking of usage was one of the main reasons that we moved to a grid architecture and it has worked out nicely so far.
jody3t
03-06-2008, 10:21 AM
>>This problem with spiking of usage was one of the main reasons that we moved to a grid architecture and it has worked out nicely so far.
On another related topic:
One serious problem we were having with traditional servers was that the dedicated hosting providers will only give a certain max WAN bandwidth to each server and there is no easy way to aggregate the bandwidth paid for across multiple servers, so we were running up against WAN bandwidth limits for our squid reverse proxy servers (in front of apache). The grid has alleviated this, since bandwidth is more easily aggregated. I have seen the bw peak at around 26Mbps for a squid (remember we are dealing with schools and they all want to start class at the top of the hour and the 1st page hits require alot of javscript dojo libs to download).
Now I have removed all of the pinned servers since it would prevent HA as you have said (might consider relaxing the pinned server constraint if the pinned node fails or at least have an option to ignore the pinning in case of node failure).
We have 3 squids (1 for each app cluster) and we will have more copies of this app running soon, so if they all happen to share the same node then they may eventually hit the 100Mbps pre node WAN bandwidth limits. I was using the pinned servers to make sure the bw was balanced evenly. I am now using the bw resource to prevent the scheduler from putting too many of the gateways on the same node, but automating this by balancing the high bw components would be another cool optimization that you automatically do after analyzing component bw stats (or recommend after running some kind of balancing optimization tool). This would be another possible use of the global group too.
PeterNic
03-07-2008, 08:47 AM
Jody, thanks for the good words and the ideas. I am glad the system is working out for you. We already have a pending change request to relax the pin (or have a "relaxed" pin); we'll take the other ideas into account when planning subsequent releases, too.
If there's anything else we can help with, please feel free to ask.
Best regards,
-- Peter
vBulletin® v3.7.5, Copyright ©2000-2012, Jelsoft Enterprises Ltd.