PDA

View Full Version : General questions about AppLogic


Vlad
04-10-2007, 09:02 PM
This thread is for a variety of technical & business questions of general interest. Most posts will be initiated by the 3tera team in response to questions received by email sent to info@3tera.com or posted directly here.

Vlad
04-10-2007, 09:09 PM
From: Sanjeev Singh
Sent: Tuesday, April 10, 2007 7:44 PM

<snip>

I do have some questions concerning how storage is handled in the 3tera system. In particular I am curious about how you keep replicated data consistent in the face of power failures, etc (including entire grid outages). I've read all the white papers on the 3tera site, but they don't go into enough detail for me to get a good feeling for this. Regarding multiple grids per VPD, what are the downsides of splitting a large application across multiple grids?

Apologies if these questions are already addressed in documentation on your site.

thanks,
- jeev


Sanjeev,

The AppLogic storage system is a distributed volume store. Within a grid, each virtual volume is mirrored across multiple servers. This is not replication - it is a true RAID mirroring with full transactional integrity. So, if a node fails, the data remains accessible in real time.

Business continuity across datacenters is handled at the application/appliance level. AppLogic makes it trivial to deploy identical instances of any application on different grids, including grids located in different datacenters. Given this, within your application you can build a NAS appliance and a database appliance that synchronize with other instance(s) of the same app. So you can run one copy of your app in New York, another in San Francisco, and have them synchronize in real time. Add to this a global load balancing service (which can be another app), and you have what you need for a 100% uptime global presence.

These things are not new, but they were very expensive to implement and maintain. What AppLogic does here is it allows you to implement a well-known, tried and true schema of your choice, but do it very inexpensively. For example, on an average day, at 3Tera we are running around 200 servers in 4-5 datacenters spread around North America and Europe. They are usually configured into half a dozen to a dozen grids that are used for things like development, testing, tuning large-scale installations, hardware certifications, evaluations, even our own internal infrastructure (we eat our own dog food). All of this is operated by a single guy - our support manager - on a part-time basis.

In regard to multiple grids per VPD, we found that this is something people often want to do even if their whole app or service fits on a single grid. As long as the number of grids is relatively small (under 10 :-)) the management overhead of dealing with them separately is not significant. However, there are some advantages such as being able to take one of the grids down for maintenance and/or upgrades while running on the other. Also, it is easier to divide human responsibility when one person manages one grid than if multiple people manage a set of apps on the same grid.

Ultimately, the applications don't care, so you can structure things differently. You can start with several grids, then combine them into one, etc.

Vlad

sanjeev
04-11-2007, 01:02 PM
Vlad,

thanks again for your response. It might be best to describe some corner cases:

1. Assume I have one instance of a mysql database running, talking to a virtual SAN device. Now the node that the mysql database is running on appears to be dead to the grid (unpingable) and another instance of the database instance is recreated on another spare node. Now the "presumed dead" mysql instance comes back (the node was not really dead), and both mysql instances write to the SAN datastore, corrupting data. What prevents this from happening?

2. re: RAID transactionality. Do layered tech grids use SCSI disks? I am leary of the ability of SATA disks to observe the "write through" bit. It should still be possible to use SATA disks in this case, but since many SATA disk write buffers do not necessarily write to disk in the order in which writes were received it is less effiecient to implement transactional log writes (you may have to do 2 writes per transaction, flushing after each one).

3. Distributed transactions. If 3tera implements RAID as distributed transactions, then I just need to be confident in the distributed transaction scheme. Do you have a description of this scheme anywhere? Additionally, can applications make use of this mechanism for their own transaction processing needs?

thanks.

Vlad
04-11-2007, 03:20 PM
1. Assume I have one instance of a mysql database running, talking to a virtual SAN device. Now the node that the mysql database is running on appears to be dead to the grid (unpingable) and another instance of the database instance is recreated on another spare node. Now the "presumed dead" mysql instance comes back (the node was not really dead), and both mysql instances write to the SAN datastore, corrupting data. What prevents this from happening?



You are describing the classic "split brain" problem. What prevents this from happening in AppLogic is that the MySQL instance started on the second node IS the same instance that died on the first one. AppLogic cannot assign the same instance to more than one node at a time, so a split brain can never happen.

This is ensured at several levels in the system. For example, the node that lost connection with the grid will reboot itself before any of the instances is restarted elsewhere. At a different level, AppLogic will have to "mount" the virtual volume prior to creating the "new" instance. By default, each volume is accessed exclusively by a single instance (unless you specifically override this in the class definition for the appliance), so if the "old" instance was somehow alive, the exclusive mount would fail.


2. re: RAID transactionality. Do layered tech grids use SCSI disks? I am leary of the ability of SATA disks to observe the "write through" bit. It should still be possible to use SATA disks in this case, but since many SATA disk write buffers do not necessarily write to disk in the order in which writes were received it is less effiecient to implement transactional log writes (you may have to do 2 writes per transaction, flushing after each one).


Disk I/O is a distributed transaction between three or more servers - the server on which the instance itself is running, and the two or more servers on which the mirrors are located (the number of mirrors is defined when the grid is first installed and applies to all volumes). AppLogic guarantees write-through up to (but not including) the disk, has succeeded on at least one mirror before the write operation inside your instance returns. Standard configurations are SATA, but I don't see a reason you can't get SCSI if you want it so much. What you will be trading off for this is ease of scaling - with standard nodes, adding more nodes to your grid can be as quick as 10-15min. With custom hardware, you will have to wait for someone to build those servers for you, which is usually measured in days. It is a business constraint, not a technical one, so if you are willing to bear the cost of keeping spares in stock, you will be OK.

sanjeev
04-12-2007, 10:17 AM
This is ensured at several levels in the system. For example, the node that lost connection with the grid will reboot itself before any of the instances is restarted elsewhere. At a different level, AppLogic will have to "mount" the virtual volume prior to creating the "new" instance. By default, each volume is accessed exclusively by a single instance (unless you specifically override this in the class definition for the appliance), so if the "old" instance was somehow alive, the exclusive mount would fail.

I understand the intent, but there can sometimes be practical difficulties with implementing these schemes. For example:

Node reboot: how is the reboot triggered? If it is from a separate "monitoring" thread it is sometimes possible for that thread to get stuck/hung and the rest of the server to continue operating. I've heard of cases where a monitoring thread hit a segfault and the core dumper took 30s to dump, in the meantime the rest of the application continued executing (incl. processing RPCs).

I am curious about the exclusive mount mechanism. Who decides that the old instance is no longer accessing the SAN?

I appreciate your patience with my questions. I've worked with large systems before and you can probably tell that I've been burned by issues like these.

Vlad
04-12-2007, 11:15 AM
I understand the intent, but there can sometimes be practical difficulties with implementing these schemes.


Look, every piece of software (and hardware, for that matter) is subject to bugs and design flows. An online review is not likely to uncover most of those :-) This being said, the 3Tera team has built large storage systems before, and we are aware of the problems. The guy who keeps us honest with things like this is Richie Lary, the architect of VAX Clusters and long-time chief storage architect at Digital and Compaq.


Node reboot: how is the reboot triggered? If it is from a separate "monitoring" thread it is sometimes possible for that thread to get stuck/hung and the rest of the server to continue operating. I've heard of cases where a monitoring thread hit a segfault and the core dumper took 30s to dump, in the meantime the rest of the application continued executing (incl. processing RPCs).


The reboot is triggered by a timer. If the server in question fails to communicate with the grid controller for a predefined amount of time, the server goes into reboot. In the current implementation, this is done completely in software to maintain a modicum of hardware independence; however, the mechanism is designed with a hardware watchdog timer in mind. We will be adding support for those in the near future.


I am curious about the exclusive mount mechanism. Who decides that the old instance is no longer accessing the SAN?


I checked this with the guys who built the failover logic, and here is how it works. The grid controller is responsible for orchestrating the sequence. The controller knows all relationships between the instances and the volumes they are allowed to access. When the controller decides that a given node is dead, it cleans up by unmounting all volumes on other nodes that were accessed by instances which were running on that node. Only after that the controller initiates restart of affected instances. So, my previous note about the mounts was wrong - the mounts are, in fact, exclusive, but this does not play in this use case, and my first answer - that the instance manager will not restart the instance until proper cleanup is complete, is the right one. If the server in question is somehow not yet dead when the cleanup occurs, it will simply lose access to the volumes before the new incarnation of the instance is started on another server.


I appreciate your patience with my questions. I've worked with large systems before and you can probably tell that I've been burned by issues like these.


I can recognize a kindred soul when I see one :-) Unfortunately, there is only so much detail I can give you in casual Q&A, especially without a white board to draw on :-); plus, as you understand, some of the aspects of the architecture are proprietary.

If you decide at some point that it is worth your time to dive into a deeper evaluation of AppLogic, we will be happy to do a full design walk-through, including security audit, etc. We do this routinely for large customers...

PeterNic
04-18-2007, 12:06 PM
Node reboot: how is the reboot triggered? If it is from a separate "monitoring" thread it is sometimes possible for that thread to get stuck/hung and the rest of the server to continue operating. I've heard of cases where a monitoring thread hit a segfault and the core dumper took 30s to dump, in the meantime the rest of the application continued executing (incl. processing RPCs).

Sanjeev, in addition to the mechanisms described by Vlad, there is yet another failsafe: if for some reason the old server does not die and disconnect -- as you said the self-monitoring can fail -- AND the controller cannot free the volumes otherwise, the grid controller will not restart the appliance in order to prevent the split brain syndrome. This is a last-resort failsafe, if everything else goes wrong and requires human intervention.

We are working on adding hardware watchdog support and remote power control (STONITH -- shoot the other node in the head) to make this even a further possibility.

Regards,
-- Peter