Friday, January 30, 2009

Replicating Boot Volumes

One key strategic decision that allowed for almost immediate failover to a DR site was replicating the boot volumes of tier 1 servers. I personally architected this with VMs, but the same can be done with a Windows server booting from the SAN. I have never personally done much booting from SAN with non-VMs. I tried this back in the Windows 2000 days, but there were issues with the page file.

It makes it easier to manage as well because you don't have to build out a server on the target side, ensure that it is up to the same spec as the source, etc. When you power up the server in the DR site it not only has the replicated data LUNs, whether that be SQL, Exchange, flat files, etc., but you also have an exact copy of the underlying Operating System.

I used RecoverPoint to do the replication. The last couple versions had the option to specifically replicate the boot volume and so it made it easy. One thing you really have to pay attention to is the necessity to quiesce or create point-in-time copies of the boot volume and not just replicate it sync or async. If you do this, there is a probability for lost in-flight transactions and you will blue screen (which I have done, of course)

RecoverPoint has a command line utility that I scheduled in scheduled tasks, which quiesces the server and creates a point-in-time image that is sent to the remote side. When you failover, always choose one of these images and not the latest I/O. The good thing with RecoverPoint is that you can choose any point in time and then change it if it doesn't work. The problem with the boot volume is that you have to allow direct access to the LUN, which then erases all of the other point-in-time copies, so you have to get it right the first time or you are screwed.

Once you choose the latest clean image on the target, you then do the same for the data LUNs and boot the server. It will come up with the same IP address so you have to have a solution for this. There are a couple ways of doing this.

One way is to use a global load balancer like Cisco or F5 Big IP. This provides a front-end IP address that end users use to connect to the back end servers. If the primary is up, the F5 forwards traffic there, if the DR side is up and the primary isn't up, it will forward traffic to the DR site. Never have them both up at the same time. If you do, make sure to set the F5 to use the primary at all times when available.

The other way to handle this is to use a stretched VLAN. There are downsides to this such as increased traffic, including broadcast traffic, that could fill the pipe. It does, however, allow you to boot up a server with the same IP address at a different site and the switches will see the change and forward accordingly.

There is of course the option to bring up the server and change the IP address, then change DNS to point to the new IP, but you will have to flush all of the end users DNS cache. This can be done with third party products automatically, but I think the first 2 options are a better fit.

No comments: