atom beingexchanged

Tuesday, September 8, 2009

CCR clustering is still clustering, and so is DAG

As more and more of my readers move to Exchange 2007 and 2010 from Exchange 2003 and earlier versions, I hear a lot about how using the new High Availability tools will finally free them from the yolk of clustering in Windows.  While both CCR and DAG are definite improvements over traditional shared-disk clustering, neither is a departure from clustering entirely.

We’ll be talking about the new HA stuff in Exchange 2010 (along with much more of course) in the webinar Double-Take Software and Microsoft are presenting tomorrow.  I’m the speaker for Double-Take, and Patrick Foley from Microsoft is going to be doing their portion. It’s September 9th at 11am, and you can still register for free by clicking here.

In the meantime, it is important to realize that both CCR (Continuous Cluster Replication) and DAG (Database Availability Groups) are offshoots of Windows Failover Clustering (WFC).  They both change the way WFC works, and by quite a lot, so you may never touch the underlying cluster technology, but it is still there.

CCR – as its name implies – works by allowing you to create a cluster during the installation of Exchange 2007.  This one is a bit easier to see as part of WFC, as you have to create a Failover Cluster first – specifically a Distributed Majority-Node File Share Witness Failover Cluster.  After that, when you install Exchange Server you can specify which server(s) will be the Active node(s) and which will be passive.  This creates the clustered Exchange resources for you, making the overall process of setting up clustering for Exchange a lot easier.  As this one has Cluster in the name, it’s easier to see the WFC roots.

DAG will permit you to create the cluster itself from Exchange 2010 command sets, eliminating the need to pre-create the Failover Cluster prior to getting the Exchange installation rolling.  While this makes the process even easier than in 2007, it still requires that you have two or more servers capable of running Distributed Failover Clustering.  This means that not every version of Windows 2008 is going to be suitable for DAG, but also means that – under the hood – you still need to know how Distributed Failover Clustering works to properly manage the DAG systems.

In both cases, the required level of understanding of clustering is greatly diminished from what was needed in Exchange 2003 and earlier versions.  Most of the guts of the cluster are controlled by Exchange itself, which is a double-edged sword.  On one side you have the fact that folks who don’t have a lot of cluster know-how can now set up HA solutions for Exchange.  On the other side, people who don’t have a lot of cluster know-how are facing troubleshooting clustered Exchange solutions they may not have realized were there.

Both solutions work great for Exchange.  While they don’t eliminate the need for 3rd-party products to help with overall HA (and I’m biased on this one, see disclaimer below), they do make mailbox server protection much more complete.  Just remember that you’re still running on a cluster, and arm yourself with the knowledge needed to keep it running smoothly.

Labels: , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Wednesday, May 27, 2009

When your cluster goes “oops,” Using RecoverCMS

First, a quick note:  I’m posting this one from Windows Live Writer on Windows 7 RC1, which I’m happy to say is remarkably stable and much faster overall than Vista.  I’d recommend it wholeheartedly!

Funny story, I once had a client who swore that clustering was enough protection for their messaging environment, until an outage took out their entire cluster at once – causing them to be down for about week.  Now, that’s not the funny part, but what caused the outage is somewhat hilarious, more on that later.

Exchange 2003 and earlier had a pretty straight-forward method for recovering an entire MSCS cluster if one had failed on you.  You built one or more nodes of a brand new cluster, created an Exchange Virtual Server (EVS) Resource Group with the same parameters (names, IP’s etc) as the production system had, and Exchange would do the rest.

With Exchange 2007, the rules changed significantly, leaving many cluster users confused as to how the system now works if they suffer a cataclysmic failure of the production cluster.  Adding both Single Copy Cluster (SCC) and Continuous Cluster Replication (CCR) to the mix just makes things more confusing, so Microsoft created a new recovery method for Exchange 2007 clusters.  Called RecoverCMS, the system is really a setup task rather than a true failover system, but since your failover system just went belly-up, that’s not a bad thing.

If your Recovery Time Objectives are flexible enough to handle some downtime if an entire cluster fails, then you can leverage this system to get back up and running, either at the original production site, or at a new location.  There are some definite limits to what you can do with it which I’ll explain later, but he basics of how it works are pretty simple.

Step one is rebuild, repair or replace the original cluster hardware. If the repair works then you’re done, just restore any missing data from tape or other backup (due disclaimer, see below, I am biased on backup tools) and then resume normal operations. If you rebuild or replace completely, bring up a new server that is configured with Exchange 2007 in the Passive Cluster Node configuration.  You can find out how to do that:

Here for CCR clustering or,

Here for SCC Clustering

During that process you will also have installed the Exchange 2007 binaries on at least one node of the cluster system, so go to the directory that has the Exchange setup files and execute the following command:

Setup.com /recoverCMS /CMSName:<name> /CMSIPaddress:<ip>

Where <name> is the name of the EVS you’re restoring from, and <IP> is the IP address you want the recovered system to have – in theory the same IP as the original EVS had.

The rest of the procedure is pretty automated, and when finished, you will have a new EVS running on your new cluster node(s) that matches the original EVS and has all the users already assigned to it.  From there, you can restore your data if it was also lost to the disaster.

There are a few things that are extremely important to be aware of before you begin:

1 – Keep in mind that /recoverCMS is designed to restore a failed cluster only.  Attempting to use it for migration or for any other purpose will result in unpredictable behavior and is not supported by MSFT.

2 – You will need to manually create the volumes that existed on the failed cluster before you run /recoverCMS.  If volumes are missing then the recovery will fail.  They don’t have to be the same physical disk or size, just large enough to hold the data and with the same drive letters as the original cluster held.

3 – The System Attendant service will start and then immediately stop after you recover, this is normal, just bring the resource back online when you’re ready.

4 – Your databases are not mounted after a recovery, you must do this manually through PowerShell or the Exchange Management Console after you’re done with the restore.

5 – Do NOT try to use this across OS’s. If you started on Windows Server 2003, you must recover to Windows Server 2003, and 2008 to 2008.   It will not work if you try to go from one to the other.

6 – While you can pre-configure many portions of this system, it will still take some time to run through a /recoverCMS procedure from start to finish, so if you need a second-stage failover, /recoverCMS isn’t the best bet.  I’m quite biased on this (see disclaimer below), but unless you can be down for a few hours if both cluster nodes fail, you might want to go with another tool to provide remote site failover in addition to SCC or CCR clustering.

7 – Finally, SCR and CCR will not automatically work with /recoverCMS.  You will need to stop SCR if it’s running before you recover, and neither will resume automatically after the recovery is done.  Once you’re set up in the new node configuration, re-enable CCR and SCR manually as required.

/RecoverCMS is a great way to restore a failed cluster system to new hardware or rebuilt hardware after a fault.  You still need to back up your data to some device outside the cluster itself, but once you have that backup /recoverCMS can get your cluster back up and running much faster than the manual methodologies used in previous versions of Exchange.

As to the funny story I mentioned at the top of the blog, this particular client was in a hardened datacenter with UPS systems, 24/7 staff and a backup generator.  They were convinced that clustering was going to be more than enough for them.  After trying to explain that a shared-disk cluster (the only option at the time) had weak points, I finally gave up and let them be.  A few months later I got a great phone call.  Apparently – unbeknownst to the client – the datacenter crew had run all power connections through the UPS – including the generator.  The UPS was rated to handle the full power load of the datacenter on 1 of its 2 redundant circuit loops.  So far so good.  Well, this particular datacenter was in the middle of the dot-com boom (this was some time ago) and had grown exponentially in a short period of time.  What they had was well over half the full expected load on each of the two circuits, and one was failing.  So they diligently got replacement parts and moved the load over to the good circuit.  Since was over half the expected load, and circuit 2 was already under over half the load, they immediately overloaded the UPS, shorting it out.  The way it was explained to me, a solenoid shot through the casing of the UPS…and there was indeed a nice hole in the unit to back that up. No one was hurt, but needless to say, the whole datacenter was offline until they replaced the UPS, 4 days later, so they lost about one business week, without anything happening to the physical cluster at all.  Just goes to show you that anything that can go wrong, will.

Labels: , , , , , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Wednesday, May 20, 2009

Outlook, can you hear me? Can you feel me near you?

Might be showing my age and/or taste in music with that particular title (and if you’re totally confused by it, check out This YouTube video), but I think that it’s a great way to describe an annoyance that can happen if you’re using versions of Outlook before Outlook 2007.  Since a large portion of the users of Office are on the 2003 version (and many even earlier than that), resolution to a new server in the event of a disaster recovery event is a subject that is just as confusing as the famous rock opera I’m making use of in my title today.

When Outlook 2007 was introduced to the world with Exchange 2007, a lot was made (and rightfully so) of the new AutoDiscover features that this platform brought into the Enterprise Email marketplace.  The long and short of the AutoDiscover solution set is this:

When an Outlook 2007 client cannot find its home server – either because it is a brand new install of Outlook or because the home server has moved or been replaced – the Exchange 2007 AutoDiscover system can help Outlook 2007 find its home.  If the Outlook client can see an Exchange Server (or be directed to one by Active Directory), the Server can tell Outlook where the mailbox information for the user’s profile exists, and direct Outlook to connect to the appropriate CAS or Mailbox systems and get connected.  All the user/Admin has to do is tell Outlook the user’s email address and password, and AD with Exchange 2007 will handle it from there.  So if you’re installing Outlook for the first time, you don’t have to manually configure the Profile anymore – a great boon to Admins everywhere.

This system also kicks in if you perform Database Portability during a disaster, and have replicated the database with SCR; or have used a 3rd party disaster recovery/availability solution (see disclaimer below for all my bias information on that one =).  Once the Exchange system is responding again, AD can ferry the Outlook 2007 client to the new home for that mailbox, requiring only that the end-user close and reopen Outlook to complete the process.

However, what many folks do not realize right off the bat is that this solution set is ONLY available if you have both Outlook 2007 and Exchange 2007 as your messaging platform.  All users who need to take advantage of AutoDiscover must be using that combination of tools, and no other.  As you might expect, POP3 and IMAP systems do not AutoDiscover, but the majority of my clients were unaware that Outlook 2003 and earlier also cannot take advantage of this system, even if you have upgraded to Exchange 2007 as the messaging platform of choice.  It’s also worth noting that AutoDiscover doesn’t officially work in Exchange 2003 – no matter what Outlook version you are on.  Before I get blasted by mail on this one, I know some folks have sometimes seen it to work on Outlook 2003 with Exchange 2007, but it bombs more than it works, and officially it’s not supported.  For proof, I direct you to this article by the MS Exchange Team.

Since the code to perform AutoDiscover wasn’t in Outlook 2003, users on that client software will not be able to dynamically re-link to the new Exchange server unless the original mailbox server is still responding.  If it is, then Outlook can find the new server via the original server and re-home itself.  If not, Outlook must be manually re-directed to the new server.

Of course, there are ways around this.  You could update DNS to re-direct anyone calling for “Server 1” to the IP Address of “Server 2” – effectively re-routing all client software including POP and IMAP.  Outlook 2003 will still need to be re-profiled unless you take over the Service Principle Name (SPN) of “Server 1” on “Server 2,” but it will be a smoother transition.  Using a 3rd party tool (see disclaimer below) you may have the option of automated DNS and SPN updates, which will allow even legacy Outlook clients to jump to the new server with no more intervention than is required on Outlook 2007 with Exchange 2007 – even if you’re whole system in on the 2003 versions of those software platforms or earlier.

So, you are not without lots of options if you have any legacy servers and/or clients – or non MAPI clients – in your environment. You just need to be aware that the Exchange 2007/Outlook 2007 solutions for AutoDiscover services are not backward compatible, and plan accordingly.  Right now it looks like Exchange 2010 will have AutoDiscover that is backward compatible to Outlook 2007 only, so this soon-to-be-released platform is not going to solve this particular problem unless you’re planning on upgrading everything else in the environment to at least Outlook 2007 first.

Since I like to let folks continue to discover on their own, here’s a link to the White Paper from MSFT on AutoDiscover.

Finally, Update Roll-Up 8 for Exchange 2007 is out there, which can make life easier if you’re doing a fresh install of Exchange 2007 and want to get up to date with patches and fixes post SP1 quickly.  You can get it at this link.

Labels: , , , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Monday, May 4, 2009

The Dread Pirate Re-Seed – Part 2

Last week, we talked about re-seeding operations between two nodes of an Active/Passive CCR Cluster.  Since both nodes will most likely be inside the same logical network (though are not required to be in Server 2008), even unexpected re-seed operations on a CCR Cluster should be relatively painless.  Though it’s a full copy to target, it is happening over a LAN, and therefore faster and less resource-intensive than a WAN copy would be.

This week, let’s look at how the game changes when you work with Server (or Standby) Continuous Replication (SCR) when you have re-seed operations.  Just to be clear, re-seeds are not the normal method for data protection with SCR. Normally, when a log file is closed and a new prime log (usually E00 for the first Storage Group) is created, the closed log is shipped over to a target Exchange 2007 if SCR is enabled.  Once on the target, the log is held until the target reaches the number of logs specified in the log replay caching system – 50 by default.  After that, Exchange commits the logs to the database on the target, effectively providing SQL-like Log Shipping (or Continuous Data Mirroring, Asynchronous) between two Exchange 2007 Servers.

Whenever the two servers are not in communication with each other, there is the potential for operations to occur on the source that are not seen by the target, and vice versa.  Since this could easily result in a corrupted database, Exchange will check to ensure it knows what state the data on both servers is in before resuming normal SCR operations.  If the state is known, then SCR continues – transmitting all logs not yet on the target, committing all but the last 50 logs, and continuing on its way as normal.

If the state is unknown, a re-seed operation will need to occur.  More specifically if there is a gap in log file enumeration for some reason, you will require a re-seed.  You can see the reasons why re-seeds are required at the TechNet website listed here.

The one that is of most concern to Exchange Admins is that if the two servers are not in communication during a backup window, you will have to re-seed the database before normal SCR operations can continue.  When speaking of a local SCR pair this is not a big issue, as connectivity will be much more solid than WAN performance, and even if a re-seed is required, it will be relatively fast.  But across a WAN, it is more likely that you can occasionally suffer WAN outages that do not interrupt business operations.  If these outages extend past the time of the backup window, the backup tool will most likely truncate logs committed to tape without SCR being able to transmit those changes to the target Exchange Server.

Since minor network outages are a common (though hopefully not very common) issue with modern networks, the likelihood of requiring a re-seed due to this sequence of events is something that could be considered a normal part of Exchange SCR operations.  If your databases are small, then it won’t be an issue.  For larger databases, remember that a re-seed operation is a full copy of all data files to the target device from source, which could be problematic depending on your WAN throughput.

So with SCR, re-seeds do become a definite issue to contend with.  Since the occurrence of re-seed operations should be limited, you may be able to keep the systems successfully in sync without too much trouble. However, if you have larger databases or smaller WAN pipes, re-seeds can create problems for your network, especially if you use a backup tool that truncates your logs, or use circular logging for any reason.

Of course, there are alternatives to SCR for WAN protection and availability (see disclaimer below, I’m biased here), so you are not out of luck in terms of WAN operations with Exchange 2007.  Locally, CCR is a spectacular choice for nearly any sized Exchange system, with the exception of very large (over 1TB) datasets that may just take to long to re-seed even over a LAN.  Remotely, if you have properly planned for re-seeds then you will also be able to successfully utilize SCR, but if you do not – or cannot – plan for these required operations, then you’ll risk a failure in your failsafe system, which is not a great situation to be in.

In short, re-seeds are a normal part of SCR operations. They should not happen every day (or even every week) but you will most likely experience them over the course of your SCR lifetime, so plan for them now.

 

Next week is TechEd 2009!  I’ll be out in Los Angeles with the Double-Take Software crew, so please stop by and say hi. I’ll also be blogging from the event, so keep an eye on this column and follow me on Twitter at http://twitter.com/talonnyc.

Labels: , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments