atom beingexchanged

Tuesday, January 12, 2010

A word on Cluster Groups

Many clients are on Exchange 2003 or 2007 and will need to deal with Cluster Groups in Microsoft Cluster Services (MSCS) or Failover Clustering Services (FCS) including Cluster Continuous Replication (CCR).  So, it is important to understand one very critical restriction of Exchange Clustering that I’ve seen several clients trip over.

When installing a Microsoft Cluster of most flavors, you will configure Groups, which are logical units used to contain Resources like IP Addresses, Network Names, Disks and Services.  By default, a Cluster Group will be created that contains the name, IP address and Quorum Disk for the cluster itself.  It may also contain a networked Distributed Transaction Coordinator (DTC) resource for the cluster as a whole.  It is very tempting to place all other resources in this group, but you should avoid doing that at all costs for 2 significant reasons:

1 – It’s not supported by Microsoft.  For proof, I refer you to This TechNET article.  There is a long explanation of many thing having to do with configuring an Exchange Cluster, but here’s the specific info I’m referring to:

“It is an Exchange best practice to install the MSDTC resource into the default cluster group. However, the MSDTC resource is the only resource supported in the default cluster group. Exchange resources should not be added to the default cluster group, as that configuration is not supported.” [emphasis added]

TechNET and the Microsoft Sites have many other examples of this warning, and it is well documented by Microsoft and the Exchange Product Team.

2 – It makes life more difficult in day-to-day administration.  There may be instances where you want to perform operations on the Cluster Group without interrupting Exchange services for your organization.  You can normally accomplish this by moving the Cluster Group to a Cluster Node that isn’t hosting any Exchange Resource Groups, and perform your activities on that node.  If you have Exchange Resources in the Cluster Group, then this options disappears.  The same goes for many 3rd-Party products (see disclaimer at the end of the blog) which may not accept Exchange Resources that appear in the Cluster Group, as they must treat the Cluster Group and the Exchange Resource Group independently for administrative purposes.

So, as tempting as it is, avoid installing Exchange Resources into the Cluster Group at all costs.  If you already have put Exchange Resources into the Cluster Group, and you don’t plan on upgrading just yet, then seriously consider migrating to a supported cluster configuration when time permits.  Issues that arise from unsupported configuration and limited administration tend to hit without warning, and at the worst possible time.  Taking time to move to a supported platform will keep your organization in the safe zone, and make life a lot easier for you over time.

Labels: , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Tuesday, September 8, 2009

CCR clustering is still clustering, and so is DAG

As more and more of my readers move to Exchange 2007 and 2010 from Exchange 2003 and earlier versions, I hear a lot about how using the new High Availability tools will finally free them from the yolk of clustering in Windows.  While both CCR and DAG are definite improvements over traditional shared-disk clustering, neither is a departure from clustering entirely.

We’ll be talking about the new HA stuff in Exchange 2010 (along with much more of course) in the webinar Double-Take Software and Microsoft are presenting tomorrow.  I’m the speaker for Double-Take, and Patrick Foley from Microsoft is going to be doing their portion. It’s September 9th at 11am, and you can still register for free by clicking here.

In the meantime, it is important to realize that both CCR (Continuous Cluster Replication) and DAG (Database Availability Groups) are offshoots of Windows Failover Clustering (WFC).  They both change the way WFC works, and by quite a lot, so you may never touch the underlying cluster technology, but it is still there.

CCR – as its name implies – works by allowing you to create a cluster during the installation of Exchange 2007.  This one is a bit easier to see as part of WFC, as you have to create a Failover Cluster first – specifically a Distributed Majority-Node File Share Witness Failover Cluster.  After that, when you install Exchange Server you can specify which server(s) will be the Active node(s) and which will be passive.  This creates the clustered Exchange resources for you, making the overall process of setting up clustering for Exchange a lot easier.  As this one has Cluster in the name, it’s easier to see the WFC roots.

DAG will permit you to create the cluster itself from Exchange 2010 command sets, eliminating the need to pre-create the Failover Cluster prior to getting the Exchange installation rolling.  While this makes the process even easier than in 2007, it still requires that you have two or more servers capable of running Distributed Failover Clustering.  This means that not every version of Windows 2008 is going to be suitable for DAG, but also means that – under the hood – you still need to know how Distributed Failover Clustering works to properly manage the DAG systems.

In both cases, the required level of understanding of clustering is greatly diminished from what was needed in Exchange 2003 and earlier versions.  Most of the guts of the cluster are controlled by Exchange itself, which is a double-edged sword.  On one side you have the fact that folks who don’t have a lot of cluster know-how can now set up HA solutions for Exchange.  On the other side, people who don’t have a lot of cluster know-how are facing troubleshooting clustered Exchange solutions they may not have realized were there.

Both solutions work great for Exchange.  While they don’t eliminate the need for 3rd-party products to help with overall HA (and I’m biased on this one, see disclaimer below), they do make mailbox server protection much more complete.  Just remember that you’re still running on a cluster, and arm yourself with the knowledge needed to keep it running smoothly.

Labels: , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Wednesday, May 27, 2009

When your cluster goes “oops,” Using RecoverCMS

First, a quick note:  I’m posting this one from Windows Live Writer on Windows 7 RC1, which I’m happy to say is remarkably stable and much faster overall than Vista.  I’d recommend it wholeheartedly!

Funny story, I once had a client who swore that clustering was enough protection for their messaging environment, until an outage took out their entire cluster at once – causing them to be down for about week.  Now, that’s not the funny part, but what caused the outage is somewhat hilarious, more on that later.

Exchange 2003 and earlier had a pretty straight-forward method for recovering an entire MSCS cluster if one had failed on you.  You built one or more nodes of a brand new cluster, created an Exchange Virtual Server (EVS) Resource Group with the same parameters (names, IP’s etc) as the production system had, and Exchange would do the rest.

With Exchange 2007, the rules changed significantly, leaving many cluster users confused as to how the system now works if they suffer a cataclysmic failure of the production cluster.  Adding both Single Copy Cluster (SCC) and Continuous Cluster Replication (CCR) to the mix just makes things more confusing, so Microsoft created a new recovery method for Exchange 2007 clusters.  Called RecoverCMS, the system is really a setup task rather than a true failover system, but since your failover system just went belly-up, that’s not a bad thing.

If your Recovery Time Objectives are flexible enough to handle some downtime if an entire cluster fails, then you can leverage this system to get back up and running, either at the original production site, or at a new location.  There are some definite limits to what you can do with it which I’ll explain later, but he basics of how it works are pretty simple.

Step one is rebuild, repair or replace the original cluster hardware. If the repair works then you’re done, just restore any missing data from tape or other backup (due disclaimer, see below, I am biased on backup tools) and then resume normal operations. If you rebuild or replace completely, bring up a new server that is configured with Exchange 2007 in the Passive Cluster Node configuration.  You can find out how to do that:

Here for CCR clustering or,

Here for SCC Clustering

During that process you will also have installed the Exchange 2007 binaries on at least one node of the cluster system, so go to the directory that has the Exchange setup files and execute the following command:

Setup.com /recoverCMS /CMSName:<name> /CMSIPaddress:<ip>

Where <name> is the name of the EVS you’re restoring from, and <IP> is the IP address you want the recovered system to have – in theory the same IP as the original EVS had.

The rest of the procedure is pretty automated, and when finished, you will have a new EVS running on your new cluster node(s) that matches the original EVS and has all the users already assigned to it.  From there, you can restore your data if it was also lost to the disaster.

There are a few things that are extremely important to be aware of before you begin:

1 – Keep in mind that /recoverCMS is designed to restore a failed cluster only.  Attempting to use it for migration or for any other purpose will result in unpredictable behavior and is not supported by MSFT.

2 – You will need to manually create the volumes that existed on the failed cluster before you run /recoverCMS.  If volumes are missing then the recovery will fail.  They don’t have to be the same physical disk or size, just large enough to hold the data and with the same drive letters as the original cluster held.

3 – The System Attendant service will start and then immediately stop after you recover, this is normal, just bring the resource back online when you’re ready.

4 – Your databases are not mounted after a recovery, you must do this manually through PowerShell or the Exchange Management Console after you’re done with the restore.

5 – Do NOT try to use this across OS’s. If you started on Windows Server 2003, you must recover to Windows Server 2003, and 2008 to 2008.   It will not work if you try to go from one to the other.

6 – While you can pre-configure many portions of this system, it will still take some time to run through a /recoverCMS procedure from start to finish, so if you need a second-stage failover, /recoverCMS isn’t the best bet.  I’m quite biased on this (see disclaimer below), but unless you can be down for a few hours if both cluster nodes fail, you might want to go with another tool to provide remote site failover in addition to SCC or CCR clustering.

7 – Finally, SCR and CCR will not automatically work with /recoverCMS.  You will need to stop SCR if it’s running before you recover, and neither will resume automatically after the recovery is done.  Once you’re set up in the new node configuration, re-enable CCR and SCR manually as required.

/RecoverCMS is a great way to restore a failed cluster system to new hardware or rebuilt hardware after a fault.  You still need to back up your data to some device outside the cluster itself, but once you have that backup /recoverCMS can get your cluster back up and running much faster than the manual methodologies used in previous versions of Exchange.

As to the funny story I mentioned at the top of the blog, this particular client was in a hardened datacenter with UPS systems, 24/7 staff and a backup generator.  They were convinced that clustering was going to be more than enough for them.  After trying to explain that a shared-disk cluster (the only option at the time) had weak points, I finally gave up and let them be.  A few months later I got a great phone call.  Apparently – unbeknownst to the client – the datacenter crew had run all power connections through the UPS – including the generator.  The UPS was rated to handle the full power load of the datacenter on 1 of its 2 redundant circuit loops.  So far so good.  Well, this particular datacenter was in the middle of the dot-com boom (this was some time ago) and had grown exponentially in a short period of time.  What they had was well over half the full expected load on each of the two circuits, and one was failing.  So they diligently got replacement parts and moved the load over to the good circuit.  Since was over half the expected load, and circuit 2 was already under over half the load, they immediately overloaded the UPS, shorting it out.  The way it was explained to me, a solenoid shot through the casing of the UPS…and there was indeed a nice hole in the unit to back that up. No one was hurt, but needless to say, the whole datacenter was offline until they replaced the UPS, 4 days later, so they lost about one business week, without anything happening to the physical cluster at all.  Just goes to show you that anything that can go wrong, will.

Labels: , , , , , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Monday, April 27, 2009

The Dread Pirate Re-Seed – Part 1

Among the most common questions I get from clients about the new data-protection features in Exchange 2007 (and the soon-to-be-released Exchange 2010) is, “What is a re-seed and why does it happen?”  This mostly falls into the category of “fear of the unknown” since the technology is new, and documentation on how it works is somewhat scarce.

Re-seeds are a commonly confusing part of most protection methods, though they fall under different names and methodologies.  In a solution like Double-Take (see disclaimer below), they’re called re-mirrors or re-synchronization operations – and are typically differences only.  In a tape-backup solution it’s a restore operation, and might be everything, incremental pieces, or some combination thereof.  In Exchange 2007 these operations are called re-seeds, basically the replay of data from a server that has a “correct” copy to one that does not.  Today, we see these operations in Exchange 2007 Local Continuous Replication (LCR), Cluster Continuous Replication (CCR) and Server – or Standby – Continuous Replication (SCR).  Today, we’ll talk about CCR, and address LCR/SCR next week.

CCR allows an active node of a 2-node Active/Passive Exchange Cluster to replicate a copy its data to the passive node.  This allows the passive node to take over with a nearly-current copy of the data if the production system fails due to hardware or software failure.  There is a log replay lag to be considered, but it’s only 50 logs that need to be applied to the passive node during a rollover event, and that does give you some measure of protection against corruption if you catch it fast enough.  Otherwise, the system acts much like a traditional Single Copy Cluster (formerly Shared Disk Cluster) in behavior, and is controlled with a combination of Windows cluster tools and PowerShell.

Whenever a log file is committed, and a new prime log (usually E00) is created, the closed log is copied over to the passive node via an SMB share, where it is held until it passes the 50 log replay limit and is then committed to the database, or a rollover occurs and the logs are committed immediately.  Exchange 2010 will move away from the SMB share, but will utilize a similar methodology overall, if the beta is to be taken at face value.

In order to get the passive node in sync with the active data, the CCR system starts with a re-seed operation.  All data from the database is copied from the active node to the passive node, as well as any non-truncated logs.  From then on, only log files are copied, as they are committed on the active node.  If all goes well, this will probably be the only re-seed you see unless you have a rollover.

If you do flip nodes – let’s say from Node A to Node B – then Node B will re-seed back to Node A if Node A becomes divergent. In other words, if Exchange cannot determine what logs still exist on Node A, or if the logs are inconsistent, or if some are missing.  A graceful rollover will not cause a re-seed, but most emergency rollovers will require it.

The same will happen if you haven’t rolled over, but instead Node B was offline for some other reason.  When Node B comes back online, Node A will see if all the required logs are on both machines, and then either just continue CCR protection or else initiate a re-seed to copy the data over again if anything is amiss.  The only issue here is if your backup tools purge logs while Node B is still offline.  In that case the servers will be considered divergent and need a re-seed to get back up and running properly.

Finally, if a cluster is restored from a backup (tape or otherwise) to the active node, then a re-seed must be manually initiated to re-sync the nodes properly.  You will see errors telling you to do this after the restore is complete and you bring Node A back online.

One other condition exists, but it is a manually created condition. If you perform Offline Defragmentation of the database, you will trigger a re-seed operation when Node A is brought back online.  As long as the first Exchange log is still present (which it should be) then this will happen automatically. Otherwise, it will need to be initiated manually.

So, why is this an issue?  Normally, it’s not, but keep in mind that re-seed operations are *full* copies of the entire database.  So if you have relatively small databases and only a few of them, this isn’t a problem.  But let’s say you have over 1 Terabyte of data in your Exchange cluster.  Re-seeding that much data locally will be time and resource consuming, and doing it over a WAN (for distributed failover clustering) could be problematic – to say the least.  So you want to avoid re-seed operations at all costs and wherever possible, which means treating the CCR cluster very carefully, and following all the best practices from Microsoft on Exchange 2007 Clustering in general.

For information on when re-seeds occur, take a look at this TechNet article.  They’re not an everyday occurrence, but you will need to be sure you know when and why they will happen to avoid confusion and frustration.

Labels: , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Monday, August 11, 2008

Quite a stretch for clustering in 2008

Microsoft Clustering Services (MSCS) have existed in one form or another since NT4, but have always suffered from a significant limitation.  All cluster resources had to exist within the same logical subnet, or else you couldn't create the cluster itself.  Windows Server 2008 allows for some flexibility in that regard, with the ability to create nodes of a contiguous cluster in different logical subnets.

Before we dive too far into that, you may want to see the official MSFT information here:

http://technet.microsoft.com/en-us/library/cc770625.aspx

So what does this mean for you and I?  It means that we can create CCR clusters on Exchange 2007 that stretch between physical locations and subnets.  However, to do this you'll need to be on Server 2008, the function just isn't available in Server 2003. This allows you to provide basic availability for Mailbox Role (MBX) servers in your Exchange environment, but doesn't take care of everything when it comes to DR planning.

First off, this applies only to Exchange 2007 Enterprise Edition, and then only to MBX role servers.  While most other roles are natively fault-tolerant with multiple servers installed with the same role able to stand in for each other, organizational or regulatory rules might not make that kind of redundancy possible.  Edge servers are the biggest example of this.  Since they're not tied into the domain structures, they don't contain any way to quickly flip traffic from one Edge server to another in different sites.  Third-party tools (see disclaimer below) can often take care of that function for you, as can working with your DNS provider to facilitate moving the MX records in the event of an emergency.

If you have legacy Exchange 2000/2003 servers, you're also not able to take advantage of this new MSCS feature set on those boxes.  The same goes with any non-Exchange tools, like SQL, anti-virus servers, anti-spam servers, etc.  Even if these servers run on Server 2008 clusters, they'll require some third-party intervention to handle the data replication for those systems.  This would include things like Blackberry servers, GoodLink systems and other non-Exchange remote email tools.

Finally, keep in mind that CCR clusters can only extend to Active/Passive, 2-node configurations. That will mean you can't use these solutions if you need to go beyond that model - which Exchange 2007 easily allows without CCR involved.

Server 2008 Failover Clustering is a great method for basic High Availability for Exchange 2007 CCR systems - even across subnets.  With some additional tools, it can become the center of an Exchange DR solution set that can help your organization withstand even site-wide emergencies.

Labels: , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments