atom beingexchanged

Tuesday, January 12, 2010

A word on Cluster Groups

Many clients are on Exchange 2003 or 2007 and will need to deal with Cluster Groups in Microsoft Cluster Services (MSCS) or Failover Clustering Services (FCS) including Cluster Continuous Replication (CCR).  So, it is important to understand one very critical restriction of Exchange Clustering that I’ve seen several clients trip over.

When installing a Microsoft Cluster of most flavors, you will configure Groups, which are logical units used to contain Resources like IP Addresses, Network Names, Disks and Services.  By default, a Cluster Group will be created that contains the name, IP address and Quorum Disk for the cluster itself.  It may also contain a networked Distributed Transaction Coordinator (DTC) resource for the cluster as a whole.  It is very tempting to place all other resources in this group, but you should avoid doing that at all costs for 2 significant reasons:

1 – It’s not supported by Microsoft.  For proof, I refer you to This TechNET article.  There is a long explanation of many thing having to do with configuring an Exchange Cluster, but here’s the specific info I’m referring to:

“It is an Exchange best practice to install the MSDTC resource into the default cluster group. However, the MSDTC resource is the only resource supported in the default cluster group. Exchange resources should not be added to the default cluster group, as that configuration is not supported.” [emphasis added]

TechNET and the Microsoft Sites have many other examples of this warning, and it is well documented by Microsoft and the Exchange Product Team.

2 – It makes life more difficult in day-to-day administration.  There may be instances where you want to perform operations on the Cluster Group without interrupting Exchange services for your organization.  You can normally accomplish this by moving the Cluster Group to a Cluster Node that isn’t hosting any Exchange Resource Groups, and perform your activities on that node.  If you have Exchange Resources in the Cluster Group, then this options disappears.  The same goes for many 3rd-Party products (see disclaimer at the end of the blog) which may not accept Exchange Resources that appear in the Cluster Group, as they must treat the Cluster Group and the Exchange Resource Group independently for administrative purposes.

So, as tempting as it is, avoid installing Exchange Resources into the Cluster Group at all costs.  If you already have put Exchange Resources into the Cluster Group, and you don’t plan on upgrading just yet, then seriously consider migrating to a supported cluster configuration when time permits.  Issues that arise from unsupported configuration and limited administration tend to hit without warning, and at the worst possible time.  Taking time to move to a supported platform will keep your organization in the safe zone, and make life a lot easier for you over time.

Labels: , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Thursday, November 5, 2009

Time to pay the bills! Exchange 2003 and GeoCluster.

Exchange 2007 introduced the idea of Cluster Continuous Replication (CCR) to the world, allowing you to extend an Exchange Cluster between sites (especially on Server 2008) and to create more than one copy of the mailbox data. Exchange 2010 will introduce Database Availability Groups (DAG), further pushing the technology to provide up to 16 total copies of the mailbox data in any number of locations. Both of these technologies are stellar in their own right, but leave those who are still running Exchange 2003 solidly in the dust. Granted, Exchange 2003 is nearing end-of-life, but with a large portion of the market still running on it (at the very least until the upgrades are done), many folks need solutions.

As I work for Double-Take Software, of course I’m happy to advocate our cluster-extending technology to help alleviate the situation on earlier versions of Exchange Server. This is both because they pay me to vocally advocate it (the FCC may be watching) and because it works remarkably well. More so for the latter reason.

GeoCluster (which was once a stand-alone product but is now a feature set of Double-Take Availability), allows you to create a Microsoft Cluster using Microsoft Clustering Services (MSCS) on Server 2003, but to do so without creating a shared-disk configuration that could lead to a single-point-of-failure and will restrict you in terms of how far apart the nodes can physically be. The idea is simple, GeoCluster works under the hood of MSCS, replicating data on each disk resource from the owning node to all potential owning nodes in the cluster. So Exchange sees a traditional cluster, but in reality the disks are replicated, creating multiple copies of the data based on the active node for each disk.

Since GeoCluster can support any valid cluster configuration, you can freely create clusters that span more than 2 nodes, or even more than one physical site. Keep in mind, however, that you’ll still be limited by single-subnet restrictions in Server 2003’s MSCS implementation. The good news is that moving resources from node to node works exactly the same was as it would in a shared-disk cluster, and therefore automatic failover and on-command moves are all possible.

If you lose a node, GeoCluster lets the MSCS engine arbitrate who should take over, then begins replicating data from that new owner to all the other, surviving, potential owners. Once you repair or replace the original node, the system will sync up the volumes and be ready to allow you to move the resources back to the original node if you want to. This replication is all done with the Double-Take Replication Engine, which allows GeoCluster to have the same level of write-order integrity and data reliability as any other Double-Take connection.

So, until you’re ready to make the jump to Exchange 2007 and beyond, or if you cannot take advantage of CCR for whatever reason, have a look at the GeoCluster solution. It is a cost effective and reliable way to make MSCS even more flexible and reliable, and does so without making Exchange work differently than it was designed to function.

Don’t believe me?  Check out this TechNET blog post about what the MSFT Virtualization Team does with partners like DBTK.  We help them with clustering solutions for Hyper-V, and can help you with that and much more.

Tomorrow, back to my usual, non-vendor-specific stuff =)

Labels: , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Wednesday, May 27, 2009

When your cluster goes “oops,” Using RecoverCMS

First, a quick note:  I’m posting this one from Windows Live Writer on Windows 7 RC1, which I’m happy to say is remarkably stable and much faster overall than Vista.  I’d recommend it wholeheartedly!

Funny story, I once had a client who swore that clustering was enough protection for their messaging environment, until an outage took out their entire cluster at once – causing them to be down for about week.  Now, that’s not the funny part, but what caused the outage is somewhat hilarious, more on that later.

Exchange 2003 and earlier had a pretty straight-forward method for recovering an entire MSCS cluster if one had failed on you.  You built one or more nodes of a brand new cluster, created an Exchange Virtual Server (EVS) Resource Group with the same parameters (names, IP’s etc) as the production system had, and Exchange would do the rest.

With Exchange 2007, the rules changed significantly, leaving many cluster users confused as to how the system now works if they suffer a cataclysmic failure of the production cluster.  Adding both Single Copy Cluster (SCC) and Continuous Cluster Replication (CCR) to the mix just makes things more confusing, so Microsoft created a new recovery method for Exchange 2007 clusters.  Called RecoverCMS, the system is really a setup task rather than a true failover system, but since your failover system just went belly-up, that’s not a bad thing.

If your Recovery Time Objectives are flexible enough to handle some downtime if an entire cluster fails, then you can leverage this system to get back up and running, either at the original production site, or at a new location.  There are some definite limits to what you can do with it which I’ll explain later, but he basics of how it works are pretty simple.

Step one is rebuild, repair or replace the original cluster hardware. If the repair works then you’re done, just restore any missing data from tape or other backup (due disclaimer, see below, I am biased on backup tools) and then resume normal operations. If you rebuild or replace completely, bring up a new server that is configured with Exchange 2007 in the Passive Cluster Node configuration.  You can find out how to do that:

Here for CCR clustering or,

Here for SCC Clustering

During that process you will also have installed the Exchange 2007 binaries on at least one node of the cluster system, so go to the directory that has the Exchange setup files and execute the following command:

Setup.com /recoverCMS /CMSName:<name> /CMSIPaddress:<ip>

Where <name> is the name of the EVS you’re restoring from, and <IP> is the IP address you want the recovered system to have – in theory the same IP as the original EVS had.

The rest of the procedure is pretty automated, and when finished, you will have a new EVS running on your new cluster node(s) that matches the original EVS and has all the users already assigned to it.  From there, you can restore your data if it was also lost to the disaster.

There are a few things that are extremely important to be aware of before you begin:

1 – Keep in mind that /recoverCMS is designed to restore a failed cluster only.  Attempting to use it for migration or for any other purpose will result in unpredictable behavior and is not supported by MSFT.

2 – You will need to manually create the volumes that existed on the failed cluster before you run /recoverCMS.  If volumes are missing then the recovery will fail.  They don’t have to be the same physical disk or size, just large enough to hold the data and with the same drive letters as the original cluster held.

3 – The System Attendant service will start and then immediately stop after you recover, this is normal, just bring the resource back online when you’re ready.

4 – Your databases are not mounted after a recovery, you must do this manually through PowerShell or the Exchange Management Console after you’re done with the restore.

5 – Do NOT try to use this across OS’s. If you started on Windows Server 2003, you must recover to Windows Server 2003, and 2008 to 2008.   It will not work if you try to go from one to the other.

6 – While you can pre-configure many portions of this system, it will still take some time to run through a /recoverCMS procedure from start to finish, so if you need a second-stage failover, /recoverCMS isn’t the best bet.  I’m quite biased on this (see disclaimer below), but unless you can be down for a few hours if both cluster nodes fail, you might want to go with another tool to provide remote site failover in addition to SCC or CCR clustering.

7 – Finally, SCR and CCR will not automatically work with /recoverCMS.  You will need to stop SCR if it’s running before you recover, and neither will resume automatically after the recovery is done.  Once you’re set up in the new node configuration, re-enable CCR and SCR manually as required.

/RecoverCMS is a great way to restore a failed cluster system to new hardware or rebuilt hardware after a fault.  You still need to back up your data to some device outside the cluster itself, but once you have that backup /recoverCMS can get your cluster back up and running much faster than the manual methodologies used in previous versions of Exchange.

As to the funny story I mentioned at the top of the blog, this particular client was in a hardened datacenter with UPS systems, 24/7 staff and a backup generator.  They were convinced that clustering was going to be more than enough for them.  After trying to explain that a shared-disk cluster (the only option at the time) had weak points, I finally gave up and let them be.  A few months later I got a great phone call.  Apparently – unbeknownst to the client – the datacenter crew had run all power connections through the UPS – including the generator.  The UPS was rated to handle the full power load of the datacenter on 1 of its 2 redundant circuit loops.  So far so good.  Well, this particular datacenter was in the middle of the dot-com boom (this was some time ago) and had grown exponentially in a short period of time.  What they had was well over half the full expected load on each of the two circuits, and one was failing.  So they diligently got replacement parts and moved the load over to the good circuit.  Since was over half the expected load, and circuit 2 was already under over half the load, they immediately overloaded the UPS, shorting it out.  The way it was explained to me, a solenoid shot through the casing of the UPS…and there was indeed a nice hole in the unit to back that up. No one was hurt, but needless to say, the whole datacenter was offline until they replaced the UPS, 4 days later, so they lost about one business week, without anything happening to the physical cluster at all.  Just goes to show you that anything that can go wrong, will.

Labels: , , , , , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Monday, April 27, 2009

The Dread Pirate Re-Seed – Part 1

Among the most common questions I get from clients about the new data-protection features in Exchange 2007 (and the soon-to-be-released Exchange 2010) is, “What is a re-seed and why does it happen?”  This mostly falls into the category of “fear of the unknown” since the technology is new, and documentation on how it works is somewhat scarce.

Re-seeds are a commonly confusing part of most protection methods, though they fall under different names and methodologies.  In a solution like Double-Take (see disclaimer below), they’re called re-mirrors or re-synchronization operations – and are typically differences only.  In a tape-backup solution it’s a restore operation, and might be everything, incremental pieces, or some combination thereof.  In Exchange 2007 these operations are called re-seeds, basically the replay of data from a server that has a “correct” copy to one that does not.  Today, we see these operations in Exchange 2007 Local Continuous Replication (LCR), Cluster Continuous Replication (CCR) and Server – or Standby – Continuous Replication (SCR).  Today, we’ll talk about CCR, and address LCR/SCR next week.

CCR allows an active node of a 2-node Active/Passive Exchange Cluster to replicate a copy its data to the passive node.  This allows the passive node to take over with a nearly-current copy of the data if the production system fails due to hardware or software failure.  There is a log replay lag to be considered, but it’s only 50 logs that need to be applied to the passive node during a rollover event, and that does give you some measure of protection against corruption if you catch it fast enough.  Otherwise, the system acts much like a traditional Single Copy Cluster (formerly Shared Disk Cluster) in behavior, and is controlled with a combination of Windows cluster tools and PowerShell.

Whenever a log file is committed, and a new prime log (usually E00) is created, the closed log is copied over to the passive node via an SMB share, where it is held until it passes the 50 log replay limit and is then committed to the database, or a rollover occurs and the logs are committed immediately.  Exchange 2010 will move away from the SMB share, but will utilize a similar methodology overall, if the beta is to be taken at face value.

In order to get the passive node in sync with the active data, the CCR system starts with a re-seed operation.  All data from the database is copied from the active node to the passive node, as well as any non-truncated logs.  From then on, only log files are copied, as they are committed on the active node.  If all goes well, this will probably be the only re-seed you see unless you have a rollover.

If you do flip nodes – let’s say from Node A to Node B – then Node B will re-seed back to Node A if Node A becomes divergent. In other words, if Exchange cannot determine what logs still exist on Node A, or if the logs are inconsistent, or if some are missing.  A graceful rollover will not cause a re-seed, but most emergency rollovers will require it.

The same will happen if you haven’t rolled over, but instead Node B was offline for some other reason.  When Node B comes back online, Node A will see if all the required logs are on both machines, and then either just continue CCR protection or else initiate a re-seed to copy the data over again if anything is amiss.  The only issue here is if your backup tools purge logs while Node B is still offline.  In that case the servers will be considered divergent and need a re-seed to get back up and running properly.

Finally, if a cluster is restored from a backup (tape or otherwise) to the active node, then a re-seed must be manually initiated to re-sync the nodes properly.  You will see errors telling you to do this after the restore is complete and you bring Node A back online.

One other condition exists, but it is a manually created condition. If you perform Offline Defragmentation of the database, you will trigger a re-seed operation when Node A is brought back online.  As long as the first Exchange log is still present (which it should be) then this will happen automatically. Otherwise, it will need to be initiated manually.

So, why is this an issue?  Normally, it’s not, but keep in mind that re-seed operations are *full* copies of the entire database.  So if you have relatively small databases and only a few of them, this isn’t a problem.  But let’s say you have over 1 Terabyte of data in your Exchange cluster.  Re-seeding that much data locally will be time and resource consuming, and doing it over a WAN (for distributed failover clustering) could be problematic – to say the least.  So you want to avoid re-seed operations at all costs and wherever possible, which means treating the CCR cluster very carefully, and following all the best practices from Microsoft on Exchange 2007 Clustering in general.

For information on when re-seeds occur, take a look at this TechNet article.  They’re not an everyday occurrence, but you will need to be sure you know when and why they will happen to avoid confusion and frustration.

Labels: , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Tuesday, August 26, 2008

To MTA or not to MTA?

Possibly the longest running debate in the history of Exchange 2003 is if you should disable the MTA Stacks service on an Exchange 2003 Cluster.  There are some very valid reasons to remove it, and one HUGE reason to leave it alone.

First, what is the MTA?  The Message Transfer Agent is a compatibility solution put in place by Microsoft to allow the movement of messages from Exchange 2000 and 2003 Servers (stand-alone and clustered) to other messaging platforms.  That could be an Exchange 5.5 Server, Lotus Notes, etc.  This is not the only method that could be used to move information between server platforms, but especially for 5.5 it is the preferred method.

You may be tempted to remove the MTA Stacks resource from a cluster, as it's one more resource that could go sideways on you, and it does offer an additional attack surface for those who would try to hack your system.  You may also try to remove it when you move to a pure Exchange 2003 environment, as such a configuration would seemingly not require it at all.

In theory, you'd be technically correct.  Reducing potential problems and removing avenues to attack are generally considered good things.  But an explicit ruling from Microsoft on the matter and the pain caused by the operations required to reinstall it if you need it in future can make you think twice.

The Microsoft Knowledge Base is pretty explicit that removing the MTA is not supported.  You can read about it here:

KB 810489 "MTA Stacks service supportability guidelines for Exchange 2000 Server and Exchange Server 2003"

Within that KB, along with the explicit note that removing the MTA is a bad idea, is the problem of reinstallation later on.  Most notably, if you ever want to re-install the MTA Resource, you have to delete the Exchange Virtual Server entirely (by deleting the SA Resource) and reinstall the whole EVA.  So, if you have to call into support, they'll tell you that you need the MTA resource in place. And in order to put it back in place, you have to first destroy the live clustered Exchange server.  That's a catch-22 you can avoid by not removing the MTA at all.

Long story short, keeping the MTA does offer a potential avenue for attack, but removing it creates an absolute headache if you need support later on.  For now, at least, leaving the MTA in place is a much better option - just make sure you have set up your firewall to block potential attacks!

Labels: , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Monday, August 11, 2008

Quite a stretch for clustering in 2008

Microsoft Clustering Services (MSCS) have existed in one form or another since NT4, but have always suffered from a significant limitation.  All cluster resources had to exist within the same logical subnet, or else you couldn't create the cluster itself.  Windows Server 2008 allows for some flexibility in that regard, with the ability to create nodes of a contiguous cluster in different logical subnets.

Before we dive too far into that, you may want to see the official MSFT information here:

http://technet.microsoft.com/en-us/library/cc770625.aspx

So what does this mean for you and I?  It means that we can create CCR clusters on Exchange 2007 that stretch between physical locations and subnets.  However, to do this you'll need to be on Server 2008, the function just isn't available in Server 2003. This allows you to provide basic availability for Mailbox Role (MBX) servers in your Exchange environment, but doesn't take care of everything when it comes to DR planning.

First off, this applies only to Exchange 2007 Enterprise Edition, and then only to MBX role servers.  While most other roles are natively fault-tolerant with multiple servers installed with the same role able to stand in for each other, organizational or regulatory rules might not make that kind of redundancy possible.  Edge servers are the biggest example of this.  Since they're not tied into the domain structures, they don't contain any way to quickly flip traffic from one Edge server to another in different sites.  Third-party tools (see disclaimer below) can often take care of that function for you, as can working with your DNS provider to facilitate moving the MX records in the event of an emergency.

If you have legacy Exchange 2000/2003 servers, you're also not able to take advantage of this new MSCS feature set on those boxes.  The same goes with any non-Exchange tools, like SQL, anti-virus servers, anti-spam servers, etc.  Even if these servers run on Server 2008 clusters, they'll require some third-party intervention to handle the data replication for those systems.  This would include things like Blackberry servers, GoodLink systems and other non-Exchange remote email tools.

Finally, keep in mind that CCR clusters can only extend to Active/Passive, 2-node configurations. That will mean you can't use these solutions if you need to go beyond that model - which Exchange 2007 easily allows without CCR involved.

Server 2008 Failover Clustering is a great method for basic High Availability for Exchange 2007 CCR systems - even across subnets.  With some additional tools, it can become the center of an Exchange DR solution set that can help your organization withstand even site-wide emergencies.

Labels: , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Friday, July 11, 2008

Dial-tone revisited

The theory of Dial-Tone Recovery (DTR) is one that has often been overlooked in the world of Disaster Recovery (DR) for Exchange Server.  However, even in Exchange 2007, DTR can provide a great method for immediate restoration of email services, though with a few things to keep in mind.

For those who haven't heard of DTR before, here's a primer:

If a primary Exchange 2000, 2003 or 2007 server fails, you can attempt to restore services by deleting the corrupted databases and re-starting Exchange services.  This will create blank databases and allow users to send and receive new mail, access new calendar items and access all shared contact information.  Running a /disasterrecovery install of Exchange on a rebuilt box with no data will do the same thing.  Though end-users can access their email systems again, there will be no historical data, so this isn't a total solution set for true DR, but gives you some options for immediate availability.

In an emergency, this can give you time to perform restoration steps - which could take quite a while to finish - without making everyone wait to get back basic send/receive capability. You can restore a copy of the historical data via several methods, taking the time you need to do it right.

If you used a brick-level tape or disk backup solution, you can restore mailboxes via that tool's recovery system. Archiving solutions and Continuous Data Recovery systems (like TimeData from Double-Take - see disclaimer below), can let you move mailbox, folder and other data back over time as well.  If neither of those tools are at your disposal, but you to have a backup of the database and logs, you can restore those to a Recovery Storage Group, and use ExMerge or Exchange 2007 tools to bring back mailbox data and merge it with the new information on the DTR-recovered server.

If no backup is available at all, you can still provide Exchange services from the point of DTR onward.  While no historical info will be available to them, the end-users will be able to send and receive new email, calendar entries and Public Folder data.

Labels: , , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments