Timing is everything.
A week ago, I got a late-night phone call about a problem with an Exchange server that seemed to be related to an expired certificate; the admin had replaced the expired cert on one member of a two-node DAG, but not the other. He noticed the errors in the event log when troubleshooting a seemingly unrelated problem, installed the new cert, and then boom! Bad stuff started happening. Problem was, the reported problem was that inbound SMTP from a hosted filtering service that doesn’t use TLS wasn’t flowing, so it didn’t seem likely that certificate expiration would be involved. By the time he called me, he had installed the new certificate and rebooted the affected server, and all seemed to be well.
Fast forward to Sunday night. I’d planned to patch these same servers to get them on to Exchange 2010 SP2 UR3, in part because I’d noticed a worrisome number of events generated by the MSExchange ADAccess service, chiefly event ID 2112:
Process MSEXCHANGEADTOPOLOGYSERVICE.EXE (PID=8356). The Exchange computer HQ-EX02.blahblah does not have Audit Security Privilege on the domain controller HQ-DC01.blahblah. This domain controller will not be used by Exchange Active Directory Provider.
This was immediately followed by MSExchange ADAccess event ID 2102 with the rather ominous message that
Process MSEXCHANGEADTOPOLOGYSERVICE.EXE (PID=8356). All Domain Controller Servers in use are not responding:
However, the event ID 2080 logged by ADAccess indicated that all but 1 of the GCs were up and providing necessary services, including indicating that their SACL allowed Exchange the necessary access. I couldn’t puzzle it out in the time I had allotted, so I decided to take a backup (see rule #3) and wait to tackle the patching until I could be physically present. That turned out to be a very, very good idea.
Last night, I sat down to patch the affected systems. I began with the passive DAG node, updating it to SP2 and then installing UR3. I half-thought that this process might resolve the cause of the errors (see rule #2), but after a reboot I noticed they were still being logged. I suspected that the reported 2102 errors might be bogus, since I knew all of the involved GCs were running and available. As I started to dig around, I learned that this error often appears when there’s a problem with permissions; to be more specific, this SCOM article asserts that the problem is that the Exchange server(s) don’t have the SeSecurityPrivilege user right on the domain controllers. However, I was still a little skeptical. I checked the default DC GPO and, sure enough, the permissions were present, so I moved on to do some further investigation.
Another possible cause is that the Exchange servers’ computer accounts aren’t in the Exchange Servers group, or that permissions on that group were jacked up somehow, but they appeared to be fine so I discounted that as a likely cause.
Along the way I noticed that the FBA service wouldn’t start, but its error message was meaningless– all I got from the service control manager was a service-specifc error code that resisted my best attempts to Bing it. Without that service, of course, you can’t use OWA with FBA mode, which would be a problem so I made a mental note to dig into that later.
A little more searching turned up this article, which is dangerously wrong: it suggests adding the Exchange computer accounts to the Domain Admins security group. Please, please, don’t do this; not only does it not fix the problem, it can cause all sorts of other tomfoolery that you don’t want to have to deal with.
Still more digging revealed two common problems that were present on this server: the active NIC wasn’t first in the binding order and IPv6 was disabled on the two enabled NICs. Now, you and I both know that IPv6 isn’t required to run Exchange.. but Microsoft does not support disabling or removing IPv6 on Windows servers. And you know what they say about what “unsupported” means! So, I enabled IPv6 on the two adapters and got the binding order sorted out, then bounced the AD topology service and…
… voila! Everything seemed to be working normally, so I ran some tests to verify that *over was working as it should, then started patching the DAG primary– only to have setup fail partway through. Upon reboot, w3svc was caught in an endless loop of trying to load some of the in-proc OWA DLLs; it kept trying endlessly until I power-cycled the server. The problem with this is that the Active Manager service was starting, so the current active node would try to sync with it before mounting its copy of the DAG databases, but it never got an answer! Net result, no mounted databases on either server, and an unhappy expression on my face as the clock ticked past 11pm.
I put the primary member server in safe mode, then set the Exchange and w3svc services to manual star and rebooted it. Rather than spend a lot of time trying to pin down exactly what happened, I ran setup in recovery mode; it installed the binaries, after which the services restarted normally. I did a switchover back to the original primary node, verified mail flow, and went home. Life was good.
Until, that is, this morning, when I got an e-mail: “OWA is down.” I checked the servers and, sure enough, the errors were back and the FBA service was again refusing to start. After some creative swearing, I once again started digging around in the guts of the server. I couldn’t shake the feeling that this was a legitimate permissions problem of some kind.
At that point, I found this article, which pointed out something critical about GPOs: you have to check the effective policy, not just the one defined in the default policy. Sure enough, when I used RSoP to check the effective policy on the DCs, the Exchange servers did not have SeSecurityPrivilege on the DCs because there was a separate GPO to control audit logging permissions, and it had recently been changed to remove the Exchange Servers group. That was easy to fix: I added the Exchange Servers group to the GPO, ran gpudate, rebooted the passive node, and found that the services all started normally and ran without error. A quick switchover let me restart the topology service on the primary DAG member, after which it too ran without errors. End result: problem solved.
It’s still not entirely clear to me why that particular service needs to have the SeSecurityPrivilege right assigned. I’m trying to find that out and will update this post once I do. In the meantime, if you have similar symptoms, check to verify that the effective policy is correct.