Newbie questions - 2008R2 2-Node Cluster and Failures
trying troubleshot sequence of events of outage on 2-node 2008 r2 mscs based cluster (we have ip address , sql server instance clustered). refer nodes node05 , node06.
both nodes running on vmware esxi 5.x database , quorum disks attached via vmware rdm ibm xiv via fiber channel. node06's rdm set use fixed-path addressing while node05's rdm (incorrectly) set use round-robin multipath (working correct this). each node running on different blade center. there private heartbeat network.
at beginning of event, node05 primary.
at 22:08:20, both nodes report node05 removed active failover cluster membership. 23 seconds later, both nodes report disks 'unexpectedly lost' respective node. these errors continue on more 2 minutes before things seem come online.
our investigation shows that, @ least esx perspective, connectivity san luns not lost. san monitoring shows nothing "dropping". in addition, i'm not seeing in os/system event logs indicating storage lost -- disk errors show in cluster logs. i'm not thinking sort of san disruption trigger event, want ensure theory fits how 2-node mscs cluster functions.
i'm theorizing node remove event (possibly triggered network disruption) occurred first may have triggered scsi-3 based "fencing" have resulted in disks appearing unavailable on both nodes though san still up. however, understanding scsi reservation requests subsequent scsi reset occurs in "split" happens @ staggered intervals (three seconds "primary" node , 7 seconds "challenger" nodes) should resolved -- not 2+ minutes saw.
can confirm i'm on right track thinking? or possibly describe how typical failure scenario play out if heartbeat network disrupted period of time?
first, configuration pass cluster storage validation tests? if not, needs sorted before doing else.
next, node05 being removed active failover cluster membership indication there sort of network connectivity issue between nodes. whatever reason, node06 lost connectivity node05, , assume node06 owns witness resource , able maintain quorum, causes node05 drop out of cluster.
when node05 drops out of cluster, other cluster node forcefully attempts take on ownership of resources owned node05. cause event logs on node05 filled errors reservation lost , disk errors regarding host being unable flush cache disk. expected in situation.
you shouldn't see same loss of reservation messages on node06. not expected behavior. i'd take @ cluster logs see might have happened while node attempting bring disks online.
Windows Server > High Availability (Clustering)
Comments
Post a Comment