Search This Blog

Sep 1, 2011

Cannot add the 2nd Node on the Cluster

Cannot add the 2nd Node on the Cluster

This kind of case it's common an the solution could be different but here's a troubleshooting way could help many cases and perhaps can help you.
The Cluster Windows Server 2008 R2 SP1 (2 nodes) which the server (let's call) ‘XYZ02’ was evicted and now it needs to be added again but it's working.

During the Add node Wizard validate and after showing the Report message: 
Node:  xyz02.sa.com.br
Started 8/10/2011 4:48:54 PM
Completed 8/10/2011 4:55:16 PM

Adding xyz02.sa.com.br to the cluster.
Validating cluster state on node xyz02.
Getting current node membership of cluster bldv01
Adding node afgbrosabld03 to Cluster configuration data.
Validating installation of the Network FT Driver on node xyz02.
Validating installation of the Cluster Disk Driver on node xyz02.
Configuring Cluster Service on node xyz02.
Waiting for notification that Cluster service on node xyz02.sa.com.br has started.
Waiting for notification that node afgbrosabld03 is a fully functional member of the cluster.
Unable to successfully cleanup.
The server xyz02.sa.com.br' could not be added to the cluster.
An error occurred while adding node xyz02.sa.com.br' to cluster 'bldv01'.
This operation returned because the timeout period expired



Let's check the Cluster environment:

NAME: xyz02
Operating System. . . . : Windows Server 2008 Datacenter R2 SP1
Roles. . . . . . . . . . . . . : Hyper-V, Failover Cluster
Anti-virus . . . . . . . . . . : SEP 11.0.6005
Virtualized. . . . . . . . . : NO

NAME: xyz01
Operating System. . . . : Windows Server 2008 Datacenter R2 SP1
Roles. . . . . . . . . . . . . : Hyper-V, Failover Cluster
Anti-virus . . . . . . . . . . : SEP 11.0.6005
Virtualized. . . . . . . . . : NO

Cluster Name: BLDV01
After received the message above if you try again run the validation wizard to add the node again the following message will be displayed:  ‘The computer " xyz02" is joined to the Cluster’
Solution

    · First Action. Run the Validation with all option and both nodes to check the nodes are 100% OK. If you find any error/warning keep in mind to try fix it before continuous the troubleshooting.
    · Checked the Event Viewer and found:
    Log Name:      System
    Source:        Microsoft-Windows-FailoverClustering
    Date:          8/10/2011 5:02:59 PM
    Event ID:      1090
    Task Category: Startup/Shutdown
    Level:         Critical
    User:          SYSTEM
    Computer:      xyz02.sa.com.br
    Description: The Cluster service cannot be started. An attempt to read configuration data from the Windows registry failed with error '2'. Please use the Failover Cluster Management snap-in to ensure that this machine is a member of a cluster. If you intend to add this machine to an existing cluster use the Add Node Wizard. Alternatively, if this machine has been configured as a member of a cluster, it will be necessary to restore the missing configuration data that is necessary for the Cluster Service to identify that it is a member of a cluster. Perform a System State Restore of this machine in order to restore the configuration data.
    

    · To fix the Critical error logged on Event Viewer and be sure the node will execute the validation add node wizard it's needed execute the Force Evict through command line :
    
    Cluster <ClusterName> node <NodeName> /force
    
    Use the article "How to Evict a Node from a Windows Server 2008 Failover Cluster" - http://technet.microsoft.com/en-us/library/bb676524%28EXCHG.80%29.aspx
    
    
    · The following action was taken
        ○ TCP/IP Parameters set as Enabled (RSS, TCPA, Chimney)
        ○ Windows Firewall was disabled on both servers.
        ○ IPv6 was disabled for all NIC on all servers.
        ○ Removed the AV SEP from server node (xyz02).
    
    · Customer informed this case he couldn't remove the AV SEP from node XYZ01.
    
    · Checking the Cluster.log from XYZ02
 
b20:578.08/11[14:42:14.967](000000) INFO  [CS] Cluster Service started
b20:578.08/11[14:42:15.888](000000) INFO  [API] RpcServerRegister => 0
b20:578.08/11[14:42:15.903](000000) ERR   [API] DmQueryString failed to retrieve the security   descriptor status 2, default security descriptor will be used for authorizing client connections
b20:578.08/11[14:42:15.935](000000) INFO  [CS] Reporting to SCM that cluster service has started.
b20:9f0.08/11[14:42:18.228](000000) DBG   [NETFTAPI] received NsiParameterNotification  for fe80::2445:9c2a:cf4e:46fd (IpDadStatePreferred )
b20:504.08/11[14:42:20.225](000000) WARN  [NETFTAPI] Failed to query parameters for 169.254.70.253 (status 80070490)
b20:504.08/11[14:42:20.225](000000) WARN  [NETFTAPI] Failed to query parameters for 169.254.70.253 (status 80070490)
b20:504.08/11[14:42:20.225](000000) DBG   [NETFTAPI] received NsiParameterNotification  for 169.254.2.243 (IpDadStatePreferred )
b20:504.08/11[14:42:20.942](000000) INFO  [CHANNEL 10.72.0.157:~3343~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
b20:504.08/11[14:42:20.974](000000) WARN  cxl::ConnectWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 10.72.0.157:~3343~ is closed'
b20:b50.08/11[14:43:17.931](000000) DBG   [NETFTAPI] received NsiParameterNotification  for 169.254.2.243 (IpDadStatePreferred )
b20:be0.08/11[14:43:19.225](000000) DBG   [NETFTAPI] received NsiParameterNotification  for fe80::2445:9c2a:cf4e:46fd (IpDadStatePreferred )
b20:be0.08/11[14:43:21.222](000000) DBG   [NETFTAPI] received NsiParameterNotification  for 169.254.2.243 (IpDadStatePreferred )
b20:be0.08/11[14:43:21.924](000000) INFO  [CHANNEL 10.72.0.156:~3343~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
b20:be0.08/11[14:43:21.924](000000) WARN  cxl::ConnectWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 10.72.0.156:~3343~ is closed'
b20:b50.08/11[14:44:18.926](000000) WARN  cxl::ConnectWorker::operator (): HrError(0xd0000043)' because of '::NetftAddRoute( handle.handle, netFtRoute.get(), &netftSecurityContext )'
b20:9f0.08/11[14:44:18.958](000000) WARN  [NETFTAPI] Failed to query parameters for fe80::5efe:169.254.2.243 (status 80070490)
b20:9f0.08/11[14:44:18.958](000000) WARN  [NETFTAPI] Failed to query parameters for fe80::5efe:169.254.2.243 (status 80070490)
b20:be0.08/11[14:44:24.230](000000) DBG   [NETFTAPI] received NsiParameterNotification  for fe80::2445:9c2a:cf4e:46fd (IpDadStatePreferred )
b20:9f0.08/11[14:44:26.227](000000) DBG   [NETFTAPI] received NsiParameterNotification  for 169.254.2.243 (IpDadStatePreferred )
b20:9f0.08/11[14:45:19.939](000000) DBG   [NETFTAPI] received NsiParameterNotification  for 169.254.2.243 (IpDadStatePreferred )
b20:be0.08/11[14:45:21.234](000000) DBG   [NETFTAPI] received NsiParameterNotification  for fe80::2445:9c2a:cf4e:46fd (IpDadStatePreferred )
b20:884.08/11[14:45:22.934](000000) INFO  [CHANNEL 10.72.0.156:~3343~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
b20:884.08/11[14:45:22.934](000000) WARN  cxl::ConnectWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 10.72.0.156:~3343~ is closed'
b20:884.08/11[14:45:23.231](000000) DBG   [NETFTAPI] received NsiParameterNotification  for 169.254.2.243 (IpDadStatePreferred )
b20:be0.08/11[14:46:20.951](000000) WARN  cxl::ConnectWorker::operator (): HrError(0xd0000043)' because of '::NetftAddRoute( handle.handle, netFtRoute.get(), &netftSecurityContext )'
b20:884.08/11[14:46:20.967](000000) WARN  [NETFTAPI] Failed to query parameters for fe80::5efe:169.254.2.243 (status 80070490)
b20:884.08/11[14:46:20.967](000000) WARN  [NETFTAPI] Failed to query parameters for fe80::5efe:169.254.2.243 (status 80070490)
b20:0e8.08/11[14:46:26.224](000000) DBG   [NETFTAPI] received NsiParameterNotification  for fe80::2445:9c2a:cf4e:46fd (IpDadStatePreferred )
b20:0e8.08/11[14:46:28.236](000000) DBG   [NETFTAPI] received NsiParameterNotification  for 169.254.2.243 (IpDadStatePreferred )
b20:be0.08/11[14:47:21.932](000000) WARN  cxl::ConnectWorker::operator (): HrError(0xd0000043)' because of '::NetftAddRoute( handle.handle, netFtRoute.get(), &netftSecurityContext )'
b20:b9c.08/11[14:47:21.963](000000) WARN  [NETFTAPI] Failed to query parameters for fe80::5efe:169.254.2.243 (status 80070490)
b20:b9c.08/11[14:47:21.963](000000) WARN  [NETFTAPI] Failed to query parameters for fe80::5efe:169.254.2.243 (status 80070490)
b20:b9c.08/11[14:47:27.236](000000) DBG   [NETFTAPI] received NsiParameterNotification  for fe80::2445:9c2a:cf4e:46fd (IpDadStatePreferred )
b20:b9c.08/11[14:47:29.233](000000) DBG   [NETFTAPI] received NsiParameterNotification  for 169.254.2.243 (IpDadStatePreferred )
b20:0e8.08/11[14:48:22.928](000000) WARN  cxl::ConnectWorker::operator (): HrError(0xd0000043)' because of '::NetftAddRoute( handle.handle, netFtRoute.get(), &netftSecurityContext )'
b20:b9c.08/11[14:48:22.960](000000) WARN  [NETFTAPI] Failed to query parameters for fe80::5efe:169.254.2.243 (status 80070490)
b20:b9c.08/11[14:48:22.960](000000) WARN  [NETFTAPI] Failed to query parameters for fe80::5efe:169.254.2.243 (status 80070490)
b20:b9c.08/11[14:48:28.232](000000) DBG   [NETFTAPI] received NsiParameterNotification  for fe80::2445:9c2a:cf4e:46fd (IpDadStatePreferred )
b20:0e8.08/11[14:48:30.229](000000) DBG   [NETFTAPI] received NsiParameterNotification  for 169.254.2.243 (IpDadStatePreferred )
b20:b9c.08/11[14:48:30.900](000000) ERR   [QUORUM] Node 2: Fail to form/join a cluster in 6:15.000
b20:b9c.08/11[14:48:30.900](000000) ERR   join/form timeout (status = 258)
b20:b9c.08/11[14:48:30.900](000000) ERR   join/form timeout (status = 258), executing OnStop
b20:b9c.08/11[14:48:30.978](000000) ERR   FatalError is Calling Exit Process.


    · Checking the Cluster.log from XYZ01

304:350.08/10[16:49:02.864](000000) ERR   [IM] Unable to find adapter
dc4:178.08/10[16:49:02.864](000000) WARN  [RES] IP Address <Cluster IP Address>: WorkerThread: NetInterface c810ac97-6966-472f-b382-968ee1f06d0b changed to state 3.
304:438.08/10[16:49:06.811](000000) WARN  [FTI][Initiator] Ignoring duplicate connection: usable route already exists
304:438.08/10[16:49:06.811](000000) INFO  [CHANNEL 10.72.0.159:~64177~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
304:438.08/10[16:49:06.811](000000) WARN  mscs::ListenerWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 10.72.0.159:~64177~ is closed'
304:ab8.08/10[16:50:02.816](000000) INFO  [CHANNEL 10.72.0.159:~64278~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
304:ab8.08/10[16:50:02.816](000000) WARN  mscs::ListenerWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 10.72.0.159:~64278~ is closed'
304:350.08/10[16:50:06.809](000000) ERR   [IM] Unable to find adapter
dc4:178.08/10[16:50:06.809](000000) WARN  [RES] IP Address <Cluster IP Address>: WorkerThread: NetInterface c810ac97-6966-472f-b382-968ee1f06d0b changed to state 3.
304:350.08/10[16:51:02.814](000000) ERR   [IM] Unable to find adapter
dc4:178.08/10[16:51:02.814](000000) WARN  [RES] IP Address <Cluster IP Address>: WorkerThread: NetInterface c810ac97-6966-472f-b382-968ee1f06d0b changed to state 3.
304:ae8.08/10[16:51:06.808](000000) WARN  [FTI][Initiator] Ignoring duplicate connection: usable route already exists
304:ae8.08/10[16:51:06.808](000000) INFO  [CHANNEL 10.72.0.159:~64430~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
304:ae8.08/10[16:51:06.808](000000) WARN  mscs::ListenerWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 10.72.0.159:~64430~ is closed'
dd8:498.08/10[16:51:58.460](000000) ERR   [RHS] RhsCall::Perform_NativeEH: ERROR_NOT_READY(21)' because of 'Startup routine for ResType MSMQ returned 21.'
dd8:498.08/10[16:51:58.491](000000) ERR   [RHS] RhsCall::Perform_NativeEH: ERROR_NOT_READY(21)' because of 'Startup routine for ResType MSMQTriggers returned 21.'
304:514.08/10[16:52:02.812](000000) INFO  [CHANNEL 10.72.0.159:~64465~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
304:514.08/10[16:52:02.812](000000) WARN  mscs::ListenerWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 10.72.0.159:~64465~ is closed'
304:350.08/10[16:52:06.806](000000) ERR   [IM] Unable to find adapter
dc4:178.08/10[16:52:06.806](000000) WARN  [RES] IP Address <Cluster IP Address>: WorkerThread: NetInterface c810ac97-6966-472f-b382-968ee1f06d0b changed to state 3.
304:02c.08/10[16:53:02.811](000000) INFO  [CHANNEL 10.72.0.159:~64587~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
304:02c.08/10[16:53:02.811](000000) WARN  mscs::ListenerWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 10.72.0.159:~64587~ is closed'
304:02c.08/10[16:54:02.809](000000) INFO  [CHANNEL 10.72.0.159:~64662~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
304:02c.08/10[16:54:02.809](000000) WARN  mscs::ListenerWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 10.72.0.159:~64662~ is closed'
304:b88.08/10[16:55:02.807](000000) INFO  [CHANNEL 10.72.0.159:~64762~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
304:b88.08/10[16:55:02.807](000000) WARN  mscs::ListenerWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 10.72.0.159:~64762~ is closed'

    · Based on logs we found the XYZ01 couldn't receive any data from node XYZ02 (Remote EndPoint Error point to Server XYZ02 IP address).
    · As solution we asked to customer remove the SEP from server XYZ01 and then try again.
    · Removed the AV SEP and made a evict cleanup (described above).
    · After that we finally could add the server to Cluster successfully.

Conclusion: The main issue was related to SEP AV blocked the communication between nodes (TCP/UDP 3343).


Related Articles
===============================
Windows Server 2008 Failover Clusters: Networking (Part 1)
http://blogs.technet.com/b/askcore/archive/2010/02/12/windows-server-2008-failover-clusters-networking-part-1.aspx
"How to Evict a Node from a Windows Server 2008 Failover Cluster" - http://technet.microsoft.com/en-us/library/bb676524%28EXCHG.80%29.aspx

2 comments:

  1. In my case, the problem arose because the cluster network Live Migration was off
    http://blog.it-kb.ru/2014/10/31/server-could-not-be-added-to-the-cluster-hyper-v-waiting-for-notification-that-node-is-a-fully-functional-member-of-the-cluster-error-code-is-0x5b4-unable-to-successfully-cleanup/

    ReplyDelete
  2. First of all thanks to Chang for publishing this post.

    I'm not very much aware in Networking and its cards.

    We faced similar issue and resolved by enabling IPV6 NIC which is by default used by cluster to communicate with the nodes in the form of heart beat UDP Remote endpoint 3343.

    https://blogs.technet.microsoft.com/askcore/2012/07/09/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership/

    Happy Learning.

    ReplyDelete