Search This Blog

Jun 24, 2011

Cluster Windows 2008 Error ID 1034 or 1069

 

A few days ago I've saw some Cluster case very interest.
The main issue was:
Issue Definition
After server “HOST02” is rebooted is generated an event error on System Log with ID 1034

Environment
---------------------------
   NAME:HOST02
   Operating System . . . . : Windows Server 2008 R2 Datacenter SP1
   Roles . . . . . . . . . .: Cluster Server
   Applications . . . . . ..: Hyper-V

On Log we can see the following message:

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Event ID: 1034
Level: Error
User: SYSTEM
Task Category: Physical Disk Resource
Node: HOST02.sankyocorp.com.br
Description: Cluster physical disk resource 'Cluster Disk 5' cannot be brought online because the associated disk could not be found. The expected signature of the disk was 'FAED5952'. If the disk was replaced or restored, in the Failover Cluster Manager snap-in, you can use the Repair function (in the properties sheet for the disk) to repair the new or restored disk. If the disk will not be replaced, delete the associated disk resource.


Log Name: System
Source: Microsoft-Windows-FailoverClustering
Event ID: 1034
Level: Error
User: SYSTEM
Task Category: Physical Disk Resource
Node: HOST02.sankyocorp.com.br
Description: Cluster resource 'Cluster Disk 5' in clustered service or application 'Cluster Group' failed.

    - In this case was tested and validate Signature thru Diskpart. Nothing found
    - Tested the Cluster Quorum and the Q: was online and working. Moved to other node and it's Ok.
    - The error message is displayed just during the Restart Process.
    - Collected MSDT and checked the Cluster.log

778:7a4.06/16[11:10:02.347](000000) INFO  [CS] Cluster Service started << Start Cluster Service
778:7a4.06/16[11:10:02.363](000000) WARN  [DM] Node 2: Failed to unload restored hive from the registry with error STATUS_INVALID_PARAMETER(c000000d)
778:7a4.06/16[11:10:02.472](000000) DBG   [DM] Delete hive C:\Windows\Cluster\CLUSDB.bak failed: ERROR_FILE_NOT_FOUND(2)
778:7a4.06/16[11:10:02.472](000000) DBG   [DM] Delete logger files for hive at C:\Windows\Cluster\CLUSDB.bak failed: ERROR_FILE_NOT_FOUND(2)
778:7a4.06/16[11:10:02.488](000000) INFO  [API] RpcServerRegister => 0
778:7a4.06/16[11:10:02.488](000000) WARN  [API] Failed to write SPNs to node's computer object - status 1355
778:748.06/16[11:10:02.597](000000) INFO  [NODE] Node 2: New join with n1: stage: 'Authenticate Initial Connection' status HrError(0x80090311) reason: '[SV] Authentication failed'
778:7a4.06/16[11:10:02.597](000000) INFO  [CS] Reporting to SCM that cluster service has started.
778:79c.06/16[11:10:02.612](000000) WARN  [JPM] Node 2: Exception WSANO_DATA(11004)' because of 'GetAddrInfo failed' while trying to query DNS for node names
778:748.06/16[11:10:02.644](000000) WARN  cxl::ConnectWorker::operator (): HrError(0x80090311)' because of '[SV] Authentication or Authorization Failed'
778:748.06/16[11:10:06.512](000000) INFO  [NODE] Node 2: New join with n1: stage: 'Authenticate Initial Connection' status HrError(0x80090311) reason: '[SV] Authentication failed'
778:748.06/16[11:10:06.512](000000) WARN  cxl::ConnectWorker::operator (): HrError(0x80090311)' because of '[SV] Authentication or Authorization Failed'
bd8:bec.06/16[11:10:07.776](000000) ERR   [RHS] s_RhsRpcCreateResType: ERROR_NOT_READY(21)' because of 'Startup routine for ResType MSMQ returned 21.'
778:70c.06/16[11:10:07.776](000000) WARN  [RCM] Failed to load restype 'MSMQ': error 21.
bd8:bec.06/16[11:10:07.792](000000) ERR   [RHS] s_RhsRpcCreateResType: ERROR_NOT_READY(21)' because of 'Startup routine for ResType MSMQTriggers returned 21.'
778:70c.06/16[11:10:07.792](000000) WARN  [RCM] Failed to load restype 'MSMQTriggers': error 21.
988:9b4.06/16[11:10:08.478](000000) WARN  [RES] Physical Disk <Cluster Disk 5>: Open: invalid device number!
988:a84.06/16[11:10:08.962](000000) ERR   [RES] Physical Disk <Cluster Disk 5>: Failed to located disk with id, no *Number* FAED5952
778:748.06/16[11:10:08.977](000000) ERR   [RCM] Arbitrating resource 'Cluster Disk 5' returned error 1168
778:748.06/16[11:10:08.977](000000) ERR   [RCM] rcm::RcmResource::HandleFailure: (Cluster Disk 5)
778:748.06/16[11:10:08.977](000000) WARN  [RCM] Skipping failure processing in because RCM PostForm has not yet been called.
778:70c.06/16[11:10:09.024](000000) ERR   mscs::QuorumAgent::FormLeaderWorker::operator (): ERROR_RESOURCE_FAILED(5038)' because of 'Failed to bring quorum resource 790c1d90-bce9-48fc-b6b0-e982152f7a74 online, status 5038'
778:70c.06/16[11:10:10.506](000000) DBG   [HM] Connection attempt to BRGRUHOST01 failed with error WSAENETUNREACH(10051): Failed to connect to remote endpoint 10.10.4.201:~3343~.

Also this kind of issue can be occur when you're moving a Resource or Restarting one node and the Event ID 1069 is logged
Example: On the first node of the cluster we are seeing Event ID 1069 "Cluster resource 'Cluster Disk 2' in clustered service or application 'Cluster Group' failed" every time we reboot the server. It doesn't matter if the server is the active node or passive node at the time of the reboot. The error still seems to occur every time we reboot. This does not seem to have a serious negative impact on the overall functionality of the cluster but I can't put this into production until this is resolved. I have not seen this error on the second node.

For this case the Issue and Solution is the same but to be 100% sure need to check the Cluster.log

Solution

=========

Network connectivity not fully initialized when node first attempts the join. The node starts the form process, tries to arbitrate for the witness disk, fails, then tries a join. The join does succeed but there is an event 1034 or 1069 logged.

We set the cluster service to 'automatic (delayed)' to give the network stack time to fully initialize prior to trying to start the cluster service.

No comments:

Post a Comment