Search This Blog

Aug 15, 2011

Cannot startup Cluster Service

Could not start the Cluster Service service on SRVNAME.
Error 1067: The process terminated unexpectedly.
 
Here I'm explaining a simple case and sometimes customer doesn't agree with the solution, but I believe with the info below we're safe tell what's going on.
In this particular scenario we've a Cluster Windows Server 2003 SP environment (2 nodes) and one of these servers cannot start up the service “Cluster Service” . When we try startup the following message is displayed:
 
       
 image
        
If we check the Event Viewer we will see the ID 1209:
 
        Event Type:        Error
        Event Source:        ClusDisk
        Event Category:        None
        Event ID:        1209
        Date:                8/5/2011
        Time:                4:02:44 PM
        Computer:        SRVNAME
        Description: Cluster service is requesting a bus reset for device \Device\ClusDisk0.
 
Let's staring take a look under Cluster Environment:
OS Version and SP:  Windows Server 2003 SP2 R2 ENT
Hostname: SRVNAME01 / SRVNAME02
Is Virtual (Y/N): N
Roles: Cluster Print Server
Domain: cro.local
  
Additional Features:
==================
Antivirus: SEP 11.0.6200
 
The Cluster environment contains 3 LUNs (Disk Quorum, Disk S: e Disk P: )
For more tests we've added a LUN number 4 (named disk N:)
 
Important TIP: The Disks Q: , S: , P: were online under Disk Management, however unavailable through Explorer or Command Prompt.
However the Disk N: was totally perfect. We could read and write the information on the disk.
 
Solution
==========

 
    · We tested startup the cluster service on both nodes, but it's doesn't worked.
    · For troubleshooting we requested to customer shutdown the node2 - SRVNAME02.
    · Based on Logs and symptom it's likely be a Storage Issue.
    · Analyzing Cluster.log:
 
598:7f8.08/05[16:52:39.256](6522675) WARN [NM] Network 4e1a3d12-269a-4f33-a3f0-bf0393bf6d10 (Public) is up.
598:3c8.08/05[16:52:39.256](6522676) WARN [NM] Network 88798bd5-4cef-4f45-a6a0-d6631a0e4fd3 (Private) is up.
a68:dfc.08/05[16:54:31.912](000008) ERR  Physical Disk <Quorum Disk>: [DiskArb] Failed to write (sector 11), error 170.
a68:dfc.08/05[16:54:32.412](000008) WARN Physical Disk <Quorum Disk>: [DiskArb] Retry arbitration, 4 attempts left
a68:dfc.08/05[16:54:32.412](000008) ERR  Physical Disk <Quorum Disk>: [DiskArb] Failed to read  (sector 12), error 170.
a68:dfc.08/05[16:54:32.912](000008) WARN Physical Disk <Quorum Disk>: [DiskArb] Retry arbitration, 3 attempts left
a68:dfc.08/05[16:54:32.912](000008) ERR  Physical Disk <Quorum Disk>: [DiskArb] Failed to read  (sector 12), error 170.
a68:dfc.08/05[16:54:33.412](000008) WARN Physical Disk <Quorum Disk>: [DiskArb] Retry arbitration, 2 attempts left
a68:dfc.08/05[16:54:33.412](000008) ERR  Physical Disk <Quorum Disk>: [DiskArb] Failed to read  (sector 12), error 170.
a68:dfc.08/05[16:54:33.912](000008) WARN Physical Disk <Quorum Disk>: [DiskArb] Retry arbitration, 1 attempts left
a68:dfc.08/05[16:54:33.912](000008) ERR  Physical Disk <Quorum Disk>: [DiskArb] Failed to read  (sector 12), error 170.
a68:b7c.08/05[16:54:33.912](6522678) WARN Physical Disk <Quorum Disk>: [PnP] RemoveDisk: disk a323c64d not found or previously removed
 
    · What does it means ID 170?.
 
status 170 - Which means "The requested resource is in use." This could be related to Persistent Reservation problems, it can also be MPIO, fiber/HBA drivers and/or some type of lower level file system driver or software such as anti-virus, quota management, open file agent for backup software, etc, etc,:

00000c94.000008d4::<date and time>.585 INFO Physical Disk <Disk Q:>: [DiskArb] Issuing Reserve on signature 33af636f. 00000c94.000008d4::<date and time>.616 ERR Physical Disk <Disk Q:>: [DiskArb] Reserve completed, status 170.
00000c94.000008d4::<date and time>.616 INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate returned status 170.
 
    · In this case customer has some doubts about the Storage issue because we have the Lun number 4 (Disk N:) available.
    · For troubleshooting and validate the Cluster environment 'Cluster x Storage' we did the following steps:
        o Disabled Cluster Service
        o Disabled Cluster Disk:
                § Click Start, point to Administrative Tools, click Computer Management, and then click Device Manager.
                § Make sure that Show hidden devices is enabled. To enable hidden devices, click Show hidden devices on the View menu.
                § Expand Non-Plug and Play Drivers, right-click Cluster Disk Driver, click Properties, click the Driver tab, in the Startup type list click Disabled, and then click OK.
                § Quit Computer Managagement, and then restart your computer.
                http://support.microsoft.com/kb/890549
    
    · After restarting the server SRVNAME01 we tried access the Disks Q:, P: e S: but both were unavailable .
    · Explained the customer this issue show us the issue is located on Storage.
    · Now to fix the issue and help the customer we re-enabled Clusdisk and put the Cluster Service as Manual and restarted the server.
    · Executed the command to start up the customer with parameter "forcequorum" (/fq switch).
    · Changed the Disk Quorum (q:) to point to new LUN 'Disk N:'
    · Set the Cluster service as automatic and restarted again.
    · Started up the Node2 and it's join the cluster successfully.

Conclusion: The issue mentioned was related to the fact the disk quorum could not be "available" to raise the Cluster service due to a miscommunication with Storage.
By adding a new LUN connectivity was possible to change the configuration and start the Cluster Quorum Service. This proved once again that failure to the client and the original disks Q: and S: are in trouble and needs to be checked.

More Information
=================
 
status 170 - Which means "The requested resource is in use." This could be related to Persistent Reservation problems, it can also be MPIO, fibre/HBA drivers and/or some type of lower level file system driver or software such as anti-virus, quota management, open file agent for backup software, etc, etc,:

00000c94.000008d4::<date and time>.585 INFO Physical Disk <Disk Q:>: [DiskArb] Issuing Reserve on signature 33af636f. 00000c94.000008d4::<date and time>.616 ERR Physical Disk <Disk Q:>: [DiskArb] Reserve completed, status 170.
00000c94.000008d4::<date and time>.616 INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate returned status 170.
<http://blogs.technet.com/b/askcore/archive/2008/02/06/troubleshooting-cluster-logs-101-why-did-the-resources-failover-to-the-other-node.aspx>

2 comments:

  1. Support the strong, give courage to the timid, remind the indifferent, and warn the opposed. See the link below for more info.


    #warn
    www.ufgop.org

    ReplyDelete
  2. Thanks for sharing such a wonderful article, I hope you could inspire more people. Visit my site too.

    n8fan.net

    www.n8fan.net

    ReplyDelete