Search This Blog

Mar 19, 2012

Logging event id 1069 and 1558 every 15 minutes


    Hello,

    I'm posting a new issue that I saw in few days and was solved with a solution that it's not so simple. Actually the solution is simple but the way to do that in same cases could create a second problem :)

    Let's check the scenario and actions takens.

    Subjective
    ============

    Getting Events 1069 and 1558 logged every 15 minutes on the server USPHXE0251 which has the Cluster Group.
    In this scenario we had Exchange 2010 DAG impacting but these "errors" can appears on SQL or any Cluster activity.

    Typed "cluster res" command
    Listing status for all available resources:

    Resource             Group                Node            Status
    -------------------- -------------------- --------------- ------
    Cluster IP Address   Cluster Group        E0251      Online
    Cluster Name         Cluster Group        E0251      Online
    File Share Witness (\\e0202.xyz.com\DAG01.xyz.com) Cluster Group      E0251      Online


    Log Name:      System
    Source:        Microsoft-Windows-FailoverClustering
    Event ID:      1069
    Task Category: Resource Control Manager
    Level:         Error
    User:          SYSTEM
    Computer:      E0233.xyz.com
    Description: Cluster resource 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)' in clustered service or application 'Cluster Group' failed.

    Log Name:      System
    Source:        Microsoft-Windows-FailoverClustering
    Event ID:      1558
    Task Category: Quorum Manager
    Level:         Warning
    User:          SYSTEM
    Computer:     E0251.xyz.com
    Description: The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data.

  1. In this case my Failover Cluster is running with Exchange 2010 SP1 and we've configure the Quorum with FSM option. The File server is Hub Server.

  2. Actions Takens
    ===============

    • Looking to HUB server (Witness Disk Servers) we see the following drivers are with the same driver level for SMB:
      • contains the MRXSMB10.sys = 6.1.7601.21767 and MRXSMB20.sys = 6.1.7601.17605


    • Generated the Cluster.log using the command "cluster log  /gen"

    00002bcc.00001eb8::2012/02/28-21:47:49.278 ERR   [QUORUM] Node 4: Failing quorum resource due to witness failure
    00002bcc.00002c90::2012/02/28-21:47:49.278 INFO  [GUM] Node 4: Processing RequestLock 4:368
    00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO  [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)', gen(578) result 0.
    00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO  [RCM] TransitionToState(File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)) Online-->ProcessingFailure.
    00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (Cluster Group, Online --> Failed)
    00002bcc.00001eb8::2012/02/28-21:47:49.278 ERR   [RCM] rcm::RcmResource::HandleFailure: (File Share Witness (\\e0202.xyz.com\DAG01.xyz.com))
    00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO  [QUORUM] Node 4: PostRelease for 0dd9ec96-71d1-4949-806d-7d5403ff3f6d
    00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO  [RCM] resource File Share Witness (\\e0202.xyz.com\DAG01.xyz.com): failure count: 2, restartAction: 2.
    00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO  [RCM] Greater than restartPeriod time has elapsed since first failure, resetting failureTime and failureCount.
    00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO  [RCM] Will restart resource in 500 milliseconds.
    00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO  [RCM] TransitionToState(File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)) ProcessingFailure-->[WaitingToTerminate to DelayRestartingResource].
    00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (Cluster Group, Failed --> Pending)
    00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO  [RCM] TransitionToState(File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)) [WaitingToTerminate to DelayRestartingResource]-->[Terminating to DelayRestartingResource].
    00002284.000042a4::2012/02/28-21:47:49.278 INFO  [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Terminating resource ...
    00002284.000042a4::2012/02/28-21:47:49.278 INFO  [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Resource is offline.
    00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO  [RCM] HandleMonitorReply: TERMINATERESOURCE for 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)', gen(579) result 0.
    00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO  [RCM] TransitionToState(File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)) [Terminating to DelayRestartingResource]-->DelayRestartingResource.
    00002bcc.00001d1c::2012/02/28-21:47:49.278 INFO  [GUM] Node 4: Processing GrantLock to 4 (sent by 6 gumid: 140952)
    00002bcc.00002c90::2012/02/28-21:47:49.278 INFO  [QUORUM] Node 4: Witness Failed Gum Handler [QUORUM] Node 4
    00002bcc.00002c90::2012/02/28-21:47:49.278 INFO  [QUORUM] Node 4: witness attach failed. next restart will happen at 2012/02/28-22:02:49.278
    00002bcc.0000238c::2012/02/28-21:47:49.278 INFO  [QUORUM] Node 4: quorum is not owned by anyone
    00002bcc.00004178::2012/02/28-21:47:49.792 INFO  [RCM] Delay-restarting File Share Witness (\\e0202.xyz.com\DAG01.xyz.com) and any waiting dependents.
    00002bcc.00004178::2012/02/28-21:47:49.792 INFO  [RCM] TransitionToState(File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)) DelayRestartingResource-->OnlineCallIssued.
    00002284.00002360::2012/02/28-21:47:49.792 INFO  [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Beginning arbitration ...
    00002284.00002360::2012/02/28-21:47:49.792 INFO  [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Opening file \\e0202.xyz.com\DAG01.xyz.com\0dd9ec96-71d1-4949-806d-7d5403ff3f6d\Witness.log.
    00002284.00002360::2012/02/28-21:47:49.808 INFO  [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Attempting to lock file \\e0202.xyz.com\DAG01.xyz.com\0dd9ec96-71d1-4949-806d-7d5403ff3f6d\Witness.log, try 1 of 30.
    00002284.00002360::2012/02/28-21:47:49.808 INFO  [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Succeeded in locking file \\e0202.xyz.com\DAG01.xyz.com\0dd9ec96-71d1-4949-806d-7d5403ff3f6d\Witness.log
    00002bcc.00004178::2012/02/28-21:47:49.824 INFO  [QUORUM] Node 4: PostArbitrate => 0 for 0dd9ec96-71d1-4949-806d-7d5403ff3f6d
    00002bcc.00004178::2012/02/28-21:47:49.824 INFO  [QUORUM] Node 4 CompareAndSetWitnessTag: ignoring any existing data on witness resource.
    00002bcc.00004178::2012/02/28-21:47:49.824 INFO  [QUORUM] Node 4 CompareAndSetWitnessTag: writing witness tag 164:164:509200
    00002284.00002360::2012/02/28-21:47:49.824 INFO  [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Writing file share witness epoch data.
    00002284.00002360::2012/02/28-21:47:49.824 INFO  [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Wrote 88 bytes to the witness file share.
    00002bcc.00004178::2012/02/28-21:47:49.824 INFO  [QUORUM] Node 4 CompareAndSetWitnessTag: releasing witness share lock
    00002284.00002360::2012/02/28-21:47:49.824 INFO  [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Releasing locked witness share.
    00002284.00002360::2012/02/28-21:47:49.824 INFO  [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Bringing resource online ...
    00002284.00002360::2012/02/28-21:47:49.824 INFO  [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Resource is online.
    00002bcc.00000c60::2012/02/28-21:47:49.824 INFO  [RCM] HandleMonitorReply: ONLINERESOURCE for 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)', gen(579) result 0.
    00002bcc.00000c60::2012/02/28-21:47:49.824 INFO  [RCM] TransitionToState(File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)) OnlineCallIssued-->Online.
    00002bcc.00000c60::2012/02/28-21:47:49.824 INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (Cluster Group, Pending --> Online)
    00002bcc.00000c60::2012/02/28-21:47:49.824 INFO  [QUORUM] Node 4: PostOnline for 0dd9ec96-71d1-4949-806d-7d5403ff3f6d
    00002bcc.0000238c::2012/02/28-21:47:49.824 INFO  [QUORUM] Node 4: quorum is arbitrated by node 4
    00002bcc.0000238c::2012/02/28-21:47:49.824 INFO  [QUORUM] Node 4: releasing witness lock (if held) because witness is not needed for quorum in new view.
    00002284.000029f4::2012/02/28-21:47:49.886 INFO  [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Ignoring request to release witness share because it is not currently locked.

    • Restarted the all Servers involved for tests (Cluster nodes and File Server).
    • After rebooting new "Host Current Server" is generating the event ID 1069 and 1558
    • Checked the Links below:


    • Basically in that articles Microsoft request to update some drivers like SMB, Clussvc.exe and TCPIP.
    • Changed the Cluster parameters
    cluster /prop SameSubnetDelay=2000 (The default value is 1000 milliseconds, we could set it to 2000 milliseconds.)
    cluster /prop CrossSubnetDelay=2000

    • Installed Updates
    ++++++++

    KB2550886 - A transient communication failure causes a Windows Server 2008 R2 failover cluster to stop working
    Clussvc.exe 6.1.7601.21772

    KB2661010 - IP packets are not routed through a Windows Server 2008 R2–based LAN router in a VLAN environment
    Fwpkclnt.sys 6.1.7601.17514
    Tcpip.sys 6.1.7601.17754

    Fwpkclnt.sys 6.1.7601.21889
    Tcpip.sys 6.1.7601.21889

    KB2616514 - Cluster service sends unnecessary registry key change notifications among cluster nodes in Windows Server 2008 or in Windows Server 2008 R2
    Clussvc.exe 6.1.7601.17730
    Clussvc.exe 6.1.7601.21867

    KB2612966 - Paged pool memory leak when you access some shared files in Windows 7 or in Windows Server 2008 R2
    Mrxsmb10.sys 6.1.7601.21819
    Mrxsmb20.sys 6.1.7601.21819

    ++++++++++++++++++

    • Installed the following updates and rebooted all DAGs and Hub (FSW Server).
    • Changed the Quorum model to Node and Majority and then go back to FSW.
    • AV exclusion contains the correct drive "X:\Witnesses"
    • Generated Cluster.log again. Getting the same issues on Cluster.log. Nothing changed so far.

    00001738.00002760::2012/03/12-22:23:21.225 ERR   [QUORUM] Node 3: Failing quorum resource due to witness failure
    00001738.00001b38::2012/03/12-22:23:21.225 INFO  [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)', gen(7) result 0.
    00001738.00001b38::2012/03/12-22:23:21.225 ERR   [RCM] rcm::RcmResource::HandleFailure: (File Share Witness (\\e0202.xyz.com\DAG01.xyz.com))
    00002168.00002808::2012/03/12-22:23:21.771 INFO  [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Ignoring request to release witness share because it is not currently locked.


    • Running a Validate this Cluster and everything is up, running and OK.
    • Checked Network Provider status inside regedit.
    • Our HUB Server is running under VM  technology and contains the following info:   on the fileshare witness server following value exist. "vmhgfs,RDPNP,LanmanWorkstation"  - but it doesn't impact anything. The values are correct.
    • File Share witness is on  a Virtual Machine. Checked on Microsoft technet that recommend configuration sould be use a File Server running on physical box .
    • Changed the FSW server to a physical server. (It doesn't worked).



    Solution
    ==============
    • Shutdown all nodes leaving just one online.
    • The last one need to be restarted to perform the FORM process.
    • After completing the reboot process turned on all others nodes.

Update: 2014/10/30 : Based on anomynous feedback the article listed below it seems to fix the problem.
http://support.microsoft.com/kb/2750820
I'm not able to make this test but it's important to keep the troubleshooting mindset to figure out the root cause.



    Forming a Cluster
    The first server that comes online in a cluster, either after installation or after the entire cluster has been shut down for some reason, forms the cluster. To succeed at forming a cluster, a server must:
    • Be running the Cluster service.
    • Be unable to locate any other nodes in the cluster (in other words, no other nodes can be running).
    • Acquire exclusive ownership of the quorum resource.


    Information
    ==================
    • According to Microsoft this logs although being generated every 15 minutes doesn’t means Cluster impact, but so far the only way to fix is using the form procedure. The Quorum (Witness Disk) is the first resource brought online when cluster service attempts to form a cluster.



8 comments:

  1. Hi there! Keep it up! This is a good read. I will be looking forward to visit your page again and for your other posts as well. Thank you for sharing your thoughts about rcm solutions in your area. I am glad to stop by your site and know more about rcm solutions .
    The important functions (of a piece of equipment) to preserve with routine maintenance are identified, their dominant failure modes and causes determined and the consequences of failure ascertained. Levels of criticality are assigned to the consequences of failure. Some functions are not critical and are left to "run to failure" while other functions must be preserved at all cost. Maintenance tasks are selected that address the dominant failure causes. This process directly addresses maintenance preventable failures. Failures caused by unlikely events, non-predictable acts of nature, etc. will usually receive no action provided their risk (combination of severity and frequency) is trivial (or at least tolerable). When the risk of such failures is very high, RCM encourages (and sometimes mandates) the user to consider changing something which will reduce the risk to a tolerable level.
    The RCM Blitz™ Solutions process was put together to allow companies to complete a traditional RCM analysis on a major critical asset by defining the envelope, completing the upfront tasks, and then setting up your meeting to complete the analysis in less than 5 days.

    ReplyDelete
  2. Excellent article, I enjoyed seeing the full process of your troubleshooting. Thank you very much for sharing this information.

    ReplyDelete
  3. Thanks for putting this online Chang!
    Great work..

    I just have one question, We are seeing the same issue here on E2K10 servers in a DAG. If I understand the process, wouldn't it be enough to restart the cluster service on the last node ?
    If that were the case, my databases would not need to be brought offline...

    ReplyDelete
    Replies
    1. Yes that is possible if all others nodes were turned off.
      In my case we restarted the servers since we had time to keep end users with no Outlook service.
      If it doesn't work you will need restart the box.

      Delete
  4. eventid 1069
    Cluster resource ''File Share Witness in clustered service or application 'Cluster Group' failed
    I reboot one of the member of DAG (E2k10SP2 server)
    Result:
    AT THE CLUSTER CORE RESOURCES, everything was offline or failed
    location,failover cluster manager - cluster core resources - cluster was offline.

    Naula
    Solution:::
    restart the Cluster service on the passive DAG (not on the current host server)


    ReplyDelete
  5. http://support.microsoft.com/kb/2750820

    ReplyDelete