Chang's Blog: Logging event id 1069 and 1558 every 15 minutes

Hello,

I'm posting a new issue that I saw in few days and was solved with a solution that it's not so simple. Actually the solution is simple but the way to do that in same cases could create a second problem :)

Let's check the scenario and actions takens.

Subjective

============

Getting Events 1069 and 1558 logged every 15 minutes on the server USPHXE0251 which has the Cluster Group.

In this scenario we had Exchange 2010 DAG impacting but these "errors" can appears on SQL or any Cluster activity.

Typed "cluster res" command

Listing status for all available resources:

Resource Group Node Status

-------------------- -------------------- --------------- ------

Cluster IP Address Cluster Group E0251 Online

Cluster Name Cluster Group E0251 Online

File Share Witness (\\e0202.xyz.com\DAG01.xyz.com) Cluster Group E0251 Online

Log Name: System

Source: Microsoft-Windows-FailoverClustering

Event ID: 1069

Task Category: Resource Control Manager

Level: Error

User: SYSTEM

Computer: E0233.xyz.com

Description: Cluster resource 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)' in clustered service or application 'Cluster Group' failed.

Log Name: System

Source: Microsoft-Windows-FailoverClustering

Event ID: 1558

Task Category: Quorum Manager

Level: Warning

User: SYSTEM

Computer: E0251.xyz.com

Description: The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data.

In this case my Failover Cluster is running with Exchange 2010 SP1 and we've configure the Quorum with FSM option. The File server is Hub Server.

Actions Takens

===============

Looking to HUB server (Witness Disk Servers) we see the following drivers are with the same driver level for SMB:

contains the MRXSMB10.sys = 6.1.7601.21767 and MRXSMB20.sys = 6.1.7601.17605

Generated the Cluster.log using the command "cluster log /gen"

00002bcc.00001eb8::2012/02/28-21:47:49.278 ERR [QUORUM] Node 4: Failing quorum resource due to witness failure

00002bcc.00002c90::2012/02/28-21:47:49.278 INFO [GUM] Node 4: Processing RequestLock 4:368

00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)', gen(578) result 0.

00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO [RCM] TransitionToState(File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)) Online-->ProcessingFailure.

00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO [RCM] rcm::RcmGroup::UpdateStateIfChanged: (Cluster Group, Online --> Failed)

00002bcc.00001eb8::2012/02/28-21:47:49.278 ERR [RCM] rcm::RcmResource::HandleFailure: (File Share Witness (\\e0202.xyz.com\DAG01.xyz.com))

00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO [QUORUM] Node 4: PostRelease for 0dd9ec96-71d1-4949-806d-7d5403ff3f6d

00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO [RCM] resource File Share Witness (\\e0202.xyz.com\DAG01.xyz.com): failure count: 2, restartAction: 2.

00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO [RCM] Greater than restartPeriod time has elapsed since first failure, resetting failureTime and failureCount.

00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO [RCM] Will restart resource in 500 milliseconds.

00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO [RCM] TransitionToState(File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)) ProcessingFailure-->[WaitingToTerminate to DelayRestartingResource].

00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO [RCM] rcm::RcmGroup::UpdateStateIfChanged: (Cluster Group, Failed --> Pending)

00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO [RCM] TransitionToState(File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)) [WaitingToTerminate to DelayRestartingResource]-->[Terminating to DelayRestartingResource].

00002284.000042a4::2012/02/28-21:47:49.278 INFO [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Terminating resource ...

00002284.000042a4::2012/02/28-21:47:49.278 INFO [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Resource is offline.

00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO [RCM] HandleMonitorReply: TERMINATERESOURCE for 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)', gen(579) result 0.

00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO [RCM] TransitionToState(File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)) [Terminating to DelayRestartingResource]-->DelayRestartingResource.

00002bcc.00001d1c::2012/02/28-21:47:49.278 INFO [GUM] Node 4: Processing GrantLock to 4 (sent by 6 gumid: 140952)

00002bcc.00002c90::2012/02/28-21:47:49.278 INFO [QUORUM] Node 4: Witness Failed Gum Handler [QUORUM] Node 4

00002bcc.00002c90::2012/02/28-21:47:49.278 INFO [QUORUM] Node 4: witness attach failed. next restart will happen at 2012/02/28-22:02:49.278

00002bcc.0000238c::2012/02/28-21:47:49.278 INFO [QUORUM] Node 4: quorum is not owned by anyone

00002bcc.00004178::2012/02/28-21:47:49.792 INFO [RCM] Delay-restarting File Share Witness (\\e0202.xyz.com\DAG01.xyz.com) and any waiting dependents.

00002bcc.00004178::2012/02/28-21:47:49.792 INFO [RCM] TransitionToState(File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)) DelayRestartingResource-->OnlineCallIssued.

00002284.00002360::2012/02/28-21:47:49.792 INFO [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Beginning arbitration ...

00002284.00002360::2012/02/28-21:47:49.792 INFO [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Opening file \\e0202.xyz.com\DAG01.xyz.com\0dd9ec96-71d1-4949-806d-7d5403ff3f6d\Witness.log.

00002284.00002360::2012/02/28-21:47:49.808 INFO [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Attempting to lock file \\e0202.xyz.com\DAG01.xyz.com\0dd9ec96-71d1-4949-806d-7d5403ff3f6d\Witness.log, try 1 of 30.

00002284.00002360::2012/02/28-21:47:49.808 INFO [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Succeeded in locking file \\e0202.xyz.com\DAG01.xyz.com\0dd9ec96-71d1-4949-806d-7d5403ff3f6d\Witness.log

00002bcc.00004178::2012/02/28-21:47:49.824 INFO [QUORUM] Node 4: PostArbitrate => 0 for 0dd9ec96-71d1-4949-806d-7d5403ff3f6d

00002bcc.00004178::2012/02/28-21:47:49.824 INFO [QUORUM] Node 4 CompareAndSetWitnessTag: ignoring any existing data on witness resource.

00002bcc.00004178::2012/02/28-21:47:49.824 INFO [QUORUM] Node 4 CompareAndSetWitnessTag: writing witness tag 164:164:509200

00002284.00002360::2012/02/28-21:47:49.824 INFO [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Writing file share witness epoch data.

00002284.00002360::2012/02/28-21:47:49.824 INFO [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Wrote 88 bytes to the witness file share.

00002bcc.00004178::2012/02/28-21:47:49.824 INFO [QUORUM] Node 4 CompareAndSetWitnessTag: releasing witness share lock

00002284.00002360::2012/02/28-21:47:49.824 INFO [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Releasing locked witness share.

00002284.00002360::2012/02/28-21:47:49.824 INFO [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Bringing resource online ...

00002284.00002360::2012/02/28-21:47:49.824 INFO [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Resource is online.

00002bcc.00000c60::2012/02/28-21:47:49.824 INFO [RCM] HandleMonitorReply: ONLINERESOURCE for 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)', gen(579) result 0.

00002bcc.00000c60::2012/02/28-21:47:49.824 INFO [RCM] TransitionToState(File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)) OnlineCallIssued-->Online.

00002bcc.00000c60::2012/02/28-21:47:49.824 INFO [RCM] rcm::RcmGroup::UpdateStateIfChanged: (Cluster Group, Pending --> Online)

00002bcc.00000c60::2012/02/28-21:47:49.824 INFO [QUORUM] Node 4: PostOnline for 0dd9ec96-71d1-4949-806d-7d5403ff3f6d

00002bcc.0000238c::2012/02/28-21:47:49.824 INFO [QUORUM] Node 4: quorum is arbitrated by node 4

00002bcc.0000238c::2012/02/28-21:47:49.824 INFO [QUORUM] Node 4: releasing witness lock (if held) because witness is not needed for quorum in new view.

00002284.000029f4::2012/02/28-21:47:49.886 INFO [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Ignoring request to release witness share because it is not currently locked.

Restarted the all Servers involved for tests (Cluster nodes and File Server).
After rebooting new "Host Current Server" is generating the event ID 1069 and 1558
Checked the Links below:

http://social.technet.microsoft.com/Forums/en-US/exchangesvravailabilityandisasterrecovery/thread/8ce21cf2-372b-4f83-b9d0-73963d31bcf1

http://blogs.technet.com/b/exchange/archive/2011/11/20/recommended-windows-hotfix-for-database-availability-groups-running-windows-server-2008-r2.aspx

Basically in that articles Microsoft request to update some drivers like SMB, Clussvc.exe and TCPIP.
Changed the Cluster parameters

cluster /prop SameSubnetDelay=2000 (The default value is 1000 milliseconds, we could set it to 2000 milliseconds.)

cluster /prop CrossSubnetDelay=2000

Installed Updates

++++++++

KB2550886 - A transient communication failure causes a Windows Server 2008 R2 failover cluster to stop working

Clussvc.exe 6.1.7601.21772

KB2661010 - IP packets are not routed through a Windows Server 2008 R2–based LAN router in a VLAN environment

Fwpkclnt.sys 6.1.7601.17514

Tcpip.sys 6.1.7601.17754

Fwpkclnt.sys 6.1.7601.21889

Tcpip.sys 6.1.7601.21889

KB2616514 - Cluster service sends unnecessary registry key change notifications among cluster nodes in Windows Server 2008 or in Windows Server 2008 R2

Clussvc.exe 6.1.7601.17730

Clussvc.exe 6.1.7601.21867

KB2612966 - Paged pool memory leak when you access some shared files in Windows 7 or in Windows Server 2008 R2

Mrxsmb10.sys 6.1.7601.21819

Mrxsmb20.sys 6.1.7601.21819

++++++++++++++++++

Installed the following updates and rebooted all DAGs and Hub (FSW Server).
Changed the Quorum model to Node and Majority and then go back to FSW.
AV exclusion contains the correct drive "X:\Witnesses"
Generated Cluster.log again. Getting the same issues on Cluster.log. Nothing changed so far.

00001738.00002760::2012/03/12-22:23:21.225 ERR [QUORUM] Node 3: Failing quorum resource due to witness failure

00001738.00001b38::2012/03/12-22:23:21.225 INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)', gen(7) result 0.

00001738.00001b38::2012/03/12-22:23:21.225 ERR [RCM] rcm::RcmResource::HandleFailure: (File Share Witness (\\e0202.xyz.com\DAG01.xyz.com))

00002168.00002808::2012/03/12-22:23:21.771 INFO [RES] File Share Witness <File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>: Ignoring request to release witness share because it is not currently locked.

Running a Validate this Cluster and everything is up, running and OK.
Checked Network Provider status inside regedit.
Our HUB Server is running under VM technology and contains the following info: on the fileshare witness server following value exist. "vmhgfs,RDPNP,LanmanWorkstation" - but it doesn't impact anything. The values are correct.
File Share witness is on a Virtual Machine. Checked on Microsoft technet that recommend configuration sould be use a File Server running on physical box .
Changed the FSW server to a physical server. (It doesn't worked).

Solution

==============

Shutdown all nodes leaving just one online.
The last one need to be restarted to perform the FORM process.
After completing the reboot process turned on all others nodes.

Update: 2014/10/30 : Based on anomynous feedback the article listed below it seems to fix the problem.

http://support.microsoft.com/kb/2750820

I'm not able to make this test but it's important to keep the troubleshooting mindset to figure out the root cause.

More information à http://technet.microsoft.com/en-us/library/cc738051(v=ws.10).aspx

Forming a Cluster

The first server that comes online in a cluster, either after installation or after the entire cluster has been shut down for some reason, forms the cluster. To succeed at forming a cluster, a server must:

Be running the Cluster service.
Be unable to locate any other nodes in the cluster (in other words, no other nodes can be running).
Acquire exclusive ownership of the quorum resource.

Information

==================

According to Microsoft this logs although being generated every 15 minutes doesn’t means Cluster impact, but so far the only way to fix is using the form procedure. The Quorum (Witness Disk) is the first resource brought online when cluster service attempts to form a cluster.

8 comments:

AnonymousDec 11, 2012, 7:35:00 PM
Hi there! Keep it up! This is a good read. I will be looking forward to visit your page again and for your other posts as well. Thank you for sharing your thoughts about rcm solutions in your area. I am glad to stop by your site and know more about rcm solutions .
The important functions (of a piece of equipment) to preserve with routine maintenance are identified, their dominant failure modes and causes determined and the consequences of failure ascertained. Levels of criticality are assigned to the consequences of failure. Some functions are not critical and are left to "run to failure" while other functions must be preserved at all cost. Maintenance tasks are selected that address the dominant failure causes. This process directly addresses maintenance preventable failures. Failures caused by unlikely events, non-predictable acts of nature, etc. will usually receive no action provided their risk (combination of severity and frequency) is trivial (or at least tolerable). When the risk of such failures is very high, RCM encourages (and sometimes mandates) the user to consider changing something which will reduce the risk to a tolerable level.
The RCM Blitz™ Solutions process was put together to allow companies to complete a traditional RCM analysis on a major critical asset by defining the envelope, completing the upfront tasks, and then setting up your meeting to complete the analysis in less than 5 days.
SteveJan 30, 2013, 4:21:00 PM
Excellent article, I enjoyed seeing the full process of your troubleshooting. Thank you very much for sharing this information.
BertFeb 1, 2013, 12:12:00 PM
Thanks for putting this online Chang!
Great work..

I just have one question, We are seeing the same issue here on E2K10 servers in a DAG. If I understand the process, wouldn't it be enough to restart the cluster service on the last node ?
If that were the case, my databases would not need to be brought offline...

AnonymousMar 16, 2013, 2:11:00 PM
eventid 1069
Cluster resource ''File Share Witness in clustered service or application 'Cluster Group' failed
I reboot one of the member of DAG (E2k10SP2 server)
Result:
AT THE CLUSTER CORE RESOURCES, everything was offline or failed
location,failover cluster manager - cluster core resources - cluster was offline.

Naula
Solution:::
restart the Cluster service on the passive DAG (not on the current host server)

AnonymousOct 29, 2014, 4:25:00 PM
http://support.microsoft.com/kb/2750820
UnknownDec 4, 2019, 11:13:00 AM
Me están haciendo espionaje
UnknownDec 4, 2019, 11:14:00 AM
Es legal espionaje civil

Chang's Blog

Search This Blog

Mar 19, 2012

Logging event id 1069 and 1558 every 15 minutes

8 comments: