- In this case my Failover Cluster is running with Exchange 2010 SP1 and we've configure the Quorum with FSM option. The File server is Hub Server.
- Looking to HUB server (Witness Disk Servers) we see the following drivers are with the same driver level for SMB:
- contains the MRXSMB10.sys = 6.1.7601.21767 and MRXSMB20.sys = 6.1.7601.17605
- Generated the Cluster.log using the command "cluster log /gen"
- Restarted the all Servers involved for tests (Cluster nodes and File Server).
- After rebooting new "Host Current Server" is generating the event ID 1069 and 1558
- Checked the Links below:
- Basically in that articles Microsoft request to update some drivers like SMB, Clussvc.exe and TCPIP.
- Changed the Cluster parameters
- Installed Updates
- Installed the following updates and rebooted all DAGs and Hub (FSW Server).
- Changed the Quorum model to Node and Majority and then go back to FSW.
- AV exclusion contains the correct drive "X:\Witnesses"
- Generated Cluster.log again. Getting the same issues on Cluster.log. Nothing changed so far.
- Running a Validate this Cluster and everything is up, running and OK.
- Checked Network Provider status inside regedit.
- Our HUB Server is running under VM technology and contains the following info: on the fileshare witness server following value exist. "vmhgfs,RDPNP,LanmanWorkstation" - but it doesn't impact anything. The values are correct.
- File Share witness is on a Virtual Machine. Checked on Microsoft technet that recommend configuration sould be use a File Server running on physical box .
- Changed the FSW server to a physical server. (It doesn't worked).
- Shutdown all nodes leaving just one online.
- The last one need to be restarted to perform the FORM process.
- After completing the reboot process turned on all others nodes.
Hello,
I'm posting a new
issue that I saw in few days and was solved with a solution that it's not so
simple. Actually the solution is simple but the way to do that in same cases
could create a second problem :)
Let's check the
scenario and actions takens.
Subjective
============
Getting Events 1069
and 1558 logged every 15 minutes on the server USPHXE0251 which has the
Cluster Group.
In this scenario we
had Exchange 2010 DAG impacting but these "errors" can appears on
SQL or any Cluster activity.
Typed "cluster
res" command
Listing status for
all available resources:
Resource Group Node Status
--------------------
-------------------- --------------- ------
Cluster IP
Address Cluster Group E0251 Online
Cluster Name Cluster Group E0251 Online
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Event ID: 1069
Task Category:
Resource Control Manager
Level: Error
User: SYSTEM
Computer: E0233.xyz.com
Description:
Cluster resource 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)'
in clustered service or application 'Cluster Group' failed.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Event ID: 1558
Task Category:
Quorum Manager
Level: Warning
User: SYSTEM
Computer: E0251.xyz.com
Description: The
cluster service detected a problem with the witness resource. The witness
resource will be failed over to another node within the cluster in an attempt
to reestablish access to cluster configuration data.
Actions Takens
===============
00002bcc.00001eb8::2012/02/28-21:47:49.278
ERR [QUORUM] Node 4: Failing quorum
resource due to witness failure
00002bcc.00002c90::2012/02/28-21:47:49.278
INFO [GUM] Node 4: Processing
RequestLock 4:368
00002bcc.00001eb8::2012/02/28-21:47:49.278 INFO [RCM] HandleMonitorReply:
FAILURENOTIFICATION for 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)',
gen(578) result 0.
00002bcc.00001eb8::2012/02/28-21:47:49.278
INFO [RCM] TransitionToState(File Share
Witness (\\e0202.xyz.com\DAG01.xyz.com))
Online-->ProcessingFailure.
00002bcc.00001eb8::2012/02/28-21:47:49.278
INFO [RCM]
rcm::RcmGroup::UpdateStateIfChanged: (Cluster Group, Online --> Failed)
00002bcc.00001eb8::2012/02/28-21:47:49.278
ERR [RCM]
rcm::RcmResource::HandleFailure: (File Share Witness (\\e0202.xyz.com\DAG01.xyz.com))
00002bcc.00001eb8::2012/02/28-21:47:49.278
INFO [QUORUM] Node 4: PostRelease for
0dd9ec96-71d1-4949-806d-7d5403ff3f6d
00002bcc.00001eb8::2012/02/28-21:47:49.278
INFO [RCM] resource File Share Witness
(\\e0202.xyz.com\DAG01.xyz.com):
failure count: 2, restartAction: 2.
00002bcc.00001eb8::2012/02/28-21:47:49.278
INFO [RCM] Greater than restartPeriod
time has elapsed since first failure, resetting failureTime and failureCount.
00002bcc.00001eb8::2012/02/28-21:47:49.278
INFO [RCM] Will restart resource in 500
milliseconds.
00002bcc.00001eb8::2012/02/28-21:47:49.278
INFO [RCM] TransitionToState(File Share
Witness (\\e0202.xyz.com\DAG01.xyz.com))
ProcessingFailure-->[WaitingToTerminate to DelayRestartingResource].
00002bcc.00001eb8::2012/02/28-21:47:49.278
INFO [RCM]
rcm::RcmGroup::UpdateStateIfChanged: (Cluster Group, Failed --> Pending)
00002bcc.00001eb8::2012/02/28-21:47:49.278
INFO [RCM] TransitionToState(File Share
Witness (\\e0202.xyz.com\DAG01.xyz.com))
[WaitingToTerminate to DelayRestartingResource]-->[Terminating to
DelayRestartingResource].
00002284.000042a4::2012/02/28-21:47:49.278
INFO [RES] File Share Witness <File
Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>:
Terminating resource ...
00002284.000042a4::2012/02/28-21:47:49.278
INFO [RES] File Share Witness <File
Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>:
Resource is offline.
00002bcc.00001eb8::2012/02/28-21:47:49.278
INFO [RCM] HandleMonitorReply:
TERMINATERESOURCE for 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)',
gen(579) result 0.
00002bcc.00001eb8::2012/02/28-21:47:49.278
INFO [RCM] TransitionToState(File Share
Witness (\\e0202.xyz.com\DAG01.xyz.com))
[Terminating to DelayRestartingResource]-->DelayRestartingResource.
00002bcc.00001d1c::2012/02/28-21:47:49.278
INFO [GUM] Node 4: Processing GrantLock
to 4 (sent by 6 gumid: 140952)
00002bcc.00002c90::2012/02/28-21:47:49.278
INFO [QUORUM] Node 4: Witness Failed
Gum Handler [QUORUM] Node 4
00002bcc.00002c90::2012/02/28-21:47:49.278
INFO [QUORUM] Node 4: witness attach
failed. next restart will happen at 2012/02/28-22:02:49.278
00002bcc.0000238c::2012/02/28-21:47:49.278
INFO [QUORUM] Node 4: quorum is not
owned by anyone
00002bcc.00004178::2012/02/28-21:47:49.792
INFO [RCM] Delay-restarting File Share
Witness (\\e0202.xyz.com\DAG01.xyz.com)
and any waiting dependents.
00002bcc.00004178::2012/02/28-21:47:49.792
INFO [RCM] TransitionToState(File Share
Witness (\\e0202.xyz.com\DAG01.xyz.com))
DelayRestartingResource-->OnlineCallIssued.
00002284.00002360::2012/02/28-21:47:49.792
INFO [RES] File Share Witness <File
Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>:
Beginning arbitration ...
00002284.00002360::2012/02/28-21:47:49.792
INFO [RES] File Share Witness <File
Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>:
Opening file \\e0202.xyz.com\DAG01.xyz.com\0dd9ec96-71d1-4949-806d-7d5403ff3f6d\Witness.log.
00002284.00002360::2012/02/28-21:47:49.808
INFO [RES] File Share Witness <File
Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>:
Attempting to lock file \\e0202.xyz.com\DAG01.xyz.com\0dd9ec96-71d1-4949-806d-7d5403ff3f6d\Witness.log,
try 1 of 30.
00002284.00002360::2012/02/28-21:47:49.808
INFO [RES] File Share Witness <File
Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>:
Succeeded in locking file \\e0202.xyz.com\DAG01.xyz.com\0dd9ec96-71d1-4949-806d-7d5403ff3f6d\Witness.log
00002bcc.00004178::2012/02/28-21:47:49.824
INFO [QUORUM] Node 4: PostArbitrate
=> 0 for 0dd9ec96-71d1-4949-806d-7d5403ff3f6d
00002bcc.00004178::2012/02/28-21:47:49.824
INFO [QUORUM] Node 4
CompareAndSetWitnessTag: ignoring any existing data on witness resource.
00002bcc.00004178::2012/02/28-21:47:49.824
INFO [QUORUM] Node 4
CompareAndSetWitnessTag: writing witness tag 164:164:509200
00002284.00002360::2012/02/28-21:47:49.824
INFO [RES] File Share Witness <File
Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>:
Writing file share witness epoch data.
00002284.00002360::2012/02/28-21:47:49.824
INFO [RES] File Share Witness <File
Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>:
Wrote 88 bytes to the witness file share.
00002bcc.00004178::2012/02/28-21:47:49.824
INFO [QUORUM] Node 4
CompareAndSetWitnessTag: releasing witness share lock
00002284.00002360::2012/02/28-21:47:49.824
INFO [RES] File Share Witness <File
Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>:
Releasing locked witness share.
00002284.00002360::2012/02/28-21:47:49.824
INFO [RES] File Share Witness <File
Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>:
Bringing resource online ...
00002284.00002360::2012/02/28-21:47:49.824
INFO [RES] File Share Witness <File
Share Witness (\\e0202.xyz.com\DAG01.xyz.com)>:
Resource is online.
00002bcc.00000c60::2012/02/28-21:47:49.824
INFO [RCM] HandleMonitorReply:
ONLINERESOURCE for 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)',
gen(579) result 0.
00002bcc.00000c60::2012/02/28-21:47:49.824
INFO [RCM] TransitionToState(File Share
Witness (\\e0202.xyz.com\DAG01.xyz.com))
OnlineCallIssued-->Online.
00002bcc.00000c60::2012/02/28-21:47:49.824
INFO [RCM]
rcm::RcmGroup::UpdateStateIfChanged: (Cluster Group, Pending --> Online)
00002bcc.00000c60::2012/02/28-21:47:49.824
INFO [QUORUM] Node 4: PostOnline for
0dd9ec96-71d1-4949-806d-7d5403ff3f6d
00002bcc.0000238c::2012/02/28-21:47:49.824
INFO [QUORUM] Node 4: quorum is
arbitrated by node 4
00002bcc.0000238c::2012/02/28-21:47:49.824
INFO [QUORUM] Node 4: releasing witness
lock (if held) because witness is not needed for quorum in new view.
00002284.000029f4::2012/02/28-21:47:49.886 INFO [RES] File Share Witness <File Share
Witness (\\e0202.xyz.com\DAG01.xyz.com)>:
Ignoring request to release witness share because it is not currently locked.
cluster /prop SameSubnetDelay=2000 (The default value is 1000
milliseconds, we could set it to 2000 milliseconds.)
cluster /prop CrossSubnetDelay=2000
++++++++
KB2550886
- A transient communication failure causes a Windows Server 2008 R2 failover
cluster to stop working
Clussvc.exe
6.1.7601.21772
KB2661010
- IP packets are not routed through a Windows Server 2008 R2–based LAN router
in a VLAN environment
Fwpkclnt.sys
6.1.7601.17514
Tcpip.sys
6.1.7601.17754
Fwpkclnt.sys
6.1.7601.21889
Tcpip.sys
6.1.7601.21889
KB2616514
- Cluster service sends unnecessary registry key change notifications among
cluster nodes in Windows Server 2008 or in Windows Server 2008 R2
Clussvc.exe
6.1.7601.17730
Clussvc.exe
6.1.7601.21867
KB2612966
- Paged pool memory leak when you access some shared files in Windows 7 or in
Windows Server 2008 R2
Mrxsmb10.sys
6.1.7601.21819
Mrxsmb20.sys
6.1.7601.21819
++++++++++++++++++
00001738.00002760::2012/03/12-22:23:21.225 ERR [QUORUM] Node 3: Failing quorum resource
due to witness failure
00001738.00001b38::2012/03/12-22:23:21.225 INFO [RCM] HandleMonitorReply:
FAILURENOTIFICATION for 'File Share Witness (\\e0202.xyz.com\DAG01.xyz.com)',
gen(7) result 0.
00001738.00001b38::2012/03/12-22:23:21.225 ERR [RCM] rcm::RcmResource::HandleFailure:
(File Share Witness (\\e0202.xyz.com\DAG01.xyz.com))
00002168.00002808::2012/03/12-22:23:21.771 INFO [RES] File Share Witness <File Share
Witness (\\e0202.xyz.com\DAG01.xyz.com)>:
Ignoring request to release witness share because it is not currently locked.
Solution
==============
http://support.microsoft.com/kb/2750820
I'm not able to make this test but it's important to keep the troubleshooting mindset to figure out the root cause.
- Be running the Cluster service.
- Be unable to locate any other nodes in the cluster (in other words, no other nodes can be running).
- Acquire exclusive ownership of the quorum resource.
- According to Microsoft this logs although being generated every 15 minutes doesn’t means Cluster impact, but so far the only way to fix is using the form procedure. The Quorum (Witness Disk) is the first resource brought online when cluster service attempts to form a cluster.
Forming a Cluster
The first server that comes online in a cluster, either after
installation or after the entire cluster has been shut down for some reason,
forms the cluster. To succeed at forming a cluster, a
server must:
Information
==================
Hi there! Keep it up! This is a good read. I will be looking forward to visit your page again and for your other posts as well. Thank you for sharing your thoughts about rcm solutions in your area. I am glad to stop by your site and know more about rcm solutions .
ReplyDeleteThe important functions (of a piece of equipment) to preserve with routine maintenance are identified, their dominant failure modes and causes determined and the consequences of failure ascertained. Levels of criticality are assigned to the consequences of failure. Some functions are not critical and are left to "run to failure" while other functions must be preserved at all cost. Maintenance tasks are selected that address the dominant failure causes. This process directly addresses maintenance preventable failures. Failures caused by unlikely events, non-predictable acts of nature, etc. will usually receive no action provided their risk (combination of severity and frequency) is trivial (or at least tolerable). When the risk of such failures is very high, RCM encourages (and sometimes mandates) the user to consider changing something which will reduce the risk to a tolerable level.
The RCM Blitz™ Solutions process was put together to allow companies to complete a traditional RCM analysis on a major critical asset by defining the envelope, completing the upfront tasks, and then setting up your meeting to complete the analysis in less than 5 days.
Excellent article, I enjoyed seeing the full process of your troubleshooting. Thank you very much for sharing this information.
ReplyDeleteThanks for putting this online Chang!
ReplyDeleteGreat work..
I just have one question, We are seeing the same issue here on E2K10 servers in a DAG. If I understand the process, wouldn't it be enough to restart the cluster service on the last node ?
If that were the case, my databases would not need to be brought offline...
Yes that is possible if all others nodes were turned off.
DeleteIn my case we restarted the servers since we had time to keep end users with no Outlook service.
If it doesn't work you will need restart the box.
eventid 1069
ReplyDeleteCluster resource ''File Share Witness in clustered service or application 'Cluster Group' failed
I reboot one of the member of DAG (E2k10SP2 server)
Result:
AT THE CLUSTER CORE RESOURCES, everything was offline or failed
location,failover cluster manager - cluster core resources - cluster was offline.
Naula
Solution:::
restart the Cluster service on the passive DAG (not on the current host server)
http://support.microsoft.com/kb/2750820
ReplyDeleteMe están haciendo espionaje
ReplyDeleteEs legal espionaje civil
ReplyDelete