cancel
Showing results forΒ 
Search instead forΒ 
Did you mean:Β 

LanMan Resource fails unexpectedly (Error V-16-2-13067)

Fred2010
Level 6

Hi all, 

I am currently installing a 3 node Netbackup 7.1 cluster, on a VSF HA 5.1 SP2 on Windows 2008 R2 (SP1).

The installation went fine and failover of the Netbackup resource back and forth shows no problems whatsoever.

This weekend however, for no apparent reason, the LanMan Resource failed on the first node and the entire Netbackup resource was failed over to the next available node.

In the Eventlog I see:

Agent is calling clean for resource(NetBackup_Server-Lanman) because the resource became OFFLINE unexpectedly, on its own. 

and in the VCS Log:

May 22, 2011 5:51:21 AM V-16-2-13067 (SRV0401) Agent is calling clean for resource(NetBackup_Server-Lanman) because the resource became OFFLINE unexpectedly, on its own.  V-16-2-13067

(SRV0401) Agent is calling clean for resource(NetBackup_Server-Lanman) because the resource became OFFLINE unexpectedly, on its own.

There were no jobs running (Not configured yet) nor was anybody working on the machine when it happened.

After the failover, the first node was left in a FAILED state...

All Systems are running:

Windows 2008 R2 Enterprise, SP1, Fully patched, x64
VSF HA Windows 5.1 SP2, x64
Netbackup 7.1, x64

I've included some logs with this post. I hope someone can help with this problem!

Thanks in advance for any input you might have...

Fred

1 ACCEPTED SOLUTION

Accepted Solutions

Fred2010
Level 6

Hi Ireyes,

Thank you for the info you provided! I will indeed increase logging as you have suggested, hoping that if it occurs again, I am able to provide more info...

I've also opened a case with Symantec ( Case 414-760-213 )

They suggested the following (After receiving VxExplorer Logging). Maybe it can help others reading this as well:

 

The below message indicates that the service being monitored by the cluster is offline. This was not initiated by Veritas Cluster Server (VCS) and does not indicate a problem with VCS itself. VCS monitors the status of the service and triggers a fault if the items being monitored go offline unexpectedly. This is by design.

2011/05/22 05:51:21 VCS ERROR V-16-2-13067 (SRV0401) Agent is calling clean for resource(NetBackup_Server-Lanman) because the resource became OFFLINE unexpectedly, on its own.

2011/05/22 05:51:21 VCS INFO V-16-1-10307 Resource NetBackup_Server-Lanman (Owner: unknown, Group: NetBackup_Server) is offline on SRV0401 (Not initiated by VCS)

From the Lanman circular logs, that has debugging messages show the following;

2011/05/22 05:51:21 VCS DBG_21 V-16-50-0 Lanman:NetBackup_Server-Lanman:monitor:VLibNetIP::IsNetBTEnabled() returned 2, 0x0000006F

            LibVirtualName.cpp:VLibVirtualName::Check[601]

2011/05/22 05:51:21 VCS DBG_21 V-16-50-0 Lanman:NetBackup_Server-Lanman:monitor:vname.Check() returned 2, 0x0000006F

            LanmanAgent.cpp:CLanmanAgent::Monitor[412]

There is a windows API call (GetAdapterAddresses()) that’s returning that error. We call that routinely to get the network adapters information.

This error would only happen if there’s any change to the network adapters when that API call is made.

Were any changes made to the network settings at the times when these events occurred?

 Examples of changes are add/removing of IP addresses, changes to network settings, etc.

 I have also seen a change of the GPO which it may cause this particular windows API to return this error.

5/21/2011        10:33:18 PM   INFORMATION           1704(0x400006a8)  SceCli  srv0401.scrambled.somewhat     

Security policy in the Group policy objects has been applied successfully.

 Possible cause:

 A network config changes that might be occurring causing this particular windows API to return an error.

 Workaround:

 Set Lanman ToleranceLimit to 2

 The time it takes to detect a resource fault or failure depends on the MonitorInterval attribute for the resource type. When a resource faults, the next monitor detects it. The agent may not declare the resource as faulted if the ToleranceLimit attribute is set to non-zero. If the monitor entry point reports offline more often than the number set in ToleranceLimit, the resource is declared faulted. However, if the resource remains online for the interval designated in the ConfInterval attribute, previous reports of offline are not counted against ToleranceLimit.

When the agent determines that the resource is faulted, it calls the clean entry point (if implemented) to verify that the resource is completely offline. The monitor following clean verifies the offline. The agent then tries to restart the resource according to the number set in the RestartLimit attribute (if the value of the attribute is non-zero) before it gives up and informs HAD that the resource is faulted. However, if the resource remains online for the interval designated in ConfInterval, earlier attempts to restart are not counted against RestartLimit.  Chapter 22, VCS Performance Considerations Detecting System Failure 633

In most cases, ToleranceLimit is 0. The time it takes to detect a resource failure is the time it takes the agent monitor to detect failure, plus the time to clean up the resource if the clean entry point is implemented. Therefore, the time it takes to detect failure depends on the MonitorInterval, the efficiency of the monitor and clean (if implemented) entry points, and the ToleranceLimit (if set).

Fred

View solution in original post

2 REPLIES 2

Ireyes
Level 3
Employee Accredited Certified

The logs provided do not provide sufficient details to determined why the Lanman agent faulted on its own.

This error will be displayed for any resource under cluster control/monitor cycle. If VCS is not able to monitor the resouce as online. http://www.symantec.com/docs/TECH70812

You can try to increased logging for the lanman agent to see if we are able to log any specific errors, but you will have to wait until the issue reproduces.

TECHnote to increase logging. http://www.symantec.com/docs/TECH67017

If the agent comes online and stays online for undetermined amount of time. you may want to start looking at connectivity issues with AD.

If its a timing issue you may consider increasing the "RestartLimit" for the agent" as per  http://www.symantec.com/docs/TECH54737 this may add a little more tolerance.

In most cases we see VCS is just reacting and its a result of issues with AD.

 

 

Fred2010
Level 6

Hi Ireyes,

Thank you for the info you provided! I will indeed increase logging as you have suggested, hoping that if it occurs again, I am able to provide more info...

I've also opened a case with Symantec ( Case 414-760-213 )

They suggested the following (After receiving VxExplorer Logging). Maybe it can help others reading this as well:

 

The below message indicates that the service being monitored by the cluster is offline. This was not initiated by Veritas Cluster Server (VCS) and does not indicate a problem with VCS itself. VCS monitors the status of the service and triggers a fault if the items being monitored go offline unexpectedly. This is by design.

2011/05/22 05:51:21 VCS ERROR V-16-2-13067 (SRV0401) Agent is calling clean for resource(NetBackup_Server-Lanman) because the resource became OFFLINE unexpectedly, on its own.

2011/05/22 05:51:21 VCS INFO V-16-1-10307 Resource NetBackup_Server-Lanman (Owner: unknown, Group: NetBackup_Server) is offline on SRV0401 (Not initiated by VCS)

From the Lanman circular logs, that has debugging messages show the following;

2011/05/22 05:51:21 VCS DBG_21 V-16-50-0 Lanman:NetBackup_Server-Lanman:monitor:VLibNetIP::IsNetBTEnabled() returned 2, 0x0000006F

            LibVirtualName.cpp:VLibVirtualName::Check[601]

2011/05/22 05:51:21 VCS DBG_21 V-16-50-0 Lanman:NetBackup_Server-Lanman:monitor:vname.Check() returned 2, 0x0000006F

            LanmanAgent.cpp:CLanmanAgent::Monitor[412]

There is a windows API call (GetAdapterAddresses()) that’s returning that error. We call that routinely to get the network adapters information.

This error would only happen if there’s any change to the network adapters when that API call is made.

Were any changes made to the network settings at the times when these events occurred?

 Examples of changes are add/removing of IP addresses, changes to network settings, etc.

 I have also seen a change of the GPO which it may cause this particular windows API to return this error.

5/21/2011        10:33:18 PM   INFORMATION           1704(0x400006a8)  SceCli  srv0401.scrambled.somewhat     

Security policy in the Group policy objects has been applied successfully.

 Possible cause:

 A network config changes that might be occurring causing this particular windows API to return an error.

 Workaround:

 Set Lanman ToleranceLimit to 2

 The time it takes to detect a resource fault or failure depends on the MonitorInterval attribute for the resource type. When a resource faults, the next monitor detects it. The agent may not declare the resource as faulted if the ToleranceLimit attribute is set to non-zero. If the monitor entry point reports offline more often than the number set in ToleranceLimit, the resource is declared faulted. However, if the resource remains online for the interval designated in the ConfInterval attribute, previous reports of offline are not counted against ToleranceLimit.

When the agent determines that the resource is faulted, it calls the clean entry point (if implemented) to verify that the resource is completely offline. The monitor following clean verifies the offline. The agent then tries to restart the resource according to the number set in the RestartLimit attribute (if the value of the attribute is non-zero) before it gives up and informs HAD that the resource is faulted. However, if the resource remains online for the interval designated in ConfInterval, earlier attempts to restart are not counted against RestartLimit.  Chapter 22, VCS Performance Considerations Detecting System Failure 633

In most cases, ToleranceLimit is 0. The time it takes to detect a resource failure is the time it takes the agent monitor to detect failure, plus the time to clean up the resource if the clean entry point is implemented. Therefore, the time it takes to detect failure depends on the MonitorInterval, the efficiency of the monitor and clean (if implemented) entry points, and the ToleranceLimit (if set).

Fred