01-03-2018 04:49 PM
Hello,
We are constantly but seemingly randomly getting this popup message during backups which results in jobs failing with status 13,23,24, etc. We have troubleshooted the network side and no changes have helped. We have tried some media server timeout setting changes to no avail. Any suggetions would be much appreciated.
Thanks,
Jamie
01-03-2018 09:22 PM
Technical Solution: Unable to connect to the EMM server (77) after network changes
|
01-03-2018 10:48 PM
The fact that the errors are intermittent says to me that there were no 'recent network changes', right?
Tell us more about master and media servers - same location and subnet or media server(s) at remote location?
Are you using hosts files or DNS for comms?
Are all servers (master and media) W2008 or any other OS's as well?
Are errors seen specifically during high load?
Are OS updates up to date? (There have been specific OS-related issues with W2008 under high I/O)
What kind of network troubleshooting has been done?
What about intermittent DNS issues?
Any continuous monitoring during backup time?
Something like a continuous ping?
What about resource monitoring on the master? (memory, cpu, network)
Have you enabled any kind of logs that may indicate 'emm heartbeat' timeouts?
See this TN for logs that are needed to troubleshoot emm comms:
https://www.veritas.com/support/en_US/article.100021074
01-03-2018 11:02 PM
One more thing - status 23 and 24 are 100% network-related issues.
Herewith good explanation of these errors:
https://vox.veritas.com/t5/NetBackup/Error-23/m-p/738836#M201891
Has anyone EVER encountered a network team who were prepared to admit that the problem might be network-related?
Oh! And hopefully you are aware that your NBU version ran out of support about a year ago?
Not that the NBU version is causing the problems, it just means that you cannot log a support call with Veritas.
Support has tools to assist with network troubleshooting.
01-03-2018 11:58 PM
Has anyone EVER encountered a network team who were prepared to admit that the problem might be network-related?
Nope, I don't think so ....
01-04-2018 06:16 AM
Hi all,
Thanks for the replies. Both NICS on the master server/media sever are now changed to 1000 FULL on server and switch side (from auto). Since failures are with so many clients we will check those as a future test. I also cleared the cache as advised as well.
Regarding the other questions:
- Master and media server are same box
- Using DNS and it appears to be working fine
- We have 2008 and 2012 in our environment
- I've noticed no load issues during the failures or as backups are running.
- OS updates are current.
- Regarding network troubleshooting just verifying if there are any port errors and we've tried some different teaming settings. I mentioned current status on that above. We have changed master timout settings (I know probably il advised). Here are current settings:
Client connect: 3600
Client read: 3600
Backup start...: 300
Backup end...: 300
File browse: 1800
Media server connect: 30
Use OS dependent: "not checked"
- No continuous monitoring yet. Partly because of how intermittent this issue is, I am often looking at the console as they fail and have checked connectivity. There never seems to be any visible issues at the time.
- No logs enabled yet, is your "heartbeat" suggestion still valid if master/media is same box?
01-04-2018 11:32 PM - edited 01-04-2018 11:51 PM
Thanks for the additional information :
.... Both NICS on the master server/media sever ....
Master and media server are same box
All very important info that we did not know about.
The 2 NICs - can we assume that they are not bonded and installed for different purposes?
Do they have IP addresses on different subnets/VLans?
And linked to different hostnames?
This is important as the master/media also communicates between different processes via TCP/IP ports. Even though it's on the same server.
If the IP's for the 2 NICs are on the same subnet and linked to the same hostname, then TCP/IP is going to 'round robin' between the NICs/IPs, causing major issues, such as request going out on one IP with response back from the other IP. The outgoing request is in the meantime expecting response on same IP.
Please confirm config on the master/media with the following commands:
ipconfig /all
blclntcmd -self (in ...\netbackup\bin)
nbemmcmd -listhosts -all (in ...\netbackup\bin\admincmd)
nbemmcmd -getemmserver
The logs will be handy to trace internal comms.
We once picked up an error with NIC config on a media server via admin log on the master server (request going to one IP on the media server and response coming back from different IP).
I also remember some time back where @RiaanBadenhorst picked up an issue with 2 NICs on a master and PBX comms... will see if I can find it.
**** Found it ****
https://vox.veritas.com/t5/NetBackup/PBX/m-p/418924
01-05-2018 05:55 AM
Hi, so...
The 2 NICs - can we assume that they are not bonded and installed for different purposes?
Correct. Only 1 IP addresses is configured. These are HP servers using their network teaming software to handle how thye work together. This software is used on all master/media servers in our enterprise and we do see a few of these status failures from time to time but nothing like this. I have attached a screenshot for more info. As a test we could dissolve the team and ensure only 1 of the 2 could ever be used but I would expect the current "NFT" config we have is effectively doing thesame thing
Do they have IP addresses on different subnets/VLans?
And linked to different hostnames?
No.
ipconfig /all
Windows IP Configuration
Host Name . . . . . . . . . . . . : US010099
Primary Dns Suffix . . . . . . . : schaeffler.com
Node Type . . . . . . . . . . . . : Peer-Peer
IP Routing Enabled. . . . . . . . : No
WINS Proxy Enabled. . . . . . . . : No
DNS Suffix Search List. . . . . . : root.hld
schaeffler.com
de.ina.com
emea.ina.com
emea.**bleep**.com
emea.luk.com
na.luk.com
na.**bleep**.com
na.ina.com
sa.ina.com
ap.ina.com
ap.**bleep**.com
ina.com
luk.com
**bleep**.com
lat-suhl.de
Ethernet adapter Local Area Connection 5:
Connection-specific DNS Suffix . : schaeffler.com
Description . . . . . . . . . . . : HP Network Team #1
Physical Address. . . . . . . . . : 2C-44-FD-81-14-18
DHCP Enabled. . . . . . . . . . . : Yes
Autoconfiguration Enabled . . . . : Yes
Link-local IPv6 Address . . . . . : fe80::61c4:419c:fecb:91af%19(Preferred)
IPv4 Address. . . . . . . . . . . : 10.217.2.198(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Lease Obtained. . . . . . . . . . : Tuesday, January 02, 2018 1:30:04 PM
Lease Expires . . . . . . . . . . : Monday, February 11, 2154 3:16:50 PM
Default Gateway . . . . . . . . . : 10.217.2.254
DHCP Server . . . . . . . . . . . : 10.217.2.205
DHCPv6 IAID . . . . . . . . . . . : 506217725
DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-20-2C-D8-B5-2C-44-FD-81-14-18
DNS Servers . . . . . . . . . . . : 10.217.2.205
10.217.2.218
10.216.2.246
Primary WINS Server . . . . . . . : 10.217.2.193
Secondary WINS Server . . . . . . : 10.160.160.73
NetBIOS over Tcpip. . . . . . . . : Enabled
Ethernet adapter Local Area Connection 4:
Media State . . . . . . . . . . . : Media disconnected
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : HP Ethernet 1Gb 4-port 331FLR Adapter #4
Physical Address. . . . . . . . . : 2C-44-FD-81-14-1B
DHCP Enabled. . . . . . . . . . . : Yes
Autoconfiguration Enabled . . . . : Yes
Ethernet adapter Local Area Connection 3:
Media State . . . . . . . . . . . : Media disconnected
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : HP Ethernet 1Gb 4-port 331FLR Adapter #3
Physical Address. . . . . . . . . : 2C-44-FD-81-14-1A
DHCP Enabled. . . . . . . . . . . : Yes
Autoconfiguration Enabled . . . . : Yes
Tunnel adapter isatap.schaeffler.com:
Media State . . . . . . . . . . . : Media disconnected
Connection-specific DNS Suffix . : schaeffler.com
Description . . . . . . . . . . . : Microsoft ISATAP Adapter
Physical Address. . . . . . . . . : 00-00-00-00-00-00-00-E0
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
Tunnel adapter isatap.{2A5119FB-C5EE-4556-AED2-5105BAAF1105}:
Media State . . . . . . . . . . . : Media disconnected
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Microsoft ISATAP Adapter #2
Physical Address. . . . . . . . . : 00-00-00-00-00-00-00-E0
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
Tunnel adapter isatap.{FB136663-96B7-4A31-87AF-1AE45EB73772}:
Media State . . . . . . . . . . . : Media disconnected
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Microsoft ISATAP Adapter #3
Physical Address. . . . . . . . . : 00-00-00-00-00-00-00-E0
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
bplclntcmd -self
gethostname() returned: nbuus002
host nbuus002: us010099.schaeffler.com at 10.217.2.198
aliases: us010099.schaeffler.com nbuus002 10.217.2.198
getfqdn(nbuus002) returned: us010099.schaeffler.com
nbemmcmd -listhosts -all
NBEMMCMD, Version: 7.6.0.3
The following hosts were found:
server nbuus002
master nbuus002
virtual_machine de012418vc.schaeffler.com
Command completed successfully.
nbemmcmd -getemmserver
NBEMMCMD, Version: 7.6.0.3
These hosts were found in this domain: nbuus002
Checking with the host "nbuus002"...
Server Type Host Version Host Name EMM Server
MASTER 7.6 nbuus002 nbuus002
Command completed successfully.
01-05-2018 05:57 AM
woops forgot attachement...
01-07-2018 08:40 AM - edited 01-07-2018 08:42 AM
Is it possible AV could be a factor? We have the recommended excludes set but in the past we had success with 636 errors when we removed it as a test. And I'm also seeing 636 again this morning along with the others already mentioned.
01-08-2018 12:19 AM
I doubt that AV will cause these errors. Even 636 is network-related.
It is recommended though to exclude NBU binaries/services from AV scanning.
Have you had a chance to go through @RiaanBadenhorst's post as suggested a few days ago?
Have you enabled logs as suggested?
In all honesty - I am not a network expert, but one thing I know is that IPV6 on 7.6 is not supported.
Is it possible to disable IPV6 and DCHP?
Something else that I have noticed is that the hostname and NBU name is different:
Host Name . . . . . . . . . . . . : US010099
host nbuus002: us010099.schaeffler.com at 10.217.2.198
Do you have hosts entry for host nbuus002 ?
10.217.2.198 nbuus002 us010099.schaeffler.com
If not, please do so.
And about status 636 -
Adjusting KeepAlive normally solves the problem.
Ensure master and all clients have the same setting.
See:
https://vox.veritas.com/t5/NetBackup/Error-636/td-p/770527
https://vox.veritas.com/t5/NetBackup/Having-trouble-with-636-status-code/m-p/654987#M170327
https://www.veritas.com/support/en_US/article.TECH202675
01-08-2018 01:43 PM - edited 01-08-2018 01:46 PM
Hi, yes I read his thread but it didn't appear relevant since he was using multiple IP addresses...?
To be clear you are suggesting to create the ltid log, correct? Looks like I need to create the debug folder as well if it doesn't exist?
I disabled IPV6 but oddly enough I reran a bunch of jobs and almost immediately got the status 77 popup and nothing is currently backing up. bpbrm seems hung even after services stop and had to kill them manually.
B:\VERITAS\NetBackup\bin>bpps
* US010099 1/08/18 16:25:46.764
COMMAND PID LOAD TIME MEM START
NbWin 6944 0.000% 3.681 29M 1/03/18 13:33:49.316
bpbrm 3964 0.000% 0.296 17M 1/05/18 19:00:03.270
bpbrm 8672 0.000% 0.140 17M 1/05/18 21:35:36.467
bpbrm 6084 0.000% 0.421 17M 1/05/18 23:30:02.187
bpbrm 10664 0.000% 0.202 17M 1/06/18 19:00:05.624
bpps 4816 0.000% 0.218 6.7M 1/08/18 16:25:45.532
We use that nbu alias in many locations. I have added the line to the hosts file as suggested.
Keepalive is currently at 300000 from previous troubleshooting of 636. Are you suggesting to try 900000? Also, its only set on the media server at the moment. per that TN
01-09-2018 06:01 AM
Circling back a bit...if we take this repeated error at face value its saying that netbackup loses sight of the device list from the EMM server which is itself when one server is both master and media server? Even though the resultant job errors are "network errors" doesn't this point to some sort of internal communcation/connection issues?
01-09-2018 06:48 AM
@weigojmi wrote:
Circling back a bit...if we take this repeated error at face value its saying that netbackup loses sight of the device list from the EMM server which is itself when one server is both master and media server? Even though the resultant job errors are "network errors" doesn't this point to some sort of internal communcation/connection issues?
Yes, as per my post of a couple of days ago:
.... the master/media also communicates between different processes using TCP/IP ports. Even though it's on the same server.
There is something in TCP config on the master/media server that is causing inter-process comms to get confused/lost.
I have tried making some suggestions with my very limited networking knowledge.
If nobody else can contribute further, your only other option is to enable logging and see if that helps in any way.
ltid logging needs directories to be created, VERBOSE entry in vm.conf followed by NBU Device Management Service.
Unified logs such as nbemm needs logging to be increased with vxlogcfg command. The logs then need to be read with vxlogview command.