full backup failed with error code 26

aidil_1 · ‎05-20-2019

Hi all

Thanks in advance for reading this post. any tips\ideas\advices are highly appreciated

environment setting
- NBU 7.7.2 on Windows 2012 R2
- 1x master server which also act as media server
- all clients are VMware, communicated via Virtual Center, snapshot backup using LAN @ NBD
- all backups goes to disk
- daily @ incremental backup from Monday to Friday
- weekly\monthly\yearly @ full on Saturday
- no scheduled backup on Sunday

issue
- no issue with daily backup
- 50% of full backup will failed with EXIT STATUS 26 (client/server handshaking failed)
- restart of the jobs in stages will eventually completed

based on reading, error 26 related to network or communication error. when the issue occured, i would test the network (bpclntcmd) and it turned out ok.
later on, restart of the jobs also turn out ok. if too many failure, i have to restart it in stages, else it will resulted the same; error code 26.
it suggest congestion of bottleneck somewhere but i have no clear path to look to
in worst case, jobs will stuck & bpup/bpdown has to be done.

a case logged to Symantec & he suspected issue with network sockets which to me not very clear or have a good understanding on it. hopefully good folks out there can help me with it.
Symantec advice to clear the network socket connections which i have to request from NW team to do it but no logs/eviden shared it was done. totally dependent on NW team.
And reboot the server which i confirmed rebooted based on server uptime.
And still the issue keep on occuring and ruined my lovely weekend.

Thanks again in advance for reading this post. any tips\ideas\advices are highly appreciated

Marianne · ‎05-24-2019

@aidil_1

I was hoping that one of 'resident network experts' would pick up on this.

I am surprised that you say that you ' logged a case to Symantec' when the name changed to Veritas about 3 years ago...

Anyway - did the Support engineer ask you for bpbrm and bpcd logs ?
I would be curious to see logs for a successful connection attempt and for a failed attempt.

Just out of curiousity - if master is also media server, then it means that this is also the backup host (NBU client) for the VMware backups, right?
So, which hostnames do you use for bpclntcmd testing?
bpclntcmd only tests forward and reverse name lookup, it does not test port connection.
bptestbpcd on master/media server will test name lookup as well as port connection, but since the one server is master, media and client, all initial comms are internal (nothing going out yet to VCenter or ESX server).

To trace ports that are being used on the NBU server, you can check 'netstat -a' output when backups fail. Save output to a file, and compare with same output when backups are good.

Your network team should also be able to trace/monitor network comms between vCenter and ESX servers during backup window.

Handy NetBackup Links

aidil_1 · ‎05-24-2019

Thanks for the advice.

the last time i worked with NBU is early 2016 before switching to other software and now i'm picking it up again.
maybe that time it still Symantec? :)

yes, my master is also the backup host for the vmware backup.
when a backup for client A failed, i perform bpclntcmd for client A

i'll get the logs and see if can task schedule the "netstat -a" on hourly basis.
thanks again, will keep posting

Krutons · ‎05-24-2019

Has this been an issue since the beginning or something new that you have seen slowly start to occur more often?

aidil_1 · ‎05-24-2019

hi,

can't really tell as i'm taking over as it is from someone. it's puzzle me a lot as it mostly impact full (weekend\monthly\yearly) backup only. any thought or idea?

Krutons · ‎05-24-2019

How many full backups are running at a time?

During those full backups, I would try to netstat -t a few times and copy the output so you can post it here.

Could you post the output of our TCP parameters on the master server?
netsh int tcp show global

aidil_1 · ‎05-24-2019

hi,

d:\Program Files\Veritas\NetBackup\MY_scr>netsh int tcp show global
Querying active state...

TCP Global Parameters
----------------------------------------------
Receive-Side Scaling State : enabled
Chimney Offload State : disabled
NetDMA State : disabled
Direct Cache Access (DCA) : disabled
Receive Window Auto-Tuning Level : normal
Add-On Congestion Control Provider : none
ECN Capability : enabled
RFC 1323 Timestamps : disabled
Initial RTO : 3000
Receive Segment Coalescing State : enabled
Non Sack Rtt Resiliency : disabled
Max SYN Retransmissions : 2

as above.. i'll collect netstat data tru out the weekend. will post once ready. any other possible input needed?

aidil_1 · ‎05-24-2019

hi krutons, marianne,

finally got backup failed therefore helped with the logs. 1st attempt of recovery completed successfully.
as i got tru bpbrm logs, no doubts there's a lot of error related to socket. need your help & community out there to assist.
as for bpcd log, it's look clean to me.
as for netstat, not really sure how to read it.

am i facing network issue? similar amount of client running on weekday and weekend. the only different will be incremental or full backup

attached is the logs needed, i hope it sufficient.

Marianne · ‎05-27-2019

Can you please show us all text in Job Details for a failed job?

It will help with locating exact PIDs in the the logs.

Handy NetBackup Links

Marianne · ‎05-27-2019

@aidil_1

I am still curious about the Support call that you logged - did the Veritas engineer not ask you for higher level logs?
I noticed that bpbrm log is at the default level of 0:
bpbrm main: INITIATING (VERBOSE = 0)

I have had a look at 2 of bpbrm logs - there is for sure issue with internal comms on the master/media server itself - bpbrm (on the master) has difficulty connecting to bpjobd (on the master):

04:57:08.137 [5852.9540] <2> check_connect_time: Giving up trying to connect. Function elapsed time (335) > client connect timeout (300).
04:57:08.137 [5852.9540] <16> bpbrm send_event_msg_to_monitor: failed to connect to bpjobd on master server capbkup0001 (23)

LOTS of errors like these:
async_connect: [vnet_connect.c:1771] getsockopt SO_ERROR returned 10061 0x274d

You will see in netstat output that there are LOTS of ports in CLOSE_WAIT, FIN_WAIT_2 (all bpjobd) and TIME_WAIT status.

I have extracted all the bpjobd connection attempts from bpbrm_194139_2.txt - see attached file.

My recommendation is to work with your Windows Admin to look at TCP tuning on the master server.
I am not a Windows or networking expert, but simply Googl'ing 'Windows 2012 TCP error 10061' finds all sorts of Microsoft URLs with TcpTimedWaitDelay and MaxUserPort tuning suggestions.

PS:
No need for Activity Monitor output.
I think we know what's happening (or not!) from bpbrm logs.

Handy NetBackup Links

aidil_1 · ‎05-28-2019

Hi Marianne,

thanks for your advice. i will liaise with my windows team then. will post result once available.

as for support team, we have webex session with him. if recalled correctly, logs created with necessary verbose & later revert back by end of the sessions.

thanks all. will update soon.

Krutons · ‎05-28-2019

I agree with Marianne, engage your windows team and get them involved.

Kleber_Marra · ‎05-29-2019

Run the test adding hostname on the host file.

VOX

full backup failed with error code 26