cancel
Showing results for 
Search instead for 
Did you mean: 

NetBackup 8.3 - Active Directory GRT is Broken

Hi All,

Just found another bug in NetBackup 8.3 related to AD GRT jobs. Apparently the image duplication phase where the real mapping should happen between the Backup Image and AD is failing miserably. I noticed couple of things while I was observing the entire process. For the moment I'm not sure if the issue is with the NBU client only but I will make few more tests including a client downgrade to rule out possibilities.

1. When the initial "System State" backup is being taken from a Domain Controller, the jobs sometimes fail with the following message:

error encountered while attempting to get additional files for system state:\

Obviously there is also an indication that initial GRT information assesment was unsuccessful. Unfortunately this behavior doesn't seem to be very consistent over the jobs so the failure seems to appear randomly. This have happened on 4 different DC's some of which are RODC. Once I managed to get a healthy image from the DC's I moved on to the duplication part.

2. During the duplication phase I noticed that the "nblbc.exe" binary is constantly consuming an entire CPU core for a while but it is not yet clear what happens in the background. The binary runs like that for about 20-30 minutes non-stop then it terminates with the following message within the "ncflbc" log folder:

0,51216,158,351,52,1596142489535,6192,1152,0:,53Smiley FrustratedpsRestoreCurrentVer:False (../ResourceChild.cpp:719),28:ResourceChildBEDS::_attach(),1
0,51216,159,351,1,1596144159553,6192,1152,0:,89:write() failed, wrote 0 of 276, error 232 (Unknown error) (../TransporterConsole.cpp:219),27:TransporterConsole::write(),1
0,51216,159,351,2,1596144159568,6192,1152,0:,256:
An Exception of type [FileWriteException] has occured at:
Module: ../TransporterConsole.cpp, Function: TransporterConsole::write(), Line: 225
File:*Console*
OS Error: 22 (The device does not recognize the command.

)
(../TransporterConsole.cpp:225),27:TransporterConsole::write(),1
0,51216,311,351,1,1596144159568,6192,1152,0:,69:exception in m_transport->write() (../BRMObserverDepreciated.cpp:407),29:BRMObserverDepreciated::write,1
0,51216,159,351,3,1596144159568,6192,1152,0:,50:Terminate Signal called . (../TfiExitEvent.cpp:43),22:TfiExitEvent::signal(),1
2,51216,309,351,128,1596144159568,6192,1152,0:,0:,0:,2,(37|)
2,51216,309,351,129,1596144159615,6192,1152,0:,0:,0:,2,(37|)
1,51216,309,351,130,1596144159615,6192,1152,0:,0:,38:ConfigDataIterator::metadataIdCallback,3,(35|A21:3:Error_Messages_Max;|A2:10|)
1,51216,309,351,131,1596144159615,6192,1152,0:,0:,38:ConfigDataIterator::metadataIdCallback,3,(35|A25:3:CLIENT_CONNECT_TIMEOUT;|A3:300|)

After about another 15 minutes the duplication job times out in NBU console with status code 191 and the following error:

Jul 30, 2020 11:37:44 PM - Error bpduplicate (pid=11388) db_IMAGE() failed: database system error (220)
Jul 30, 2020 11:37:44 PM - Error bpduplicate (pid=11388) Status = no images were successfully processed.
Jul 30, 2020 11:37:44 PM - Error bpduplicate (pid=11388) Duplicate of backupid sns01dcvm04.skynet.local_1596138967 failed, database system error (220).
Jul 30, 2020 11:37:44 PM - Error bpduplicate (pid=11388) Status = no images were successfully processed.
Jul 30, 2020 11:37:44 PM - end Duplicate; elapsed time 0:43:17

3. When I try to view the content of the images throuhg "BAR" client, I have partial success. Although I'm able to open the "System State" and "Active Directory" folders any subsequent tries further through the OU's are making the client to get stuck with "Communicating with server" messages. Those folders that I managed to open were indicated throuh respective restore jobs in NBU console but no further jobs have been generated. Once the timeout has been reached approximately 5 minutes later. another "ERROR: file read failed" message appeared on the screen.

I haven't yet turned on debugging for "bpbkar" or "ncflbc" logs but I will collect those in case NBU client 8.2 wouldn't solve these issues.

Also I noticed that it doesn't matter between which type of storage pools I would engage the deduplication. I tried Adv. Disk to Adv. Disk and also Adv. Disk to MSDP but the result is all the same. As it seems the process is stuck somewhere on the client side.

My GRT AD implementation used to work perfectly with NBU 8.2 and except for the recent upgrade to NBU 8.3 nothing else has changed in the environment.

Any suggestions on the debug logs or other aspects are highly welcomed.

Cheers.

3 Replies

Re: NetBackup 8.3 - Active Directory GRT is Broken

So this is rather strange. While I was looking at logs and processes I was retrying the duplication jobs few more times and surprisingly on the 4th time they went throuhg to a different storage device. At the same time though the "nblbc.exe" process kept doing the same behavior but this time it had only the following (supposedly error) output:

0,51216,158,351,49,1596147990576,8400,8524,0:,53Smiley Very Happyevice Name : NFS Network (../SubContextBEDS.cpp:182),14:_dumpDLEInfo(),1
0,51216,158,351,50,1596147990576,8400,8524,0:,54Smiley Very Happyevice Name : System?State (../SubContextBEDS.cpp:182),14:_dumpDLEInfo(),1
0,51216,158,351,51,1596147990576,8400,8524,0:,75Smiley Very Happyevice Name : Active Directory Application Mode (../SubContextBEDS.cpp:182),14:_dumpDLEInfo(),1
0,51216,158,351,52,1596147991154,8400,8524,0:,53Smiley FrustratedpsRestoreCurrentVer:False (../ResourceChild.cpp:719),28:ResourceChildBEDS::_attach(),1
1,51216,309,351,128,1596149160389,8400,8524,0:,0:,38:ConfigDataIterator::metadataIdCallback,3,(35|A21:3:Error_Messages_Max;|A2:10|)
1,51216,309,351,129,1596149160404,8400,8524,0:,0:,38:ConfigDataIterator::metadataIdCallback,3,(35|A25:3:CLIENT_CONNECT_TIMEOUT;|A3:300|)
0,51216,309,351,130,1596149160404,8400,8524,0:,93Smiley Tonguerocessing ProcessDebug:nblbc; using the Window's registry (../DefaultConfigProvider.cpp:345),40Smiley Very HappyefaultConfigProvider::getValueNonBpConf,1
1,51216,309,351,131,1596149160404,8400,8524,0:,0:,40Smiley Very HappyefaultConfigProvider::getValueNonBpConf,3,(36|S20:\HKEY_LOCAL_MACHINE\|S61Smiley FrustratedOFTWARE\Veritas\NetBackup\CurrentVersion\Config\ProcessDebug|S5:nblbc|D32:2|)
0,51216,309,351,132,1596149160404,8400,8524,0:,93Smiley Tonguerocessing Granular:CopyLocal; using the Window's registry (../DefaultConfigProvider.cpp:345),40Smiley Very HappyefaultConfigProvider::getValueNonBpConf,1
1,51216,309,351,133,1596149160404,8400,8524,0:,0:,40Smiley Very HappyefaultConfigProvider::getValueNonBpConf,3,(36|S20:\HKEY_LOCAL_MACHINE\|S57Smiley FrustratedOFTWARE\Veritas\NetBackup\CurrentVersion\Config\Granular|S9:CopyLocal|D32:2|)
1,51216,309,351,134,1596149161857,8400,8524,0:,0:,14:nbfs_unmount(),0,(28|A33:INF - Lets terminate the event...|)
1,51216,309,351,135,1596149161857,8400,8524,0:,0:,14:nbfs_unmount(),0,(28|A25:INF - Event terminated...|)

Also in the meantime I changed the client on one of the DC's to v8.2 however it behave literally the same way as the v8.3 clients. Right now I'm completely puzzled on what's going on.

Does anyone know what is the exact purpose of the "nblbc.exe" process and how does it relate to the GRT operations ? I assume that it is the main responsible binary to do the mapping between AD/Exchange and the backup images. What I really don't understand is the constant CPU load it does for about 20 - 30 minutes then it times out. It would be nice to know what operations are involved in the background to and from each of the involved servers. This issue now looks more like a timeout between the NBU client and the NBU servers.

I would appreciate if someone could provide some enlightenmenet on the GRT related procedures.

Cheers

 

Re: NetBackup 8.3 - Active Directory GRT is Broken

Ok so I looked a bit further at the background situation on the NBU master server and it seems that the issue might be somewhere within the "bpdbm" processes which are engaged shortly after the "nblbc" process is running on the client.

I noticed that the "bpdbm" instances kept using a lot of CPU so I looked at the logs and this is what got generated in huge amounts:

03:20:12.541 [12040.1020] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\7100_{01E0C052-3E0E-4048-B588-1D69F857E066}
03:20:12.541 [12040.1020] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
03:20:12.541 [12040.1020] <2> logconnections: BPJOBD CONNECT FROM 192.168.12.84.59428 TO 192.168.12.84.13723 fd = 2036
03:20:12.541 [12040.1020] <2> job_authenticate_connection: VxSS authentication check for now...
03:20:12.541 [12040.1020] <2> get_long_base: (1) cannot read (byte 1) from network: An established connection was aborted by the software in your host machine.
03:20:12.541 [12040.1020] <2> job_connect: Can't get server time
03:20:12.541 [12040.1020] <2> check_connect_time: It took 0 seconds to connect.
03:20:12.541 [12040.1020] <2> check_connect_time: Try again with 83 seconds left.
03:20:12.541 [12040.1020] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\7100_{C269A78A-B5D9-4CD6-9F7F-D2F51BB417FA}
03:20:12.541 [12040.1020] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
03:20:12.541 [12040.1020] <2> logconnections: BPJOBD CONNECT FROM 192.168.12.84.59429 TO 192.168.12.84.13723 fd = 2036
03:20:12.541 [12040.1020] <2> job_authenticate_connection: VxSS authentication check for now...
03:20:12.541 [12040.1020] <2> get_long_base: (1) cannot read (byte 1) from network: An established connection was aborted by the software in your host machine.
03:20:12.541 [12040.1020] <2> job_connect: Can't get server time
03:20:12.541 [12040.1020] <2> check_connect_time: It took 0 seconds to connect.
03:20:12.541 [12040.1020] <2> check_connect_time: Try again with 83 seconds left.
03:20:12.557 [12040.1020] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\7100_{5850DF88-5291-4527-B1E6-ABDF27BD2C61}
03:20:12.557 [12040.1020] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
03:20:12.557 [12040.1020] <2> logconnections: BPJOBD CONNECT FROM 192.168.12.84.59430 TO 192.168.12.84.13723 fd = 2028
03:20:12.557 [12040.1020] <2> job_authenticate_connection: VxSS authentication check for now...
03:20:12.557 [12040.1020] <2> get_long_base: (1) cannot read (byte 1) from network: An established connection was aborted by the software in your host machine.
03:20:12.557 [12040.1020] <2> job_connect: Can't get server time
03:20:12.557 [12040.1020] <2> check_connect_time: It took 0 seconds to connect.
03:20:12.557 [12040.1020] <2> check_connect_time: Try again with 83 seconds left.
03:20:12.557 [12040.1020] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\7100_{C0421D7F-53D9-4C97-B858-F6634C4D30C9}
03:20:12.557 [12040.1020] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
03:20:12.557 [12040.1020] <2> logconnections: BPJOBD CONNECT FROM 192.168.12.84.59431 TO 192.168.12.84.13723 fd = 2028
03:20:12.557 [12040.1020] <2> job_authenticate_connection: VxSS authentication check for now...
03:20:12.557 [12040.1020] <2> get_long_base: (1) cannot read (byte 1) from network: An established connection was aborted by the software in your host machine.
03:20:12.557 [12040.1020] <2> job_connect: Can't get server time
03:20:12.557 [12040.1020] <2> check_connect_time: It took 0 seconds to connect.
03:20:12.557 [12040.1020] <2> check_connect_time: Try again with 83 seconds left.
03:20:12.572 [12040.1020] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\7100_{79A485BB-EDB5-4B73-9549-E548D74DD4CE}
03:20:12.572 [12040.1020] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
03:20:12.572 [12040.1020] <2> logconnections: BPJOBD CONNECT FROM 192.168.12.84.59432 TO 192.168.12.84.13723 fd = 2028
03:20:12.572 [12040.1020] <2> job_authenticate_connection: VxSS authentication check for now...
03:20:12.572 [12040.1020] <2> get_long_base: (1) cannot read (byte 1) from network: An established connection was aborted by the software in your host machine.
03:20:12.572 [12040.1020] <2> job_connect: Can't get server time
03:20:12.572 [12040.1020] <2> check_connect_time: It took 0 seconds to connect.
03:20:12.572 [12040.1020] <2> check_connect_time: Try again with 83 seconds left.
03:20:12.572 [12040.1020] <4> create_user_group_id_marker_WIN: connecting to named pipe:\\.\pipe\7100_{3DEBB42D-DC38-4586-BC0F-E1846B61998C}
03:20:12.572 [12040.1020] <4> create_user_group_id_marker_WIN: successfully connected to server named pipe
03:20:12.572 [12040.1020] <2> logconnections: BPJOBD CONNECT FROM 192.168.12.84.59435 TO 192.168.12.84.13723 fd = 2036
03:20:12.572 [12040.1020] <2> job_authenticate_connection: VxSS authentication check for now...

Any ideas from anyone ?

Re: NetBackup 8.3 - Active Directory GRT is Broken

Problem is now resolved with a workaround.

Apparently NBU 8.3 seems to have issues when the master server's services are running under domain based accounts instead of LocalSystem.

This behavior was encountered during an entirely different issue with Exchange policies about which I have posted the details here:

NetBackup 8.3 - Exchange Policy Bugs 

Once I changed the master server's service accounts to run under LocalSystem (except for Messaging Broker and Web Management services), I was able to duplicate the previously taken and failing images on the first go without seeing any of the "bpdbm" and "nblbc" issues which I previously described. Also I took fresh images of the DC's and those were also replicated without any problems.

Please note that this is just a workaround so the situation is far from ideal. More investigation shall be done by the backend engineers on this rather odd behavior of the product.

Cheers