Solved: netbackup client side deduplication backups fail w...

NavGee · ‎05-02-2014

client side deuplication backups fail with error 87, any ideas?

4/30/2014 17:26:34 - Info bpbrm (pid=31619) starting bpbkar on client
04/30/2014 17:26:34 - connecting
04/30/2014 17:26:34 - connected; connect time: 0:00:00
04/30/2014 17:26:36 - Info bpbkar (pid=9960) Backup started
04/30/2014 17:26:36 - Info bpbrm (pid=31619) bptm pid: 31630
04/30/2014 17:26:36 - Info bptm (pid=31630) start
04/30/2014 17:26:37 - Info bptm (pid=31630) using 262144 data buffer size
04/30/2014 17:26:37 - Info bptm (pid=31630) using 30 data buffers
04/30/2014 17:26:37 - Info basxtsprdbcka01 (pid=31630) Using OpenStorage client direct to backup from client basxpsprdfpsv01.be.xchanginghosting.com to basxtsprdbcka01
04/30/2014 17:26:39 - begin writing
04/30/2014 17:27:29 - Info bpbkar (pid=9960) change journal NOT enabled for <E:\>
04/30/2014 20:14:21 - Critical bptm (pid=31630) sts_close_handle failed: 2060022 software error
04/30/2014 20:14:21 - Critical bptm (pid=31630) cannot write image to disk, media close failed with status 2060022
04/30/2014 20:14:21 - Info basxtsprdbcka01 (pid=31630) StorageServer=PureDisk:basxtsprdbcka01; Report=PDDO Stats for (basxtsprdbcka01): scanned: 358440501 KB, CR sent: 132515037 KB, CR sent over FC: 0 KB, dedup: 63.0%, cache hits: 10793 (0.3%)
04/30/2014 20:14:21 - Critical bptm (pid=31630) sts_close_server failed: error 2060005 object is busy, cannot be closed
04/30/2014 20:14:24 - Info bptm (pid=31630) EXITING with status 87 <----------
04/30/2014 20:14:25 - Info bpbkar (pid=9960) done. status: 87: media close error
04/30/2014 20:14:25 - end writing; write time: 2:47:46
media close error (87)

NavGee · ‎05-15-2014

Details from last night failure with accelerator enabled as well as clientside deduplication.

05/06/2014 18:40:26 - Critical bptm (pid=19164) sts_close_server failed: error 2060005 object is busy, cannot be closed
05/06/2014 18:40:28 - Info bptm (pid=19164) EXITING with status 87 <----------
05/06/2014 18:56:00 - Error bpbrm (pid=19146) [ERROR][proxy_open_server_v7]CORBA::SystemException is caught in proxy_open_server_v7, minor = 1413546772, status = 2, info = system exception, ID 'IDL:omg.org/CORBA/COMM_FAILURE:1.0'.TAO exception, minor code = 14 (failed to recv request response; ENOTSUP), completed = MAYBE
05/06/2014 18:56:00 - Error bpbrm (pid=19146) libsts opensvh() 14/05/06 18:56:00: v11_open_server failed in plugin /usr/openv/lib/libstspinbostpxy.so err 2060057
05/06/2014 18:56:00 - Error bpbrm (pid=19146) sts_open_server failed: error 2060057
05/06/2014 18:56:00 - Error bpbrm (pid=19146) could not send server status message
05/06/2014 18:56:01 - Info bpbkar (pid=10396) done. status: 87: media close error
05/06/2014 18:56:01 - end writing; write time: 0:55:15
media close error (87)

will update worker thread next to 128

View solution in original post

SymTerry · ‎05-05-2014

Hello, off hand please make sure that you have enough drive space for logging on the media server.

Also the logs of interest here would be the bptm and bpdm. Try running again with the logging turned up. Review the logs and see what they say.

Mark_Solutions · ‎05-06-2014

Well it writes plenty 130GB+ but then has an issue closing off the files - this is after 2hours 45 minutes - so doesn't seem much communication going on between times

Maybe worth setting a smaller Storage unit fragment size (5000MB i use) and / or using checkpoint re-start so that it keep the communication going regularly to avoid any timeouts

And of course make sure the appliance is not overloaded / suffering from huge processing queues etc

NavGee · ‎05-07-2014

Hi Mark made your recommended changes rerunning backup currently.

Mark_Solutions · ‎05-07-2014

All media close errors which tends to imply something is not quite right - plenty has been backed up before the failure so try the worker threads value which should help and give it a reboot once you have made the change so that it is fully registered

Also run an All Log Entries report from overnight to see of your disk pool is having any issues (going UP and DOWN quietly in the back ground) - it may need a DPS value adding / changing if that is the case

NavGee · ‎05-07-2014

HI Mark just put some more info up but curremtly held by site moderator.

Regards

Mark_Solutions · ‎05-07-2014

Try again but attach it as a text file - the paste you have done does not display properly and makes the thread to large - I need to remove it

NavGee · ‎05-07-2014

PLEASE FIND ALL LOG ENTRIES ATTACHED TO THIS POST.

REGARDS

Mark_Solutions · ‎05-07-2014

Thanks - not going down but does look like things are struggling so worth updating the workerthreads to 128

On you master and appliance do you have any DPS* files under /usr/openv/netbackup/db/config/

If so what are they and what values do they have in them?

I reccomend having only DPS_PROXYDEFAULTRECVTMO in there and using a value of 800 in it.

NavGee · ‎05-07-2014

lrwxrwxrwx 1 root root 45 Aug 15 2013 DEFERRED_IMAGE_LIMIT -> /opt/NBUAppliance/config/DEFERRED_IMAGE_LIMIT
lrwxrwxrwx 1 root root 48 Aug 15 2013 DPS_PROXYDEFAULTRECVTMO -> /opt/NBUAppliance/config/DPS_PROXYDEFAULTRECVTMO
lrwxrwxrwx 1 root root 45 Aug 15 2013 LIFECYCLE_PARAMETERS -> /opt/NBUAppliance/config/LIFECYCLE_PARAMETERS
-rw-r--r-- 1 root root 343 Dec 6 23:32 behavior
-rw------- 1 root root 264 Oct 21 2013 dc
drwxr-xr-x 2 root root 96 May 7 06:00 shm
-rw-rw-rw- 1 root root 700 Dec 6 15:04 user_retention
-rw-rw-rw- 1 root root 700 Dec 6 15:04 user_retention.bak
maintenance-!> more DPS_PROXYDEFAULTRECVTMO
800
maintenance-!>

NavGee · ‎05-07-2014

mark what the file that needs to be updated for worker threads mate

Mark_Solutions · ‎05-07-2014

/disk/etc/puredisk/contentrouter.cfg

WorkerThreads value needs changing from 64 to 128 and then reboot to make sure it is fully registered

NavGee · ‎05-07-2014

Hi Mark

worker threads updated on all 4 appliances. Can't understand testing 4 clients server backups 1 ndmp backup and replication. We have resouring issue with the appliances, Any thoughts Mark?

Currently run a test backup.

NavGee · ‎05-07-2014

seem to get quite a few of these type of errors I beleive they are disk related could this effect the performance of the disk pools.

Compute Node actxtsprdbcka02

Time Monitoring Ran: Wed May 7 2014 16:09:03 BST

+-----------------------------------------------------------------------------------------+

| RAID Information |

|+---------------------------------------------------------------------------------------+|

||ID|Name|Status |Capacity| Type |Disks| Write |Enclosure|HotSpare | State |Acknowledge||

|| | | | | | | Policy | ID |Available| | ||

||--+----+-------+--------+------+-----+---------+---------+---------+-------+-----------||

|| | | | | |1 2 3| | | | | ||

|| | | | | |4 5 6| | | | | ||

||2 |VD-1|Optimal|35.469TB|RAID-6|7 8 9|WriteBack|41 |no |Warning|No ||

|| | | | | |10 11| | | | | ||

|| | | | | |12 13| | | | | ||

|| | | | | |14 15| | | | | ||

|+---------------------------------------------------------------------------------------+|

+-----------------------------------------------------------------------------------------+

+---------------------------------------------------------------------------------------------+

| Enclosure 2 Disk Information |

|+-------------------------------------------------------------------------------------------+|

||ID| Slot | Status |Foreign|Firmware|Serial |Capacity|Type|Enclosure| State |Acknowledge||

|| |Number| | State |Version |Number | | | ID | | ||

||--+------+------------+-------+--------+-------+--------+----+---------+-------+-----------||

|| | |Unconfigured| | | | | | | | ||

||16|16 |(good), Spun|None |A222 |HITACHI|2.727TB |SAS |41 |Warning|No ||

|| | |down | | | | | | | | ||

|+-------------------------------------------------------------------------------------------+|

+---------------------------------------------------------------------------------------------+

Mark_Solutions · ‎05-07-2014

Yes - bad disks really affect performance

The first one looks like a replaced disk that has not been brought back on line so is still in writeback mode

Logs a support call and get them to look at it

Do a DataCollect from the CLISH and get a copy of it ready (via Support - Logs - Share Open) as it is the first thing that they will ask for

NavGee · ‎05-07-2014

Will do as we seem to getting these hardware issue on all 4 appliances.

Regards

Mark_Solutions · ‎05-07-2014

What does the Web Gui show?

NavGee · ‎05-08-2014

check the web gui all is ok apart from 1 appliance at the remote site has a failed disk.

Mark_Solutions · ‎05-08-2014

OK - that one will need fixing but do get the others checked .. spun down is usually OK as it can be the hot spare disk but the first one looks odd to me

NavGee · ‎05-08-2014

Yes the first one is the one with problem. from what I can tell disk 16 is the hotspare however in this case it has a cross against it with unconfigured (good) spun down.

The rest are fine.

Backup failed again with error 87 and for the first time error 40.

VOX

netbackup client side deduplication backups fail with error 87