cancel
Showing results for 
Search instead for 
Did you mean: 

Sharepoint restore performance

Steppan
Level 4

Hi, We have some SP2013 and SP2010 farms over here and there's something I would like to check with you guys who have Sharepoint backed up in your Netbackup infrastructure.

Our main SP2013 site has 3 main folders as follow:

Folder ABC: ~40.000 items
Folder XYZ: ~241.000 items
Folder DEF: ~60.000 items

I'm restoring one file inside a subfolder of the folder DEF (that has ~60k items). It's taking 4h to restore just few files with not more than 1Mb.

The SP2013 farm consists in:

2 Sharepoint server (let's name them SP-A and SP-B) and their database is located in a instance named SQLAX which run in one node of our SQL Failover Cluster (Let's name them SQL-A and SQL-B). This instance is running at SQL-A node.

In the other hand we restored from our SP2010 farm a XLSX worksheet with it's 30 versions and it has been restored in 1h55min.

Our SP2010 farm consists in: 1 Sharepoint server (let's name it SP2010) and it's database is located in a instance named MSSQL which run in on node of our SQL Failover Cluster. This instance is runing in the SQL-B node.

The SP2010 has lots of folders with the total of 549.885 documents. The XLSX I restored was in a folder, alone. Why does it need to spend such long time to restore? How it works? Which are your times to restore Sharepoint objects?

19 REPLIES 19

watsons
Level 6

Are you using GRT (granular restore) for the file-level recovery?

Netbackup version?

What kind of storage (disk or tape) are you backing up to?

Is it just the restore of Sharepoint data having this issue, what about other type of data?

Yes, I'm trying to restore using GRT. NB 8.0

All images are backed up to a Dell DR4100 (OST) disk appliance. No tape for now (the SLP is configured to duplicate but few months from now).

All other restores are pretty quick. Filesystem, VMware (even GRT inside VM's), Exchange, SQL. Everything is fine. Only Sharepoint takes ages.

 

Can you share the restore job detailed status?

Tape_Archived
Moderator
Moderator
   VIP   

"I think", this may not be related to Netbackup. When the restore is being done the Sharepoint is working on creating those thousands for records that you are restoring and that's actually taking the time. 

I'm starting a new restore now.

Just for you to compare. I started a Full job in april 12. It took 37h to complete. Please see the attached image.
The restore I'm starting now is from this job. Lets see what happen.

Is this right: 30Gb (30.412.919) takes over 37hrs to backup? 

Something seems very wrong.  45Gb here and backup completes in 10mins.

Maybe the client is in dire need of resources (CPU & memory). 

Really annoying. :(

3 hours has passed since I started the restore. Take a look in the attached picture.

The two server that compose this SP2013 farm doesn't seems to need more resources. Both are very low on resource use (disk, cpu, memory, etc.).

You have 45Gb in 10 minutes. Are you using GRT? Are your databases placed in the same server as SP application or separated database server? How many files do you have? How many time to restore a single file?

No GRT; DB on separate server. Rarely had restore request but full database (few Gb) restores in minutes. 

Do you have a list that has a multi-level folder hierarchy?

(I hope this question makes sense to you. I'm repeating words I found in an old customer escalation.)

Ah.. so the backup was slow as well!

A slow backup usually means a slower restore.

Can you please share the job detailed status of jobID 32795? That might give us some hints.

Unfortunately I don't have the detailed information. Someone deleted the job from the Activity Monitor (I think it was the veritas partner who is implementing the NBU). :(

Is there any other way to see detailed information since the job is no longer present in activity monitor?

Just to update you, I was reading SP backup documentation and it says that we have to implement separate jobs for each type of backup. If I understand correct, I need a job for the whole information (this job can have Full and Incremental schedules), but can't use GRT and a separate job just for the site (like Microsoft Sharepoint Resources:\intranet.yourcompany.com) where I can select GRT but can't schedule incrementals, only full.

The odd thing is: I have 3 farms, 2 (SP2010) farms have "incorrectly" policies configured, with everything select (Microsoft Sharepoint Resources:\) and GRT and the backup runs OK. They need around 2h to finish GRT backup with ~40Gb. The other farm (SP2013) which runs for 37h was using this type of selection too, everything checked with GRT. Now we splitted it in two, one policy with everything Full/Incr without GRT (which we already run and took 10min to finish) and another policy with just the site where the docs are.

This second policy just for the site with GRT has been running for 1h (it's 8:46am now, we started it 7:46) and it stopped in "begin Sharepoint Granular Resolver: Policy Execution Manager Preprocessed" with nothing saved and 0% complete.

Thank you!

 

Just to update you guys. We found the problem. The SQL node that has the sharepoint databases got a lot of bpbkar32 and bphdb process hanging. This is the second time it happens. We moved the SQL instance to the other node and now the backups is running a lot faster (it's still running but for now it took 25min to backup 25Gb).

This is the second time it happen during the NBU implementation. The first time was in our Exchange 2013 Servers. I've been searching here and out of here for ways to deal with netbackup process hanging on servers after backups. It seems that even Veritas don't know how to do that. We have an engineer working together along with many open cases (we have around 6 cases for different issues just in implementation phase). When we asked him how to deal with that he says "try a reboot". Lets be honest, you just can't reboot production servers in a big enterprise when the backup software hangs. The engineer doesn't even know how to track why the process hang. His answer were "because gods wanted to".

Does anyone know how to track errors in this 2 process and how to kill them without a reboot? I found people asking about that in a dozen of forums since 2001. I really don't want to believe that Veritas has not developed a way to kill those process even 16 years later and keep saying to customers that a "reboot will work".

Thanks!

There's something more complicated than what I'm reading. To kill a client backup process (bpkar32, bphdb), just kill it from the Task Manager or equivalent. If the process is associated with a current job, that job will fail with a status in the 20's or 40's. I concede that it might be difficult to kill a NetBackup process by canceling the job that started it - if the process is truly hung, it's not likely to be responding to input.

To find out why the process is running after you expect it to have ended, look in the client log. For the particular pid, find look at the start of the pid to find out why it was called, and the end of the pid to find out what it's doing now. Look for the client pid in the bpbrm log on the media server to find out why bpbkar32 or bphdb was called. (Note that Windows re-uses pids, so not every instance of the pid you find will belong to the process of interest.) 

I'm disappointed that we haven't helped you with past issues. I am not aware of cases with backup processes hanging on Exchange servers, and I've monitored every Exchange-related escalation to engineering since 2007. I've personally handled many of them. The backline support people I work with would not ask you to reboot your Exchange servers.

Trust me, or better, do yourself a search in google using 3 words "bpbkar32 process kill" and you will find a lot of people with this issue. I tried every process killer app you could imagine (taskmanager, killall, pskill, processexplorer, everything). Even with all NBU services down these process still hang, a lot of them. Like I said, it happened with our Exchange servers 2 weeks ago and veritas support engineer said nothing but "reboot your server". Happened again yesterday with one of our SQL nodes and they said the same. We have 8 SQL instances, balanced within 2 SQL nodes, our "hang" node was the node1, we moved it's instances for node2, rebooted the node1 after midnight, rolled back the instances from node2 to node1 and everything is normal now.

I think you should educate your engineers in Brazil for this product. It seems that they don't know nothing about it.

It's so difficult to accept, for a guy coming from HP Data Protector, how things works in Netbackup. A lot of process to do a lot of things, too much to hang, too much complexity, logging is hard, mantaining it is hard, working is hard, implementing is hard, everything is so difficult. It feels like I'm playing with a cards tower. You move a thing, you mess everything. It's like stepping on eggs.

Really disappointed with the software. We paid a lot of money (around $100k) for a "Enterprise Grade" software as the Veritas Sales guys told us and we have no confidence in it. One day it works well, another day it hangs. This beta software for me, sincerely.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

If all NBU services are stopped and you are still unable to kill processes, then something else is holding onto those processes.

Antivirus software is normally the culprit.

You can use tools like Process Explorer and/or handle to find out what is locking the bpbkar processes.

 

Marianne,

Thanks for your input. I have seen some information abouth antivirus and hanging in other posts during my search here in this board. Our servers don't have AV installed/active but I'll try to investigate the threads/process relationship using process explorer and seek for some clue.

For now we are dealing with another issue. Our transaction logs backups are running fine within Netbackup. Every job has the blue doll icon. The jobs were running during all night long. When we arrive at the company this morning we got the transaction logs full and database locked (our main CRM was offline). The jobs are not truncating the logs after backup.

I wonder when (if) it will stop to surprise us with issues. Everyday is a new one.

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

NetBackup does not truncate logs.

It merely notifies the application of successful backup.
The application then needs to truncate logs.

You really need to stop blaming NetBackup for everything that goes wrong in your environment.

 

The guy who is implementing has fixed the SQL issue. He didn't told me what he did.

Back to the SP restore/backup. We're still fighting against it. I've been reading and trying to understand the logs and messages.

Here is some information I collected in the last few days.

From the 3 GRT policies I have, 2 are "ok" for backup.
SP04 has 2.5 million entries, within 101Gb of data and it takes 2h08 to finish.
SP02 has 2.8 million entries, within 315Gb of data and it takes 4h45 to finish.
SP11 has 360k entries, within 26.9Gb of data and it takes 32h to finish. This is the problematic one.

These are the messages from SP04 backup. It sends the 5000 entries to bpdbm in a matter of seconds.

sp04-grt.png

sp04-grt-entries.png

 

These are the messages for SP02 job. Almost the same of SP04. Entries are sent in a matter of seconds.

sp02-grt-entries.png

 

These are the entries for SP11 job. It takes ~12min to reach 84% progress and then starts the granular process. It takes 2 hours to start sending the first 5000 entries and then almost 6h to the next 5000 entries. The next sendings have a shorter interval but not seconds like SP04, it takes minutes.

sp11-grt-entries.png

And this is the end of SP11 job.

sp11-grt.png
Reading the media server logs for BPBRM I found something that I would like to ask to you experts.

These are the bpbkar syntax for all three jobs:

13:18:34.121 [111049.111049] <2> start_client_for_non_mpx_backup_archive_verify_import: Going to execute cmd </usr/openv/netbackup/bin/bpbkar bpbkar -r 2678400 -ru root -dt 0 -to 0 -bpstart_time 1493137412 -clnt sp-11 -class SP11_GRT -sched FULL -st FULL -bpstart_to 3600 -bpend_to 3600 -read_to 3600 -stream_count 1 -stream_number 1 -jobgrpid 166100 -blks_per_buffer 512 -pdi -granular_backup -use_otm -fso -b sp-11_1493137111 -kl 8 -alt_client SQLAX -ct 8 -use_ofb> on host <sp-11>


00:03:31.397 [19166.19166] <2> start_client_for_non_mpx_backup_archive_verify_import: Going to execute cmd </usr/openv/netbackup/bin/bpbkar bpbkar -r 8035200 -ru root -dt 0 -to 0 -bpstart_time 1493089708 -clnt sp-04.xxxxxxxx.local -class SP04_GRT -sched FULL -st FULL -bpstart_to 3600 -bpend_to 3600 -read_to 3600 -stream_count 1 -stream_number 1 -jobgrpid 158266 -blks_per_buffer 512 -pdi -granular_backup -use_otm -fso -b sp-04.xxxxxx.local_1493089408 -kl 8 -alt_client MSSQL -ct 8 -use_ofb> on host <sp-04.xxxxxx.local>


04:02:27.042 [55948.55948] <2> start_client_for_non_mpx_backup_archive_verify_import: Going to execute cmd </usr/openv/netbackup/bin/bpbkar bpbkar -r 8035200 -ru root -dt 0 -to 0 -bpstart_time 1493104044 -clnt sp-02.xxxxxx.local -class SP02_GRT -sched FULL -st FULL -bpstart_to 3600 -bpend_to 3600 -read_to 3600 -stream_count 1 -stream_number 1 -jobgrpid 160674 -blks_per_buffer 512 -pdi -granular_backup -use_otm -fso -b sp-02.xxxxxxx.local_1493103744 -kl 8 -alt_client MSSQL -ct 8 -use_ofb> on host <sp-02.xxxxx.local>

What "-r" stands for? Why does SP11 have a lower value than the other two policies?

Thank you!

-r is the retention level in seconds. You can verify this by using the command line command bpretlevel located at <install path>\netbackin\bin\admincmd\bpretlevel -l

0 1468800 17 days
1 1209600 2 weeks
2 10713600 4 months
3 2678400 1 month
4 5356800 2 months
5 8035200 3 months
6 16070400 6 months
7 24105600 9 months
8 31536000 1 year
9 2147483647 infinite
10 0 expires immediately
11 220752000 7 years
12 315360000 10 years
13 432000 5 days
14 94608000 3 years
15 604800 1 week
16 63072000 2 years
17 1296000 15 days
18 2147483647 infinite
19 2147483647 infinite
20 2147483647 infinite
21 34819200 13 months
22 1468800 17 days
23 189216000 6 years
24 157680000 5 years