Forum Discussion

devans3428's avatar
devans3428
Level 4
12 years ago

backups causing server to drop network packets eventually lead to network failure

First a little history...  I just migrated netbackup 7.1.03 master/media from solaris10 sparc V440 to Redhat 6.4 netbackup 7.1.0.3 on hardware Dell PowerEdge R720xd.   I have a mixture of clients with windows, solaris, Redhat linux.   I performed the migration around end of May.   Everything ran fine for the first week with all clients except one client.   It appears that one particular client running 7.1.03 netbackup client on solaris 9 with hardware SunFire 6800 started to experience latency up to the point of where the server eventually would drop packets and off the network.   This server is running rman as well as filesystem level backups.   Prior to the migration from solaris10 to redhat, i had not experienced this issue.   Even when you cant access the server via ssh or ftp, the backups never fail but nothing can connect to the server(ie oracle, ssh, etc..).    However if i get on the client console, the client resources are fine and the server operates normally with the only issue being it cant ping out and nothing can ping the server without dropping packets or pings just stop all together.    If I run tcpdump on the master/media the last thing i see is the master talking to the client and the client does not return.    I have checked with our network team and they only see an increase in traffic but its not to the point of dropping packets and there are no errors on the switch or nic.   Keep in mind this was all working from the solaris10 master media.   Below are things i have done to try and resolve the issue with no affect:

  1. Physically switched cable on the server thus illuminating a bad cable
  2. Physically switched to a new nic card and a different port on the switch with new cable
  3. Moved to a different ip from the public nic to exclusive backup nic
  4. Verified with Network team there are no packets being dropped from the switch
  5. Moved the SAN from active/active to active/passive… this was to fix the trespassing LUNS
  6. I have tried to duplicate a load by running 7 @1GB scp transfers simultaneously and the server did not even blink.
  7. Tried running only one stream versus multiple streams
  8. Put in exclusions

I am at a lost at the moment... In my mind i know its not a netbackup issue and has to be something with hardware but i just cant find it.    I am open for suggestions. 

 

 

  • Thanks Nicolai.... I have increased the NUMBER_DATA_BUFFERS_DISK and  NUMBER_DATA_BUFFERS TO 128 for starters.... I have moved the backups to a isolated nic from the public traffic on the client.   I still see minimal increase in ping times on the backup nic and also from time to time i may miss a packet.  However, the network continues to function from the public and backup nic.   So in conclusion, the fix was to lower the SIZE_DATA_BUFFERS_DISK to minimum 262144.  I will mark this one resolved after a week of doing backups to ensure nothing else creeps up.  As a side note i did increase tcp sliding window on the client solaris but i dont believe there was any affect on increasing the window.

     

    Thanks

11 Replies