Forum Discussion

ashok_veritas's avatar
12 years ago

Drives disappearing from fabric

HI All,

We are also facing some drive issues in our environment.Mainly in weekend full backups are running which are huge databse backups on the day all drives going down in master server along with media servers. When i check the var/adm/messages noticed invalid drives also found some drives are disappeaaring and reappearing from fabric.

Please find the output of the commands

bash-3.00$ cat /var/adm/messages |grep -i disappear


Sep 10 04:12:41 dm4cfi fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=180cc, PWWN=500507630f50f601 disappeared from fabric
Sep 10 16:51:18 dm4cfi fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=180cc, PWWN=500507630f50f601 disappeared from fabric


cat /var/adm/messages |grep -i reappeared


Sep 10 04:12:51 dm4cfi fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=180cc,

PWWN=500507630f50f601 reappeared in fabric
Sep 10 16:51:28 dm4cfi fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=180cc, PWWN=500507630f50f601 reappeared in fabric
------------------------------------------------------------------------------------------------------------------------------
cat /var/adm/messages |grep -i drive output has been attached below.

Can you any help me on this much appreciated.

Master server:solaris 5.10

Netbackup version 7.0

Tape library:IBMTS3580

Please let me know incase of more information required.

 

 

  • Thanks Marianne for moving this to a new post.

     

    Ashok,

    You have a SAN issue, not a netbackup issue.  You should speak to your san team to troubleshhot.

    Could well be a faulty HBA or switch.

     

    You have many errors ...

    1

    Sep 10 04:12:41 dm4cfi fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=180cc, PWWN=500507630f50f601 disappeared from fabric

    2

     

    Sep 10 03:03:41 dm4cfi tldd[24118]: [ID 316020 daemon.notice] TLD(0) unload==TRUE, but no unload, drive 11 (device 19)
    Sep 10 03:04:00 dm4cfi tldd[24118]: [ID 921149 daemon.notice] TLD(0) [24118] waited 8 seconds for ready, drive 11
    Sep 10 04:12:54 dm4cfi tldcd[19981]: [ID 718619 daemon.error] TLD(0) drive does not exist in robot, drive = 16
    Sep 10 04:12:54 dm4cfi tldcd[19990]: [ID 587315 daemon.error] TLD(0) drive does not exist in robot, drive = 8

     

    NBU will never work properly if there are issues at the os level. 1. (above) are the os level errors and therefore these MUST be fixed before the errors shown in 2.  

    In fact, when 1. is fixed, I suspect 2. will disappear.

    So, the correct way to look at this is to completly forget about NBU beacuse you will ALWAYS have errors in NBU until you fix the os level stuff.

    The errors you see in 1 and cause becasue the drives are becoming disconnected from the SAN.

    The cause can be anything on the SAN for example

    HBA fault

    Firmware issue

    Switch fault

    GBIC fault

    You will need to speak with your SAN team, and investigate with them.  There is a high possibility you will have to swap parts to see if they are faulty, for example, inserting a new HBA into the server.

    Martin

5 Replies

  • Thanks Marianne for moving this to a new post.

     

    Ashok,

    You have a SAN issue, not a netbackup issue.  You should speak to your san team to troubleshhot.

    Could well be a faulty HBA or switch.

     

    You have many errors ...

    1

    Sep 10 04:12:41 dm4cfi fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=180cc, PWWN=500507630f50f601 disappeared from fabric

    2

     

    Sep 10 03:03:41 dm4cfi tldd[24118]: [ID 316020 daemon.notice] TLD(0) unload==TRUE, but no unload, drive 11 (device 19)
    Sep 10 03:04:00 dm4cfi tldd[24118]: [ID 921149 daemon.notice] TLD(0) [24118] waited 8 seconds for ready, drive 11
    Sep 10 04:12:54 dm4cfi tldcd[19981]: [ID 718619 daemon.error] TLD(0) drive does not exist in robot, drive = 16
    Sep 10 04:12:54 dm4cfi tldcd[19990]: [ID 587315 daemon.error] TLD(0) drive does not exist in robot, drive = 8

     

    NBU will never work properly if there are issues at the os level. 1. (above) are the os level errors and therefore these MUST be fixed before the errors shown in 2.  

    In fact, when 1. is fixed, I suspect 2. will disappear.

    So, the correct way to look at this is to completly forget about NBU beacuse you will ALWAYS have errors in NBU until you fix the os level stuff.

    The errors you see in 1 and cause becasue the drives are becoming disconnected from the SAN.

    The cause can be anything on the SAN for example

    HBA fault

    Firmware issue

    Switch fault

    GBIC fault

    You will need to speak with your SAN team, and investigate with them.  There is a high possibility you will have to swap parts to see if they are faulty, for example, inserting a new HBA into the server.

    Martin

  • Hi Martin,

    Thanks for your information.

    could you please investigate on more logs and confirm which are the logs required.

     

     

    Regards,

    Ashok

  • None from NBU

    I am not awre of any SAN logs on the server, the only one that I can think of is messages / system log, but this will only show the devices disappearing/ reappearing so apart from show the time this happened, and the fact 'it did happen' they are not going to help.

    This is not a problem you are going to fix with logs, it is a case of changing bits and and maybe drivers/ firmware (HBA and switch), but I think the chance if this being a driver is very very low, in fact thinking about it, I doubt this could be caused by a driver.  The chnance of it being firmware I think is also very low - most likely I think is that something is faulty, as to what is faulty, it could be anything from the HBA to the drive.

    You will have to work with either your san team, or hardware vendor I think.

    Martin

  • This TN shows an example where an issue with and HBA caused devices to disappear.

    http://www.symantec.com/docs/TECH33767

    Martin

  • Hi Martin,

    Thanks for your prompt response.

    I would check with SAN team for investigation and wil get back to you with some more information.

     

    Ashok