Solved: SAN Client: ARCHIVE Python device disappears

rizwan84tx · ‎03-01-2012

Hi All,

Recently; i've been experiancing issues in a SAN client. Let me explain how its all setup,

San Client: Windows 2008 R2 EE, on QLE2462.

FT Media Server: RHEL 5.5 x64, targer HBA on QLE2462

Zoning: WWPN of one port in SAN Client is zoned to both target HBA port's WWPN.

Master Server: Windows 2008 R2 EE

Initially 4 ARCHIVE Python device were showing up in the client device manager and backups were running fine. Now, i see the ARCHIVE Python devices disappering most of the time. Its unusual, that sometimes i see 4 ARCHIVE Python devices and then they decrease to 3 or 2 and so on; also i do see SCSI devices (attached the screen shot). I can feel that you may question on Zoning, i did get same thought and checked there were no changes in zoning.

I checked the FT target device state, they are active and FT service are running.

[root@FTSERVER admincmd]# ./nbftconfig -listtargets

m FTSERVER

d 0 0 w # active FABRIC QLE246x Series FC Hba Qlogic

d 0 1 w # active FABRIC QLE246x Series FC Hba Qlogic

d 1 0 w # active FABRIC QLE246x Series FC Hba Qlogic

d 1 1 w # active FABRIC QLE246x Series FC Hba Qlogic

Tried re-installing the FT media server too, but nothing changed. The SAN Client's HBA drivers are latest. I Suspect something has gone wrong in the Clients Operating system and would want to find what went wrong.

Any inputs by you experts is highly appreciated.

-Rizwan

Mark_Solutions · ‎03-02-2012

You have either a faulty HBA, faulty cables, a bad switch or a bad library - but somethiung is sending resets to the device and then it shows as being in use

You need to see if you can get logs from everything to see if you can pin it down.

There are setting on some switches that can cause a reset as well as on the HBA itself.

Also worth checking your media servers for ophaned bptm processes that can be causing issues in view of the "resource is in use comment" - this is what you get when Removable Storage Service or TUR has not been dealt with but also when bptm / bpbrm processes have been orphaned.

Are you sure nothing else is zoned to it that shouldn't be?.

View solution in original post

Mark_Solutions · ‎03-01-2012

Windows 2008 doesn't have the Removable Storage Service installed by default but check that it does not have it and if it does stop and disable it as this will take the drives away from NetBackup.

Next make sure that Windows is not interferring with the tape drives via its drivers by disabling the Windows Tape Unit Ready commands - see this tech note:

http://support.microsoft.com/kb/842411

This applies in the same way to all Windows versions

rizwan84tx · ‎03-01-2012

Mark,

There is no Removable Storage Service; i have disabled TUR as per the MS technote and rebooted the client. The issue still persists.

Mark_Solutions · ‎03-02-2012

Firewall / Antivirus blocking something - you don't use McAffee do you?

Marianne · ‎03-02-2012

Any errors in Event Viewer System and/or Application logs?

Windows Storport driver is known for 'misbehaving'.

Handy NetBackup Links

rizwan84tx · ‎03-02-2012

Hi Marianne,

Event viewer for system was reporting warning for source QL2300

Reset to device, \Device\RaidPort1, was issued.

Event messages for application source QLManagementAgentJava

Error: RetrieveTargetDataForTargets: Unable to get target data (0xaa) (The requested resource is in use. )

Warning: RetrieveLunDataForTargets: Unable to get lun data (0xaa) (The requested resource is in use. )

Performed OS updates, upgraded the firmware and drivers for SAN client and still i see target LUNs disappearing after a while. Is the HBA adapter is faulty?

rizwan84tx · ‎03-02-2012

Mark,

Yes , i have McAfee running in the SAN client.

Mark_Solutions · ‎03-02-2012

You have either a faulty HBA, faulty cables, a bad switch or a bad library - but somethiung is sending resets to the device and then it shows as being in use

You need to see if you can get logs from everything to see if you can pin it down.

There are setting on some switches that can cause a reset as well as on the HBA itself.

Also worth checking your media servers for ophaned bptm processes that can be causing issues in view of the "resource is in use comment" - this is what you get when Removable Storage Service or TUR has not been dealt with but also when bptm / bpbrm processes have been orphaned.

Are you sure nothing else is zoned to it that shouldn't be?.

rizwan84tx · ‎03-02-2012

Nothing else is zoned. Also, I tried Zoning both ways; but resulted to same issue.

Created single zone with 2 target ports of FT media server + 1 client port
In the other way, created 2 zones; as below

Zone a: Port 1 of target HBA + Client port1

Zone b: Port 2 of target HBA + Client Port1

I'm sure that SAN switch can't be faulty, as other zoned devices are working without issues.

Shut down the NBU service in FT media server, there were no orphaned processes.

I will try changing the FC cable on Monday and update you.

Its evening in India, and i got to knock off. Happy Weekend!!!

Mark_Solutions · ‎03-02-2012

Just noticed that you have said you do have McAffee - worth disabling it for a while if possible and also rename the \system32\drivers\MFETDIK.sys file then rebooting - this is its own firewall driver and can cause all sorts of issues:

http://www.symantec.com/docs/TECH56658

rizwan84tx · ‎03-05-2012

Uninstalled McAfee and used different FC, still the same. So this if forcing me to replace the target HBA, if that didn't work, blame Microsoft?

rizwan84tx · ‎03-09-2012

Mark,

As you guessed that switch could be faulty; on checking the SAN switch logs for client connected port, appeared the port was going offline many times.

Fabric log reports the switch port going offline many times.

Time Stamp   Input and *Action                                    S, P   Sn,Pn Port Xid
==================================================================
13:32:56.206 SCN LR_PORT (0);g=0x65b                     D2,P0 D2,P0 6     NA
13:32:56.209 SCN Port Online;g=0x65b                       D2,P0 D2,P1 6     NA
13:33:00.223 SCN Port Offline;g=0x65d                       D2,P1 D2,P0 6     NA
13:33:00.243 *Removing all nodes from port               D2,P0 D2,P0 6     NA
13:33:02.270 SCN LR_PORT (0);g=0x65d                    D2,P0 D2,P0 6     NA
13:33:02.276 SCN Port Online;g=0x65d                     D2,P0 D2,P1 6     NA

Connected the client to different FC port, BINGO!!! issue got resolved. I'm marking your post as solution.

Mark_Solutions · ‎03-09-2012

Great news! - glad to have helped

VOX

SAN Client: ARCHIVE Python device disappears