Forum Discussion

liuyl's avatar
liuyl
Level 6
18 days ago

Some media server has been "offline" with not reachable by master (6)

We have a small NBU 8.2 domain environment with one master, four media servers and also share one TL with 2 SSO TDs. 
All of them work fine except for one media server is offline with not reachable by master (6) via nbemmcmd listhosts.
This problematic media server can be activated/online only for nearly 5 minutes every time, then it would be offline again.
Notes: there is almost no any more errors/warnings in all the relevant logfiles including VxUL logs.




# nbemmcmd -listhosts -display_server -machinename erp-fzyf -machinetype media -verbose|grep State
    MachineState = not reachable by master (6)


# nbemmcmd  -updatehost -machinename erp-fzyf -machinetype media -machinestateop set_master_server_connectivity -masterserver erpbak
NBEMMCMD, Version: 8.2
Command completed successfully.


# nbemmcmd -listhosts -display_server -machinename erp-fzyf -machinetype media -verbose|grep State
    MachineState = active for tape and disk jobs (14)


# nbemmcmd -listhosts -display_server -machinename erp-fzyf -machinetype media -verbose|grep State
    MachineState = active for tape and disk jobs (14)


# nbemmcmd -listhosts -display_server -machinename erp-fzyf -machinetype media -verbose|grep State
    MachineState = not reachable by master (6)


This problematic media server appears with the machinestateopcode = <9> every 10 minutes.
Notes: the other good media servers have no one such opcode 9 entry within the OID 144 !



# vxlogview -p 51216  -o 144 -d all  -t 06:00:00  > /tmp/nb_144.txt


# grep  erp-fzyf /tmp/nb_144.txt|grep -Ei "\<9\>"|tail
06/05/2025 14:57:26.478 [Debug] NB 51216 da 144 PID:738 TID:139688411072256 File ID:111 [No context] 1 [DeviceAllocatorImpl::updateMachineState]  - MachineName = < erp-fzyf >, NetBackupVersion = < 0 >, MachineStateOpCode = < 9 >
06/05/2025 15:07:26.444 [Debug] NB 51216 da 144 PID:738 TID:139688404768512 File ID:111 [No context] 1 [DeviceAllocatorImpl::updateMachineState]  - MachineName = < erp-fzyf >, NetBackupVersion = < 0 >, MachineStateOpCode = < 9 >
06/05/2025 15:17:26.554 [Debug] NB 51216 da 144 PID:738 TID:139688404768512 File ID:111 [No context] 1 [DeviceAllocatorImpl::updateMachineState]  - MachineName = < erp-fzyf >, NetBackupVersion = < 0 >, MachineStateOpCode = < 9 >
06/05/2025 15:27:26.453 [Debug] NB 51216 da 144 PID:738 TID:139688404768512 File ID:111 [No context] 1 [DeviceAllocatorImpl::updateMachineState]  - MachineName = < erp-fzyf >, NetBackupVersion = < 0 >, MachineStateOpCode = < 9 >
06/05/2025 15:37:26.474 [Debug] NB 51216 da 144 PID:738 TID:139688411072256 File ID:111 [No context] 1 [DeviceAllocatorImpl::updateMachineState]  - MachineName = < erp-fzyf >, NetBackupVersion = < 0 >, MachineStateOpCode = < 9 >
06/05/2025 15:47:26.499 [Debug] NB 51216 da 144 PID:738 TID:139688404768512 File ID:111 [No context] 1 [DeviceAllocatorImpl::updateMachineState]  - MachineName = < erp-fzyf >, NetBackupVersion = < 0 >, MachineStateOpCode = < 9 >
06/05/2025 15:57:26.438 [Debug] NB 51216 da 144 PID:738 TID:139689423341440 File ID:111 [No context] 1 [DeviceAllocatorImpl::updateMachineState]  - MachineName = < erp-fzyf >, NetBackupVersion = < 0 >, MachineStateOpCode = < 9 >
06/05/2025 16:07:26.447 [Debug] NB 51216 da 144 PID:738 TID:139688404768512 File ID:111 [No context] 1 [DeviceAllocatorImpl::updateMachineState]  - MachineName = < erp-fzyf >, NetBackupVersion = < 0 >, MachineStateOpCode = < 9 >
06/05/2025 16:17:26.534 [Debug] NB 51216 da 144 PID:738 TID:139688411072256 File ID:111 [No context] 1 [DeviceAllocatorImpl::updateMachineState]  - MachineName = < erp-fzyf >, NetBackupVersion = < 0 >, MachineStateOpCode = < 9 >
06/05/2025 16:27:26.462 [Debug] NB 51216 da 144 PID:738 TID:139689423341440 File ID:111 [No context] 1 [DeviceAllocatorImpl::updateMachineState]  - MachineName = < erp-fzyf >, NetBackupVersion = < 0 >, MachineStateOpCode = < 9 >


# grep  erp-hana1 /tmp/nb_144.txt|grep -Ei "\<9\>"|tail


# grep  erp-hana2 /tmp/nb_144.txt|grep -Ei "\<9\>"|tail


# grep  erp-test /tmp/nb_144.txt|grep -Ei "\<9\>"|tail

7 Replies

  • Here just give with some most representative and verified output:

    a). The normal master server
    # bptestbpcd -verbose -host erp-fzyf
    1 1 1
    127.0.0.1:54422 -> 127.0.0.1:58722 PROXY 10.130.23.60:51643 -> 10.130.23.56:1556 
    127.0.0.1:59052 -> 127.0.0.1:34300 PROXY 10.130.23.60:59543 -> 10.130.23.56:1556 
    LOCAL_CERT_ISSUER_NAME = O=vx,OU=root@erpbak,CN=broker
    LOCAL_CERT_SUBJECT_COMMON_NAME = 7df64390-9f5a-43b7-9487-95b2572b6d75
    PEER_CERT_ISSUER_NAME = O=vx,OU=root@erpbak,CN=broker
    PEER_CERT_SUBJECT_COMMON_NAME = 4209c31d-1de4-4617-b0c3-7ec5ba6306f9
    PEER_NAME = erpbak
    HOST_NAME = erp-fzyf
    CLIENT_NAME = erp-fzyf
    VERSION = 0x08200000
    PLATFORM = linuxS_x86_3.0.76
    PATCH_VERSION = 8.2.0.0 
    SERVER_PATCH_VERSION = 8.2.0.0 
    MASTER_SERVER = erpbak
    EMM_SERVER = erpbak
    NB_MACHINE_TYPE = MEDIA_SERVER
    SERVICE_TYPE = VNET_DOMAIN_CLIENT_TYPE
    PROCESS_HINT = 4209c31d-1de4-4617-b0c3-7ec5ba6306f9


    b). the problematic media server

    # bptestbpcd -verbose -host erpbak
    1 1 1
    127.0.0.1:65093 -> 127.0.0.1:46806 PROXY 10.130.23.56:62677 -> 10.130.23.60:1556 
    127.0.0.1:65093 -> 127.0.0.1:46808 PROXY 10.130.23.56:11619 -> 10.130.23.60:1556 
    LOCAL_CERT_ISSUER_NAME = O=vx,OU=root@erpbak,CN=broker
    LOCAL_CERT_SUBJECT_COMMON_NAME = 4209c31d-1de4-4617-b0c3-7ec5ba6306f9
    PEER_CERT_ISSUER_NAME = O=vx,OU=root@erpbak,CN=broker
    PEER_CERT_SUBJECT_COMMON_NAME = 7df64390-9f5a-43b7-9487-95b2572b6d75
    PEER_NAME = erp-fzyf
    HOST_NAME = erpbak
    CLIENT_NAME = erpbak
    VERSION = 0x08200000
    PLATFORM = linuxR_x86_2.6.32
    PATCH_VERSION = 8.2.0.0 
    SERVER_PATCH_VERSION = 8.2.0.0 
    MASTER_SERVER = erpbak
    EMM_SERVER = erpbak
    NB_MACHINE_TYPE = MASTER_SERVER
    SERVICE_TYPE = VNET_DOMAIN_CLIENT_TYPE
    PROCESS_HINT = 4209c31d-1de4-4617-b0c3-7ec5ba6306f9

    • davidmoline's avatar
      davidmoline
      Level 6

      Okay - that does look good. 

      First - have you tried using the "set_tape_active" & "set_disk_active" for the machinestateop option? Or even used the "reset_all"

      Have you enabled ltid debugging on the media and restarted ltid to see what may be happening. 

      If none of these help I'd suggest logging a support case.

      Cheers

       

  • Yes,  I am sure about all the every bi-directional bptestbpcd/bpclntcmd/bptestnetconn result .
    Notes: I had also already read this KB article ... 

  • Hi liuyl​ 

    The nbemmcmd you are using only updates the primary server to indicate that the media server "might" be online, after the 5 minutes or so the state goes back to offline as the master is unable to communicate with the media. 

    What you need to check is network communications between the master and media server (ping, bptestbpcd etc.) and make sure that no firewalls are blocking required ports (primarily 1556, but others may be needed depending on version and functionality - refer to the network ports reference guide for more details).

    Cheers

    • liuyl's avatar
      liuyl
      Level 6

      Surely,  all the every bi-directional bptestbpcd/bpclntcmd/bptestnetconn are successful.
      All the master/media servers are within the same subnet, there is no any iptables/firewall blocking.
      And also the master can always receive the heartbeat from the problematic media server every minute !

      # grep fzyf /tmp/nb_219.txt|head
      06/05/2025 10:33:06.503 [Diagnostic] NB 51216 rsrcevtmgr 219 PID:738 TID:139687773497088 File ID:111 [No context] 1 V-219-1 [ResourceEventMgr_i::updateInfo ] Heartbeat received from host erp-fzyf
      06/05/2025 10:34:06.512 [Diagnostic] NB 51216 rsrcevtmgr 219 PID:738 TID:139687779800832 File ID:111 [No context] 1 V-219-1 [ResourceEventMgr_i::updateInfo ] Heartbeat received from host erp-fzyf
      06/05/2025 10:35:06.520 [Diagnostic] NB 51216 rsrcevtmgr 219 PID:738 TID:139687769294592 File ID:111 [No context] 1 V-219-1 [ResourceEventMgr_i::updateInfo ] Heartbeat received from host erp-fzyf
      06/05/2025 10:36:06.528 [Diagnostic] NB 51216 rsrcevtmgr 219 PID:738 TID:139687773497088 File ID:111 [No context] 1 V-219-1 [ResourceEventMgr_i::updateInfo ] Heartbeat received from host erp-fzyf
      06/05/2025 10:37:06.536 [Diagnostic] NB 51216 rsrcevtmgr 219 PID:738 TID:139687773497088 File ID:111 [No context] 1 V-219-1 [ResourceEventMgr_i::updateInfo ] Heartbeat received from host erp-fzyf
      06/05/2025 10:38:06.544 [Diagnostic] NB 51216 rsrcevtmgr 219 PID:738 TID:139687773497088 File ID:111 [No context] 1 V-219-1 [ResourceEventMgr_i::updateInfo ] Heartbeat received from host erp-fzyf
      06/05/2025 10:39:06.552 [Diagnostic] NB 51216 rsrcevtmgr 219 PID:738 TID:139687769294592 File ID:111 [No context] 1 V-219-1 [ResourceEventMgr_i::updateInfo ] Heartbeat received from host erp-fzyf
      06/05/2025 10:40:06.560 [Diagnostic] NB 51216 rsrcevtmgr 219 PID:738 TID:139687779800832 File ID:111 [No context] 1 V-219-1 [ResourceEventMgr_i::updateInfo ] Heartbeat received from host erp-fzyf
      06/05/2025 10:41:06.568 [Diagnostic] NB 51216 rsrcevtmgr 219 PID:738 TID:139687779800832 File ID:111 [No context] 1 V-219-1 [ResourceEventMgr_i::updateInfo ] Heartbeat received from host erp-fzyf
      06/05/2025 10:42:06.576 [Diagnostic] NB 51216 rsrcevtmgr 219 PID:738 TID:139687779800832 File ID:111 [No context] 1 V-219-1 [ResourceEventMgr_i::updateInfo ] Heartbeat received from host erp-fzyf
      #

      • davidmoline's avatar
        davidmoline
        Level 6

        Okay - can you send the output of "bptestbpcd -verbose -host < >" from the master to the media and the media to the master. 

        Are you also sure that name resolution is working correctly on both systems (it would be more likely on the media server if this is the case given the other media servers are working fine).