Forum Discussion

Spartacus81's avatar
12 years ago

Disk Volume Pool Down (2074)

Friends,

given up on fixing disk pool volume down issue after spending whole morning.. this is the typical one..

nbdevquery and nbdevconfig works successfully.. you can inventory the disk pool from.. you can see the storage configuration properties under credentials tab.. disk pool stays alive less than 10secs and it goes down..  here is the spoold logs from that path.

gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad

number of errors being reported of connection reset by peer.. i can bet my brand new BMW 325D on this.. the name resolution/nslookup and everything working okay.. no problem.. this problem arose over the weeked due where the media server (built in storage) went offline in the middle of the backup due to unexpected power outage..  looking at the logs below its suggesting connection reset by peer; REceive session shutdown

any suggestion will highly be appreciated.

 

February 25 13:49:19 INFO [140449875887872]: Task Type         : spad
February 25 13:49:19 INFO [140449875887872]: Started at        : Mon Feb 25 13:49:19 2013
February 25 13:49:19 INFO [140449875887872]: Task ID           : 0
February 25 13:49:19 INFO [140449875887872]: Thread ID         : 140449875887872
February 25 13:49:19 INFO [140449875887872]: Session ID        : 1563389238
February 25 13:49:19 INFO [140449875887872]: Data Selection    : 1
February 25 13:49:19 INFO [140449875887872]: Agent Name        : spoold
February 25 13:49:19 INFO [140449875887872]: Agent Address     : gsrugnbapp01.groupservices.co.uk:39478
February 25 13:49:19 INFO [140449875887872]: OS Family         : Linux-x86_64
February 25 13:49:19 INFO [140449875887872]: Software Version  : 7.0003.0012.091
February 25 13:49:19 INFO [140449875887872]: Protocol Version  : 6.6.1
February 25 13:49:19 INFO [140449875887872]: Crypto engine     : server
February 25 13:49:19 INFO [140449875887872]: Data Encryption   : enabled
February 25 13:49:19 INFO [140449875887872]: Data Compression  : enabled
February 25 13:49:19 INFO [140449875887872]: Accelerated Access: disabled
February 25 13:49:19 INFO [140449875887872]: task 0 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39478: spad handler running
February 25 13:49:48 INFO [140449875887872]: newEvent Adding new event. type:4
February 25 13:50:14 INFO [140449875887872]: Task Type         : spad
February 25 13:50:14 INFO [140449875887872]: Started at        : Mon Feb 25 13:50:14 2013
February 25 13:50:14 INFO [140449875887872]: Task ID           : 0
February 25 13:50:14 INFO [140449875887872]: Thread ID         : 140449875887872
February 25 13:50:14 INFO [140449875887872]: Session ID        : 1393531371
February 25 13:50:14 INFO [140449875887872]: Data Selection    : 1
February 25 13:50:14 INFO [140449875887872]: Agent Name        : spoold
February 25 13:50:14 INFO [140449875887872]: Agent Address     : gsrugnbapp01.groupservices.co.uk:39494
February 25 13:50:14 INFO [140449875887872]: OS Family         : Linux-x86_64
February 25 13:50:14 INFO [140449875887872]: Software Version  : 7.0003.0012.091
February 25 13:50:14 INFO [140449875887872]: Protocol Version  : 6.6.1
February 25 13:50:14 INFO [140449875887872]: Crypto engine     : server
February 25 13:50:14 INFO [140449875887872]: Data Encryption   : enabled
February 25 13:50:14 INFO [140449875887872]: Data Compression  : enabled
February 25 13:50:14 INFO [140449875887872]: Accelerated Access: disabled
February 25 13:50:14 INFO [140449875887872]: task 1 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39494: spad handler running
February 25 13:50:32 INFO [140449875887872]: newEvent Adding new event. type:1
February 25 13:50:46 INFO [140449875887872]: newEvent Adding new event. type:1
February 25 13:53:08 INFO [140449875887872]: newEvent Adding new event. type:1
February 25 13:53:08 ERR [140449875887872]: 25000: NetSelect: invalid pipe
February 25 13:53:08 ERR [140449875887872]: 25036: binaryMessageRead:3398, Could not receive binary message: expected 4 bytes got 0 bytes. Cause: connection reset by peer
February 25 13:53:08 ERR [140449875887872]: 25036: binaryMessageRead:3398, Could not receive binary message: expected 4 bytes got -1 bytes. Cause: connection reset by peer
February 25 13:53:08 INFO [140449875887872]: task 1 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39494: spad handler returns
February 25 13:53:08 INFO [140449875887872]: task 0 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39478: spad handler returns
February 25 13:53:08 INFO [140449875887872]: Receive session shutdown
February 25 13:53:08 INFO [140449875887872]: Receive session shutdown
February 25 13:53:08 INFO [140449875887872]: Uptime                  : 173.54 sec
February 25 13:53:08 INFO [140449875887872]: Uptime                  : 228.61 sec

February 25 13:53:08 INFO [140449875887872]: Messages Received       : 19
February 25 13:53:08 INFO [140449875887872]: Messages Received       : 8
February 25 13:53:08 INFO [140449875887872]: Bytes Received          : 3996
February 25 13:53:08 INFO [140449875887872]: Bytes Received          : 2104
February 25 13:53:08 INFO [140449875887872]: Seconds spent receiving : 0.00
February 25 13:53:08 INFO [140449875887872]: Seconds spent receiving : 0.00
February 25 13:53:08 INFO [140449875887872]: Messages Sent           : 19
February 25 13:53:08 INFO [140449875887872]: Messages Sent           : 8
February 25 13:53:08 INFO [140449875887872]: Bytes Sent              : 1820
February 25 13:53:08 INFO [140449875887872]: Bytes Sent              : 744
February 25 13:53:08 INFO [140449875887872]: Seconds spent sending   : 0.00
February 25 13:53:08 INFO [140449875887872]: Seconds spent sending   : 0.00
February 25 13:53:08 INFO [140449875887872]: 0.00 MB sent in 173.54 sec
February 25 13:53:08 INFO [140449875887872]: 0.00 MB sent in 228.61 sec
February 25 13:53:08 INFO [140449875887872]: spad completed at: Mon Feb 25 13:53:07 2013
February 25 13:53:08 INFO [140449875887872]: spad completed at: Mon Feb 25 13:53:07 2013
February 25 13:53:43 INFO [140449875887872]: Task Type         : spad
February 25 13:53:43 INFO [140449875887872]: Started at        : Mon Feb 25 13:53:43 2013
February 25 13:53:43 INFO [140449875887872]: Task ID           : 0
February 25 13:53:43 INFO [140449875887872]: Thread ID         : 140449875887872
February 25 13:53:43 INFO [140449875887872]: Session ID        : 291392991
February 25 13:53:43 INFO [140449875887872]: Data Selection    : 1
February 25 13:53:43 INFO [140449875887872]: Agent Name        : spoold
February 25 13:53:43 INFO [140449875887872]: Agent Address     : gsrugnbapp01.groupservices.co.uk:39541
February 25 13:53:43 INFO [140449875887872]: OS Family         : Linux-x86_64
February 25 13:53:43 INFO [140449875887872]: Software Version  : 7.0003.0012.091
February 25 13:53:43 INFO [140449875887872]: Protocol Version  : 6.6.1
February 25 13:53:43 INFO [140449875887872]: Crypto engine     : server
February 25 13:53:43 INFO [140449875887872]: Data Encryption   : enabled
February 25 13:53:43 INFO [140449875887872]: Data Compression  : enabled
February 25 13:53:43 INFO [140449875887872]: Accelerated Access: disabled
February 25 13:53:43 INFO [140449875887872]: task 0 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39541: spad handler running
February 25 13:54:17 INFO [140449875887872]: newEvent Adding new event. type:4
February 25 13:54:45 INFO [140449875887872]: Task Type         : spad
February 25 13:54:45 INFO [140449875887872]: Started at        : Mon Feb 25 13:54:45 2013
February 25 13:54:45 INFO [140449875887872]: Task ID           : 0
February 25 13:54:45 INFO [140449875887872]: Thread ID         : 140449875887872
February 25 13:54:45 INFO [140449875887872]: Session ID        : 1820742354
February 25 13:54:45 INFO [140449875887872]: Data Selection    : 1
February 25 13:54:45 INFO [140449875887872]: Agent Name        : spoold
February 25 13:54:45 INFO [140449875887872]: Agent Address     : gsrugnbapp01.groupservices.co.uk:39553
February 25 13:54:45 INFO [140449875887872]: OS Family         : Linux-x86_64
February 25 13:54:45 INFO [140449875887872]: Software Version  : 7.0003.0012.091
February 25 13:54:45 INFO [140449875887872]: Protocol Version  : 6.6.1
February 25 13:54:45 INFO [140449875887872]: Crypto engine     : server
February 25 13:54:45 INFO [140449875887872]: Data Encryption   : enabled
February 25 13:54:45 INFO [140449875887872]: Data Compression  : enabled
February 25 13:54:45 INFO [140449875887872]: Accelerated Access: disabled
February 25 13:54:45 INFO [140449875887872]: task 9 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39553: spad handler running
February 25 13:56:23 INFO [140449875887872]: newEvent Adding new event. type:1
February 25 13:57:42 ERR [140449875887872]: 25036: binaryMessageRead:3398, Could not receive binary message: expected 4 bytes got 0 bytes. Cause: connection reset by peer
February 25 13:57:42 INFO [140449875887872]: task 0 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39541: spad handler returns
February 25 13:57:42 INFO [140449875887872]: Receive session shutdown
February 25 13:57:42 INFO [140449875887872]: Uptime                  : 238.56 sec
February 25 13:57:42 INFO [140449875887872]: Messages Received       : 6

 

17 Replies

  • Thanks Mark.. i have tried all those tricks already but no joy (83/84).. trust me i wont let my BMW go that easy..

    did you see my last reply.. disk pool gone down again during the queue processing...

     

     

  • OK - looking good then - the processqueue has probably re-enabled the system so it should be OK now

    Once it has finished kick it off another couple of times to get it down as it was quite large and need it running a couple of times after a break like this to get it back up.

    The 83/84 error for client side de-dupe may cost you a BMW as this is when the client talks directly to the storage to write its backups so if it doesn't resolve it correctly it will fail.

    If it is just one or two clients you could take a look at the clients ost-plugins directory and find the media server folder within it - delete it and let it try again or take a look to see if anything in there looks wrong.

    I think you will be OK on your systems now as the PUT and GET are both saying Yes.

  • BINGO... the disk pool has gone down during the queueprocessing which ran okay for like half an hour...

    trying to up it .. command works but --getmode gives the connection error???

  • Mark its showing BUSY since i ran the queueprocessing and am constantly monitoring the storaged.log

    tail -f storaged.log

    and its moving so far so good.. disk pool hasnt gone down yet.. fingers crossed! i had couple of dupications pending which have now completed.. just noticed so things are sort of normal.. backups are scheduled to run at 22:00 and the diskpool is only 10TB 48% used (built in) so hopefully queueprocessing will finish that time...

    --getmode is showing the restult as expected..

    gsrugnbapp01:~ # /usr/openv/pdde/pdcr/bin/crcontrol --getmode
    Mode : GET=Yes PUT=Yes DEREF=Yes SYSTEM=Yes STORAGED=Yes REROUTE=No COMPACTD=Yes RECOVERCRDB=No
     

    --dsstat is also appearing.. inventory and storage server properties in the console all working okay..

    one thing my little brain is struggling to understand this error 83/84 every now and again when you have client side dedupliation enabled.. when you change it to media server deduplication it works.... Grrr

  • OK - so busy No but Pending Yes - so it is not running, or at least wasn't.

    what does --getmode show?

    As you have kicked off the processqueue it may resolve things but after one of these failure it wont actually be usable again until the getmode shwos YES for all of the important bits (PUT and GET)

    Think we may need the storaged log for this one as it is obviously not playing ball at the moment.

  • Thanks Mark... up'd the diskpool and ran quickly process queue info before it goes down on another session... fingers crossed to see if that helps...

    gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad # /usr/openv/pdde/pdcr/bin/crcontrol --queueinfo
    total queue size : 2156929592
    creation date of oldest tlog : Mon Feb 25 12:20:48 2013

    gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad # /usr/openv/pdde/pdcr/bin/crcontrol --processqueueinfo
    Busy   : no
    Pending: yes
    gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad # /usr/openv/pdde/pdcr/bin/crcontrol --processqueueinfo
    Busy   : no
    Pending: yes
    gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad # /usr/openv/pdde/pdcr/bin/crcontrol --processqueue
    OK
    gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad # /usr/openv/pdde/pdcr/bin/crcontrol --processqueue
    OK
    gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad #

     

     

     

     

     

  • Hi

    Sounds like corruption may have crept in but the repeated failures during lack of connection may have caused other issues

    do a crcontrol --get mode first to seee what the state of play is with the storage

    Next a crcontrol --queueinfo and --processqueueinfo to see what the process queue is up to (if anything)

    The spoold log should actually have the answer to take a look there or if it is not too big post it on here.

    I have had situations where we just need to kick queue processing off manually to bring it back to life due to 3 failures (runs as 12 noon so would have been trying at the time of the failure) (storaged logs will say something to the effect that queue processing has failed 3 time and the storage is being shut down - call support!)

    Also when a fragement has been orphaned and needs manually deleting from the de-dupe catalog

    Hopefully it is the first and manully running queue processing will bring it back to life (crcontrol --processqueue)

    Hope this helps