Solved: Standard environment checks

Chitra · ‎04-26-2009

Hi All,

I want you to share the "Best Practices" for doing a "health-check" of an environment.

I monitor jobs through Veritas 6.5 console daily and I want some simple/basic systematic steps which I can go through (like checking elapsed time, checking kbps, checking drives etc) which will help me easily identify problems (if any) in the environment.

I sometimes have problem in easily idenitifying problems like "Hung Backup", "Hung Environment", "Slow Job" etc.

Also, is there an easier way to "identify jobs that haven't triggerred automatically" in an environment without manual intervention?

Your expert advices and best tips will help me more efficiently monitor the environment of servers and effectively diagnose problems within servers in timely fashion!

Thanks in advance.

Chitra

J_H_Is_gone · ‎04-27-2009

If you can get NOM setup it would be good.
NOM has alerts that will send emails for things like.

Long running jobs. - for me if a job runs over 20 hours I get an email.
or if a tape drive goes down.
or a job is supsended
or a tape is frozen
a job fails for something other then a 1, or 150

Plus you have the NOM console that you can just 'watch' for issues.

NOM comes free/included - you just have to set up it up on a server - BUT NOT ON THE MASTER.

View solution in original post

scorpy_582 · ‎04-26-2009

Some most important ones:-

-> First and primary: Status of backups
-> Error Report: Can be seen in gui or with bperror command
-> Volume pools to make sure you have media needed for future backups.
->Status of drives.

Note: All of these can be checked in GUI.
Also, for unix server, with basic scripting knowledge, alert emails can be generated based on the requirements.
Some common examples:Email for status of backup, daily report of bperror (there are many switches and many different reports possible), status of drives every few hours, media report (using available_media script)

If you need anything more specific, please let us know.

Srikanth_Gubbal · ‎04-26-2009

list out the operations, you would like to operate at ease, i shall try to give the best way to do it.

Chitra · ‎04-26-2009

Thanks all for the suggestions.

We are part of Monitoring Team hence we indulge in very basic troubleshooting.

We dont have CLI access and totally rely on our Consoles. Checking Reports, Scratch Count and Drives etc are not the problem areas majorly but few environments have high activity going on ..... sometimes we have supposedly close to 500 jobs active on the Console, quite naturally we cannot check each job individually hence I want some "key things/tabs/options" which if checked systematically will ensure that nothing is wrong in the environment.

Operations like Restore, Tape Management etc have no problems, its just the Monitoring.

Thanks again,

Chitra

Yongkang · ‎04-26-2009

Are you using NOM and/or VBR?

Srikanth_Gubbal · ‎04-26-2009

you can check elapsed time, kbps from activity moniter, right click on the job and check for columns, move the tab you want to view.

generally monitering inculdes, whether any jobs failed, which client is active at very low speed... all these things you can filter on clicking the status tabs, every tab in activtity moniter will act as a filter, so once you click on status tab, it will filter the jobs starting from 0, play with activity moniter and let me know.

J_H_Is_gone · ‎04-27-2009

If you can get NOM setup it would be good.
NOM has alerts that will send emails for things like.

Long running jobs. - for me if a job runs over 20 hours I get an email.
or if a tape drive goes down.
or a job is supsended
or a tape is frozen
a job fails for something other then a 1, or 150

Plus you have the NOM console that you can just 'watch' for issues.

NOM comes free/included - you just have to set up it up on a server - BUT NOT ON THE MASTER.

Dcase · ‎05-28-2009

"Also, for unix server, with basic scripting knowledge, alert emails can be generated based on the requirements.
Some common examples:Email for status of backup, daily report of bperror (there are many switches and many different reports possible), status of drives every few hours, media report (using available_media script)"

Is there no way to do this in a Windows environment? I'm looking for an email/alert that will be triggered for low scratch. We use NOM but the "Low available media" alert doesn't seem to work.

Thanks,
Dale

alazanowski · ‎05-28-2009

While others will focus on the "put out the fire and rerun", i am in the ideology of repairing for good across the board and making everything run smoothly. Depending on what rights you have, its important to check a few things:

1) Do you have alot of re-queing or failed jobs?
--- If you have alot of re-queing or failed jobs, its important to find out what common things might be occuring. Is there a down drive? Is there a mismapped drive? Is there a ghost tape in the robot? Does the media server and master server need a recycle of services?

2) How long are the jobs being queued?

-----Alot of database jobs will fail or seem hung if they waited a long time to get access to drives. It's important that database policies have the highest priority since they need to finish to get back into active mode. Flat file backups generally don't follow as high of precedence.

3) How long are the backups running? Is this causing delays for others?

----Sometimes you need to take a look at what is getting backed up, if its the /tmp directory with a ton of small files, theres no need to have it get backed up. This is delaying you finishing. Also you can see on the KB/sec throughput to see how the performance is. Most 100 full duplex connections will write at roughly 8,000KB/sec to 10,000 KB/sec if the network health and everything is setup correctly. Gigabit connections can write significantly more (i've seen 55,000KB/sec and more). Making sure that the network switches and network card ports are set to their highest value and at full duplex with no auto negotiate can resolve alot of performance issues on backups. A ton of small files significantly slows down backups because it has to write to teh header, read the tiny file, and then write it, then stop its writing pattern and do it again for the next one. This can be potentially looked at in a "can we zip and archive them at a certain point" or is this more appropriate for a VCB or flashbackup?

There are tons upon tons of recommendations you can get on how to react and improve your environment. It all depends on your current circumstances and your availability.

Raghuraam · ‎05-28-2009

If you have to monitor more than one master server then make use of NOM. Related doc's are attached

http://seer.entsupport.symantec.com/docs/290226.htm

http://seer.entsupport.symantec.com/docs/290227.htm

quebek · ‎05-29-2009

Hello
When I am checking jobs via GUI I am sorting by column Status, so all jobs which finished with status greater then 1 are easly found.
Also I am using the filter (ctrl+t) in windows based cosnole (not java, in java console right click on the right hand pane, while beinig in the Activity Monitor and select FILTER option) and filtering by desired client, policy name.

ABT · ‎07-24-2009

Not sure how to ask this question...
I cant see anything in activity monitor. On master and media servers when launching activity monitor.

any ideas? im running 6.5.2A

Any help is appreciated.
Thanks
ABT

Andy_Welburn · ‎07-24-2009

This thread has essentially been 'solved'.

A new post will keep all comments relevant to your problem & allow you to mark any future solution for your specific issue.

See you there ;)

VOX

Netbackup Console Monitoring - Best Practices!