This is the first in series of 2 guides to take you through some general performance troubleshooting steps that I've gathered over the last 8-9 years supporting Enterprise products for two popular, large software vendors. I'd welcome anyone's thoughts and comments and will enhance the guide whenever I get chance with the addition of more experience, and with other people's comments, questions and suggestions.
The first thing really is that we have to get the terminology right.
A crash, for our purpose, is when a process under Windows terminates unexpectedly. A process dump may or may not be created. The proces may or may not get restarted. This type of issue is out of scope for this guide. (We'll worry about that if I write that guide after this one !)
BSODs are a type of crash (but of the beloved Windows Kernel, rather than an application), as are outright system freezes. Both of these though are also out of scope, and I have close to zero knowledge in troubleshooting BSOD's or freezes (so someone else will have to write that guide!)
A hang I would define as arising from one of the following two situations :-
* A process drops to 0% CPU usage for a considerable time (when you're actually expecting it to spring to life and do something useful)
* A process consumes 100% CPU usage (or close to it) for a considerable period of time (when it doesn't normally consume this much CPU for this length of time)
Troubleshooting these types of issues is similar, but, there are some differences as we will go into.
Precision is key
The next thing we really need to do is try to define what the problem is. For example "My system runs at 100% CPU" isn't particular accurate, or precise. We have to begin to observe when this happens, how often it happens, what process(es) is (or are) involved, we have to gather some data, theorise about the problem, and then devise tests to prove it.
A *better* description is "My Enterprise Vault Server runs at 100% CPU randomly throughout the day, and I can see it's the retrievaltask process".
This, of course, is the 100% CPU usage situation outlined above, and I will explain this area in more detail in the next article.
In my experience you should collect the following :-
A performance monitor log file
A number of userdumps of the supected process. At least TWO.. preferably 3 or even 4.
The performance monitor log file takes a bit of setting but isn't too hard if you try to consider the following guidelines.
I normally start with things like Process, Processor, Memory, Paging file and capture all instances at 15 seconds intervals for a period of an hour. If you already know alot about the issue you're troubleshooting then you may capture less than this, at a more frequent (or less frequent) interval over a shorter or longer period... this is partly an art form, I think!
The userdumps need to be taken a few minutes apart, and the two easiest ways are to use adplus.vbs, which comes as part of the Windows Debugging Tools, or install DebugDiag. Using DebugDiag you just right click on the process periodically to create a dump file. With adplus.vbs (my favourite) you run a command like this :-
Note it takes a few minutes for adplus to run, as it's taking a memory dump of a process. You can observe it in action, as you'll see a little command window appear in a minimised state, that's actually cdb (the command line debugger) capturing the data via the adplus.vbs script. So wait for that window to disappear, before you then start a timer in your mind of 2-5 minutes, before you take the next dump file, and so on.
When you're done with adplus.vbs, and you've got 3-4 dump files, all of the requisite data will be in the c:\dumps folder.
These dump files along with the performance monitor log file are what you need to gather for Support.