Showing results for 
Search instead for 
Did you mean: 

Datastore Utilization Alarm Trigger Test

Level 4

I had an opportunity to conduct datastore utilization alarm test at 95 % utilization. The datastore criticl alarm was set to 95% and the free space was set to 10% before the customer implemented VMware snapshot based backups. The goal of the test was to identify the response times of respective teams starting from the time the datastore reached 95% and the actual time it takes the alarm trigger to reach the monitoring team and then the time it gets notified to VMware platform team to perform vMotion to overcome datastore out of space condition. Below are the steps performed and corresponding actual results.

Test Steps and Expectations:

1.         Datastore 95% alert alarms – Monitoring Team should receive the alarm

2.         Ticket should be is escalated/raised with the Platform Team – performed by Monitoring Team

3.         Platform Team should acknowledges the ticket – performed by VMware Platform Team

Record Test Results:

  1. Capture how much time elapses for ticket to be raised with platform team
  2. Document the time each step is performed
  3. Identify any other groups that should be notified as a result of netbackup filling up the datastore full

Actual Test Results: 

Total time elapsed from step 1 to 3 was 1 hr. This does not include the vMotion time to overcome the datastore out of space or full condition. It was noticed that clearly the 95% threshold and 10% free space was too narrow compared to the response times of respective teams. Thus it was concluded the free space should be at least 20 % and the alarm threshould should be set down to 85 % to avoid potential outage.