Within our organisation we are solely concerned with email archiving. We have six Exchange 2000 active/passive clusters, with a dedicated EV server attached to each split equally across two data centres.
Before our virtualisation adventure took place we were running our EV email archiving servers on end of life hardware (End of life when the project implemented EV), which with ever increasing frequency of hardware failures took place. Typically mirrored disks would go down, raid card battery failures, RSA cards not functioning, the list goes on. It was just one big headache; and that was only the hardware.
For the software we were running on 2007 7.5 SP1, which seemed to give us no end of users complaining about their emails not completely archiving and their mailboxes not decreasing in size. We were also running on Windows 2003 SP1; which Symantec had advised us several times has serious shortcomings with MSMQs. Essentially the outgoing queues did not get processed so any archive messages which needed to travel between EV servers did not get delivered, and so we were back to the problem of users reporting their email items not being archived properly.
With an ever expanding user base in the thousands this was becoming a real headache. All the while PST migrations were taking place as more users were migrated to the service; EV servers were failing with all manner of issues. Luckily we have warm (building blocks) standby boxes which were constantly utilised, without these we would have been royally $cr3w3d several times over.
All of this added up to over 70% of our email support incident queue directly related to Enterprise vault.
Our first plan in improving the infrastructure was to upgrade each EV server with Windows 2003 SP2 to fix aforementioned MSMQs issues. This proved to be a lengthy process as we have a lot of pre-requisites built around SP2. However this revenue of progress was quickly dismissed after the second server we started this process on suffered a sever hard disk failure. One mirror disk went down in the server and the other disk in the mirror decided to go down with it. Luckily our hardware vendor was able to supply a replacement disk, rebuild the mirror and restore the data.
At this point we gave up on upgrading the software levels on the existing infrastructure and decided to virtualise the entire environment to new clean Windows 2003 SP2 builds. Then once we were on the virtualised infrastructure, snapshot all machines, and then upgrade Enterprise vault itself from SP1 to SP5.
We built the new Windows 2003 server in the virtual environment, instead of using P2V tools. We had two concerns which lead us to do this. The original server builds were old and clapped out. In addition we had been warned about injection of drivers when running P2V. Although our ESX guys had not experienced it themselves, they were concerned it could become an issue. Airing on the side of caution we went with new builds.
Our installations went well, SAN storage snapped across by our storage and ESX teams. Our only issue came about when we put the new servers into production. We use building blocks (DNS aliases) in our environment. We installed the new servers with the same DNS Aliases as the old servers. The net result with this was the services were running on one server and the tasks on another. However a call to Symantec and some database and registry updates later and we had gotten all services and tasks running on the new servers and could remove the old servers from the administration console server list.
At this point we stopped all activity for a week, held our breath and waited for the end-user response. All seemed good, we had a reduction in the number of support incidents and the feedback was good.
We then prepared for the big bang approach to the SP5 upgrade. This went through easily, with minimal fuss.
After the upgrade we considered what to do with the old EV servers. One of them had actually had a network card failure while we were migrating service off of it. We realised using the old servers in a disaster recover scenario was not feasible, let alone run the entire production environment off of them. They are now either in landfill or have been reincarnated as mobile phones.
It has now been several weeks since the virtualisation and upgrade projects have been put to bed. We have seen a 98% reduction in the number of support incidents for enterprise vault. All of my monitoring scripts which were so essential for maintaining a once creaking system are now almost redundant. Hardware failures are a thing of the past now that we enjoy HA through ESX’s clustering. We are confident we are receiving better performance than on physical hardware. We still have the building blocks standby servers in place, just in case of OS corruption on an individual VM EV Server.
While I am satisfied with this victory, I now need something else to fix. I feel better for having this rant about the kind of terrible condition systems can get in left unchecked. Your comments and questions are welcome. :)