How collections and sparse collections work
One feature of Enterprise Vault is the use of Collections, where Enterprise Vault will collect multiple items into a Collection which is a Microsoft Cabinet (CAB) file. The main use for doing this is to help with backup times.
For instance if you have 1 x 10MB cab file, this will backup quicker than say 100 x 10KB files, one thing to note however, is that the CAB files are not compressed, meaning that if you extract a 10MB CAB file, it will result in 10MB of DVS Files. The reason for this is that DVS files are already highly compressed, and when you attempt to compress something that’s already compressed, it results in a bigger file.
Enterprise Vault Collections are configured on the Collections tab, where you can figure when the collections run, how big the CAB files can be, and how old the items have to be before they can be collected in to a CAB file.
Note that when you enable Collections, you cannot disable them. The best you can do is either make the age of the files to collect so old that nothing would get archived, or you can limit the amount of time that the collections process can run (i.e setting the start AND end time to be at 11:00AM).
A word of caution on the second method though, when an item is retrieved from a CAB file, it is put in its original location and named as an ARCHDVS (or ARCHDVSSP, ARCHDVSCC etc on EV8(, those files are not automatically deleted after the user has finished reading the email.
Instead, it is the Collections process itself that goes behind and deletes the ARCHDVSxx files after a certain period of time, if the collections period is set too short or has 0 seconds to run, then then archdvs files cannot be cleaned up and you will end up duplicating space unnecessarily.
Where are CAB Files stored?
Collections themselves are stored in different places dependent on your version of Enterprise Vault.
Enterprise Vault 2007 and below:
The following folder structure is used to store DVS files, and the Collections are placed in the “Day” folder.
Files are stored in a yyyy\mm\dd\hh format. For example
E:\Enterprise Vault Stores\Journal Vault\ptn1\2010\01\30\17\<saveset>.dvs
The above would symbolize an item archived at 5pm on 30th January 2010
The CAB files are stored in the \dd\ section..so it may look like
E:\Enterprise Vault Stores\Journal Vault\ptn1\2010\01\30\Collection12345.cab
In Enterprise Vault 8 however, the locations are stored in a little different format.
it stores it in \yyyy\mm-dd\LETTER
Example:
E:\Enterprise Vault Stores\Journal Vault\2010\01-11\A\074\<saveset>.dvs
The above would suggest an item is archived on 11th January 2010.
However rather than storing in an additional hour folder as it used to in EV2007, it now uses parts of the file name of the DVS.
In this example we have a file name called A07465CEEC2320A040210B08E3549781.DVS, the name is based on the Transaction ID assigned to the item, it takes the first letter of the transaction ID (A) and then creates folders that use the next three numbers or letters of the transaction ID.
Another example, if an item called 107DC3824ADB33CDABCE5C15B7B46BD1.DVS and it was archived on January 11th 2010, it would be located in the following location:
E:\Enterprise Vault Stores\Journal Vault\2010\01-11\1\07D
On Enterprise Vault collection files are stored in the first letter of the transaction id’s location.
For instance the collection file may be stored here
E:\Enterprise Vault Stores\Journal Vault\2010\01-11\1\Collection12345.cab
What happens when I delete an item or run storage expiry?
When items are added to a CAB file, they will remain there until a process called Sparse Collections is run, which involves extracting valid savesets and then deleting the cab, those savesets are then re-collected at a later date.
When an item is deleted, Enterprise Vault simply cannot delete an item with in a CAB file (this actually applies to any compressed file such as ZIP or RAR) therefor you get in to a situation where items are deleted from the Databases and indexes, but still remain in the CAB files.
So what occurs is that Enterprise Vault does a look up of all the items in a CAB and determines which ones are still valid, if there is only a certain percentage of items that truly exist in the CAB file, then EV extracts all the items, and the cab file is deleted.
So how does Enterprise Vault know which cab files to check?
Well when a collection file is created, there are two SQL Columns populated in the Collections table.
One is called RefCount and one is called TotalCount.
When a Collection is first created, it takes a count of how many items are stored, and sets the refcount and totalcount to the same number, so if 100 items are stored , both refcount and totalcount will be set to 100.
Then, when an item is deleted or expired from that collection, it will reduce the number of the refcount, but the totalcount will remain the same.
So if 50 items are deleted that belong in that CAB file, then refcount will be set to 50, and the totalcount will remain at 100. When the Refcount hits 0, this means that none of the DVS files within that CAB file exist in the database or the indexes, thus the CAB file and all its contents can be deleted.
But what happens if you have a refcount of 1 and a totalcount of 100? This 1 item that still exists in EV is stopping the other 99 items from being removed from disk and freeing up storage. So what happens is the Sparse collections process.
The last items are extracted to their original location, the refcount is set to 0 and then EV deletes the CAB file. By default, Enterprise Vault will initiate the sparse collections when the refcount is 15% of the the totalcount.
So if you have 100 items stored in a cab, as soon as the refcount hits 15 items or lower, it will extract and then delete the cab file. So if you every run a storage expiry, make sure you run your collections process after so that you can reclaim disk space immediately.