Forum Discussion

Michael_Conlon1's avatar
9 years ago

General Sizing Guidelines - Disk space Calculation

I am trying to determine the amount of disk space that I will need for my DI 5 RP1 server. Currently I have 100GB drive assigned for DI and I need to grow it because it is has been consumed. The question is how big do I need to grow the disk?

We recently migrated to a new VNX which contains two large shares. Current Size between the two shares is 3.5TB. This 3.5 TB consists of 60M files in 930k folders. The daily audit files average about 150MB. 

According to the sizing guidelines in Appendix B of the DI 5 Admin Guide I should only need the following: I doubled everything because it is a single instance installation.

200MB per every 1 million files.

 60M files x 200MB = 12,000M or 12GB estimated. 

40MB per 1 million events.

 Not sure where to find the stats for this. Activity show 40M.

 40M events x 40MB = 1600M or 1.6 GB estimated

Due to single instance - This server is also a collector

 60 -80 GB per collector

Total disk space needed - 80GB for collector + 12GB + 1.2GB = 93.2GB

Current size of DI server disk space is 100GB. 

Only scans and audit data no reports or anything else has been configured as of yet. 

Are there more detailed sizing guidelines that can be used to get a better estimate of space consumption?

 

  • Hello Michael.

    Sizing is very specific to environment. Since you are using a single tier configuration I am left to assume you intend to have a very small environment and will advise accordingly but first let me explain the application architecture just a bit to describe your storage needs.

    The application is designed to capture access events from users hitting devices within shares you are monitoring. Scans create the objects to store the events such as files and directories by encapsulating the meta-data into databases stored locally on an indexer. By design the collector node will periodically run scans to update the object references in the databases and constantly capture I/O to the device's shares.

    Once the collector receives the raw data from the device it needs to be parsed to format the data for database records. That requires twice the space as the raw file is not removed until the completed parsed file is created. The resources necessary are CPU and memory which are somewhat configurable based on the number of threads allotted and speed of the receipt of files. Once the files are generated and prepared they are transferred to the indexer.

    Even though your indexer is on the same machine the process remains the same and the files are internally transferred from the collector node to the indexer node requiring enough disk space to hold two files for each transaction the outgoing and incoming file. These are recorded on each node as the file is transferred out, acknowledged and then deleted and the staging area cleaned up. Until the file is deleted from the collector it will have two copies. The indexer will store these files until the schedule initiates a job to index all the files which is the process to place the data in the records of the databases stored on the indexer node. Again there are configurable resources required and processing done to compare the new data to the existing object. Some data is used to identify or create the object and other data is used to update the record(s). It is quite likely on an existing object that much data is discarded once the correct record is located as only the new data or delta is committed to the database. This would mean the disk space allocated to the storage of the temporary data files is not the same as the resulting data stored in the database.

     The Management Server, often referred to as a MS or console, is responsible for scanning the directory services to identify users, scheduling of the jobs required by the different processes, display of the data in a web GUI and configuration of the application and reports. The reports are stored on the MS and once you start using this functionality of the application you will require storage for the reports you create as well. Based on a smaller environment the data captured in a report will be less per report and it will be the number of reports you create or store that will affect the need for storage.

    Another assumption I need to make is that you have likely added additional disk to the node you are placing the entire application on and have assigned the data directory (databases and application environmental configuration) and application directory (installed binaries and application configuration) to the non-system disk. I am purposely avoiding the discussion of matching the sector size of the disks to the segment size of the databases on the indexer since it is but one component of the application and other files stored as components of the other worker node functions do not benefit from this match in either performance or disk utilization. I assume you can increase disk space at will either through  the datastore allocation for a VM, Hardware configuration or a product like Veritas Storage Foundation which allows sizing of volumes and layout changes at will.  Also since you have already outgrown the 100GB of disk and not fully used the application's capabilities there is likely an alteration required to the math you are using.

    A mechanism to capture the number of events you currently have stored in a Matrix Storage Unit (MSU) or share in human terms is to run a utility and see the current count in the summary at the bottom. For example you would have to choose one of your shares. Got to the settings tab, then filers and choose your VNX array name.

    filers.jpg

    Then in the next window select the monitored shares view and mouse over the column header to allow exposing the MSUID.

    MSUIDcolumn.jpg

    using that number we can run a command line utility to determine the count of events already captured.

    Note: the hierarchy is set up to be the last 3 digits of your MSUID then the MSU so MSU 21987 would be under \987\21987 as example.

    C:\Program Files\DataInsight\bin>idxreader.exe C:\DataInsight\data\indexer\default\1\1\ -a

    expect a long listing of events capture to scroll on by and if you have 40M then it will take a while to complete but at the bottom will be a summary that you can use to assess the activity on that share (MSU).

    Total Events: 36366 (create=1395, delete=264, read=30505, write=2828, rename=795
    , mkdir=123, rmdir=10, renamedir=14, security=432)
    Timespan for Events: 1460570280-1460642760 (Wed 13 Apr 17:58:00 2016 GMT-Thu 14
    Apr 14:06:00 2016 GMT)
    Results generated in: 32.96 minutes

    Note: I have minimized the events down from 165M in this dataset for time constraints and as it is a lab environment for testing I have not followed my own advice leaving the node major under-powered.

    that number can come into play later.

     

    For an appropriate estimation of disk space now we need a new formula and for that we need some basic assumptions.
    There are storage needs for meta-data and database object record creation of approximately (~) .0001 MB/file, each event once recorded will take ~30 Bytes which were in your basic calculations and can be estimated by checking the on disk properties of your current C:\DataInsight\data\indexer\default directory. As you add new events this size will also increase. From a resource perspective it is suggested minimally you should have 8 CPU cores, 32GB of RAM and 500GB of disk space in a smaller environment (doubled for medium to larger) to handle the processing we discussed above and properly keeping the application humming along.  Based on your current number of filers at 1 and the large shares you have mentioned the on disk storage for the records and future reporting should be fine in this range. The doubling of the disk space due to the single tier that you added was close in perspective since there will be multiple copies of files in play until the raw data is actually incorporated into the database records. There is a tool support could use to capture the files for estimation of the disk space but unfortunately it keeps copies in a separate location until deleted for this calculation or troubleshooting which would triple the necessary disk space and is not an option in your current situation. You could also open a support case and a technician could do some counts and assessments on your system to see the current growth rate and suggest a more accurate assessment based on your current system and environment.

    There is however another self-assessment tool you could utilize that should be current in your environment and is useful for you to see the recent trend on your single tier MS.

    Using the settings tab on the MS console in the GUI there are a couple of places to review the data that is captured as statistics.
    1) settings >> DataInsight Servers >> click on your MS node link and view Statistics:
    Note: there are many graphs and each has a scope from size, # of files, time lag, server, location and a separate 'Select charts' setting for the charts perspective.

    diskperf.jpg

    2) settings >>Performance

    sysperf.jpg

    you can use these setting views to watch your performance and measure current status or past performance to allow for a more personalized view of how much data you are processing and ultimately storing.

    If you have the ability to increase disk size quickly, without any down time and your trending shows a slow steady increase you could also try 200GB and keep an eye on when you may require further allotment. You should really avoid filling the disk so as to prevent the database updates from occurring and the application stopping since in a single tier environment you do not have the stop gaps in place of storing the data on the collector while you increase indexer disk space. MY estimate was based on the assumptions above, the future need for reporting and the fact you are running single tier, feel free to customize to your own needs.

     

    Rod

     

  • Hello Michael.

    Sizing is very specific to environment. Since you are using a single tier configuration I am left to assume you intend to have a very small environment and will advise accordingly but first let me explain the application architecture just a bit to describe your storage needs.

    The application is designed to capture access events from users hitting devices within shares you are monitoring. Scans create the objects to store the events such as files and directories by encapsulating the meta-data into databases stored locally on an indexer. By design the collector node will periodically run scans to update the object references in the databases and constantly capture I/O to the device's shares.

    Once the collector receives the raw data from the device it needs to be parsed to format the data for database records. That requires twice the space as the raw file is not removed until the completed parsed file is created. The resources necessary are CPU and memory which are somewhat configurable based on the number of threads allotted and speed of the receipt of files. Once the files are generated and prepared they are transferred to the indexer.

    Even though your indexer is on the same machine the process remains the same and the files are internally transferred from the collector node to the indexer node requiring enough disk space to hold two files for each transaction the outgoing and incoming file. These are recorded on each node as the file is transferred out, acknowledged and then deleted and the staging area cleaned up. Until the file is deleted from the collector it will have two copies. The indexer will store these files until the schedule initiates a job to index all the files which is the process to place the data in the records of the databases stored on the indexer node. Again there are configurable resources required and processing done to compare the new data to the existing object. Some data is used to identify or create the object and other data is used to update the record(s). It is quite likely on an existing object that much data is discarded once the correct record is located as only the new data or delta is committed to the database. This would mean the disk space allocated to the storage of the temporary data files is not the same as the resulting data stored in the database.

     The Management Server, often referred to as a MS or console, is responsible for scanning the directory services to identify users, scheduling of the jobs required by the different processes, display of the data in a web GUI and configuration of the application and reports. The reports are stored on the MS and once you start using this functionality of the application you will require storage for the reports you create as well. Based on a smaller environment the data captured in a report will be less per report and it will be the number of reports you create or store that will affect the need for storage.

    Another assumption I need to make is that you have likely added additional disk to the node you are placing the entire application on and have assigned the data directory (databases and application environmental configuration) and application directory (installed binaries and application configuration) to the non-system disk. I am purposely avoiding the discussion of matching the sector size of the disks to the segment size of the databases on the indexer since it is but one component of the application and other files stored as components of the other worker node functions do not benefit from this match in either performance or disk utilization. I assume you can increase disk space at will either through  the datastore allocation for a VM, Hardware configuration or a product like Veritas Storage Foundation which allows sizing of volumes and layout changes at will.  Also since you have already outgrown the 100GB of disk and not fully used the application's capabilities there is likely an alteration required to the math you are using.

    A mechanism to capture the number of events you currently have stored in a Matrix Storage Unit (MSU) or share in human terms is to run a utility and see the current count in the summary at the bottom. For example you would have to choose one of your shares. Got to the settings tab, then filers and choose your VNX array name.

    filers.jpg

    Then in the next window select the monitored shares view and mouse over the column header to allow exposing the MSUID.

    MSUIDcolumn.jpg

    using that number we can run a command line utility to determine the count of events already captured.

    Note: the hierarchy is set up to be the last 3 digits of your MSUID then the MSU so MSU 21987 would be under \987\21987 as example.

    C:\Program Files\DataInsight\bin>idxreader.exe C:\DataInsight\data\indexer\default\1\1\ -a

    expect a long listing of events capture to scroll on by and if you have 40M then it will take a while to complete but at the bottom will be a summary that you can use to assess the activity on that share (MSU).

    Total Events: 36366 (create=1395, delete=264, read=30505, write=2828, rename=795
    , mkdir=123, rmdir=10, renamedir=14, security=432)
    Timespan for Events: 1460570280-1460642760 (Wed 13 Apr 17:58:00 2016 GMT-Thu 14
    Apr 14:06:00 2016 GMT)
    Results generated in: 32.96 minutes

    Note: I have minimized the events down from 165M in this dataset for time constraints and as it is a lab environment for testing I have not followed my own advice leaving the node major under-powered.

    that number can come into play later.

     

    For an appropriate estimation of disk space now we need a new formula and for that we need some basic assumptions.
    There are storage needs for meta-data and database object record creation of approximately (~) .0001 MB/file, each event once recorded will take ~30 Bytes which were in your basic calculations and can be estimated by checking the on disk properties of your current C:\DataInsight\data\indexer\default directory. As you add new events this size will also increase. From a resource perspective it is suggested minimally you should have 8 CPU cores, 32GB of RAM and 500GB of disk space in a smaller environment (doubled for medium to larger) to handle the processing we discussed above and properly keeping the application humming along.  Based on your current number of filers at 1 and the large shares you have mentioned the on disk storage for the records and future reporting should be fine in this range. The doubling of the disk space due to the single tier that you added was close in perspective since there will be multiple copies of files in play until the raw data is actually incorporated into the database records. There is a tool support could use to capture the files for estimation of the disk space but unfortunately it keeps copies in a separate location until deleted for this calculation or troubleshooting which would triple the necessary disk space and is not an option in your current situation. You could also open a support case and a technician could do some counts and assessments on your system to see the current growth rate and suggest a more accurate assessment based on your current system and environment.

    There is however another self-assessment tool you could utilize that should be current in your environment and is useful for you to see the recent trend on your single tier MS.

    Using the settings tab on the MS console in the GUI there are a couple of places to review the data that is captured as statistics.
    1) settings >> DataInsight Servers >> click on your MS node link and view Statistics:
    Note: there are many graphs and each has a scope from size, # of files, time lag, server, location and a separate 'Select charts' setting for the charts perspective.

    diskperf.jpg

    2) settings >>Performance

    sysperf.jpg

    you can use these setting views to watch your performance and measure current status or past performance to allow for a more personalized view of how much data you are processing and ultimately storing.

    If you have the ability to increase disk size quickly, without any down time and your trending shows a slow steady increase you could also try 200GB and keep an eye on when you may require further allotment. You should really avoid filling the disk so as to prevent the database updates from occurring and the application stopping since in a single tier environment you do not have the stop gaps in place of storing the data on the collector while you increase indexer disk space. MY estimate was based on the assumptions above, the future need for reporting and the fact you are running single tier, feel free to customize to your own needs.

     

    Rod