Keeping true to my last blog post Unstructured Data – GDPR’s Wild Wild West, I will continue with the Western theme.
In my last post, I wrote about the hidden pitfalls of finding and managing data in unstructured data sources. In this post, I want to introduce the idea of determining the value of data in your environment.
When I talk with customers about intelligent data management, I usually say that over 50% of unstructured data is either stale, redundant, obsolete or trivial. Every time I discuss this topic, each person in the room not only agrees but usually says “I bet ours is higher than that!”.
Veritas’ own Databerg Report reveals that only 15% of data in most organizations is considered “clean”, while 32% is considered redundant, obsolete or trivial (ROT). This leaves 55% of data that is considered “dark data”. This is data whose value cannot be determined as “clean” or “ROT” category.
So how do you mine your unstructured data sources to find the gold? How do you place value on data when you know nothing of the data beyond a file name?
“Value” of data comes in two forms, and each one is required to give you a complete picture to help you determine what to do with the file.
First, you have contextual information. This is information that will give you insight into how the data is being used, who is using it, who is modifying it, how often it is being used or modified, who owns the data (actual and inferred), and how much of your data is stale.
Next, you have content. You can’t get a complete understanding of the data in your environment by looking at usage and utilization statistics alone. There is also value on the content of the files in your environment. By determining the content of a file, you can determine how the data is going to be managed, how long you will keep it, and what compliance and retention settings apply to it to match your corporate policy.
Once we understand both the context and the content, we can now start to place data into different logical “buckets” of information.
- The first bucket may be data that is not being actively used, hasn’t been modified or even opened in a long time. Also, by looking at the content of the file we can determine that the files hold no business value. These files are perfect candidates for deletion.
- The second bucket may be data that is not being actively used, hasn’t been modified or opened in a long time, but based on a keyword, phrase, or pattern within the content of the file we have determined that based on corporate policy we must keep this information for a set period. Because the data is not being used, but has “value” to the company based on the content, this is a perfect candidate for archival.
- A third bucket may be data that is actively used, opened and modified often, and holds content that requires it to be retained for a set period based on corporate policy. This is data that we will want to keep on fast storage and make sure that it is highly available.
Data management has matured to the point where we can monitor how the data is being used, we can view the content within the file, and then manage the file accordingly. Imagine a world where you can be completely hands-off when managing your data, letting automation take care of everything. From the moment a file is created on the network we can monitor its use, look at the content, and then automatically classify and tag the file. Based on this classification we can then assign an automated retention and defensible deletion plan for the data. The file is managed throughout its lifecycle without any intervention from the IT staff.
Gone are the days of simply looking at a mountain or a stream and wondering where the gold may be. I’m sure the Gold Rush would have been a lot easier if there were signs and arrows pointing to where the valuables were stored. Veritas can help you find the gold in your mountains of unstructured data.