A regular day in an organization includes countless emails, instant messages, and documents, that may have been saved long after their useful lifespan. It just doesn’t stop at that, there are thousands of ZIP files, log files, archived web content, partially developed and then abandoned applications, code snippets… Organizations often hoard unmeasured and unknown amount of data long after its business benefit.
What percentage of Data Collection is considered to be Dark?
Gartner defines this data as dark data, the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes. Dark data is growing at a rate of 62% per year, according to IDG. By 2022, they say, 93% of all data will be dark data.
The worrying fact is that dark data contains huge amounts of sensitive data, and most organizations fail to account for this in their security strategy. Hence the risk of exposure also increases along with the increase in the dark data.
Moreover, under regulations like GDPR and CCPA, individuals have the right to demand organizations how his or her data is managed. They can request a copy of this data and even demand deletion of the same. Identifying and extracting the right data from dark data is very difficult if you don’t know exactly where your sensitive data assets are. In fact, GDPR presumes that organizations know exactly what data, especially personal data, they hold.
However, we can not just do away with dark data by deleting it. Dark data is a gold mine of historical data which can be used by data-driven organizations for analyzing trends, understanding consumer behavior, and making sound decisions. But to do so while still being compliant to data regulations calls for complete awareness and protection of your sensitive data assets.
How can you locate all the sensitive data locations (including the dark data)?
There are several methods available to find sensitive data locations. Some organizations depend on the database or application experts and their knowledge to decide what data is sensitive. Rudimentary methods like dictionary match enable you to find data in column names that specifically follow a pattern or data that is stored under known column names such as National Identifier, First Name, and so forth. But these methods do not result in a complete discovery as they fail to discover data hidden in hard to find locations, such as a “value” or “description” field, much like dark data.
Another method is regular expression (Reg-ex) and pattern-based searches, but they are not enough either. They fail to find sensitive data in complex columns, composite columns, BLOBs, CLOBs, key-value pairs, phantom tables, etc. and result in a high number of false positives.
Moreover, the challenges of accurate discovery do not stop there. Incomplete discovery of sensitive data in dark data stores will also lead to incomplete anonymization and monitoring, leaving a lot of data unprotected and hence vulnerable.
To effectively manage data growth, especially that of dark data, and security of it, organizations need to deploy the right tools that secure all the sensitive data assets, including dark data.