The last post of the DataMasking Project Diary was about the process of identifying where sensitive data is. As this is a complex topic, we are still missing some more relevant information about how to analyze databases and its content, that is today’s post topic. We want to localize sensitive data. In order to do so, it’s necessary to go through an iterating and incremental process in which we will filter and segregate results to adjust them to the particularities of each organization. By doing this, precision of results will arise, which is essential to guarantee the security of the information.
Basically, data analysis process verifies each field in database and its content with the purpose of detecting sensitive information. When the analyzer detects one field that could contain sensitive information it records It and it’s later presented to the configuration and security team to be verified. Unfortunately, the complexity of these systems and the great amount of differences in the way data is stored makes this problem difficult to be solved automatically. And for this reason, generic solutions usually don’t work.
Before starting with our solution, let’s see an easy example that will let us understand how challenging this procedure could be.
- In the world of data, it’s a widespread practice using the same field to store different kinds of information. Imagine that there is a field named document which is used to store IDs or Passport numbers from customers. Since both documents don’t share the same length the DB people decided to add a 0 at the begging of the shorter to fill it so now both share length. By doing this, data shape is being altered so the inspectors won’t detect it any more on its generic configuration. This encloses a deep issue: The way we store data alters its structure and representation. To solve it and reach every sensitive data saved, intelligent and adaptable mechanisms are needed.
Icaria TDM is designed to solve this problem. The proposed solution starts by locating data with generic mechanisms and looking for the most common types of information, such us IDs, telephone numbers, credit cards, etc. Also, if any other kind of info is known to be present from the Sensitive Data Inventory, we start to customize new inspectors for it and include them from the very beginning, so that we can do some tests in real environment.
With the first results we can take some decisions. What we do is to analyze the output looking for false positive, like a field named document that contains a document name and not a document ID, and we set the inspectors to skip them. The information given is not binary, inspectors provide a confidence interval, so we also configure them based on these results to highlight some fields from others and to discard those with small coincidence. Once this customization is done, we repeat the process, check the new results and evaluate them with the data responsible of the other company. After it, we customize them again to be more accurate and iterate over and over until we feel confident with the output.
This inspection process has mainly two differentiated procedures working together. One that is scanning the name of the field looking for coincidences with a list of possible values. And other studding the content of the tables, column by column and looking for known patterns, such us the ID one indicated above. The output from both is pondered and from it the report is done.
This report is our main tool. It indicates the name of the table, of the field, the kind of info it could contain and how accurate results are. During the first iterations we also look for other fields that we know they probably won’t contain sensitive data but that could indicate the presence of it at the table. Some examples are the city or the expiration date, that could mean that there are fields with addresses or credit card numbers on it.
For this, what can we customize on Icaria? Well, we have already given some hints. As we already pointed out, the field name inspector does its job checking coincidences on a candidates list. We have a default list, but each installation requires it to be modified. It also needs a list of nouns that won’t contain sensitive data even though it might seem to. It is used as well for adding some exclusions when it is required. One example that is very common are the fields containing the word “Client”, they sometimes include the client name, but they could also include its client number which can be omitted, so we would add an exception for fields called “Client_ID” in case we check it shouldn’t be masked. The length of the field and data can also modify the sensibility given by the analysis. Regarding the data inspectors, we can do some customization on the way the information is represented, so they are quite flexible on their searches. But not only that, our tools let us easily create them in case the information hasn’t been considered before or if it needs to be very specific. At this moment, the question could be how do we know what does the inspector should look for? From the first analysis we usually extract valuable data. Not always but with some frequency, we discover at this step the shape of data stored based on the report and our visual inspection. We check it with the client and ask if they know some more relevant information, get some examples of the possible values and start adapting the inspectors.
Two more points to highlight, first, it is important to remember that this is an iterative process, that will result in continuous improvements having to deliberate whether we prefer to ignore some fields or to get some false positives. By considering the reports, the visual analysis and the client feedback we adjust the search parameters and analyzers and reach a production version which will be executed every while looking for new tables and fields added with relevant data. And second, we have referenced the use of the analysis for searching for information to be masked, but it is not the only purpose. Companies can also use them to search for relevant information for Big Data uses. Or simply to be aware of the information contained in databases.