Subsetting: not the whole database, just part of it

In the previous post we have seen how our tool can transform real data into data that isn’t real but that looks like to use it in nonproductive environments such us QA or development. Now, we are bringing a new feature that, according with our experience, is a great utility when databases have large volumes of data, most of it unnecessary.

To understand why it is useful, we must first understand the problem we wish to address.  Let’s suppose that we are working with databases containing customers who have subscribed some services from the company, and we want to prepare a dissociated environment for the QA and development teams with that data. Then, the production database will contain many customers, each with much data linked. However, it is likely that for nonproductive environments the whole database it is not necessary, but having customers that meet certain circumstances is enough, and that even not all records of this selection of clients are needed.

Consequently, our proposal consists in moving a coherent subset of the database (or a part of it) starting from a table or root entity from which the related data is obtained. With that seeds we move over the rest of the entities and transfer them to another DB with the same structure. In addition, during the transfer of the coherent data, the process of decoupling is carried out to ensure the protection of the customer info.

Thus, the new database has dissociated data and does not have a volume greater than necessary, with advantages like greater speed in replicating and/or restoring data, faster decoupling time, lower volumes of data, easier to find useful information, etc.

Image. Setting up the subsetting path

Now, this solution seems a good alternative until we find some catalog entities that other entities refer to. We call catalog entities to those that offer general info such as the catalog of products or services offered by the company. Usually it is necessary to transfer them completely and better not to take them through the tree but with a complete transfer process that is faster in such scenarios. As a result, our application not only allows you to define trees from roots, but also allows to set which entities will be transfer in its entirety.

Taking the selective dissociation to the extreme, in icaria TDM we have implemented an additional functionality, the inbox. It is a tool designed for the execution of the subsetting for a very reduced amount of seeds, usually only for one. That is, imagine that we are looking for a client in production that we want to move to other environment, instead of executing the whole subset process we could create a request to the inbox to move and decouple this client, using a very friendly interface for any user, independently of his training on the platform. In addition, it has the advantage of being a very delimited method that enables data managers to give access to a greater number of users.

To sum up, these alternatives are extremely useful and needed depending on the characteristics of the installation and the preferences of the users. The limitation of space is one of the biggest drawbacks that we usually face, and the selective method is the ideal solution to face this obstacle, obtaining the maximum performance with the least effort. Nearly two out of three projects with icaria TDM use it.

Conceptually, this idea is simple, but the definition is relatively laborious, usually the data model is not completely known nor is the right documentation available. We tend to test it during the PoC, so part of the needed definition is done. However, some improvements in the routes to obtain coherent data are missing. To make life easier it is time to create and test a new icaria TDM plugin whose objective is the discovery of the relationships between data.