Synthetic data generation

Generating synthetic data for testing management software has unique characteristics that set it apart from other use cases with similar titles but vastly different application domains.

Generating synthetic data for a test file in a Big Data environment likely involves little variation in affected persistence technologies and data types, as well as relationships and dependencies among them. It probably requires a massive volume of generated data.

Business support applications have different needs. The volume of data is not an issue. Instead, the structures (data models) are complex. The relationships and dependencies among the data are numerous. And often, there are several persistence technologies involved. Consider the case of generating a synthetic customer to test a process for contracting a new product. This involves CRM applications, billing, collections, service provisioning, etc. Each of these has a database with a different technology, hosting a different application data model, with dependent relationships among them.

Synthetic data generated by icaria TDM in an internal repository and available for self-service

Synthetic data for testing business applications

The generation of synthetic data by icaria TDM is perfectly adapted to the use case defined by the needs of business applications. It can generate a complex and coherent data structure for several applications at once, supported by different database technologies and requiring data that maintains the referential integrity of the information.

Operating Principles

The essential idea of the icaria TDM synthetic data generation engine is as follows: start from a coherent data structure, created by the applications themselves, offering the structural integrity needed in the data domain relevant for the tests that will consume the data, and synthetic data generation rules that will modify the necessary attributes so that the synthetic data obtained can coexist simultaneously with the original in the same test environment.

Generating a million customer table records isn't the objective. Instead, it's about creating customers along with their accounts, contracts, movements, services, invoices, and claims across various applications. Different database technologies come into play, ensuring data remains coherent for process testing across multiple applications (integrated tests).

All this with minimal specifications from the icaria TDM user.

To achieve this goal, icaria TDM generates synthetic data from the following elements:

  1. A data domain. This is the set of tables and relationships of all the applications involved in the test that will consume the generated data.
  2. A data instance. This is an example of data available in the databases of the applications, possibly from the production environment, delivered by icaria TDM through segmentation in a previous environment, and modified by the applications to constitute a perfect model, and then stored again in an internal repository of icaria TDM to preserve it for future uses.
  3. A set of synthetic data generation rules. icaria TDM offers a wide catalog of synthetic data generation rules, which can be easily extended in a specific installation. These rules are applied at the necessary points of the data structure and are of two types:
    • Technical rules: generate technical attributes, such as unique identifiers.
    • Functional rules: provide values for attributes with a functional meaning, such as the customer's name.
icaria TDM offers a comprehensive synthetic data generation rules catalog

Synthetic Data Generation Process

This icaria TDM process of icaria TDM, once configured by the Data Architect and made available to the users of the self-service portal, follows these steps:

  • Selection of the data template. The user has different templates depending on the data they need to generate. These templates cover a data domain - the customer, with their accounts, contracts, services, for example - for several applications simultaneously.
  • Choice of the model. The user chooses which real data structure will serve as the model for generating the manufactured copies.
  • Choice of the repository. Finally, the user will decide the number and storage of the synthetic copies. Typically, synthetic copies are generated in an internal repository of icaria TDM, so that they are preserved and can be delivered through self-service in any application environment as often as required by the tests.
Synthetic data generation process summary before execution

