Page tree
Skip to end of metadata
Go to start of metadata

This article attempts to outline a set of possible strategies for removing data from active use in your projects.

Contents:

Each strategy outlined below has comparative advantages and disadvantages and may be suitable to some use cases and not to others. Be careful in applying any of these strategies, with care and consideration applied to the nature of the data.

Below are some common example use cases:

  1. Data that has been removed from the source system
  2. Duplicate rows of data
  3. Malformed or incomplete data
  4. Erroneous data
  5. Fixes to ETL or LDM have changed the data

Basic Approach

Typically, a single ETL graph maps to a single dataset in a project, so you can design deletion strategies on a per-dataset basis.

You may decide to use different strategies for deleting data on each dataset, depending on the types of data, uses, and data volume. In some cases, a dataset does not need to be designed for deletion.

Identifying Records to Delete

A key determinant in your approach to deleting records is how to identify the records to remove.

  • Were the records removed from the source system? If so, is there a log or revision history that can be used to identify the records?
  • Do the records need to be removed from the project due to a corruption in the ETL process? If so:
    • Are the records an entire data set (bad data load)?
    • Are the records identifiable based on a set of logical conditions in the data (WHERE clause)?

Depending on the source, you may be able to determine what needs to be done:

  • If you cannot identify the records to delete, you need to pursue one of the data replacement strategies (Strategy 1 and Strategy 2).

    For simplicity, periodic full data loads from the source system are recommended where possible.

  • If you can identify the records to delete, you can use one of the strategies in which the specific records are flagged or explicitly removed from the dataset (Strategy 3 and Strategy 4).

Strategy 1 - Full Data Loads Only

If possible, you can configure your source systems and your ETL graphs to perform full data loads each time. Whenever the ETL executes, all existing data is removed from the target dataset and is replaced by the incoming data.

Basic Steps:

  1. Source system must be configured to provide the full dataset each time it is queried or accessed.
  2. The GD Dataset Writer component must be configured to perform full data loads.

Advantages:

  • Full data loading is the most intuitive way to manage and purge data.
  • The entire dataset is replaced with each load. You do not have to manage incremental changes at all.

Challenges:

  • The period of available reporting data from the source system must match the period expected in the project.
  • If the source system doesn’t enable full queries of the available dataset in a single execution, then the source data must be staged somehow.
  • For very large datasets, it may not be viable to perform a full data load each time.
  • Over time, full data loads can consume the SLIHash if a connection point is present, so performing full data loads across a number of datasets is not optimal.

Unless necessary, you should not define a connection point on fact datasets (tables). If you are running full data loads, a connection point is rarely necessary on a fact table.

If you have created your ETL graphs to write to the dataset in the GoodData project using the GD Dataset Writer with default configuration settings, no further configuration is required.

Strategy 2 - Two Versions of ETL

A modification of the previous strategy, this method extends a process that is configured to perform incremental data loads by creating a second process to perform periodic full data loads. This second data load effectively purges the GoodData project of records that have been deleted from the source system.

This method is similar to the first one. Since the full data loads are executed on a less frequent basis, they have a reduced impact on bandwidth and the user experience.

This strategy results in the creation of two separate ETL processes:

  • Basic loading process: The basic process utilizes incremental data loading to supplement the data in each dataset. These updates do not feature any deletions.
  • Periodic full loads: Periodically, a full data load is executed to rewrite the data to include only the non-deleted data. This process executes on a nightly or weekly basis, depending on data volumes and project usage.

This strategy can be implemented by one of the following ways:

  1. Modify the single, incremental-load graph to perform a full-load execution on a less frequent, off-hours basis. Then, publish that graph to the project, and configure the execution of the process. Drawback: you must perform the changes required to switch between data loading types and republish each time you make a modification to the process.
  2. Create a copy of the source process and then modify it to work with the other method of loading. Drawback: you must maintain two ETL graphs.
  3. Utilize a staging area, which is updated incrementally. Then, full updates are performed from the staging area into the GoodData project periodically and during off-peak hours.

Basic Steps:

This method may be implemented by modifying the existing incremental-load graph in the following ways:

  • The method by which source data is gathered must be modified to accommodate the other method of loading. If the source ETL uses incremental data loading, you must modify the query, the URL parameters, or other aspect of the reader so that it pulls all of the available data.
  • The GD Dataset Writer needs to be modified to perform the full or incremental data load.

Advantages:

  • Combines the advantages of full data loads and incremental data loads
  • Easy to implement in most cases
  • Ensures that on the intervals for full data loads the data in the system is current with the source system

Challenges:

  • Full data loads can be problematic to schedule for worldwide users.
  • Depending on the frequency of data loads, there may be significant periods where data removed in the source system is still present in the GoodData project.
  • This strategy does not provide a method for handling errors introduced into the data outside of the source system. Malformed or incomplete data resulting from the ETL process isn’t purged until the problem is corrected and the next full data load is executed.
  • For more information on full versus incremental data loading, see Incremental Data Loading.

Strategy 3 - Add isDeleted Flag

In this strategy, an extra column is added to each dataset for which you wish to enable data deletion. This column is used to identify if a record in the dataset is to be removed from use. The isDeleted flag is set to True or 1 if the record is not be used in the project.

Basic Steps:

  1. Modify the LDM. An isDeleted flag is added to each dataset in the logical data model. To each dataset that you wish to enable the marking of records as deleted, you must add this attribute.
  2. This flag must also be applied to each step of the ETL process, depending on how and when deletion is flagged.
    1. If deletion is flagged before the ETL begins, then the flag is simply another field in the dataset and can be processed through the graph accordingly.
    2. If deletion must be flagged after the transformation, it may be difficult to track the key identifier(s) for the record.

    In some implementations, it may be easier to create a separate, smaller ETL graph to upload data directly from a flat file. However, you need to acquire the key for the records that you wish to delete and store them in the file, along with the IsDeleted flag value.

  3. After the ETL and LDM have been updated, you must update your project to respect the flag. Two options:
    1. Modify each metric to include a check AND isDeleted = False
    2. Add a Data Permissions filter for each user to check to see if isDeleted=False.

If users can be invited to the project, ou may need to modify the call to the Invitation API to apply the Data Permissions filters are part of the invitation process. For more information, see Inviting Users with Pre-Applied Data Permissions.

Advantages:

  • Data is not actually removed from the project, which enables easier correction of mistakes.
  • This method confirms to recommended practices in BI.

Challenges:

  • It may not be possible to retrieve the keys to deleted data from a source system. In some systems, that data is actually deleted and gone forever, and there may not be a record of what has been removed. As a workaround, you would have to maintain a separate record of all records that were uploaded to GoodData.
  • Some third-party data sources may not enable you to add the IsDeleted flag to a data extract, which would require manual management of it.
  • Setting up this strategy requires significant modification to the LDM, ETL, metrics, and reporting. In particular, you must define a connection point in each logical data model and modify your ETL graph to guarantee uniqueness in the data in the connection point field.
  • Data is not removed from the project. If it is garbage data, in some cases, it is preferable to physically remove it from the project.
  • Over time, a significant volume of deleted record may require traditional purging.
  • Applying the isDeleted=False check to each metric is tedious in a large project. There are no impediments to creating a metric that does not include this conditional check.
  • Since data may or may not be present in the project, debugging may be harder for project administrators from within the project.

Strategy 4 - Automate through MAQL DML

The Data Manipulation Language (DML) flavor of MAQL enables the configured use of DELETE statements. Experienced users can build queries to remove data directly from the fact and dimension tables in the data store.

Basic Steps:

  1. Validate your project.
  2. If possible, clone your project.
  3. Execute your MAQL DML through the gray pages.
  4. Validate your project.
  5. Verify that the records have been properly removed from the project.

Advantages:

  • MAQL DML statements can be executed via API, allowing you to perform periodic purges of aged-out data, for example.
  • This method performs direct manipulation of the database and does not require any skills other than good scripting skills.
  • This method is particularly useful for purging garbage data, so that the project is clean.

Challenges:

MAQL DELETE statements are translated to a corresponding SQL command, which may generate unexpected side effects. For example, if attribute values are not applicable to the type of attribute, then they may be converted to NULL values, and all records in the dataset are deleted.

  • Currently, there is no easy, integrated set of tools for managing these kinds of deletions from your project. You must use combinations of APIs and the gray pages to complete these steps. If precautions are not taken, it may not be possible to rollback to a last-known-good version.
  • Executing DELETE statements on a project can have very harmful effects if they are not coded properly.

For more information on this strategy, see Deleting Records from Datasets.

Strategy 5 - Use DELETE-CENTER mode

You can delete some old or obsolete data from a dataset while loading new data to the same or different dataset using the REST API. The data to delete and upload are specified in CSV files.

For more information, see Delete Old Data while Loading New Data to a Dataset via API.