Deleting Data from a Workspace

This article attempts to outline a set of possible strategies for removing data from active use in your workspaces.

Each strategy outlined below has comparative advantages and disadvantages and may be suitable to some use cases and not to others. Be careful in applying any of these strategies, with care and consideration applied to the nature of the data.

Below are some common example use cases:

  1. Data that has been removed from the source system
  2. Duplicate rows of data
  3. Malformed or incomplete data
  4. Erroneous data
  5. Fixes to ETL or LDM have changed the data

Basic Approach

Typically, a single ETL graph maps to a single dataset in a workspace, so you can design deletion strategies on a per-dataset basis.

Identifying Records to Delete

A key determinant in your approach to deleting records is how to identify the records to remove.

  • Were the records removed from the source system? If so, is there a log or revision history that can be used to identify the records?
  • Do the records need to be removed from the workspace due to a corruption in the ETL process? If so:
    • Are the records an entire data set (bad data load)?
    • Are the records identifiable based on a set of logical conditions in the data (WHERE clause)?

Depending on the source, you may be able to determine what needs to be done:

  • If you cannot identify the records to delete, you need to pursue one of the data replacement strategies (Strategy 1 and Strategy 2).

  • If you can identify the records to delete, you can use one of the strategies in which the specific records are flagged or explicitly removed from the dataset (Strategy 3 and Strategy 4).

Strategy 1 - Full Data Loads Only

If possible, you can configure your source systems and your ETL graphs to perform full data loads each time. Whenever the ETL executes, all existing data is removed from the target dataset and is replaced by the incoming data.

Basic Steps:

  1. Source system must be configured to provide the full dataset each time it is queried or accessed.
  2. The GD Dataset Writer component must be configured to perform full data loads.

Advantages:

  • Full data loading is the most intuitive way to manage and purge data.
  • The entire dataset is replaced with each load. You do not have to manage incremental changes at all.

Challenges:

  • The period of available reporting data from the source system must match the period expected in the workspace.
  • If the source system doesn’t enable full queries of the available dataset in a single execution, then the source data must be staged somehow.
  • For very large datasets, it may not be viable to perform a full data load each time.
  • Over time, full data loads can consume the SLIHash if a connection point is present, so performing full data loads across a number of datasets is not optimal.

If you have created your ETL graphs to write to the dataset in the GoodData workspace using the GD Dataset Writer with default configuration settings, no further configuration is required.

Strategy 2 - Two Versions of ETL

A modification of the previous strategy, this method extends a process that is configured to perform incremental data loads by creating a second process to perform periodic full data loads. This second data load effectively purges the GoodData workspace of records that have been deleted from the source system.

This method is similar to the first one. Since the full data loads are executed on a less frequent basis, they have a reduced impact on bandwidth and the user experience.

This strategy results in the creation of two separate ETL processes:

  • Basic loading process: The basic process utilizes incremental data loading to supplement the data in each dataset. These updates do not feature any deletions.
  • Periodic full loads: Periodically, a full data load is executed to rewrite the data to include only the non-deleted data. This process executes on a nightly or weekly basis, depending on data volumes and workspace usage.

This strategy can be implemented by one of the following ways:

  1. Modify the single, incremental-load graph to perform a full-load execution on a less frequent, off-hours basis. Then, publish that graph to the workspace, and configure the execution of the process. Drawback: you must perform the changes required to switch between data loading types and republish each time you make a modification to the process.
  2. Create a copy of the source process and then modify it to work with the other method of loading. Drawback: you must maintain two ETL graphs.
  3. Utilize a staging area, which is updated incrementally. Then, full updates are performed from the staging area into the GoodData workspace periodically and during off-peak hours.

Basic Steps:

This method may be implemented by modifying the existing incremental-load graph in the following ways:

  • The method by which source data is gathered must be modified to accommodate the other method of loading. If the source ETL uses incremental data loading, you must modify the query, the URL parameters, or other aspect of the reader so that it pulls all the available data.
  • The GD Dataset Writer needs to be modified to perform the full or incremental data load.

Advantages:

  • Combines the advantages of full data loads and incremental data loads
  • Easy to implement in most cases
  • Ensures that on the intervals for full data loads the data in the system is current with the source system

Challenges:

  • Full data loads can be problematic to schedule for worldwide users.
  • Depending on the frequency of data loads, there may be significant periods where data removed in the source system is still present in the GoodData workspace.
  • This strategy does not provide a method for handling errors introduced into the data outside the source system. Malformed or incomplete data resulting from the ETL process isn’t purged until the problem is corrected and the next full data load is executed.
  • For more information on full versus incremental data loading, see Incremental Data Loading.

Strategy 3 - Add isDeleted Flag

In this strategy, an extra column is added to each dataset for which you wish to enable data deletion. This column is used to identify if a record in the dataset is to be removed from use. The isDeleted flag is set to True or 1 if the record is not be used in the workspace.

Basic Steps:

  1. Modify the LDM. An isDeleted flag is added to each dataset in the logical data model. To each dataset that you wish to enable the marking of records as deleted, you must add this attribute.

  2. This flag must also be applied to each step of the ETL process, depending on how and when deletion is flagged.

    1. If deletion is flagged before the ETL begins, then the flag is simply another field in the dataset and can be processed through the graph accordingly.
    2. If deletion must be flagged after the transformation, it may be difficult to track the key identifier(s) for the record.
  3. After the ETL and LDM have been updated, you must update your workspace to respect the flag. Two options:

    1. Modify each metric to include a check AND isDeleted = False
    2. Add a Data Permissions filter for each user to check to see if isDeleted=False.

Advantages:

  • Data is not actually removed from the workspace, which enables easier correction of mistakes.
  • This method confirms to recommended practices in BI.

Challenges:

  • It may not be possible to retrieve the keys to deleted data from a source system. In some systems, that data is actually deleted and gone forever, and there may not be a record of what has been removed. As a workaround, you would have to maintain a separate record of all records that were uploaded to GoodData.
  • Some third-party data sources may not enable you to add the IsDeleted flag to a data extract, which would require manual management of it.
  • Setting up this strategy requires significant modification to the LDM, ETL, metrics, and reporting. In particular, you must define a connection point in each logical data model and modify your ETL graph to guarantee uniqueness in the data in the connection point field.
  • Data is not removed from the workspace. If it is garbage data, in some cases, it is preferable to physically remove it from the workspace.
  • Over time, a significant volume of deleted record may require traditional purging.
  • Applying the isDeleted=False check to each metric is tedious in a large workspace. There are no impediments to creating a metric that does not include this conditional check.
  • Since data may or may not be present in the workspace, debugging may be harder for workspace administrators from within the workspace.

Strategy 4 - Automate through MAQL DML

The Data Manipulation Language (DML) flavor of MAQL enables the configured use of DELETE statements. Experienced users can build queries to remove data directly from the fact and dimension tables in the data store.

Basic Steps:

  1. Validate your workspace.
  2. If possible, clone your workspace.
  3. Execute your MAQL DML through the gray pages.
  4. Validate your workspace.
  5. Verify that the records have been properly removed from the workspace.

Advantages:

  • MAQL DML statements can be executed via API, allowing you to perform periodic purges of aged-out data, for example.
  • This method performs direct manipulation of the database and does not require any skills other than good scripting skills.
  • This method is particularly useful for purging garbage data, so that the workspace is clean.

Challenges:

  • Currently, there is no easy, integrated set of tools for managing these kinds of deletions from your workspace. You must use combinations of APIs and the gray pages to complete these steps. If precautions are not taken, it may not be possible to rollback to a last-known-good version.
  • Executing DELETE statements on a workspace can have very harmful effects if they are not coded properly.

For more information on this strategy, see Deleting Records from Datasets.

Strategy 5 - Use DELETE-CENTER mode

You can delete some old or obsolete data from a dataset while loading new data to the same or different dataset using the REST API. The data to delete and upload are specified in CSV files.

For more information, see Delete Old Data while Loading New Data to a Dataset via API.