Delete Data from Datasets in Automated Data Distribution
You can adjust your Automated Data Distribution (ADD) output stage to delete old or obsolete data from a dataset while loading new data to the dataset. When you delete old data and upload new data to datasets, deleting and uploading the data is done in a single transaction. This helps keep data in your workspace consistent and avoid situations when new data is already in the dataset but the old data are not yet removed.
The primary use case for deleting data using ADD is to delete data during incremental load. ADD deletes data using the data loading API (see Delete Old Data while Loading New Data to a Dataset via API). To set up ADD to be able to delete data, update the structures in the output stage according to Naming Convention for Output Stage Objects.
Deletion Modes
- Deleting by the data load table
- Deleting by the delete side table
- Deleting using both the data load table and delete side table
Deleting by the Data Load Table
Add the x__deleted column to the data load table. The x__deleted column indicates whether a specific row should be deleted from the dataset when running data load.
Deleting by the data load table works in the following load modes:
- Incremental load: The data is processed in the following way:
- Data after a certain point in time (based on LSLTS, Last Successfully Loaded Time Stamp) and where
x__deleted=FALSE
is loaded to the datasets using the UPSERT method. - Data after a certain point in time (based on LSLTS) and where
x__deleted=TRUE
is deleted from the datasets.
- Data after a certain point in time (based on LSLTS, Last Successfully Loaded Time Stamp) and where
- Full load: All data where
x__deleted=FALSE
is loaded to the datasets. The datasets are completely rewritten.
Deleting by Connection Point
Imagine you have the following data in ADD:
Previously loaded data
cp_invoice_id | r_customer | f_totalprice | x__timestamp | x__deleted |
---|---|---|---|---|
100 | 3 | 324 | 2017-12-01 0:00:00.000000 | FALSE |
101 | 78 | 109 | 2017-12-01 0:00:00.000000 | FALSE |
New incremental data update:
cp_invoice_id | r_customer | f_totalprice | x__timestamp | x__deleted |
---|---|---|---|---|
101 | 3 | NULL | 2017-12-02 0:00:00.000000 | TRUE |
“cp_invoice_id” is the connection point (CP) for the dataset.
Previously, you loaded data to the dataset by CP with “invoice_id”=“101”.
Now, you want to delete particular records from the dataset as a part of your data load. The data where “invoice_id”=”101” will be deleted from the dataset.
Deleting by Fact Table Grain
Imagine you have the following data in ADD:
Previously loaded data
a_ticket_id | a_organization | f_timespentsec | x__timestamp | x__deleted |
---|---|---|---|---|
12345 | ABC company | 458 | 2017-12-01 0:00:00.000000 | FALSE |
12346 | CDE company | 11342 | 2017-12-01 0:00:00.000000 | FALSE |
New incremental data update:
a_ticket_id | a_organization | f_timespentsec | x__timestamp | x__deleted |
---|---|---|---|---|
12345 | ABC company | NULL | 2017-12-02 0:00:00.000000 | TRUE |
Fact Table Grain (FTG) consists of “a_ticket_id” and “a_organization” attributes.
Previously, you loaded data to the dataset by FTG with “ticket_id”=”12345” and “organization”=“ABC company”.
Now, you want to delete particular records from the dataset as a part of your data load. The data where “ticket_id”=”12345” AND “organization”=”ABC company” will be deleted from the dataset.
You do not have to explicitly specify FTG in the output stage table. When data load is running, the output stage table is compared to the logical data model, and FTG is automatically recognized.
Deleting by Attributes (no CP or FTG)
Imagine you have the following data in ADD:
Previously loaded data
a_branch | a_organization | a_address | f_mrr | x__timestamp | x__deleted |
---|---|---|---|---|---|
NYC | ABC company | 5th avenue | 20000 | 2017-12-01 0:00:00.000000 | FALSE |
LA | CDE company | 3rd street | 100000 | 2017-12-01 0:00:00.000000 | FALSE |
New incremental data update:
a_branch | a_organization | a_address | f_mrr | x__timestamp | x__deleted |
---|---|---|---|---|---|
NYC | ABC company | 5th avenue | 20000 | 2017-12-02 0:00:00.000000 | TRUE |
If no CP or FTG exists, data is deleted by all attributes (facts are ignored). For data to be deleted, all attribute values must match.
Previously, you loaded data to the dataset, and now you want to delete particular records from the dataset as a part of your data load. The data where attribute “branch”=”NYC” AND “organization”=”ABC company” AND “address”=”5th avenue” will be deleted from the dataset.
Deleting by the Delete Side Table
Create a delete side table that describes columns by which you want delete data.
You do not have to specify all columns according to the definition of the dataset within your LDM. The delete side table lets you flexibly define criteria by which you can delete data (the delete vector).
The delete side table is an optional entity in the ADD output stage. The delete side table adheres to the same naming conventions as the data load table (see Naming Convention for Output Stage Objects). The delete side table works the same way as deleting by attributes using the data load table (no connection point or Fact Table Grain present) except the delete side table must not contain any fact.
Deleting by the delete side table works in the following load modes:
- Incremental load: Data after a certain point in time (based on LSDTS, Last Successfully Deleted Time Stamp) is deleted from the datasets.
- Interval or full load: The delete side table is skipped.
Imagine you have the following data in ADD:
Previously loaded data (data load table)
a_branch | a_organization | a_address | x__timestamp |
---|---|---|---|
LA | CDE company | 3rd street | 2017-12-01 0:00:00.000000 |
New incremental data update (delete side table)
a_branch | a_organization | x__timestamp |
---|---|---|
LA | CDE company | 2017-12-02 0:00:00.000000 |
Previously, you loaded data to the dataset, and now you want to delete particular records from the dataset as a part of your data load. The data where “branch”=”LA” AND “organization”=”CDE company” will be deleted from the dataset.