Load Modes in Automated Data Distribution v2 for Object Storage Services
Automated Data Distribution (ADD) v2 for object storage services supports the following load modes:
- Full load: All data is loaded to the datasets. The datasets are completely rewritten.
- Incremental load: Only data after a certain point in time is loaded using the
UPSERT
method. - Delete mode load: Some old data is deleted while new data is uploaded.
Full Mode and Incremental Mode
Full Mode
When ADD v2 runs for the first time, data is uploaded in full mode.
After that, ADD v2 loads data in full mode to those datasets for which the source files have full load mode set (the file names have the {load_mode}
section set to full
).
For more information, see:
Incremental Mode
ADD v2 loads data in incremental mode to those datasets for which the source files have incremental mode set (the file names have the {load_mode}
section set to inc
).
For more information, see:
How Data is Loaded
During data load in either full or incremental mode, ADD v2 takes the timestamp from the name of the uploaded file and applies it to all records in the file. This timestamp is called “last load timestamp” (LSTS) and it is stored in the dataset. At the next run, ADD v2 finds the first timestamp that is greater than the LSTS in the dataset and loads this file to the dataset. In a single run, ADD v2 processes only one source file.
Example: Imagine that you have the following source files to load to the Products dataset in incremental mode:
Products_20200820120000_inc.csv
(file 1)Products_20200821120000_inc.csv
(file 2)Products_20200821130000_inc.csv
(file 3)Products_20200821133000_inc.csv
(file 4)
Last time, ADD v2 ran on August 21, 2020, at 12.30, and loaded the file Products_20200821120000.csv
(file 2) to the dataset. The LSTS for the Products dataset is set to 20200821120000
.
When ADD v2 runs again on August 22, 2020, at 12.30, it will load the file Products_20200821130000.csv
(file 3) because ADD v2 takes the source file with the first timestamp that is next to the dataset’s LSTS (20200821120000
), and it is file 3. After the load is complete, the LSTS is set to 20200821130000
.
Next time ADD v2 runs on August 22, 2020, at 18.30, and it will load the file Products_20200821133000.csv
(file 4). If more files have been added to the folder by that time, they will be ignored. The dataset’s LSTS is set to 20200821133000
.
Incremental Mode Scenarios
Loading Data to a Newly Added Client Workspace
When a new client workspace is added to a workspace, it does not have a LSTS. Instead, the workspace creation time will be used at the next ADD v2 run.
When ADD v2 runs, it finds the source files with the first timestamp that is greater than the workspace creation time and loads these files to the workspace. If at any point data load fails, see Loading Data Failing for a Workspace and Loading Data to a Workspace after Three Successive Load Failures.
We recommend that you upload all the data to the new workspace first to bring it up to the same level as the other workspaces in the segment are at (see Full Mode).
Loading Data Failing for a Workspace
The LSTS is defined per client workspace. If data load fails for a client workspace for any reason (such as invalid data in the source file or a network issue), the LSTS of this workspace does not get updated. ADD v2 will try to load the failed file to this workspace again at the next execution. If the data load succeeds and a newer source file is available, ADD v2 will load it to the workspace, too.
Example: Imagine that you have two client workspaces, Workspace 1 and Workspace 2. Both have the LSTS set to T0
after the last successful data load.
You provide the source file Dataset_T1_inc.csv
.
ADD v2 runs and loads the file to Workspace 1, and its LSTS is set to T1
now. However, the data load for Workspace 2 fails for some reason, and its LSTS remains T0
.
You provide two more source files, Dataset_T2_inc.csv
and Dataset_T3_inc.csv
.
When ADD v2 runs again:
- Workspace 1 (having the LSTS at
T1
) loads the fileDataset_T2_inc.csv
. - Workspace 2 (having the LSTS at
T0
) tries to load the fileDataset_T1_inc.csv
again (because it failed last time). If the data load succeeds, Workspace 2 then loads the fileDataset_T2_inc.csv
.
The both workspaces now have the LSTS set to T2
.
The file Dataset_T3_inc.csv
will be loaded at the next ADD v2 run. After the load completes, the both workspaces will have the LSTS set to T3
.
Loading Data to a Workspace after Three Successive Load Failures
In some cases, ADD v2 may keep failing at loading data to a workspace. To prevent the workspace from getting stuck at an old timestamp, ADD v2 stops trying to load the data to this workspace after the third failure.
In the previous example, Workspace 2 failed to load the file Dataset_T1_inc.csv
. Let’s assume that it failed to load the files Dataset_T2_inc.csv
and Dataset_T3_inc.csv
in the two following ADD v2 runs. When the file Dataset_T4_inc.csv
is provided, it will not be loaded to Workspace 2. The workspace’s LSTS remains T0
. The workspace is now missing the data loads for the timestamps T1
, T2
, T3
, and T4
.
After data failed to load to a workspace three times in a row (for three successive timestamps), load the missing data to this workspace manually.
Steps:
- Export the missing data from the database to a source file, and assign it a new timestamp. In the previous example, you would export the data from the timestamp
T1
to the timestampT4
inclusive and give the source file a timestamp ofT5
. - Upload the source file to your object storage service.
- Delete the old source files that failed to load from the object storage service.
- Run ADD v2 to load the source file with the new timestamp to the workspace.
Overriding Incremental Mode with Full Mode
You can load data in full mode to a dataset that has been previously updated in incremental mode (that is, you need to completely overwrite the data in the dataset).
Steps:
- Export the full data (all the data that you want to load to the dataset) from the database to a source file.
- Assign the source file a new timestamp. For example, if the last file loaded incrementally had the timestamp
T1
, the new source file must have the timestampT2
. - Mark the source file for full load (see Full Mode).
- Upload the source file to your object storage service.
- Run ADD v2 to load the source file with the new timestamp.
Overriding Incremental Mode with Incremental Mode for a Specific Time Interval
You can load data to a dataset in incremental mode for a specific time interval.
Steps:
- Export the data for the required time interval from the database to a source file.
- Assign the source file a new timestamp. For example, if the last loaded file had the timestamp
T1
, the new source file must have the timestampT2
. - Mark the source file for incremental load (see Incremental Mode).
- Upload the source file to your object storage service.
- Run ADD v2 to load the source file with the new timestamp to the workspace.
Loading Data to a Subset of Workspaces in a Segment in Full or Incremental Mode
You can load data only to some (not all) client workspaces in a segment. Prepare the source files that would contain the data for the client workspaces that you want to update. When ADD v2 runs, it will load the data only to those client workspaces and will set their LSTS to the timestamp from the source files. All the other client workspaces will not be updated with any data, but their LSTS will be set to the same value as the updated ones to avoid processing the same source files for those other workspaces next time ADD v2 runs.
Delete Mode
You can adjust your source files to delete old or obsolete data from a dataset while loading new data to the dataset. When you delete old data and upload new data to datasets, deleting and uploading the data is done in a single transaction. This helps keep data in your workspace consistent and avoid situations when new data is already in the dataset but the old data are not yet removed.
The primary use case for deleting data using ADD v2 is to delete data during incremental load. ADD v2 deletes data using the data loading API (see Delete Old Data while Loading New Data to a Dataset via API).
To set up ADD v2 to be able to delete data:
- Update the structures in the source files according to Naming Convention for Source Files in Automated Data Distribution v2 for Object Storage Services.
- Add the
x__deleted
column to the data load table. Thex__deleted
column indicates whether a specific row should be deleted from the dataset when running data load.
Deleting data works in the following load modes:
- Incremental load: The data is processed in the following way:
- Data after a certain point in time (based on the LSTS) where
x__deleted=FALSE
is loaded to the datasets using theUPSERT
method. - Data after a certain point in time (based on the LSTS) where
x__deleted=TRUE
is deleted from the datasets.
- Data after a certain point in time (based on the LSTS) where
- Full load: All data where
x__deleted=FALSE
is loaded to the datasets. The datasets are completely rewritten.
Deleting by Connection Point
Imagine you have the following data in ADD v2:
Previously loaded data
cp__invoice_id | r__customer | f__totalprice | x__deleted |
---|---|---|---|
100 | 3 | 324 | FALSE |
101 | 78 | 109 | FALSE |
New incremental data update
cp__invoice_id | r__customer | f__totalprice | x__deleted |
---|---|---|---|
101 | 3 | NULL | TRUE |
cp__invoice_id
is the connection point (CP) for the dataset.
Previously, you loaded data to the dataset by CP with invoice_id=101
.
Now, you want to delete particular records from the dataset as a part of your data load. The data where invoice_id=101
will be deleted from the dataset.
Deleting by Fact Table Grain
Imagine you have the following data in ADD v2:
Previously loaded data
a__ticket_id | a__organization | f__timespentsec | x__deleted |
---|---|---|---|
12345 | ABC company | 458 | FALSE |
12346 | CDE company | 11342 | FALSE |
New incremental data update
a__ticket_id | a__organization | f__timespentsec | x__deleted |
---|---|---|---|
12345 | ABC company | NULL | TRUE |
Fact Table Grain (FTG) consists of the a__ticket_id
and a__organization
attributes.
Previously, you loaded data to the dataset by FTG with ticket_id=12345
and organization=ABC company
.
Now, you want to delete particular records from the dataset as a part of your data load. The data where ticket_id=12345 AND organization=ABC company
will be deleted from the dataset.
You do not have to explicitly specify FTG in the source files. When data load is running, the source file is compared to the logical data model, and FTG is automatically recognized.
Deleting by Attributes (no CP or FTG)
Imagine you have the following data in ADD v2:
Previously loaded data
a__branch | a__organization | a__address | f__mrr | x__deleted |
---|---|---|---|---|
NYC | ABC company | 5th avenue | 20000 | FALSE |
LA | CDE company | 3rd street | 100000 | FALSE |
New incremental data update
a__branch | a__organization | a__address | f__mrr | x__deleted |
---|---|---|---|---|
NYC | ABC company | 5th avenue | 20000 | TRUE |
If no CP or FTG exists, data is deleted by all attributes (facts are ignored). For data to be deleted, all attribute values must match.
Previously, you loaded data to the dataset, and now you want to delete particular records from the dataset as a part of your data load. The data where branch=NYC AND organization=ABC company AND address=5th avenue
will be deleted from the dataset.