Load Modes in Automated Data Distribution v2 for Object Storage Services

Automated Data Distribution (ADD) v2 for object storage services supports the following load modes:

  • Full load: All data is loaded to the datasets. The datasets are completely rewritten.
  • Incremental load: Only data after a certain point in time is loaded using the UPSERT method.
  • Delete mode load: Some old data is deleted while new data is uploaded.

Contents:

Full Mode and Incremental Mode

Full Mode

When ADD v2 runs for the first time, data is uploaded in full mode.

After that, ADD v2 loads data in full mode to those datasets for which the source files have full load mode set (the file names have the {load_mode} section set to full or not set at all; see GoodData-S3 Integration Details).

  • If you have only one source file for a dataset, the file name must not contain a timestamp.
  • If you have multiple source files for a dataset, all those files must have a timestamp for ADD v2 to determine what file to load.

For more information about the timestamps, see GoodData-S3 Integration Details.

Incremental Mode

ADD v2 loads data in incremental mode to those datasets for which the source files have incremental mode set (the file names have the {load_mode} section set to inc). The source files to be loaded in incremental mode must have the timestamp. For more information about the {load_mode} section and the timestamps, see GoodData-S3 Integration Details.

How Data is Loaded

During data load in either full or incremental mode, ADD v2 takes the timestamp from the name of the uploaded file and applies it to all records in the file. This timestamp is called "last load timestamp" (LSTS) and it is stored in the dataset. At the next run, ADD v2 finds the first timestamp that is greater than the LSTS in the dataset and loads this file to the dataset. In a single run, ADD v2 processes only one source file.

If a source file does not have a timestamp (as if you have a single source file for a dataset to load in full mode), the LSTS is stored in the dataset as 0.

Example:
Imagine that you have the following source files to load to the Products dataset in incremental mode:

  • Products_20200820120000_inc.csv (file 1)
  • Products_20200821120000_inc.csv (file 2)
  • Products_20200821130000_inc.csv (file 3)
  • Products_20200821133000_inc.csv (file 4)

Last time, ADD v2 ran on August 21, 2020, at 12.30, and loaded the file Products_20200821120000.csv (file 2) to the dataset. The last load timestamp (LSTS) for the Products dataset is set to 20200821120000.

When ADD v2 runs again on August 22, 2020, at 12.30, it will load the file Products_20200821130000.csv (file 3) because ADD v2 takes the source file with the first timestamp that is next to the dataset's LSTS (20200821120000), and it is file 3. After the load is complete, the LSTS is set to 20200821130000.

Next time ADD v2 runs on August 22, 2020, at 18.30, and it will load the file Products_20200821133000.csv (file 4). If more files have been added to the folder by that time, they will be ignored. The dataset's LSTS is set to 20200821133000.

Incremental Mode Scenarios

Loading Data to a Newly Added Client Workspace

When a new client workspace is added to a workspace, it does not have a LSTS. Instead, the workspace creation time will be used at the next ADD v2 run.

When ADD v2 runs, it finds the source files with the first timestamp that is greater than the workspace creation time and loads these files to the workspace. If at any point data load fails, see Loading Data Failing for a Workspace and Loading Data to a Workspace after Three Successive Load Failures.

We recommend that you upload all the data to the new workspace first to bring it up to the same level as the other workspaces in the segment are at (see Full Mode).

Loading Data Failing for a Workspace

The LSTS is defined per client workspace. If data load fails for a client workspace for any reason (such as invalid data in the source file or a network issue), the LSTS of this workspace does not get updated. ADD v2 will try to load the failed file to this workspace again at the next execution. If the data load succeeds and a newer source file is available, ADD v2 will load it to the workspace, too.

Example:
Imagine that you have two client workspaces, Workspace 1 and Workspace 2. Both have the LSTS set to T0 after the last successful data load.

You provide the source file Dataset_T1_inc.csv.

ADD v2 runs and loads the file to Workspace 1, and its LSTS is set to T1 now. However, the data load for Workspace 2 fails for some reason, and its LSTS remains T0.

You provide two more source files, Dataset_T2_inc.csv and Dataset_T3_inc.csv.

When ADD v2 runs again:

  • Workspace 1 (having the LSTS at T1) loads the file Dataset_T2_inc.csv.
  • Workspace 2 (having the LSTS at T0) tries to load the file Dataset_T1_inc.csv again (because it failed last time). If the data load succeeds, Workspace 2 then loads the file Dataset_T2_inc.csv.

The both workspaces now have the LSTS set to T2.

The file Dataset_T3_inc.csv will be loaded at the next ADD v2 run. After the load completes, the both workspaces will have the LSTS set to T3.

Loading Data to a Workspace after Three Successive Load Failures

In some cases, ADD v2 may keep failing at loading data to a workspace. To prevent the workspace from getting stuck at an old timestamp, ADD v2 stops trying to load the data to this workspace after the third failure.

In the previous example, Workspace 2 failed to load the file Dataset_T1_inc.csv. Let's assume that it failed to load the files Dataset_T2_inc.csv and Dataset_T3_inc.csv in the two following ADD v2 runs. When the file Dataset_T4_inc.csv is provided, it will not be loaded to Workspace 2. The workspace's LSTS remains T0. The workspace is now missing the data loads for the timestamps T1, T2, T3, and T4.

After data failed to load to a workspace three times in a row (for three successive timestamps), load the missing data to this workspace manually.

Steps:

  1. Export the missing data from the database to a source file, and assign it a new timestamp.
    In the previous example, you would export the data from the timestamp T1 to the timestamp T4 inclusive and give the source file a timestamp of T5.
  2. Upload the source file to your S3 bucket.
  3. Delete from the S3 bucket the old source files that failed to load.
  4. Run ADD v2 to load the source file with the new timestamp to the workspace.

Overriding Incremental Mode with Full Mode

You can load data in full mode to a dataset that has been previously updated in incremental mode (that is, you need to completely overwrite the data in the dataset).

Steps:

  1. Export the full data (all the data that you want to load to the dataset) from the database to a source file.
  2. Assign the source file a new timestamp. For example, if the last file loaded incrementally had the timestamp T1, the new source file must have the timestamp T2.
  3. Mark the source file for full load (see Full Mode).
  4. Upload the source file to your S3 bucket.
  5. Run ADD v2 to load the source file with the new timestamp.

Overriding Incremental Mode with Incremental Mode for a Specific Time Interval

You can load data to a dataset in incremental mode for a specific time interval.

Steps:

  1. Export the data for the required time interval from the database to a source file.
  2. Assign the source file a new timestamp. For example, if the last loaded file had the timestamp T1, the new source file must have the timestamp T2.
  3. Mark the source file for incremental load (see Incremental Mode).
  4. Upload the source file to your S3 bucket.
  5. Run ADD v2 to load the source file with the new timestamp to the workspace.

Loading Data to a Subset of Workspaces in a Segment in Full or Incremental Mode

You can load data only to some (not all) client workspaces in a segment. Prepare the source files that would contain the data for the client workspaces that you want to update. When ADD v2 runs, it will load the data only to those client workspaces and will set their LSTS to the timestamp from the source files. All the other client workspaces will not be updated with any data, but their LSTS will be set to the same value as the updated ones to avoid processing the same source files for those other workspaces next time ADD v2 runs.

Delete Mode

You can adjust your source files to delete old or obsolete data from a dataset while loading new data to the dataset. When you delete old data and upload new data to datasets, deleting and uploading the data is done in a single transaction. This helps keep data in your project consistent and avoid situations when new data is already in the dataset but the old data are not yet removed.

The primary use case for deleting data using ADD v2 is to delete data during incremental load. ADD v2 deletes data using the data loading API (see Delete Old Data while Loading New Data to a Dataset via API).

To set up ADD v2 to be able to delete data:

  1. Update the structures in the source files according to Naming Convention for Source Files in Automated Data Distribution v2 for Object Storage Services.
  2. Add the x__deleted column to the data load table.
    The x__deleted column indicates whether a specific row should be deleted from the dataset when running data load.

Deleting data works in the following load modes:

  • Incremental load: The data is processed in the following way:
    • Data after a certain point in time (based on the LSTS) where x__deleted=FALSE is loaded to the datasets using the UPSERT method.
    • Data after a certain point in time (based on the LSTS) where x__deleted=TRUE is deleted from the datasets.
  • Full load: All data where x__deleted=FALSE is loaded to the datasets. The datasets are completely rewritten.

Deleting by Connection Point

Imagine you have the following data in ADD v2:

Previously loaded data

cp__invoice_idr__customerf__totalpricex__deleted
1003324FALSE
10178109FALSE

New incremental data update

cp__invoice_idr__customerf__totalpricex__deleted
1013NULLTRUE

cp__invoice_id is the connection point (CP) for the dataset.

Previously, you loaded data to the dataset by CP with invoice_id=101.

Now, you want to delete particular records from the dataset as a part of your data load. The data where invoice_id=101 will be deleted from the dataset.

Deleting by Fact Table Grain

Imagine you have the following data in ADD v2:

Previously loaded data

a__ticket_ida__organizationf__timespentsecx__deleted
12345ABC company458FALSE
12346CDE company11342FALSE

New incremental data update

a__ticket_ida__organizationf__timespentsecx__deleted
12345ABC companyNULLTRUE

Fact Table Grain (FTG) consists of the a__ticket_id and a__organization attributes.

Previously, you loaded data to the dataset by FTG with ticket_id=12345 and organization=ABC company.

Now, you want to delete particular records from the dataset as a part of your data load. The data where ticket_id=12345 AND organization=ABC company will be deleted from the dataset.

You do not have to explicitly specify FTG in the source files. When data load is running, the source file is compared to the logical data model, and FTG is automatically recognized.

Deleting by Attributes (no CP or FTG)

Imagine you have the following data in ADD v2:

Previously loaded data

a__brancha__organizationa__addressf__mrrx__deleted
NYC

ABC company

5th avenue

20000

FALSE

LA

CDE company

3rd street100000

FALSE

New incremental data update

a__brancha__organizationa__addressf__mrrx__deleted
NYC

ABC company

5th avenue

20000

TRUE

If no CP or FTG exists, data is deleted by all attributes (facts are ignored). For data to be deleted, all attribute values must match.

Previously, you loaded data to the dataset, and now you want to delete particular records from the dataset as a part of your data load. The data where branch=NYC AND organization=ABC company AND address=5th avenue will be deleted from the dataset.

Powered by Atlassian Confluence and Scroll Viewport.