Incremental Data Loading

GoodData now supports incremental load via the Automated Data Distribution (ADD) feature. Using ADD, you can choose between full and incremental load modes.

Contents:

Ideally, each execution of an ETL process would result in a full load of a dataset with data that has been gathered and processed from the source system. In this ideal structure, the data in your GoodData project corresponds to the data in the source system, factoring in any transformations applied to it by the ETL process. If oddities or bad data are detected in the project, you simply re-run your full data load ETL process to renew the contents of your project.

However, full data loads may not always be possible for the following reasons:

  • For very large datasets, a full load may not be 1) possible to query from the source system or 2) loadable within a reasonable timeframe into the GoodData project.
  • Since the full data load removes all content from the project, the period of available data in the GoodData project cannot differ from the date ranges available in the source system.

By default, ETL graphs are designed to use the full data loading method of refreshing the target dataset in your GoodData project. However, in situations such as the above, you can modify the ETL graph to perform incremental data loads. An incremental data load is a method of updating the dataset in which only new or modified records are uploaded to the project. If the target record is modified in the incoming data or if there is no target record, the incoming data is written to the project.

Records cannot be removed through incremental data loading methods. If you perform a full data load, any records that were in the dataset and are absent in the new data no longer appear in the updated dataset. For more information on methods of removing data from your datasets, see Deleting Data from Project.

Overview

In an incremental data loading scenario, the incoming data is compared to the existing data in the dataset via a connection point. In GoodData, a connection point is a column in the dataset that has been identified as containing unique values for each record in the dataset. It functions similarly to a primary key in a database table.

So, using the defined connection point in the dataset to match incoming records to existing records, the loading processes within the GoodData platform can determine if the incoming records need to 1) modify existing records or 2) append new records to the dataset. For modified records, any updated values are written into the specific fields of the record.

Each value in the connection point field must be unique within the dataset records. When you have three rows with the same connection point value, only the record that was uploaded last is in your data, replacing the other two rows with the same value that was uploaded before the last record.

Additional Considerations

Change How Data Is Queried

Keep in mind that if you are changing to an incremental data loading method, then you may be required to change how you acquire data from your source system. For example, if the source system is pulling data for full loads, you may need to change either 1) how data is queried from the source system in your ETL graph or 2) how the data is provided from the source system to your ETL staging area.

As part of any change to an incremental data loading method, you should review the entire ETL graph to verify that all steps of the process support an incremental loading method. The way in which data is read is particularly important, as you may create duplicated data if your ETL graph is set to incremental and you forget to change the extract step to grab only incremental data.

Multiple ETL Graphs

In some scenarios, you may find it useful to maintain multiple ETL graphs. For example, you may decide that it is most efficient to perform regular incremental data loads to refresh the data and to periodically do a full data load during off-peak hours to completely refresh the dataset.

In these scenarios, you may decide that the easiest method is to maintain separate ETL graphs and publish to separate processes.

Configuration Methods

In order to implement incremental data loading, you must define a connection point in your logical data model, which can be accomplished in one of the following ways:

Configuration through CloudConnect

Setting up the Connection Point hrough LDM Modeler

Through the LDM Modeler in CloudConnect Designer, you can define a connection point for a dataset very easily.

Steps:

    1. Open CloudConnect Designer. Open the project containing the dataset you wish to configure for incremental data loading.
    2. Select the dataset.
    3. Click the Edit link.
    4. The Attributes and Facts window opens.

  1. In the above example, the Payments dataset does not contain a connection point. To add a connection point, click the Type column of the field that you wish to make a connection point.

    In most cases, the connection point is defined on an attribute.

  2. Click the Browse… button. In the Select Type window, click the Connection Point radio button.
  3. Click OK. The field is now defined as a connection point.
  4. Click OK again. Save the file.
  5. Publish the model to the server.

Define Mode in GD Dataset Writer

You must also configure the mode in the GD Dataset Writer component of your ETL graph to write incremental data. This component writes data into the GoodData project that you designate.

Steps:

  1. In CloudConnect Designer, open the ETL graph you wish to configure for incremental data loading.
  2. Double-click the GD Dataset Writer component. The GD Dataset Writer configuration is displayed.
  3. Click the Mode setting:
  4. From the drop-down, select Incremental (append data from input).
  5. Click OK.
  6. Save the graph file.

Change the Input

Before you test the graph, you should verify that the input to the graph is set to query the source for incremental data, instead of full data loads.

Testing

This configuration should be tested. You may wish to test by using a dummy data file. During testing, you should test the following:

  • Append - adding new records
  • Update - modifying existing records

You might create a test set of records in a flat file, change the ETL to read from this file, and then upload them. Then, you could make modifications to fields in the records, except for the connection point field. When you re-execute the graph using this test file, the existing test records should be modified accordingly.

Configuration through the CL Tool

The CL Tool is a legacy interface and may be deprecated in a future release. Instead of using the CL Tool for setting up incremental data loading, use incremental load mode of our Automated Data Distribution (ADD) feature.

Setting up the Connection Point Using the CL Tool

In the CL Tool, incremental data loading is executed by adding the following parameter:

TransferData(incremental="true")

If the incremental parameter is false, the data loading script deletes all of the existing data and then loads your incoming data as the entire data for the dataset.

The connection point is basically the primary key of the data set. It can provide connections with other data sets (through the REFERENCE ldmType). In the following example, a CONNECTION_POINT is defined:

<schema>
	<name>store</name>
	<columns>
		<column>
			<name>storeid</name>
			<title>StoreID</title>
			<ldmtype>CONNECTION_POINT</ldmtype>
			<folder>Store</folder>
		</column>
		<column>
			<name>store_name</name>
			<title>Store name</title>
			<ldmtype>ATTRIBUTE</ldmtype>
			<folder>Store</folder>
		</column>
			<name>employees</name>
			<title>Employees</title>
			<ldmtype>FACT</ldmtype>
			<folder>Store</folder>

	</columns>
</schema>

In the above example, the StoreID column is the connection point of the dataset.

Incremental Data Loading Examples

Example 1 - Loading Data for the Last Three Days

Suppose you need to upload data from three days back until today, and you must execute this load every day. So, every day you use incremental data loading to include the last three days.

If some data already exists (has the same connection point value), the incoming data replaces it with the new values. If the records you are uploading are completely new, the script appends them.

If the data load fails on any given day, the records are added the next day. However, if data load fails three days in a row, then the data is missing from the application.

Example 2 - Google Analytics

The most recently available Google Analytics data is typically from the day before, so when you perform an incremental data, you must load data from the day before today and to execute this data load each day.

For example, yesterday we used the Google Analytics API to get the page views and visits by referrers. The API may return something like this:

2011-08-10,Google,500,30
2011-08-10,Bing,50,15

By adding the unique connection point to the records, our data load looks something like the following:

ea3d273d6296684da21d3dc2aa82d0b7,2011-08-10,Google,500,30
f4583c4214d22677d991e25d041336ea,2011-08-10,Bing,50,15

In this case, the connection point is an MD5 hash, which is computed from the attributes and its values. In the CL Tool, it can be easily done with the IDENTITY element. If you add the IDENTITY transformation to the dataset, the column that is transformed is automatically filled with the MD5 hash of every non-fact column (attributes, labels etc.). The transformation column is added to the dataset and uploaded regularly.

Below, you can review the definition of the transformation column:

<schema>
<name>store</name>
<columns>
<column>
<name>storeid</name>
<title>StoreID</title>
<ldmtype>CONNECTION_POINT</ldmtype>
<transformation>IDENTITY</transformation>
<folder>Store</folder>
</column>
...
</columns>
</schema>

The following day, the Google Analytics API returns larger values, which is normal because yesterday’s data may not have been completed:

2011-08-10,Google,509,33 
2011-08-10,Bing,52,19

The hash-based connection point is added to the data:

ea3d273d6296684da21d3dc2aa82d0b7,2011-08-10,Google,509,33
f4583c4214d22677d991e25d041336ea,2011-08-10,Bing,52,19

When this data is uploaded, the corresponding records are updated based on the connection point values.