Merging Data Using Staging Tables
CloudConnect is a legacy tool and will be discontinued. We recommend that to prepare your data you use the GoodData data pipeline as described in Data Preparation and Distribution. For data modeling, see Data Modeling in GoodData to learn how to work with Logical Data Modeler.
Use staging tables when loading data into Data Warehouse, because:
- Data Warehouse does not support upsert operations.
- Data Warehouse does not enforce the uniqueness of primary key during data load. However, duplicate records may trigger an error at query time in join queries.
You cannot load your data directly into the target table and expect that any matching records already in the table are automatically overwritten.
To manage adding data for records that may already exist in the target table, load your data into an empty staging table first. Use the MERGE command to merge the staged data into the target table.
Segmentation in Staging
There are performance impacts in shuffling data across cluster nodes.
Where possible, use the same segmentation for the staging area as is used in the target table’s projections.
Example:
CREATE TEMP TABLE in_customer (
id VARCHAR(32) PRIMARY KEY,
name_first VARCHAR(255),
name_last VARCHAR(255),
created_at DATETIME,
is_deleted BOOLEAN,
) ON COMMIT PRESERVE ROWS
SEGMENTED BY HASH(id) ALL NODES; -- consistent with the target table
COPY in_customer FROM LOCAL '/data/customers.csv' ABORT ON ERROR DIRECT;
MERGE /*+direct*/ -- "direct" improves the performance of large batch operations
INTO customer tgt USING in_customer src
ON src.id = tgt.id
WHEN MATCHED THEN UPDATE
SET name_first = src.name_first, name_last = src.name_last,
created_at = src.created_at, is_deleted = src.is_deleted
WHEN NOT MATCHED THEN INSERT
(id, name_first, name_last, created_at, is_deleted)
VALUES
(src.id, src.name_first, src.name_last, src.created_at, src.is_deleted);