You are viewing our older product's guide. Click here for the documentation of GoodData Cloud, our latest and most advanced product.

Merging Data Using Staging Tables

CloudConnect is a legacy tool and will be discontinued. We recommend that to prepare your data you use the GoodData data pipeline as described in Data Preparation and Distribution. For data modeling, see Data Modeling in GoodData to learn how to work with Logical Data Modeler.

Use staging tables when loading data into Data Warehouse, because:

Data Warehouse does not support upsert operations.
Data Warehouse does not enforce the uniqueness of primary key during data load. However, duplicate records may trigger an error at query time in join queries.

You cannot load your data directly into the target table and expect that any matching records already in the table are automatically overwritten.

To manage adding data for records that may already exist in the target table, load your data into an empty staging table first. Use the MERGE command to merge the staged data into the target table.

Segmentation in Staging

There are performance impacts in shuffling data across cluster nodes.

Where possible, use the same segmentation for the staging area as is used in the target table’s projections.

Example:

CREATE TEMP TABLE in_customer (
  id VARCHAR(32) PRIMARY KEY,
  name_first VARCHAR(255),
  name_last VARCHAR(255),
  created_at DATETIME,
  is_deleted BOOLEAN,
) ON COMMIT PRESERVE ROWS
  SEGMENTED BY HASH(id) ALL NODES; -- consistent with the target table

COPY in_customer FROM LOCAL '/data/customers.csv' ABORT ON ERROR DIRECT;

MERGE /*+direct*/  -- "direct" improves the performance of large batch operations
  INTO customer tgt USING in_customer src
  ON src.id = tgt.id
  WHEN MATCHED THEN UPDATE
    SET name_first = src.name_first, name_last = src.name_last,
           created_at = src.created_at, is_deleted = src.is_deleted
  WHEN NOT MATCHED THEN INSERT
   (id, name_first, name_last, created_at, is_deleted) 
   VALUES
   (src.id, src.name_first, src.name_last, src.created_at, src.is_deleted);

Error Handling

Statistics Collection