Merging Data Using Staging Tables

Use staging tables when loading data into Data Warehouse, because:

  • Data Warehouse does not support upsert operations.
  • Data Warehouse does not enforce the uniqueness of primary key during data load. However, duplicate records may trigger an error at query time in join queries.

You cannot load your data directly into the target table and expect that any matching records already in the table are automatically overwritten.

Segmentation in Staging

There are performance impacts in shuffling data across cluster nodes.

Example:

CREATE TEMP TABLE in_customer (
  id VARCHAR(32) PRIMARY KEY,
  name_first VARCHAR(255),
  name_last VARCHAR(255),
  created_at DATETIME,
  is_deleted BOOLEAN,
) ON COMMIT PRESERVE ROWS
  SEGMENTED BY HASH(id) ALL NODES; -- consistent with the target table

COPY in_customer FROM LOCAL '/data/customers.csv' ABORT ON ERROR DIRECT;

MERGE /*+direct*/  -- "direct" improves the performance of large batch operations
  INTO customer tgt USING in_customer src
  ON src.id = tgt.id
  WHEN MATCHED THEN UPDATE
    SET name_first = src.name_first, name_last = src.name_last,
           created_at = src.created_at, is_deleted = src.is_deleted
  WHEN NOT MATCHED THEN INSERT
   (id, name_first, name_last, created_at, is_deleted) 
   VALUES
   (src.id, src.name_first, src.name_last, src.created_at, src.is_deleted);