Choosing a Primary Attribute Identifier in Your ETL Graph

In CloudConnect Designer, when you perform field mappings from the metadata of your GD Dataset Writer component, you may be required to select the primary identifier of some attributes. This identifier is a label of the attribute.

CloudConnect is a legacy tool and will be discontinued. We recommend that to prepare your data you use the GoodData data pipeline as described in Data Preparation and Distribution.

Contents:

This article describes how to manage these changes through the Choose a primary identifier of the following attributes dialog in the GD Dataset Writer for the following cases:

  • You have one dataset
  • You have two more datasets connected through a connection point.

Before you begin, please check your data for the primary and foreign keys. These keys must be used as the primary identification, which functions like a primary key; it stores only unique values. More information is provided below.

Case 1: One Dataset

Example Model

Here is a simple example, which contains a single dataset.

This dataset contains the following fields. Note that the Employee ID attribute contains three attribute labels:

  • First Name
  • Last Name
  • Identification number.

Here is some sample data in a CSV file:

ETL Designer - mapping fields in GD Dataset Writer

When you are mapping fields in GD Dataset Writer, the following dialog appears. Select the primary attribute label, which describes the Employee ID attribute.

Behavior variations

Depending on the field that you select, the behavior varies. In the following examples, a simple report was created to sum salary values for the Employee ID attribute, the label of which was changed in each of the following reports.

1. Select Employee ID (label.employee.employeeid) as the primary identifier

Secondary-click the Employee ID column to change the attribute label.

Primary identifier = Employee ID*

Primary identifier = Last Name

Primary identifier = First Name

2. Select First name (label.employee.employeeid.firstname) as the primary identifier

Secondary-click the Employee ID column to change the attribute label.

Primary identifier = Employee ID

Primary identifier = First Name

Primary identifier = Last Name

In Example 2, only four rows are displayed. The primary identifier is retrieving only the unique values, storing only the last data associated with each value. These facts are summed together in the displayed metric.

The primary identifier behaves like a primary key; it stores only the unique values and store only the last one.

For example, in the source data, there are three rows in which the value for First Name is “Jan”. In the generated report, their salaries are summed together and displayed in a single row as a unique value.

This behavior is consistent whenever you select the primary identifier and have repetitive values in multiple rows.

Case 2: Two Datasets

Example Model

In this example, two datasets are connected through a connection point:

In this example, you must map the fields for both datasets. Since they are connected through a connection point, you must specify in the Employee dataset the foreign key-like reference in the Work dataset.

The field-level information on these datasets is displayed below.

Field mapping for Employee dataset

Field mapping for Work dataset

CSV files

Example data for each dataset is listed below.

employee.csv

work.csv

ETL Designer - mapping fields in GD Dataset Writer

When you are mapping fields in the GD Dataset Writer, the following Employee Dataset Writer dialog window appears for each dataset.

Employee Dataset Writer

Field mapping dialog - Select the foreign key (the primary key is the Work ID connection point in the Work dataset)

For the Employee dataset, the Work ID functions like the foreign key for the dataset, since it is the connection point for the Work dataset.

Select the referenced label (foreign key) and the primary identifier of the Employee ID attribute:

Selecting the referenced label and the primary key

Then, you must select the label that identifies the attribute.

Work Dataset Writer

For the Work dataset, you must identify the primary attribute label.

Field mapping - primary attribute label

You must select the label that identifies the attribute.

Behavior variations

Depending on the field you select, the behavior varies when data is uploaded to the GoodData platform.

1. Select Employee ID (label.employee.employeeid) as the primary identifier in Employee dataset

Behaviors are the same as in the first case. See above.

2. Select First name (label.employee.employeeid.firstname) as the primary identifier in Employee dataset

Work dataset with Employee ID

Work dataset with First Name

Work dataset with Last Name

Here again, values are summed based on unique values for the label identifying the first name. So, due to the multiple instances of “Jan,” the reported value is the sum of all salaries for all instances of Jan.

3. Select First name (label.employee.employeeid.firstname) as the primary identifier in Employee dataset and Title (label.work.workid.title) as the primary identifier in Work Dataset (and also in Employee dataset writer as a reference)

(empty value) generated by using Work ID as primary key and foreign key

The (empty value) for the Work ID is generated because the field Work ID is used as the connection point (primary key and reference key).

employee.csv

work.csv

To correct this use case, you must change the employee.csv. The worktitle column has been added, as a reference to the person’s job Title.

employee.csv with worktitle column

When the above CSVs are reloaded, the following values are displayed in your reports:

Employee dataset with Work ID = worktitle

Employee dataset with Work ID = First Name

Employee dataset with Work ID = Last Name

What It Means

Example 1

Work and employee are now connected through the primary key, Work ID. In this example, the work ID is represented directly by the work id attribute.

Employee.csv

employee id,fname,lname,idnumber,salary,workid
1,Martin,Cacky,65456,1,1
2,Jan,Cisar,3543,2,2
3,Jan,Vlcek,34535,3,2
4,Jan,Nanista,354354,4,3
5,Michal,Manak,74,5,4
6,Pert,Stedry,121,6,5

Work.csv

work id,title,url
1,alfa,www.alfa.cz
2,beta,www.beta.cz
3,beta,www.beta.sk
4,gamma,www.gamma.net
5,delta,www.delta.com

When performing the mapping in the GD Dataset Writer component, you must choose “Work ID” as the primary identifier:

Field mapping Work ID in Work dataset


Select the same name in the field mapping dialog for the Employee dataset:

Mapping Work ID in Employee dataset

Example 2

Work.csv and Employee.csv are connected through primary key workid. In this example, this primary key is identified as the title of the work.

Employee.csv

employee id,fname,lname,idnumber,salary,workid
1,Martin,Cacky,65456,1,alfa
2,Jan,Cisar,3543,2,beta
3,Jan,Vlcek,34535,3,beta
4,Jan,Nanista,354354,4,beta
5,Michal,Manak,74,5,gamma
6,Pert,Stedry,121,6,delta

Work.csv

work id,title,url
1,alfa,www.alfa.cz
2,beta,www.beta.cz
3,beta,www.beta.sk
4,gamma,www.gamma.net
5,delta,www.delta.com

Choose “Title” as the primary identifier:

Choose primary identifier for Work ID in Work dataset


Also, select the same name in the field mapping dialog for the Employee dataset:

Select field mapping for Work ID reference in Employee dataset

This reference in the Employee dataset must be the same as in the Work dataset. Otherwise, your data will not be connected together and (empty values) are displayed.

Conclusion

  1. Be careful while mapping fields in GD Dataset Writer.
  2. Remember to have some attributes with at least one attached label.
  3. Before selecting the primary identifier, check your source data to verify the values for primary and foreign keys.

In some future release, this feature will be included in the LDM Modeler.