Example Data Schema and How to Structure Data
CloudConnect is a legacy tool and will be discontinued. We recommend that to prepare your data you use the GoodData data pipeline as described in Data Preparation and Distribution. For data modeling, see Data Modeling in GoodData to learn how to work with Logical Data Modeler.
In this tutorial, you create the logical data model for a simple payroll application. The data stored in this application is transactional in nature. For purposes of the exercise, the data that is stored is the following:
Employee ID | Name | Pay Stub ID | Pay Date | Payment | Department | Department ID |
---|---|---|---|---|---|---|
35 | Peter Chen | P0001 | 2012/10/10 | 3250 | Marketing | D001 |
21 | Janelle Smith | P0002 | 2012/10/10 | 3850 | Support | D002 |
18 | Martin Jones | P0003 | 2012/10/10 | 2650 | Marketing | D001 |
19 | Kristina Larsen | P0004 | 2012/10/10 | 3350 | Support | D002 |
16 | Steven Francis | P0005 | 2012/10/10 | 9650 | Technical Writing | D003 |
35 | Peter Chen | P0006 | 2012/10/24 | 3250 | Marketing | D001 |
21 | Janelle Smith | P0007 | 2012/10/24 | 3850 | Support | D002 |
18 | Martin Jones | P0008 | 2012/10/24 | 2650 | Marketing | D001 |
19 | Kristina Larsen | P0009 | 2012/10/24 | 3350 | Support | D002 |
16 | Steven Francis | P0010 | 2012/10/24 | 9650 | Technical Writing | D003 |
The above rows show the payroll information for five employees across two periods. It contains all the information you need to track these transactions:
Field Name | Description |
---|---|
Employee ID | Internal identifier for the employee |
Name | First and last name of the employee |
Pay Stub ID | Internal identifier for the payment made to the employee |
Pay Date | Date that the payment was made |
Payment | Actual amount paid to the employee on the pay date |
Department | Department to which the employee belongs |
Department ID | Internal identifier for the department |
If you look at the source data again, you may notice a few things:
- Employee ID and Name both uniquely identify the employee in this dataset. However, since it is possible for the value for Name to be shared by multiple employees, Employee ID is used as the key.
- Name and Department contain text data. Within the scope of the set of employees, the range of possibilities for this data is finite; there are only so many names and employees in the company.
- If you know the Employee ID, you know the Name and the Department, too.
- Pay Stub ID contains a unique internal identifier for the transaction in which the employee receives his pay for the pay period.
- Payment lists the actual payment made to the employee. Payments range from 2900 to 9900.
Structuring the Data
As transactional data, data sets such as one shown above, work fine. But there are some issues with the relationships implied by the transactions:
- Some of the fields in the data are not describing the same object. For example, the Payment field has no direct connection to the Department field.
- Field values can be mapped to multiple values in other fields. For example, in the data structure, an employee may receive payments from multiple departments. In this flat file format, there are no defined hierarchies.
In the logical data model, you want to clarify these relationships so that they are enforced in underlying physical data model. Data fields in the incoming feeds that are not directly related should be logically separated, and ambiguous many-to-many relationships should be resolved into 1:N relationships. Through the logical data model, you can enforce the cardinalities of the data feed without having to understand the specifics of the data.
For example, if you know the Employee ID, the Name, Department and Department ID values should always be consistent. Using the example data, the employee data might be stored separately in the following format:
Employee ID | Name | Department | Department ID |
---|---|---|---|
35 | Peter Chen | Marketing | D0001 |
21 | Janelle Smith | Support | D0002 |
18 | Martin Jones | Marketing | D0001 |
19 | Kristina Larsen | Support | D0002 |
16 | Steven Francis | Technical Writing | D0003 |
Now, the transactional data can be redesigned to be stored in the following format:
Employee ID | Pay Stub ID | Payment | Pay Date |
---|---|---|---|
35 | P0001 | 3250 | 2012/10/10 |
21 | P0002 | 3850 | 2012/10/10 |
18 | P0003 | 2650 | 2012/10/10 |
19 | P0004 | 3350 | 2012/10/10 |
16 | P0005 | 9650 | 2012/10/10 |
35 | P0006 | 3250 | 2012/10/24 |
21 | P0007 | 3850 | 2012/10/24 |
18 | P0008 | 2650 | 2012/10/24 |
19 | P0009 | 3350 | 2012/10/24 |
16 | P0010 | 9650 | 2012/10/24 |
Through the values in the Employee ID, you can access Name and Department information. A single Employee ID value can now point to multiple payment records. It is mapping 1 value to N values. In this manner, the set of employee information is the parent to the child transactional data. These two data sets form a hierarchy of information, as specified by the Employee ID.
You may have noticed another hierarchy in the above data. One value for the Department field may point to multiple values for Employee, which makes sense. A department may contain multiple employees. So the departmental mapping looks like the following:
Department | Department ID |
---|---|
Marketing | D001 |
Support | D002 |
Technical Writing | D003 |
To create an optimized data model, you can split it into multiple datasets and create separate datasets for Department and Employee. In this tutorial, you step through the process of rendering the above flat file structure into a set of logical relationships that build hierarchies and enforce cardinalities implicit in the data. This creation process happens in the following steps:
- Build the dataset to match the flat file of transactions.
- Remove the employee fields into a separate dataset. Build the hierarchical relationship between the employee data set and the transaction dataset.
- Remove the department fields into a separate dataset. Build the hierarchical relationship.
- Add the Date dataset to the logical data model.
- Validate and publish.