Example Data Schema and How to Structure Data

In this tutorial, you create the logical data model for a simple payroll application. The data stored in this application is transactional in nature. For purposes of the exercise, the data that is stored is the following:

Employee ID

Name

Pay Stub ID

Pay Date

Payment

Department

Department ID

35

Peter Chen

P0001

2012/10/10

3250

Marketing

D001

21

Janelle Smith

P0002

2012/10/10

3850

Support

D002

18

Martin Jones

P0003

2012/10/10

2650

Marketing

D001

19

Kristina Larsen

P0004

2012/10/10

3350

Support

D002

16

Steven Francis

P0005

2012/10/10

9650

Technical Writing

D003

35

Peter Chen

P0006

2012/10/24

3250

Marketing

D001

21

Janelle Smith

P0007

2012/10/24

3850

Support

D002

18

Martin Jones

P0008

2012/10/24

2650

Marketing

D001

19

Kristina Larsen

P0009

2012/10/24

3350

Support

D002

16

Steven Francis

P0010

2012/10/24

9650

Technical Writing

D003

 The above rows show the payroll information for five employees across two periods. It contains all the information you need to track these transactions:

Field Name

Description

Employee ID

Internal identifier for the employee

Name

First and last name of the employee

Pay Stub ID

Internal identifier for the payment made to the employee

Pay Date

Date that the payment was made

Payment

Actual amount paid to the employee on the pay date

Department

Department to which the employee belongs

Department ID

Internal identifier for the department

 If you look at the source data again, you may notice a few things:

  • Employee ID and Name both uniquely identify the employee in this dataset. However, since it is possible for the value for Name to be shared by multiple employees, Employee ID is used as the key.
  • Name and Department contain text data. Within the scope of the set of employees, the range of possibilities for this data is finite; there are only so many names and employees in the company.
  • If you know the Employee ID, you know the Name and the Department, too.
  • Pay Stub ID contains a unique internal identifier for the transaction in which the employee receives his pay for the pay period.
  • Payment lists the actual payment made to the employee. Payments range from 2900 to 9900.

Structuring the Data

As transactional data, data sets such as one shown above, work fine. But there are some issues with the relationships implied by the transactions:

  • Some of the fields in the data are not describing the same object. For example, the Payment field has no direct connection to the Department field.
  • Field values can be mapped to multiple values in other fields. For example, in the data structure, an employee may receive payments from multiple departments. In this flat file format, there are no defined hierarchies.

In the logical data model, you want to clarify these relationships so that they are enforced in underlying physical data model. Data fields in the incoming feeds that are not directly related should be logically separated, and ambiguous many-to-many relationships should be resolved into 1:N relationships. Through the logical data model, you can enforce the cardinalities of the data feed without having to understand the specifics of the data.

For example, if you know the Employee ID, the Name, Department and Department ID values should always be consistent. Using the example data, the employee data might be stored separately in the following format:

Employee ID

Name

Department

Department ID

35

Peter Chen

Marketing

D0001

21

Janelle Smith

Support

D0002

18

Martin Jones

Marketing

D0001

19

Kristina Larsen

Support

D0002

16

Steven Francis

Technical Writing

D0003

 Now, the transactional data can be redesigned to be stored in the following format:

Employee ID

Pay Stub ID

Payment

Pay Date

35

P0001

3250

2012/10/10

21

P0002

3850

2012/10/10

18

P0003

2650

2012/10/10

19

P0004

3350

2012/10/10

16

P0005

9650

2012/10/10

35

P0006

3250

2012/10/24

21

P0007

3850

2012/10/24

18

P0008

2650

2012/10/24

19

P0009

3350

2012/10/24

16

P0010

9650

2012/10/24

 Through the values in the Employee ID, you can access Name and Department information. A single Employee ID value can now point to multiple payment records. It is mapping 1 value to N values. In this manner, the set of employee information is the parent to the child transactional data. These two data sets form a hierarchy of information, as specified by the Employee ID.

You may have noticed another hierarchy in the above data. One value for the Department field may point to multiple values for Employee, which makes sense. A department may contain multiple employees. So the departmental mapping looks like the following:

Department

Department ID

Marketing

D001

Support

D002

Technical Writing

D003

 To create an optimized data model, you can split it into multiple datasets and create separate datasets for Department and Employee. In this tutorial, you step through the process of rendering the above flat file structure into a set of logical relationships that build hierarchies and enforce cardinalities implicit in the data. This creation process happens in the following steps:

  1. Build the dataset to match the flat file of transactions.
  2. Remove the employee fields into a separate dataset. Build the hierarchical relationship between the employee data set and the transaction dataset.
  3. Remove the department fields into a separate dataset. Build the hierarchical relationship.
  4. Add the Date dataset to the logical data model.
  5. Validate and publish.