Create a Logical Data Model from CSV Files

This tutorial guides you through the process of creating a logical data model (LDM) in your workspace from CSV files. A newly created workspace does not have an LDM therefore you are going to create an LDM from scratch.

For the purpose of this tutorial, you are going to create a simple LDM with three datasets and publish it to your workspace.

You create an LDM in the LDM Modeler. To do so, perform the following steps:

  1. Prepare CSV files.
  2. Add a dataset with attributes.
  3. Create a dataset with facts and a Date dataset.
  4. Publish the LDM.

Prepare CSV Files

Make sure that your CSV files meet the requirements described in CSV File Requirements.

In this tutorial, you are going to use two CSV files, product.csv and order_line.csv. Download them to your machine.

  • product.csv (download as .CSV)

    product_id,product_name,category
    150,Skirt,Clothing
    210,Artego,Electronics
    240,PortaCode,Electronics
    420,Elentrix,Outdoor
    430,Integres,Outdoor
    120,"Polo Shirt",Clothing
    110,Pullover,Clothing
    310,Applica,Home
    330,Optique,Home
    
  • order_line.csv (download as .CSV)

    order_line_id,order_id,order_status,date,product_id,price,quantity
    10668-9VYN74-2,10668-9VYN74,Cancelled,2016-07-12,210,100.33,1.00
    10697-7GBN87-1,10697-7GBN87,Delivered,2017-12-05,310,39.35,1.00
    10907-0TPH53-3,10907-0TPH53,Delivered,2015-03-28,220,50.40,1.00
    10778-7INQ32-1,10778-7INQ32,Delivered,2018-02-08,230,78.40,1.00
    10240-3SBQ40-3,10240-3SBQ40,Delivered,2017-05-07,160,21.49,1.00
    10356-7ZBU60-1,10356-7ZBU60,Cancelled,2015-03-01,140,21.49,1.00
    10700-0ACT49-1,10700-0ACT49,Returned,2019-02-20,150,21.82,1.00
    10175-0YRN35-3,10175-0YRN35,Delivered,2015-04-11,150,19.15,3.00
    10152-6HOB25-1,10152-6HOB25,Cancelled,2016-03-04,140,29.67,1.00
    

Add a Dataset with Attributes

You are going to create your first dataset from the product.csv file. This CSV file contains only attributes. Attributes are data that is to be used in grouping or segmenting the values resulting from the computed functions (for example, countries, names, IDs). In almost all use cases, this data is non-numerical. For more information about the attributes, see Attributes in Logical Data Models.

You are also going to set one of the attributes in the dataset as a primary key. The primary key defines the unique identifier for a row of data in a file and serves as a connection point that allows you to connect this dataset to another dataset. In other words, you will be able to create a relationship between the dataset that you are going to create from the product.csv file and another dataset that you will be creating later in this tutorial. A relationship between two datasets allows you to use information from one dataset to slice the data in the other dataset. The dataset with the data that you want to use to help analyze the other data is the one that must have a primary key. In this tutorial, this dataset is the one created from the product.csv file. For more information about primary keys, see Primary and Foreign Keys and Connection Points in Logical Data Models.

Steps:

  1. On the top navigation bar, select Data. The LDM Modeler opens. You see a blank canvas area in view mode.

  2. Click Edit. The LDM Modeler is switched to edit mode.

  3. Drag the product.csv file and drop it in the blank canvas area. The file preview opens. 

     This preview shows the data from your file. Based on the column names and the contents of those columns in the CSV file, the column names and the types of the data are suggested for the dataset that will be created from this file. For example, the product_name column is detected as an attribute based on the values in the column (non-numerical data) and will be named Product Name in the dataset.

  4. Notice that the option for creating a new dataset from the CSV file is selected, and the name for the new dataset is predefined as Product based on the name of the CSV file. Keep this option selected and leave the name as suggested.

  5. Notice that the product_id column is detected as a fact, which is a numerical piece of data, which in a business environment is used to measure a business process (see Facts in Logical Data Models). This is because it contains numbers. By default, numbers are auto-detected to be facts. However, the product_id column contains the identifiers of the products. Those are non-measurable descriptive data, and therefore must be an attribute. Because the product IDs are unique, you can use it as a primary key.

  6. Change the data type for the product_id column from Fact to Primary key. If you do not set the primary key now, you will be able to do it after the dataset has been created (see Update a Logical Data Model).

  7. Review the suggested column names and the data types for the other columns. These columns are correctly detected to be attributes.

  8. Click Import. The CSV file is imported, and the Product dataset is added. 

     Notice the following:

    • The dataset is marked with orange. This is because the dataset contains only attributes and no facts.
    • The Product Id attribute is marked as the primary key. This means that you will be able to create a relationship between this dataset and other datasets in the logical data model.

    You are now going to add another dataset from the order_line.csv file.

Create a Dataset with Facts and a Date Dataset

You are going to create another dataset from the order_line.csv file. This CSV file contains attributes, facts, and dates.

In an LDM, dates are managed through a separate object, the Date dataset (see Dates in the Logical Data Model). Therefore, after you have imported the order_line.csv file, not one but two datasets will be added to your LDM: one Date dataset for the dates, and the other one with the rest of information. These two datasets will be automatically connected with a relationship.

You are also going to establish a relationship between the dataset created from the order_line.csv file and the Product dataset that you created earlier.

Steps:

  1. Drag the order_line.csv file and drop it in the canvas area. The file preview opens. 
     
     Notice the following: 
    • The order_line_id, order_id, and order_status columns are correctly detected as attributes.
    • The date column is correctly detected as dates in the yyyy-MM-dd format and will be converted to a separate Date dataset.
    • The price and quantity columns are correctly detected as facts.
  2. Notice that the option for creating a new dataset from the CSV file is selected, and the name for the new dataset is predefined as Order Line based on the name of the CSV file. Keep this option selected and leave the name as suggested.
  3. Notice that the product_id column is detected as a fact. This is because it contains numbers. However, this column contains the product identifiers and therefore must be an attribute. Earlier in this tutorial, when creating the Product dataset, you set the Product Id attribute to be the primary key of this dataset. Now, you can set the product_id column in the order_line.csv file as a reference and thus create a relationship between these two datasets.
  4. Change the data type for the product_id column from Fact to Reference. When prompted, select Product as the dataset to set the reference to. 

  5. Click Import. The CSV file is imported, and the Order Line dataset and the Date dataset are added. 
     Notice the following:
    • The Order Line dataset is marked with green. This is because the dataset contains facts.
    • The Date dataset is marked with blue. This is because this is a date dimension dataset.
    • A relationship is created between the Date dataset and the Order Line dataset. This is because the dates were extracted from the order_line.csv file. The direction of the arrow determines which dataset’s data can be used to analyze (slice) the data from the other dataset. The relationship between the Date dataset and the Order Line dataset allows you to slice the data from the Order Line dataset (an incoming relationship to the Order Line dataset) by date (an outgoing relationship from the Date dataset).
    • A relationship is created between the Product and the Order Line dataset. This is because you connected these datasets by the Product Id primary key in the Product dataset. The direction of the arrow determines which dataset’s data can be used to analyze (slice) the data from the other dataset. The relationship between the Product dataset and the Order Line dataset allows you to slice the data from the Order Line dataset (an incoming relationship to the Order Line dataset) by product (an outgoing relationship from the Product dataset).
    • References to the Product Id attribute and the Date dataset have been added to the Order Line dataset. These references indicate that the Product Id field and the Date dataset are foreign keys in the Order Line dataset. 

Your LDM is ready. You can now publish it.

Publish the LDM

To publish the LDM, follow the instructions from Publish a Logical Data Model.

Mapping the Datasets to Source Tables

Any dataset in your LDM must have each field (a fact or an attribute) unambiguously mapped to a column in a source table (represented by a table in your data warehouse or a CSV file). The dataset itself must be mapped to this source table.

During data load, the data from the columns in the source table will be loaded to the corresponding facts or attributes in the dataset.

In a dataset that was created from a CSV file, the dataset fields are mapped to the columns in that CSV file. You can use CSV files with the same column names to load data to the dataset. For more information, see Update Data in a Dataset from a CSV File.

If you want to switch to loading data from a table in the data warehouse and this table has the same name and the column names as the CSV file, you can keep the mapping as is. If the table has a different name and/or column names, manually update the dataset so that its fields are mapped to the table columns and the dataset has the table name mapped to it. For more information, see Mapping between a Logical Data Model and the Data Source.