Basics of Data Modeling

This article provides an overview of data modeling basics. For experienced data modelers, these sections can be skimmed. Beginning users should review these contents.

Contents:

Transactional vs Dimensional Modeling

In transactional data modeling, each transaction is stored in the database as a single record. All of the information required to process the transaction is contained in the record, including references to other reference data.

In dimensional modeling, numerical and text data are stored as facts and attributes, respectively, in a dimensional model. In a dimensional model, each dimension provides a different perspective or axis on the transactional data. 

Suppose you are storing sales transactions from a retail store. In the transactional model, each transaction conducted at the register is recorded either as a sale or a return in the transactions table. This table may identify the product, customer, and date of the transaction. 

In a dimensional model of this data, you can examine all of the transactions by product, customer, date or some combination of them. Product, customer, and date are considered dimensions of the transaction data.

GoodData consumes transactional records and renders them into a dimensional model. These transactional records are extracted from their source data, transformed as necessary for storage, and loaded into the dimensional model that you are defining in LDM Modeler.

Model Representations

In the following diagram, you can see a visual representation of the example from the previous section. Product, customer, and time are represented as visual dimensions of the transactional dataset.

The following diagram is used for representation purposes only. It represents a data in three dimensions in the traditional MOLAP definition. In GoodData, however, cubes do not exist, and the data within a GoodData project can be assembled across N-dimensions. For example, the Sales dataset may include additional dimensions for sales person, region, and other useful measurements. When the number of dimensions exceeds three, a data cube becomes a more abstract concept and cannot be easily represented in a single diagram.

The cube is used for descriptive purposes only. In HOLAP (Hybrid OLAP) modeling, the data structure known here as a cube is generated on the fly from underlying database tables and does not necessarily correspond to any clear planar representation. 

Sales data cube from MOLAP Modeling
In this diagram, the highlighted segment within the cube denotes all of the transactions on the given date (03/15/13) for the specified product (X) and customer (Y). This set of transactions is listed in the insert table.

The numerical data (Amt) is known as a fact. A fact can be aggregated using one of the available functions. These aggregations by total, average, minimum, maximum, and count for specific fields are called metrics. Metrics that operate on specific facts are defined in the GoodData Portal.

Metrics that aggregate on the above aggregation functions only are very simple to define. Metrics that are more specific to your enterprise data typically involve more complex expressions that you define using MAQL through the GoodData Portal  

In a data cube, all of the line-by-line transactional Sales data has been reorganized so that it can be accessed quickly and easily by customer, product, or time (date or even time of day). Each cube within the cube represents an aggregation of Sales data for a specific product for a specific customer, and a specific time. 

Because of the way the cube is organized in the database, you can perform queries across multiple cubes within the cube based on the dimensional relationships that have been defined. These slices across the data cube represent aggregations of the transactional data. In the following diagram, the slice represents the sales of Product X on Date Z for all customers.

A slice of the Sales data cube
This diagram may represent a simple aggregation: the sum of all transactions at the cash register for Product X on a single date.

GoodData provides multiple aggregation functions, including sum, average, minimum, maximum, and count of items. Each of these functions could be applied to the above slice. These functions are applied to the data through metrics.

Here are some simple slicing exercises to visual using the above data cube:

  • The sum of all sales for all customers and products on a single date
  • The maximum amount for a sale for a single month
  • The best-selling product for the quarter

Data Cube Hierarchies

In the preceding examples, you may have noticed that for the Time dimension, multiple aggregation levels are represented. The first example aggregates by a single day, the second by a month, and the third by quarter. These aggregation levels represent different granularities of aggregation. 

Collectively, they represent a dimensional hierarchy in the data cube.

  • Sales for a single day represent the aggregation of all transactions for that date.
  • Sales for a month represent the aggregation of all Sales for all days within the month (aggregated).
  • Sales for a quarter represent the aggregation of all Sales for months in the quarter.

Quarter, Month, and Day are aggregation levels in the hierarchy of the Sales data cube. Similar hierarchies typically exist for customer and product:

  • Sales information by customer may be aggregated by parent company.
  • Sales information by product may be aggregated by brand.

These aggregations can be represented in your logical data model.