Set the Grain of a Fact Table to Avoid Duplicate Records
CloudConnect is a legacy tool and will be discontinued. We recommend that to prepare your data you use the GoodData data pipeline as described in Data Preparation and Distribution. For data modeling, see Data Modeling in GoodData to learn how to work with Logical Data Modeler.
The fact table grain improves performance of incremental uploads and simplifies data management. Normally, when deduplicating records, you must create a special column or ‘connection point’ in your project. This approach leads to complexity in the ETL process because all attributes that form a unique identifier together must be concatenated, passed to a hash function, and then added to the dataset.
This article describes where it is helpful to use the fact table grain feature and how you can set the grain of a fact table.
Knowledge of MAQL DDL is helpful.
What is the Fact Table Grain?
The fact table grain functionality sets a new compound primary key for a table. This means you no longer need to use connection points for incremental uploads to fact tables.
The grain of a fact table defines the lowest level of detail that the fact table is divided into. For more information about grain in fact tables, see Fact Table. As there may be billions of records in the table that need to be inserted or updated on a daily basis, full data upload requires potentially slow transfer of an extremely large amount of data with possibility of errors or missing values. Therefor, it is more advantageous to update tables of records incrementally. To do this, each record must have a unique identifier so that only those records that need to be overwritten are overwritten.
For example, in a billing system that has invoices with multiple invoice lines, each line represents a product bought by a customer and does not have a natural primary key. Each Invoice has a unique Invoice ID, and you can identify any entity by specifying the Invoice ID and Invoice Line. The Invoice ID and the line number together form a compound key that uniquely identifies any record within the table invoice line. This can be set as the grain of the table.
Where the Fact Grain Helps
The fact grain helps where there are extremely large volumes of data that need to be updated regularly. However, owing to the effort needed to implement it, we do not recommend using the fact grain in the following cases:
- Projects are small (up to 1 GB).
- Projects do not use incremental upload mode (only full load is used).
- Projects do not contain fact tables with connection points.
Key Benefits of the Fact Table Grain
- The fact table grain simplifies incremental loading of data. This is especially useful when working with large fact tables leading to much quicker ETL times.
- You no longer have to worry about surrogate keys because the system checks for existing data rows.
Using the Fact Table Grain
You can implement the fact table grain using the API (see Data Modeling API (Cloud Connect)) or CloudConnect LDM Modeler (see Work with Fact Table Grain Using CloudConnect Modeler).
Fact tables that have a grain do not have to contain facts. In this case, the common loading method of all attributes is applied to this table as well. Attributes that are part of a grain must not contain any labels. If you remove the grain from an attribute with loaded data and then add a connection point to that attribute, you must refresh data when you upload new data (a full upload; or a hard synchronize, see Managing LDM Project Changes in CloudConnect).
The limit of 2-32 attributes or references only is applied (that is, a compound key can be up to 32 columns max).