Page tree
Skip to end of metadata
Go to start of metadata

After trying the simplest examples on how to load data into GoodData project, at some point (probably sooner than later) you will encounter situation that your data are not in the format required for the load. Luckily, there’s a large number of different transformer components for various scenarios. However, without knowledge which is the right one to choose for the task you need to deliver, you increase the overall complexity of your transformation or even fail altogether. And as transformations over single dataset are probably the most frequent, it’s essential to make the right choice here. In this article we will have a look at three components - Reformat, Denormalizer, and Normalizer.

What is the same

All the components we are covering in this article belong into the Transformers section, which apart from other facts means that one input port and at least one output port has to be connected. However, it’s not necessary for the input and output ports to share the same metadata. Moreover, each of these components has to implement the transform() CTL function, defining the input to output mapping. This can be either achieved by setting the mapping in graphical wizard or by directly editing the source.

For details on mapping functions please see the documentation ###here###.

function integer transform() {
 
    $out.0.text = $in.0.text
 
    return OK;
}

What is the difference

Probably the best way to describe the differences lies in the fact how each of those components trigger the transform() function related to the number of records they are processing.

Component nameDescription
ReformatTransform function is triggered exactly once per every input record
DenormalizerTransform function is triggered once per one or more input records
NormalizerTransform function is triggered at least once per input record

Reformat

If you look at the description above carefully, it is probably clear that both Denormalizer and Normalizer can be used to simulate Reformat if set accordingly. However, I recommend to avoid doing this as it makes the graph more difficult to read.

The Reformat component is arguably the most popular among transformers. It allows using wide range of CTL functions inside the transform() function as well as emulating various other transformers (ExtFilter and SimpleCopy to name a few). Traditional usage includes string manipulation, data type conversion etc. However, Reformat won’t help you in case you want to split or merge the input records.

 function integer transform() {
     
    //Let's join the names and create a fullname field
    $out.0.FullName = $in.0.FirstName + ' ' + $in.0.Surname;
    $out.0.DateAsDate = str2date($in.0.DateAsString,"dd-MM-yyyy");
     
    //We want to track only the active users and filter the inactive out
    if ($in.0.IsActive) { return OK; } else { return SKIP; }
 
}

Denormalizer

Using Denormalizer, you’re basically lowering the number of records on input in favor of making the dataset wider, usually with intention to merge rows belonging to the same entity. Apart from the transform() function, each instance of Denormalizer also has to implement function append(), which defines the way the input records are merged.

Parameter nameDescription
Key Rowssharing the same key are merged (in our example it’s the Person field)
Group SizeAn alternative to Key parameter, used when the number of rows per output row is constant
DenormalizeDefines the transform() and append() functions

A simple example input data for Denormalizer could look like this:

PersonAttributeValue
John DoeAge31
John DoeCityAnytown
John DoeSexM

This situation is common when your source is not a relational database with a fixed set of fields, or if the source allows complex data types. As a result there’s a collection of attributes for every entity (person in our case) and for some entities there might be more rows than for others. Imagine that a following couple of rows in our example dataset would look like this:

PersonAttributeValue
Jane DoeAge46
Jane DoeCityNotown
Jane DoeSexF
Jane DoeOccupationGoodWorker

However, no matter how many different records there are for each entity you process, the dataset you will be eventually loading your data into has to have some fixed structure. You can decide to go for all the attributes you can find in your data source and fill the missing with nulls or “N/A” strings, or pick just the attributes that are common for all the records, but the final structure has to be fixed, such as:

PersonAgeCitySexOccupation
John Doe31AnytownMN/A
Jane Doe46NotownFGoodWorker

Using Denormalizer is the way to go if the attributes for entity you’re interested in is spread over several records.

DenormalizedMeta1 myRecord;
 
function integer append() {
     
    setStringValue(myRecord,$in.0.Attribute,$in.0.Value);
    myRecord.Name = $in.0.Person;
 
    return 1;
     
}
 
function integer transform() {
 
$out.0.* = myRecord;
 
    return 1;
}

Normalizer

In the Excel world, it’s quite common to see datasets formatted like this:

NameSalary_2011Salary_2012Salary_2013
John Doe100001100012000

Now, this is perfectly correct to use in a spreadsheet. However, for the future data analysis it’s a bit unfortunate, as the Salary_XX is in reality the very same fact, only preaggregated for a certain year. With dataset like this, some questions become difficult to answer (“What’s the sum of salaries over the years”). Moreover, what if 2014 is added, which can be expected? In case ETL is poorly designed, it might be necessary to change the LDM because of such a small updates.

In similar case, it’s probably worth converting the data into something like this:

NameYearSalary
John Doe201110000
John Doe201211000
John Doe201312000

In this case, we’re perfectly fine with adding new records. Also, it’s a lot easier for us if some records are missing.

Apart from transform(), we also need to define function count() which defines how many records will be produced per one input record. Based on the use case, the number might be fixed or computed dynamically (one field containing a list of strings would be a good example for this). Let’s use the length of our input metadata in case of our dataset:

function integer count() {
     
    //Get the number of fields in the input
    return (length($in.0.*)-1);
}
 
function integer transform(integer idx) {
 
    $out.0.Name = $in.0.Name;
    $out.0.Year = num2str(2011 + idx);
     
    //Access the input record by index
     
    $out.0.Salary = getStringValue($in.0.*,idx+1);
 
     
    return OK;
}

Example

I have created a simple example transformation to illustrate the use cases from this article. Feel free to download the example here.

  • No labels