Integrate Data Warehouses Directly to GoodData based on Your DW Schema

This tutorial guides you through the process of setting up direct data distribution to load data from a Snowflake data warehouse to the GoodData platform.

This tutorial assumes that you do not have a GoodData workspace yet or want to generate/update an existing LDM to best fit the existing data structure in your data warehouse. During this tutorial, you are going to generate the Output Stage based on the database schema and then generate the LDM to correspond with the Output Stage.

If you already have a GoodData workspace with the logical data model (LDM) that meets your business requirements for data analysis and which you do not want to update, see Integrate Data Warehouses Directly to GoodData based on an Existing LDM.

By the end of this tutorial, you should be able to configure your Snowflake instance and connect it to the GoodData platform, and set up the data load to run regularly.

This tutorial describes the process using the web-based interface. Alternatively, you can use the REST API.

  • This tutorial focuses on the Snowflake data warehouse. If you are using the Amazon Redshift data warehouse, you can still follow this tutorial because the performed steps are similar or the same. When different actions are required for Redshift, we will provide you with Redshift-specific information.
  • This tutorial assumes that you access your projects at https://secure.gooddata.com/. If you are a white-labeled customer, replace secure.gooddata.com with your white-labeled domain in the procedure steps.

Contents:


Create GoodData-specific assets in your Snowflake environment

Although creating GoodData-specific assets is optional, we strongly recommend that you do it.

Having separate assets for GoodData integration helps you reliably distinguish the data to load to the GoodData platform from the rest of your data in your Snowflake instance, improves performance, and makes troubleshooting easier (you can filter out the GoodData-related actions and log records).

Do the following:

  • Create a dedicated user and a role.

  • Create a dedicated warehouse that ensures that a specific computing power is available solely for the purpose of GoodData integration and is used to unload data from Snowflake for use with the GoodData platform.
  • Grant USAGE privileges to the database and schema that you will be using for GoodData integration.

  • Grant USAGE privileges to the schema to create a Snowflake internal stage. GoodData integration uses the Snowflake internal stage to unload and download data. A new internal stage is created every time a new data integration process starts running.

  • Grant SELECT privileges to the existing and future tables/views within your schema.

For more information about permissions in Snowflake, see Security Access Control Privileges in the Snowflake documentation.

The following SQL script creates the minimum scope of assets needed to make Snowflake-GoodData integration work except for granting privileges to the tables/views (the last SQL command). If you need to ensure higher security, you can adjust it to grant privileges only to specific tables/views.

In the SQL script, replace the placeholders for the user password, the database name and schema names with your values.

-- use the role with privileges to create new assets
USE ROLE ACCOUNTADMIN;

-- create a role
CREATE ROLE "GOODDATA_INTEGRATION" COMMENT = 'Role for GoodData integration.';

-- create a user
CREATE USER GOODDATA_INTEGRATION PASSWORD = '{user_password}' COMMENT = 'User for GoodData integration.' DEFAULT_ROLE = "GOODDATA_INTEGRATION" DEFAULT_WAREHOUSE = 'GOODDATA_INTEGRATION' MUST_CHANGE_PASSWORD = FALSE;

-- grant the role to the user
GRANT ROLE "GOODDATA_INTEGRATION" TO USER GOODDATA_INTEGRATION;

-- create a warehouse
CREATE WAREHOUSE GOODDATA_INTEGRATION WITH WAREHOUSE_SIZE = 'XSMALL' WAREHOUSE_TYPE = 'STANDARD' AUTO_SUSPEND = 600 AUTO_RESUME = TRUE COMMENT = 'Warehouse for GoodData integration.';

-- grant the role privileges to the warehouse
GRANT OPERATE ON WAREHOUSE GOODDATA_INTEGRATION TO ROLE GOODDATA_INTEGRATION;
GRANT USAGE ON WAREHOUSE GOODDATA_INTEGRATION TO ROLE GOODDATA_INTEGRATION;

-- grant the role privileges to the database and schema
GRANT USAGE ON DATABASE {database_name} TO ROLE GOODDATA_INTEGRATION;
GRANT USAGE ON SCHEMA {database_name}.{schema_name} TO ROLE GOODDATA_INTEGRATION;

-- grant the role privileges to create an internal stage
GRANT CREATE STAGE ON SCHEMA {database_name}.{schema_name} TO ROLE GOODDATA_INTEGRATION;

-- grant the role privileges to allow to select data from all existing and future tables/views within the schema
GRANT SELECT ON ALL TABLES IN SCHEMA {database_name}.{schema_name} TO ROLE GOODDATA_INTEGRATION;
GRANT SELECT ON FUTURE TABLES IN SCHEMA {database_name}.{schema_name} TO ROLE GOODDATA_INTEGRATION;
GRANT SELECT ON ALL VIEWS IN SCHEMA {database_name}.{schema_name} TO ROLE GOODDATA_INTEGRATION;
GRANT SELECT ON FUTURE VIEWS IN SCHEMA {database_name}.{schema_name} TO ROLE GOODDATA_INTEGRATION;

Also, see GoodData-Snowflake Integration Details for access rights recommended for the user who is referenced in Data Sources.

You are now going to set up a connection between Snowflake and the GoodData platform.


For Redshift users

If you are using Redshift, use the following SQL script to set up a user who will connect using the IAM role and key-based authentication:

-- create a group
CREATE GROUP GOODDATA_INTEGRATION;

-- create a user
CREATE USER GOODDATA_INTEGRATION PASSWORD disable IN GROUP GOODDATA_INTEGRATION;
  
-- grant the group privileges to the schema
GRANT USAGE ON SCHEMA {schema_name} TO GROUP GOODDATA_INTEGRATION;

-- grant the SELECT privilege on tables/views to the group
GRANT SELECT ON ALL TABLES IN SCHEMA {schema_name} TO GROUP GOODDATA_INTEGRATION;

Also, see GoodData-Redshift Integration Details for access rights recommended for the user who is referenced in Data Sources.

Set up a connection between Snowflake and the GoodData platform

The connection string is described within the entity called Data Source. You are going to configure the Data Source entity.

Steps:

  1. Go to the gray page for configuring Data Sources:

    https://secure.gooddata.com/gdc/dataload/dataSources
  2. Provide the Data Source parameters, and click Create:
    • Name: enter a meaningful name for your Data Source.
      Example: ACME analytics data product
    • URL: enter the URL to connect to your Snowflake account.
      Example: jdbc:snowflake://acme.snowflakecomputing.com
    • Username: enter the username that defines the user who will be used for Snowflake-GoodData integration.
      Example: GOODDATA_INTEGRATION
    • Password: enter the password for the specified username.
    • Database: enter the name of the database where the schema that you want to use is located.
      Example: ACME
    • Schema: enter the name of the schema where the tables and views that you want to load are located.
      Example: ANALYTICS
    • Warehouse: enter the name of the warehouse that will be used for Snowflake-GoodData integration.
      Example: GOODDATA_INTEGRATION
    • Prefix: enter the prefix that specifies what tables and views will be loaded to the GoodData platform.
      Example: out_

     
    The Data Source entity is configured, and the Data Source identifier is returned.

    The Data Source is created under the currently authenticated user of the GoodData platform. This user becomes the owner of the newly created Data Source and only this user can access and manage it.

You are now going to create a GoodData-compatible Output Stage.


For Redshift users

If you are using Redshift, switch to Redshift on the gray page for configuring Data Sources, and decide on what authentication method you want to use (basic or IAM). Then, provide the Data Source parameters. The parameters are the same as when creating a Data Source for Snowflake except for Redshift you do not set the Warehouse parameter.

Create the GoodData-compatible Output Stage on top of your data

The Output Stage is a set of tables and views that serve as a source for loading data into your GoodData workspaces. To create the Output Stage, you are going to use the API that scans all tables within the schema defined in your Data Source and suggests SQL DDLs that you can use to create views on top of your tables. You will then execute those SQL DDLs on your schema.

To learn how Snowflake data types are mapped to GoodData LDM data types, see GoodData-Snowflake Integration Details.

For Redshift users: see GoodData-Redshift Integration Details.


The views must comply with the following naming conventions of the GoodData Output Stage:

Element typeColumn prefixColumn name formatExampleNotes
Attributea__a__{attribute_name}a__name
Factf__f__{fact_name}f__price
Primary key (connection point)cp__cp__{attribute_name}cp__id
Foreign key (reference)r__r__{referenced_table_name}

r__orders


Dated__

d__{date_dimension_name}

d__orderdate

If you want to use a shared date dimension for more tables/views, use the same {date_dimension_name}.


Example:

Imagine that you have two tables within your schema. One table holds the information about your orders, and the other one holds the list of your customers.

You want to load these tables to the GoodData platform. To do that, you need to create a GoodData-compatible Output Stage.


The 'orders' table:

Column nameData typeCorresponding Output Stage object name
idVARCHAR (primary key)cp__id
customeridINTEGER (foreign key)r__customers
descriptionVARCHARa__description
priceNUMERIC(12,2)f__price
modified_atTIMESTAMPx__timestamp

Note that a view is created where the modified_at column is renamed to x__timestamp. In this example, modified_at indicates the time when a particular record within the table was modified. This timestamp allows you to use incremental loading. This column itself will not be loaded to the GoodData workspace.

For more information about the naming convention and special data loading columns that you can use (such as x__timestamp), see Naming Convention for Output Stage Objects.


The 'customers' table:

Column nameData typeCorresponding Output Stage object name
idINTEGER (primary key)cp__id
nameVARCHARa__name


The corresponding Output Stage SQL DDL would look like the following:

CREATE OR REPLACE VIEW {schemaname}.out_orders AS SELECT

        id AS cp__id,
        customerid AS r__customersid,
        price AS f__price,
        description AS a__description,
        modified_at AS x__timestamp

FROM {schemaname}.orders;

CREATE OR REPLACE VIEW {schemaname}.out_customers AS SELECT

        id AS cp__id, name AS a__name

FROM {schemaname}.customers;

Steps:

  1. Go to the gray page for generating the Output Stage:

    https://secure.gooddata.com/gdc/dataload/dataSources/{Data_Source_id}/generateOutputStage

    where {Data_Source_id} is the identifier of the Data Source that you created at the step Set up a connection between Snowflake and the GoodData platform.

  2. If your original data is located within a different schema than your Output Stage would use, enter the name of that schema.
  3. Click Generate output stage.

    The link for polling for status of the Output Stage generation is returned.
  4. Poll for the status until the gray page returns SQL DDLs to create GoodData-compatible views.
    Here is an example of the SQL DDLs:

    USE DATABASE "ACME";
    USE SCHEMA "ANALYTICS";
    --------------------------------------------------
    -- out_orders --
    --------------------------------------------------
    CREATE OR REPLACE VIEW "out_orders" AS SELECT
    
            "id" AS "cp__id",
            "customerid" AS "r__customers",
            "description" AS "a__description",
            "price" AS "f__price",
            "modified_at" AS "d__modified_at"
    
            , x__timestamp AS x__timestamp /* Remove this line if it's not incremental load */
            , x__client_id AS x__client_id /* Remove this line if loading into all client workspaces */
            , x__deleted AS x__deleted /* Remove this line if inserting only */
    
    FROM "orders";
  5. Copy the returned SQL DDLs and paste them into your Snowflake SQL client.
  6. Review the suggested SQL DDLs and modify them, if needed.
    Specifically, make sure that your timestamp column (in this example, modified_at at line 12 in the picture above) is going to be renamed to x__timestamp.
  7. Execute the SQL DDLs.

You have configured everything on the Snowflake side.

You are now going to create a GoodData workspace where your data will be loaded to.

Create a GoodData workspace

A GoodData workspace (also known as 'project') contains all data and metadata, such as dashboards, reports, insights, and metrics.

  • If you already have a GoodData workspace for loading data from your Snowflake instance, find the workspace ID and copy it for later use (see Find the Project ID).
  • If you do not have a GoodData workspace for loading data from your Snowflake instance, create one (see Create a Project). Copy the ID of the newly created workspace and save it for later use.

    To create a new workspace, you must have the authorization token. If you do not know it, find the authorization token of any of your existing workspaces or of the trial workspace that was automatically created when you registered with GoodData. For more information, see Find the Project Authorization Token.

You are now going to create a logical data model in your GoodData workspace.

Generate a logical data model in your GoodData workspace

To load data into your workspace, you need a Logical Data Model (LDM) created within your workspace. The LDM enables a layer of abstraction between the information that a GoodData user accesses and the method that is used to store data.

Based on the Output Stage that you created earlier in this tutorial, you are going to generate the JSON structure describing the logical data model. The types of LDM elements are recognized based on the OutputStage naming convention. Then, you will apply the generated JSON to your workspace. Dividing the process of creating the LDM into several steps allows you to customize any step.

To learn more about GoodData data modeling, see GoodData Modeling Concepts.

Steps:

  1. Go to the gray page for generating the logical data model:

    https://secure.gooddata.com/gdc/dataload/dataSources/{Data_Source_id}/generateModel

    where {Data_Source_id} is the identifier of the Data Source that you have created at the step Set up a connection between Snowflake and the GoodData platform.

  2. Click Generate project model.

    The link for polling for status of the LDM generation is returned.
  3. Poll for the status until the gray page returns the JSON structure describing the logical data model. The JSON is compatible with the API for generating a MAQL DDL script that generates or update an LDM.
    Here is an example of the JSON structure:

    {
      "projectModelView": {
        "model":
          {
            "projectModel": {
              "datasets": [ {
                "dataset": {
                  "identifier": "dataset.orders",
                  "title": "orders",
                  "anchor": {
                    "attribute": {
                      "identifier": "attr.orders.id",
                      "title": "id",
                      "folder": "orders",
                      "labels": [ {
                        "label": {
                          "identifier": "label.orders.id",
                          "title": "id",
                          "type": "GDC.text",
                          "dataType": "VARCHAR (128)"
                        }
                      } ]
                    }
                  },
                  "attributes": [ {
                    "attribute": {
                      "identifier": "attr.orders.description",
                      "title": "description",
                      "folder": "orders",
                      "labels": [ {
                        "label": {
                          "identifier": "label.orders.description",
                          "title": "description",
                          "type": "GDC.text",
                          "dataType": "VARCHAR (128)"
                        }
                      } ]
                    }
                  } ],
                  "facts": [ {
                    "fact": {
                      "identifier" : "fact.orders.price",
                      "title": "price",
                      "folder": "orders",
                      "dataType": "DECIMAL (12, 2)"
                    }
                  } ],
                  "production": true
                }
              }
            ]
          }
        }
      }
    }
  4. Copy the returned JSON structure starting with line 4, which is the following segment:

    {
      "projectModel": {
        ...

    Do not copy the projectModelView and model keys in lines 1-3.

  5. Go to the gray page for generating a MAQL DDL script (you will be using the generated script to create the logical data model later in this procedure):

    https://secure.gooddata.com/gdc/projects/{project_id}/model/diff

    where {project_id} is the ID of your workspace that you obtained at the step Create a GoodData workspace.

  6. Paste the copied JSON structure into the gray page form, and click Create Diff.

    The link for polling for status of the MAQL DDL generation is returned.

  7. Poll for the status until the gray page returns the MAQL DDLs to create the logical data model. The MAQL DDLs are compatible with the API for executing a MAQL DDL script that generates or update an LDM.
  8. In the returned MAQL DDLs, locate the maqlDdl key, and copy the key content.
    Here is an example of the maqlDdl key:

    "maqlDdl": "CREATE DATASET {dataset.orders} VISUAL(TITLE \"orders\");\n
    CREATE FOLDER {dim.orders} VISUAL(TITLE \"orders\") TYPE ATTRIBUTE;\n
    CREATE ATTRIBUTE {attr.orders.id} VISUAL(TITLE \"id\", FOLDER {dim.orders}) 
    AS KEYS {f_orders.id} FULLSET;\nALTER DATASET {dataset.orders} 
    ADD {attr.orders.id};\nALTER ATTRIBUTE {attr.orders.id} 
    ADD LABELS {label.orders.id} VISUAL(TITLE \"id\") AS {f_orders.nm_id};\n
    CREATE ATTRIBUTE {attr.orders.description} VISUAL(TITLE \"description\", 
    FOLDER {dim.orders}) AS KEYS {d_orders_description.id} FULLSET, 
    {f_orders.description_id};\nALTER DATASET {dataset.orders} 
    ADD {attr.orders.description};\nALTER ATTRIBUTE {attr.orders.description} 
    ADD LABELS {label.orders.description} VISUAL(TITLE \"description\") 
    AS {d_orders_description.nm_description};\nCREATE FOLDER {ffld.orders} 
    VISUAL(TITLE \"orders\") TYPE FACT;\nCREATE FACT {fact.orders.price} 
    VISUAL(TITLE \"price\", FOLDER {ffld.orders}) AS {f_orders.f_price};\n
    ALTER DATASET {dataset.orders} ADD {fact.orders.price};\nSYNCHRONIZE {dataset.orders};",
    
  9. Use any free online JSON escape tool to make the following changes in the copied content:

    1. Remove all instances of \n.
    2. Replace \" with ".

    Your final content should look similar to the following:

    CREATE DATASET {dataset.orders} VISUAL(TITLE "orders "); 
    CREATE FOLDER {dim.orders} VISUAL(TITLE "orders ") TYPE ATTRIBUTE; 
    CREATE ATTRIBUTE {attr.orders.id} VISUAL(TITLE "id ", FOLDER {dim.orders}) AS KEYS {f_orders.id} FULLSET; 
    ALTER DATASET {dataset.orders} ADD {attr.orders.id}; 
    ALTER ATTRIBUTE {attr.orders.id} ADD LABELS {label.orders.id} VISUAL(TITLE "id ") AS {f_orders.nm_id}; 
    CREATE ATTRIBUTE {attr.orders.description} VISUAL(TITLE "description ", FOLDER {dim.orders}) AS KEYS {d_orders_description.id} FULLSET, {f_orders.description_id}; 
    ALTER DATASET {dataset.orders} ADD {attr.orders.description}; 
    ALTER ATTRIBUTE {attr.orders.description} ADD LABELS 
    {label.orders.description} VISUAL(TITLE "description ") AS 
    {d_orders_description.nm_description}; 
    CREATE FOLDER {ffld.orders} VISUAL(TITLE "orders ") TYPE FACT; 
    CREATE FACT {fact.orders.price} VISUAL(TITLE "price ", FOLDER {ffld.orders}) AS {f_orders.f_price}; 
    ALTER DATASET {dataset.orders} ADD {fact.orders.price}; 
    SYNCHRONIZE {dataset.orders};


  10. Copy the final content.
  11. Go to the gray page for generating the logical data model:

    https://secure.gooddata.com/gdc/md/{project_id}/ldm/manage2

    where {project_id} is the ID of your workspace that you obtained at the step Create a GoodData workspace.

  12. Paste the final content into the gray page form, and click submit.

    The link for polling for status of the LDM generation is returned.

  13. Poll for the status until the gray page returns OK.
    The logical data model is created within your workspace.

The following is an example of the logical data model:

Once you have created a logical data model, we recommend that you use the API for validating the mapping between the Output Stage and the logical data model.

You are now going to deploy, schedule, and run the Automated Data Distribution (ADD) process that will load the data from your Snowflake Output Stage to the workspace.

Deploy, schedule, and run the ADD process

Deploy the ADD process to the GoodData platform and then create a schedule for the deployed process to automatically execute the data loading process at a specified time.

Steps:

  1. From the Projects tab of the Data Integration Console (see Accessing Data Integration Console), select your workspace.
  2. Click Deploy Process.

    The deploy dialog opens.
  3. From the Component dropdown, select Automated Data Distribution.
  4. From the Data Source dropdown, select the Data Source that you have created at the step Set up a connection between Snowflake and the GoodData platform.
  5. Enter a descriptive name for the ADD process.
  6. Click Deploy.
    The ADD process is deployed.
    You are now going to schedule the deployed process.
  7. For the deployed ADD process, click Create new schedule.
  8. In the Runs dropdown, set the frequency of execution to manually. This means that the process will run only when manually triggered.
  9. (Optional) If you want to increase the number of download processes to execute in parallel (by default, it is 1), add the queryParallelism parameter and set it to 2, 3, or 4.
  10. Click Schedule.
    The schedule is saved and opens for your preview.
    You are now going to manually run the scheduled process.
  11. Click Run.
    The schedule is queued for execution and is run as platform resources are available.
  12. If the schedule fails with errors, fix the errors, and run the schedule again. Repeat until the process finishes with a status of OK, which means that the ADD process has loaded the data to your workspace.
  13. (Optional) In the Runs dropdown, set the frequency of execution to whatever schedule fits your business needs. Click Save.
    The schedule is saved.

You can now start analyzing your data. Use Analytical Designer and create your first insight.

Summary

You have completed the tutorial. You should now be able to set up direct data distribution from Snowflake.