Direct Data Distribution from Data Warehouses

Direct data distribution from data warehouses covers extracting consolidated and cleaned data directly from a data warehouse and distributing it to your workspaces.

Integration with the following data warehouses are supported:

You can integrate data from your Snowflake instance or Redshift cluster directly into the GoodData platform. The Automated Data Distribution (ADD) process synchronizes data from Snowflake/Redshift with your customers’ workspaces based on a defined schedule. This is a key approach in building optimized multi-tenant analytics for all your customers and users without the runaway costs associated with executing direct queries to your data warehouse. For more information about ADD benefits and usage, see Automated Data Distribution Reference.

Setting up direct data distribution from Snowflake/Redshift requires actions on your Snowflake instance/Redshift cluster and on the GoodData platform.

Follow our step-by-step tutorials that will help you integrate Snowflake/Redshift and GoodData and provide you with as much automation during integration as possible.

Contents:

Components of Direct Data Distribution

Data Source

A Data Source is an entity that stores data warehouse credentials and the location of the Output Stage.

The Data Source is the main reference point when you are performing the following tasks:

  • Generating the Output Stage. The API that you are using scans the data warehouse schema stored in your Data Source and generates recommended views for the Output Stage.
  • Generating a logical data model (LDM). The API scans the Output Stage connected to your Data Source and provides a definition of the LDM, which can then be used for generating the LDM in your workspace via the API.
  • Validating the mapping between the Data Source and the LDM. The API compares the Output Stage connected to your Data Source to the LDM and returns a list of inconsistencies. Validate the mapping after you have changed the Output Stage or the LDM to see what changes are required.
  • Managing data mapping items. You can use data mapping if you want to override the way how data is loaded into workspaces. To do so, provide an alternative mapping scheme for project_id or client_id, respectively.

For more information about creating and listing Data Sources, see the API Reference.

Output Stage

The Output Stage is a set of tables and/or views that will serve as a source for loading data to the GoodData platform. You can prepare the Output Stage manually or generate the Output Stage for the Data Source.

If you decide to create the Output Stage manually, make sure that the following requirements are met:

  • The Output Stage is located in the warehouse, database, and schema that you specified in your Data Source.
  • All views and tables in your Output Stage have the prefix that you specified in the Data Source. All views and tables without the prefix are ignored during the data load.
  • All columns are named according to the naming convention (see Naming Convention for Output Stage Objects). Column names can contain underscores (__) only as part of the prefix.
  • All columns in the Output Stage are compatible with the GoodData data types. For more information, see GoodData-Snowflake Integration Details and GoodData-Redshift Integration Details.
  • The names of the views and tables in the Output Stage do not contain special characters.
  • There are no name collisions between your columns/tables and the GoodData technical columns and tables (see 'Special Columns in Output Stage Tables' in Naming Convention for Output Stage Objects).

If you decide to generate the Output Stage for the Data Source, review the resulting SQL code to check the following:

  • All columns that should serve as connection points are prefixed with cp__.
  • All columns that should serve as references are prefixed with r__.
  • Attributes that are represented by numeric values (for example, a customer tier that can be 1, 2, or 3) are prefixed with a__. Unless this is done, the Data Source by default identifies all columns with a numerical data type (INT, FLOAT, and other) as facts (the prefix f__).

You can generate the Output Stage from a different schema than the schema that will contain the Output Stage.

Best Practices

For better data load performance, we recommend that you apply the following best practices:

  • Use tables/views to store data for all customers together with a client_id differentiator. ADD can load such data to the customers’ workspaces much faster than when the data is stored per customer in dedicated tables/views.
  • Use incremental loads instead of full loads.