Direct Data Distribution from Data Warehouses and Object Storage Services

Direct data distribution from data warehouses and object storage services covers extracting consolidated and cleaned data directly from a data warehouse/object storage service and distributing it to your GoodData workspaces.

Contents:

Supported Data Warehouses and Object Storage Services

In addition to the GoodData ADS (Agile Data Warehousing Service; see Data Warehouse), the GoodData platform supports direct integration with the following third-party data warehouses:

The GoodData platform also supports direct integration with the Amazon S3 object storage service (https://aws.amazon.com/s3/).

You can integrate data from your Snowflake instance, Redshift cluster, BigQuery project, or Amazon S3 bucket directly into the GoodData platform. The Automated Data Distribution (ADD) v2 process synchronizes data from the data warehouse/object storage service with your customers’ workspaces based on a defined schedule. This is a key approach in building optimized multi-tenant analytics for all your customers and users without the runaway costs associated with executing direct queries to your data warehouse or object storage service. For more information about ADD v2 benefits and usage, see Automated Data Distribution v2 for Data Warehouses and Automated Data Distribution v2 for Object Storage Services.

Setting up direct data distribution from a data warehouse/object storage service requires actions that you perform both on your Snowflake instance, Redshift cluster, BigQuery project, or S3 bucket and in your GoodData workspace. Follow our step-by-step tutorials that will help you integrate your data warehouse/object storage service and GoodData and provide you with as much automation during integration as possible.

Depending on your experience, you can start with your own data or you can first try using our sample data for your warehouse-GoodData integration to better understand the processes involved:

If you have a GoodData workspace with the logical data model (LDM) that meets your business requirements for data analysis, see Integrate Data Warehouses Directly to GoodData based on an Existing LDM.

For the step-by-step tutorial about your S3 bucket-GoodData integration, see Integrate Data from Your Amazon S3 Bucket Directly to GoodData.

Components of Direct Data Distribution

Data Source

In data warehouses, a Data Source is an entity that stores data warehouse credentials and the location of the Output Stage.

The Data Source is the main reference point when you are performing the following tasks:

  • Generating the Output Stage. During the process, we scan the data warehouse schema stored in your Data Source and generate recommended views for the Output Stage.
  • Generating an LDM. The process scans the Output Stage connected to your Data Source and provides a definition of the LDM, which can then be used for generating the LDM in your workspace.
  • Validating the mapping between the Data Source and the LDM. The Output Stage connected to your Data Source is compared to the LDM, and a list of inconsistencies is returned. Validate the mapping after you have changed the Output Stage or the LDM to see what changes are required.
  • Managing mapping between workspaces and your customers' data. You can provide a custom mapping scheme to match the project IDs and the client IDs. For more information about client IDs, see Automated Data Distribution v2 for Data Warehouses.

You can perform all the above tasks using individual API calls. For more information about creating and listing Data Sources, see the API Reference.


In object storage services, a Data Source is an entity that stores connection details for an object storage service. For more information, see Integrate Data from Your Amazon S3 Bucket Directly to GoodData.

Output Stage

The Output Stage is used only when you set up direct data distribution from the data warehouses. The Output Stage is not used with object storage services.

The Output Stage is a set of tables and/or views that will serve as a source for loading data to your workspaces. You can prepare the Output Stage manually or generate the Output Stage for the Data Source.

If you decide to create the Output Stage manually, make sure that the following requirements are met:

If you decide to generate the Output Stage for the Data Source, review the resulting SQL code to check the following:

  • All columns that should serve as connection points are prefixed with cp__.
  • All columns that should serve as references are prefixed with r__.
  • Attributes that are represented by numeric values (for example, a customer tier that can be 1, 2, or 3) are prefixed with a__. Unless this is done, the Data Source by default identifies all columns with a numerical data type (INT, FLOAT, and other) as facts (the prefix f__).

For more information, see Naming Convention for Output Stage Objects in Automated Data Distribution v2 for Data Warehouses.

You can generate the Output Stage from a different schema than the schema that will contain the Output Stage.

For more information, see the following tutorials:

Powered by Atlassian Confluence and Scroll Viewport.