Workspace Data Loading Blueprint
The workspace data loading blueprint is the simplest one from the GoodData platform footprint standpoint. It is suitable for solutions that leverage the GoodData platform’s scalable analytical experience distribution. This blueprint relies on a 3rd party process (usually executes outside of GoodData platform) that pushes data to one or more GoodData workspaces via the GoodData SLI REST API (for details, see Loading Data via REST API).
Capabilities
Data is loaded on a per dataset basis. The GoodData platform allows for single dataset loading per workspace at any given time (no support for multiple concurrent uploads to the same workspace). Both incremental and full data loading into GoodData workspace are supported. This blueprint relies on the the GoodData SLI REST API (see Loading Data via REST API). There are numerous SDKs that wrap the REST API to specific programmatic components:
Appending vs. Upserting vs. Deleting Data
There are following mechanisms that control a way how data are uploaded to or removed from a dataset via the GoodData SLI REST API:
- Mode parameter in the upload manifest. The mode parameter supports the following values:
- FULL reload removes all existing records from the loaded dataset and inserts the data passed to the GoodData SLI REST API.
- If there are rows with duplicate keys (either fact table grain or connection), the “last row wins”.
- INCREMENTAL keeps the dataset’s current data and merges/upserts (inserts or updates) the data submitted to the GoodData SLI REST API.
- Again, the “last row wins” strategy is applied here. No matter if the “last row” is in the same or subsequent data increment.
- DELETE-CENTER deletes old data from a dataset via the GoodData SLI REST API. Can be used together with INCREMENTAL mode.
- FULL reload removes all existing records from the loaded dataset and inserts the data passed to the GoodData SLI REST API.
- Dataset connection point. If the uploaded dataset has a connection point (anchor), then the records with the same connection point values are replaced (the “last row win” strategy is used)
- The Fact table grain feature (for more information, see Set the Grain of a Fact Table to Avoid Duplicate Records) works in a same way as connection point. The key difference is that it is significantly faster than the connection point.
Workspace Data Loading Guidelines
This paragraph describes the workspace data loading blueprint’s guidelines.
Incremental Data Loading
We recommend to load data larger than ~ 25M records / 1GB total loaded data size (uncompressed data size) incrementally.
Ordering the Dataset Loading
Connected datasets must be loaded in a particular order. You have to make sure all datasets referenced from a particular dataset are loaded before the dataset is loaded. In other words, you need to load the connected datasets in “left-to-right” direction when looking at the workspace’s LDM visualization.
Multiple Datasets Loading
The GoodData platform supports the so called batch load of CSV data (REST API) when multiple CSV files are submitted at the same time. The multiload optimizes the data load processing on the GoodData platform side. We recommend using this optimization as much as possible. See following tutorials for more details:
- Batch load of CSV data (REST API) - see Data Loading Modes in CloudConnect
- Loading multiple datasets with Ruby SDK
Fact Table Grain
Using the fact table grain feature (for more information, see Set the Grain of a Fact Table to Avoid Duplicate Records) is recommended for incremental loading of large volumes of data. Use the fact table grain instead of the connection point on the large fact tables. The fact that these tables have no connection point implies that these can’t be referenced from any other dataset (these are the ultimate centers of your LDM visualization). Consider merging large datasets with connection point (e.g. dimensions) with the datasets they reference in order to remove the connection point and use the fact table grain instead.
Connection Point Values
The fact table grain can’t be used for datasets that are referenced from other datasets (dimensions in most frequent cases). You have to use the connection point for these datasets. Make sure that the values of the connection points are the same as in the previous load. GoodData platform “remembers” every unique connection point value. The data loading is considerably slower when the number of connection point values (even those that are no longer used) reaches roughly 10s of millions of unique values.
The connection point values that are no longer used can be stripped with the optimize lookups functionality:
https://secure.gooddata.com/gdc/md/your-workspace-id/etl/mode
Use this resource when your workspace no longer contains many records with unique connection point values.
Limitations, Concurrency and Performance
Here are the current platform limits related to the workspace upload.
Accessing Platform Limits requires a log in to view this page. If you do not have access you can register via sign-in on the GoodData support portal.