ELT Blueprint FAQ
Question: Our multi-tenant solution requires downloading separate files for each tenant. How should we process the individual files? Should we first collect all files, consolidate them outside of ADS, and load (COPY LOCAL) them then to the ADS input stage tables? Or can we load (COPY LOCAL) each file separately?
Answer: We recommend downloading all tenant files to S3 (single platform process execution), consolidate the files on S3 and load them ADS (single platform process execution), transform data via SQL queries to the output stage tables (single platform process execution), and load data to the tenant’s workspaces (single platform process execution per each tenant’s workspace). Regarding the multiple file consolidation on S3, in case all files have the same structure, we recommend using the COPY LOCAL option for loading multiple files in a single COPY LOCAL statement.
Question: Can the downloader data increments include data updates (as opposed to pure data additions)? Does this affect the performance of the merge?
Answer: Yes, the downloaded data on S3 can contain duplicates that are resolved later in the merge processing. In fact, we recommend implementing certain safety overlaps that make sure that downloaders do not lose any data. Ideally the de-duplication process uses an APPEND only strategy de-duplication strategy that has the lowest lock contention. Despite the MERGE command has higher lock contention, it is usually the most convenient (and performant) way of implementing the data de-duplication. In either case, all tenants data should be processed at once (no INSERT / MERGE per tenant).
Question: Where are the download, extract, and merge platform processes deployed?
Answer: All platform processes are deployed in a single workspace in case of a single-tenant solution (one workspace). The download, extract, merge, and transform processes are deployed to a common (shared) workspace that contains no dashboards and reports in case of a multi-tenant solution. The workspace loading platform processes are always deployed to the tenant’s workspace.
Question: Can customer push data to GoodData platform?
Answer: Yes, this scenario is supported. Customers can push data to a (properly secured) shared location (for example, WebDAV over HTTPS, Amazon S3). The downloader then downloads data from the shared location. The downloader schedule can be invoked by the push process when it finishes (via the schedule execution REST API). The downloader platform process can be also executed manually via the schedule’s Run Now button. Please note that we don’t support pushing data directly to the downloader’s Amazon S3 bucket mentioned above in this blueprint. Also, we do not recommend using the GoodData platform’s workspace level WebDAV storage for the data push purposes. For more details, see Workspace Specific Data Storage.
Question: How does GoodData back up the data pipeline persistent data?
Answer: We do full ADS cluster backups that are suitable for restoring all GoodData customers at the same time. Restoring a single ADS schema is very complicated. Also, the ADS backup is stored in the same datacenter (optimized for fast recovery) so it is not fully fault-tolerant. GoodData ultimate disaster recovery relies on the Amazon S3 bucket where the downloader collects the data files. These files would need to be reloaded in case of a disaster event. Your solution should automate this ultimate disaster recovery mechanism.
Question: How to make sure the workspaces will not start loading before the ADS output stage tables are ready?
Answer: Use the run-after scheduling to sequence the transformation and workspace loading platform processes. Using ADS transaction for enforcing consistency among multiple ADS output stage tables is not recommended (this is a very error-prone approach with exclusive locking).
Question: Can I DELETE or UPDATE data in ADS?
Answer: Try to avoid DELETEs and UPDATEs as much as you can. Use TRUNCATE TABLE or append only strategies for implementation of data retention strategies. Consider ADS table partitioning and DROP PARTITION.
Question: Can I create the ADS output tables on per-tenant basis in the multi-tenant solutions?
Answer: No. Please use the shared ADS output stage tables with a discriminator column (ideally the workspace ID) instead. ADS catalog size is the main concern here.
Question: My tenant’s data does not have unique IDs across all tenants.
Answer: Use a composite key containing the business key plus a tenant discriminator column.
Question: How to implement incremental workspace loading?
Answer: You’ll need to maintain a specific timestamp (or load ID) column for each record in the ADS output table plus ADS based audit table that stores the last timestamp (or load ID) loaded into a particular workspace. The incremental workspace load will use these columns to determine the correct data increment. The incremental workspace loading must be retry-able, so if it fails, the correct increment is loaded during the next load workspace process execution.
Question: How can I add a new tenant into my multi-tenant solution?
Answer: Provision a tenant workspace project and deploy and schedule the workspace data loading platform process.
Question: Where should I store a per-tenant configuration?
Answer: We recommend following mechanisms:
- Tenant’s workspace load process parameters for all tenant’s workspace loading specific parameters (e.g. the tenant’s discriminator value, custom fields to generic custom column’s mapping etc.).
- ADS audit table for customizing the transformation platform processes. This includes parameters for the incremental processing (determining the Amazon S3 files that haven’t been extracted yet, plus records that haven’t been incrementally loaded into the ADS output stage tables etc.)
- Downloader’s Amazon S3 bucket for the incoming metadata (structure of the data in the Amazon S3 bucket etc.)
Question: My tenant uses custom data processing for different tiers of their projects. How should I implement the individual tiers?
Answer: Use separate data pipelines (separate platform processes, Amazon buckets, and ADS instances) for each tier implementation. Use parametrization (see above) for simple customizations (for example, a tenant’s specific time zones, etc.)