Recommended ETL Development Practices
CloudConnect is a legacy tool and will be discontinued. We recommend that to prepare your data you use the GoodData data pipeline as described in Data Preparation and Distribution. For data modeling, see Data Modeling in GoodData to learn how to work with Logical Data Modeler.
The following recommendations can be integrated into your ETL development processes to simplify the effort and improve the overall quality of the finished product.
Planning Your ETL Project
Think about what you have to do. How does the ETL project need to be structured? Are there parts of the processing that can be done before the data is passed to the ETL? Can some processing be handled in the source system before ETL?
Use a small sample of data to build and test your ETL project. Do not process massive volumes of data until your ETL has been completely finished and debugged. Using smaller datasets is easier to validate.As a final test of your ETL graphs, you should submit data in production volumes in the project, which can identify potential problems with processing-intensive steps, such as joins, sorts, aggregations, and lookups.
The maximum execution time for an ETL graph in the GoodData platform is 5 hours. At this point, the graph is forcibly shut down. In CloudConnect Designer, this limit is not imposed, so you should pay attention to the duration of very long ETL executions in CloudConnect Designer.
ETL Graph Architecture
Building sequences of ETL steps in CloudConnect Designer is relatively simple and intuitive. It can be challenging, though, to build complex workflows across multiple graphs. You should invest time in designing and executing your ETL workflow logic.In a future release, CloudConnect Designer will include graphically designed workflows, which will simplify the process of building interactions between ETL graphs.
Build out your ETL graphs dataset by dataset.
Try to divide the overall ETL project into smaller, integrated parts. Typically, these parts are Extract, Transform, and Load steps. This compartmentalization facilitates debugging and handling error recovery.
- The execution sequence of the steps is orchestrated by an orchestrator graph, which utilizes a sequence of RunGraph parts.
- In the orchestrator graph, you may use multiple RunGraph components. You can arrange these in sequence by adding an index identifier to the components. RunGraph components are run in order: 0, 1, 2, and so on.
After you have an idea about the desired ETL architecture and connections between the parts, you may start building the graph steps.Do not attempt to build mechanisms that enable restarting of the graph with a partially processed dataset. The unit of execution of an ETL graph is the entire graph.
When the ETL successfully completes:
- In some cases, you may wish to maintain and update a record or timestamp counter, which may be useful in incremental data loading environments. For more information on tracking your most recently loaded record, see Hardening Your ETL.
- Create a backup of the data that was uploaded to GoodData. Through the ETL graph, store these backups in an online repository, such as an S3 bucket.
Debugging ETL in CloudConnect
- Between the CloudConnect components, add debug edges, where debug data can be captured and reviewed after graph execution. You can also create force failure points during execution, but these points need to be removed in the production ETL.
- For debugging purposes, you might consider designing a secondary output of each step to be a file via the CSV Writer component.
- Results can be shared between steps in your graphs through the
datafolder. This intermediate storage is valid for the life of a graph’s execution. If the graph fails to complete, the data remains available through the log.
- Before you publish any ETL project to a production environment, you should apply stress testing by processing the maximum estimated data load through the ETL. Sort and Join operations are important to test in your transformations.
Documentation for ETL Projects
- Create your source-target field mappings and document them in an easy-to-read and accessible format.
- You may use labels in CloudConnect to do some in-process documentation.
- In addition to your in-graph documentation, you should create an external document, which describes each ETL graph, including source data, destination dataset, and summary information on each step of the process. In particular, you should explain any non-intuitive design decisions that you made and why you made them. If possible, describe any failure scenarios and steps that can be taken to recover from them, including whether it can be restarted safely. Associated with each ETL graph description should be the technical contact, who can assist if there are problems.
Building Individual ETL Graphs
Each individual ETL graph should be readable by a technical person who has no experience with the project beforehand. You or someone else may need to debug the graph when an API changes, for example, or you may need to enrich it with new functionality at a later time. The following are some general recommendations for making your ETL graphs more accessible:
- Use labels to add comments on what each graph does. Use in-code commenting to describe the functionality of more complex component functions.
- Avoid building nested calls to other ETL graphs, unless you carefully and consistently document them. It can be challenging to debug nested ETL.
- Always validate your ETL graph. Verify that your ETL works as expected in your local environment, before it is published into a production project. Add a reference to the validation test, date of local execution, and any version information about the ETL graph to your external ETL project document.
Hardening Your ETL
As much as possible, you should build in features to your ETL to make it unbreakable. Features may include using quality coding standards, robust data validation, and recovery practices.
- If you are using downloaded files as source data, you should consider performing backups of any downloaded source files, as the dataset may change or may not be available if you attempt to re-acquire from the source system at a later time. Source data should be backed up after completion of the Extract phase. In the event of disaster, you can recover your source data files without having to go back to the system of origin, where the data may have changed or been wiped clean. Backups may also facilitate recovery in GoodData if user error results in execution of a bad ETL graph or other unexpected event. Accidents happen. Plan for them.
- If an ETL graph fails, it should be able to be restarted without breaking the process or corrupting the dataset. Particularly if you are building the ETL graph to deploy across multiple projects using incremental data loads, building and testing this kind of recoverability will save considerable time in project maintenance. Restarts should work in the following conditions for incremental data loads:
In case of timeout or other unspecified failure, the project should be able to restart without manual intervention to clean up data or set variables.
The project should be able to withstand multiple restarts within a single day.
If graph execution is interrupted, it should be recoverable even after a period of several days.
One method involves storing a record or timestamp counter in the ETL metadata. For example, if the
LAST_LOADEDparameter doesn’t exist or is set to zero, then it is the first execution of the ETL. As part of that execution, you can configure the ETL at runtime to do a full data load with replacement.
Do not update the
LAST_LOADEDparameter until all of the data in the graph has been successfully uploaded to GoodData, and backups have been created.
You should consider delaying the loading of data from your ETL into GoodData until all of the datasets in your project are ready to be loaded, which can reduce the possibility of data corruption. All transformation steps must be complete without failure before the ETL performs the loading steps into GoodData. This structuring of your ETL ensures that all data is loaded, or no data is loaded at all, which simplifies the potential debugging issues.
Design your ETL graphs to fail in a graceful manner when unrecoverable errors are detected. If you use RunGraph components, this error trapping is easy to manage, as you can trap errors by delivering output through the second port of the component.
You can log important events in your ETL graphs for debugging purposes using the following function call, Valid levels include
- After you have built an ETL project, follow the validation and testing steps in the Uncover phase. You can create simple data validation reports in the GoodData project to validate that your ETL graph has executed properly.
- Validation should be run over a period of several execution runs, so that you can verify that data is being properly updated without creating gaps.
- In some cases, you can use the source system to generate post-ETL validation data. For example, the Salesforce SOQL interface enables you to query for data that is transformed according to your ETL transformation step.
- For more information on validating your projects, see Validate a Workspace.
- You should familiarize yourself with the Data Integration Console, which enables you to schedule graph executions and run them on an on-demand basis. Some of the features in Data Integration Console may be useful in debugging your ETL graphs after they have been deployed into the GoodData platform, including issues around connectivity, project parameters, and other execution issues. For more information, see Data Integration Console Reference.
Optimizing Your ETL Graphs
- After you have completed the basic functionality of your ETL graphs, you should optimize memory consumption and performance of the ETL. Perform sorts in separate phases. Filter your data before sorting or joining. Basic database performance techniques can be applied.
- Make the runtime of each graph as short as possible. Perform units of work in increments. For example, if your graph requires polling, perform it in a separate graph and schedule that graph separately, so that it doesn’t block platform resources.
- Incremental and full data loads should be planned and executed differently. For more information, see Incremental Data Loading.