Phases of Building the Data Pipeline
This article describes the process of building a fully functional data pipeline.
The process consists of the following phases:
- Preparation: Gather information about data sources and data types.
- Testing Implementation: Create a simple testing data pipeline and make sure that it works in its minimum possible version.
- Production Implementation: Create a fully functional data pipeline based on the test pipeline.
Preparation
Identify the sources your data comes from. You may have one or multiple sources.
For example, a portion of your data may come from CSV files, while the rest may be stored in an SQL database.Based on the data sources, identify the types of downloaders that you are going to need.
To see what downloaders are available, see “Bricks” in Data Preparation and Distribution Pipeline.Decide whether you want to transform the source data before it gets distributed to the workspaces.
This decision depends on whether your data is ready as-is to be used for creating insights, or it has to be somehow transformed (pre-aggregate, de-normalize, process time-series data, build snapshots, consolidate multiple data sources, and so on).
If you choose to transform the data, you are going to need to use executors (such as SQL Executor).To cover all possible scenarios, the following sections assume that you decided to transform the data, therefore they include executors as part of the pipeline. However, using the executors is optional. If you do not need to transform your data before distributing it to workspaces, you do not have to use the executors, and you can skip all the steps related to the executors in the following sections.
You can now proceed to the Testing Implementation phase, where you are going to create and test the data pipeline in its minimum version.
Testing Implementation
Choose a downloader to test the data pipeline with. For the testing implementation purposes, we are going to use CSV Downloader.
Prepare a source file. For the testing implementation purposes, create a simple CSV file.
Put the source file to the remote location where you plan to store your source files going forward. For the testing implementation purposes, use your S3 bucket.
Create a simple configuration file for CSV Downloader.
- Copy the minimum layout of the configuration file (see “Minimum Layout of the Configuration File” in CSV Downloader).
- Replace the placeholders with your values.
- Save the file as configuration.json, and put it to your S3 bucket.
Deploy CSV Downloader in the service workspace (see Deploy a Data Loading Process for a Data Pipeline Brick).
When deploying the downloader, specify the path to the configuration file (configuration.json in your S3 bucket).Schedule CSV Downloader in the service workspace (see Schedule a Data Load). When scheduling CSV Downloader, add general and location-specific parameters to its schedule (see “Schedule Parameters” in CSV Downloader).
Run the CSV Downloader process (see Run a Scheduled Data Loading Process on Demand). When the process execution completes, check its log. If the process failed with errors, fix the errors, and run the process again. Repeat until the process competes successfully, and CSV Downloader downloads the source file and places the data to BDS.
Deploy ADS Integrator in the service workspace (see Deploy a Data Loading Process for a Data Pipeline Brick). When deploying the integrator, specify the path to the configuration file (configuration.json in your S3 bucket).
Schedule ADS Integrator in the service workspace (see Schedule a Data Load). When scheduling ADS Integrator, add general and location-specific parameters to its schedule (see “Schedule Parameters” in ADS Integrator).
Run the ADS Integrator process (see Run a Scheduled Data Loading Process on Demand). When the process execution completes, check its log. If the process failed with errors, fix the errors, and run the process again. Repeat until the process competes successfully, and ADS Integrator integrates the downloaded data into Agile Data Warehousing Service (ADS).
Set up the ADD Output Stage (see Use Automated Data Distribution). After you complete this step, the ADD data loading process appears in the Data Integration Console for your client workspaces. The process is named “Automated Data Distribution”.
Create a view to link the integrated data to the ADD Output Stage (see Naming Convention for Output Stage Objects).
Creating a view is a one-time action, and you perform it only for the testing implementation purposes to make sure that the data got integrated correctly in the testing implementation phase.Schedule the ADD process in the client workspace (see Schedule a Data Load).
Run the ADD process (see Run a Scheduled Data Loading Process on Demand). When the process execution completes, check its log. If the process failed with errors, fix the errors, and run the process again. Repeat until the process competes successfully, and the data is distributed to the client workspaces.
You can now proceed to the Production Implementation phase, where you are going to build a fully functional data pipeline based on the test pipeline.
For a better understanding of the Testing Implementation phase, review Build Your Data Pipeline.
Production Implementation
Update the configuration file (configuration.json) that you have created for the testing implementation phase.
The configuration file covers configurations for all your downloaders and integrators (executors do not require any parameters in the configuration file). You do not have to create a configuration file for each downloader and integrator separately. The order of the sections in the configuration file is not important.Depending on your business and technical requirements, you can do the following:
- Add more downloaders (add more instances of the same downloader or add other downloaders).
- Add more instances of the integrators.
- Customize the default settings of the downloaders and integrators.
For information about the configuration file parameters for each downloader and integrator, see the appropriate articles under Downloaders and Integrators. For the configuration file example, see Configure a Brick.
Prepare transformation scripts for the executors, and store the scripts for later use. For information about the transformation scripts, see SQL Executor. Skip this step if you decided not to transform your data before distributing it to workspaces.
Put the updated configuration file (configuration.json) and the prepared transformation scripts to your S3 bucket.
Deploy the downloaders, integrators, and executors to your service workspace (see Deploy a Data Loading Process for a Data Pipeline Brick). When deploying a downloader or integrator, specify the path to the configuration file (configuration.json).
Schedule the deployed downloaders in the service workspace. Depending on the amount of the data to download, the required data load frequency, and other business requirements, you can schedule them in one of the following ways:
- The downloaders run one after another (see Configure Schedule Sequences).
- The downloaders are executed only when you run them manually, either via the Data Integration Console or API (see Schedule a Data Load for Manual Execution Only).
When scheduling a downloader, add the downloader-specific parameters to its schedule. For information about the schedule parameters for each downloader, see the appropriate articles under Downloaders.
Schedule the deployed integrators, executors, and the ADD data loading process (see Schedule a Data Load).
To schedule integrators and executors, navigate to the Data Integration Console for your service workspace.
To schedule the ADD data loading process, navigate to the Data Integration Console for your client workspaces.
When scheduling an integrator or executor, add the integrator- or executor-specific parameters to its schedule. For information about the schedule parameters for each integrator and executor, see the appropriate articles under Integrators and Executors. Schedule the components so that they run in this exact order and each component starts only after the previous component completes:
- Downloaders
- Integrators
- Executors
- ADD
(Optional) Deploy Schedule Executor to your service workspace (see Deploy a Data Loading Process for a Data Pipeline Brick) and schedule it. When scheduling Schedule Executor, add general and environment-specific parameters to its schedule (see “Schedule Parameters” in Schedule Executor).
(Optional) Set up notifications for the scheduled processes (see Create a Notification Rule for a Data Loading Process).
This completes the process of building the data pipeline. You can now place your source files to the specified remote location so that the scheduled downloaders can download them at their first run.