Phases of Building the Data Pipeline

This article describes the process of building a fully functional data pipeline.

The process consists of the following phases:

  1. Preparation: Gather information about data sources and data types.
  2. Testing Implementation: Create a simple testing data pipeline and make sure that it works in its minimum possible version.
  3. Production Implementation: Create a fully functional data pipeline based on the test pipeline.

Preparation

  1. Identify the sources your data comes from. You may have one or multiple sources. 
    For example, a portion of your data may come from CSV files, while the rest may be stored in an SQL database.  

  2. Based on the data sources, identify the types of downloaders that you are going to need. 
    To see what downloaders are available, see “Bricks” in Data Preparation and Distribution Pipeline.  

  3. Decide whether you want to transform the source data before it gets distributed to the workspaces. 
    This decision depends on whether your data is ready as-is to be used for creating insights, or it has to be somehow transformed (pre-aggregate, de-normalize, process time-series data, build snapshots, consolidate multiple data sources, and so on). 
    If you choose to transform the data, you are going to need to use executors (such as SQL Executor).

You can now proceed to the Testing Implementation phase, where you are going to create and test the data pipeline in its minimum version.

Testing Implementation

  1. Choose a downloader to test the data pipeline with. For the testing implementation purposes, we are going to use CSV Downloader.  

  2. Prepare a source file. For the testing implementation purposes, create a simple CSV file.  

  3. Put the source file to the remote location where you plan to store your source files going forward. For the testing implementation purposes, use your S3 bucket.  

  4. Create a simple configuration file for CSV Downloader.

    1. Copy the minimum layout of the configuration file (see “Minimum Layout of the Configuration File” in CSV Downloader).
    2. Replace the placeholders with your values.
    3. Save the file as configuration.json, and put it to your S3 bucket.  
  5. Deploy CSV Downloader in the service workspace (see Deploy a Data Loading Process for a Data Pipeline Brick). 
    When deploying the downloader, specify the path to the configuration file (configuration.json in your S3 bucket).  

  6. Schedule CSV Downloader in the service workspace (see Schedule a Data Load). When scheduling CSV Downloader, add general and location-specific parameters to its schedule (see “Schedule Parameters” in CSV Downloader).  

  7. Run the CSV Downloader process (see Run a Scheduled Data Loading Process on Demand). When the process execution completes, check its log. If the process failed with errors, fix the errors, and run the process again. Repeat until the process competes successfully, and CSV Downloader downloads the source file and places the data to BDS.  

  8. Deploy ADS Integrator in the service workspace (see Deploy a Data Loading Process for a Data Pipeline Brick). When deploying the integrator, specify the path to the configuration file (configuration.json in your S3 bucket).   

  9. Schedule ADS Integrator in the service workspace (see Schedule a Data Load). When scheduling ADS Integrator, add general and location-specific parameters to its schedule (see “Schedule Parameters” in ADS Integrator).  

  10. Run the ADS Integrator process (see Run a Scheduled Data Loading Process on Demand). When the process execution completes, check its log. If the process failed with errors, fix the errors, and run the process again. Repeat until the process competes successfully, and ADS Integrator integrates the downloaded data into Agile Data Warehousing Service (ADS).  

  11. Set up the ADD Output Stage (see Use Automated Data Distribution). After you complete this step, the ADD data loading process appears in the Data Integration Console for your client workspaces. The process is named “Automated Data Distribution”.  

  12. Create a view to link the integrated data to the ADD Output Stage (see Naming Convention for Output Stage Objects).

  13. Schedule the ADD process in the client workspace (see Schedule a Data Load).  

  14. Run the ADD process (see Run a Scheduled Data Loading Process on Demand). When the process execution completes, check its log. If the process failed with errors, fix the errors, and run the process again. Repeat until the process competes successfully, and the data is distributed to the client workspaces.

You can now proceed to the Production Implementation phase, where you are going to build a fully functional data pipeline based on the test pipeline.

Production Implementation

  1. Update the configuration file (configuration.json) that you have created for the testing implementation phase.

    Depending on your business and technical requirements, you can do the following:

    • Add more downloaders (add more instances of the same downloader or add other downloaders).
    • Add more instances of the integrators.
    • Customize the default settings of the downloaders and integrators.

    For information about the configuration file parameters for each downloader and integrator, see the appropriate articles under Downloaders and Integrators. For the configuration file example, see Configure a Brick.  

  2. Prepare transformation scripts for the executors, and store the scripts for later use. For information about the transformation scripts, see SQL Executor. Skip this step if you decided not to transform your data before distributing it to workspaces.  

  3. Put the updated configuration file (configuration.json) and the prepared transformation scripts to your S3 bucket.  

  4. Deploy the downloaders, integrators, and executors to your service workspace (see Deploy a Data Loading Process for a Data Pipeline Brick). When deploying a downloader or integrator, specify the path to the configuration file (configuration.json).  

  5. Schedule the deployed downloaders in the service workspace. Depending on the amount of the data to download, the required data load frequency, and other business requirements, you can schedule them in one of the following ways: 

    When scheduling a downloader, add the downloader-specific parameters to its schedule. For information about the schedule parameters for each downloader, see the appropriate articles under Downloaders.  

  6. Schedule the deployed integrators, executors, and the ADD data loading process (see Schedule a Data Load).

    • To schedule integrators and executors, navigate to the Data Integration Console for your service workspace.

    • To schedule the ADD data loading process, navigate to the Data Integration Console for your client workspaces.

    When scheduling an integrator or executor, add the integrator- or executor-specific parameters to its schedule. For information about the schedule parameters for each integrator and executor, see the appropriate articles under Integrators and Executors.  Schedule the components so that they run in this exact order and each component starts only after the previous component completes: 

    • Downloaders
    • Integrators
    • Executors
    • ADD  
  7. (Optional) Deploy Schedule Executor to your service workspace (see Deploy a Data Loading Process for a Data Pipeline Brick) and schedule it. When scheduling Schedule Executor, add general and environment-specific parameters to its schedule (see “Schedule Parameters” in Schedule Executor).  

  8. (Optional) Set up notifications for the scheduled processes (see Create a Notification Rule for a Data Loading Process).

This completes the process of building the data pipeline. You can now place your source files to the specified remote location so that the scheduled downloaders can download them at their first run.