Loading Data via REST API

This article describes how to load data from CSV files to your project in a single task instead of loading the CSV files one by one. Loading multiple CSV files at once helps you save time if your workspace uses incremental data loading, requires a long time to synchronize data, or contains many datasets (multiple data load tasks are integrated into a single task, and you do not have to start a data load task for each dataset).

Loading data via REST API consists of the following steps:

  1. Prepare the SLI manifest.
  2. Prepare CSV files with data to load.
  3. Transfer the CSV files to user-specific storage.
  4. Run a load task.

Prepare the SLI Manifest

Prepare Single Load Interface (SLI) manifests, one for each dataset. These manifests map the data columns to the logical data model in your workspace.

The SLI manifests are automatically generated by the GoodData platform. You are going to obtain them using the API.

Steps:

  1. Use the API for obtaining SLI manifests to obtain the SLI manifests for each dataset you want to update:

    • API resource: https://secure.gooddata.com/gdc/md/{workspace_id}/ldm/singleloadinterface/{dataset}/manifest/

    • Method: GET

    • Request headers:

      Content-Type: application/json
      Accept: application/json
      

    The API returns the SLI manifest for the specified dataset. The following sample is an example of the SLI manifest for the dataset called “dataset.quotes”:

    {
      "dataSetSLIManifest": {
        "parts": [
          {
            "columnName": "id",
            "mode": "FULL",
            "populates": ["label.quotes.id"],
            "referenceKey": 1
          },
          {
            "columnName": "company",
            "mode": "FULL",
            "populates": ["label.quotes.symbol.company"]
          },
          {
            "columnName": "symbol",
            "mode": "FULL",
            "populates": ["label.quotes.symbol"],
            "referenceKey": 1
          },
          {
            "columnName": "sector",
            "mode": "FULL",
            "populates": ["label.quotes.sector"],
            "referenceKey": 1
          },
          {
            "columnName": "industry",
            "mode": "FULL",
            "populates": ["label.quotes.industry"],
            "referenceKey": 1
          },
          {
            "columnName": "market",
            "mode": "FULL",
            "populates": ["label.quotes.market"],
            "referenceKey": 1
          },
          {
            "columnName": "quote_date",
            "mode": "FULL",
            "populates": ["quote.date.mdyy"],
            "constraints": {"date": "yyyy-mm-dd"},
            "referenceKey": 1
          },
          {
            "columnName": "open_price",
            "mode": "FULL",
            "populates": ["fact.quotes.open_price"]
          },
          {
            "columnName": "high_price",
            "mode": "FULL",
            "populates": ["fact.quotes.high_price"]
          },
          {
            "columnName": "low_price",
            "mode": "FULL",
            "populates": ["fact.quotes.low_price"]
          },
          {
            "columnName": "close_price",
            "mode": "FULL",
            "populates": ["fact.quotes.close_price"]
          },
          {
            "columnName": "volume",
            "mode": "FULL",
            "populates": ["fact.quotes.volume"]
          },
          {
            "columnName": "adjusted_close_price",
            "mode": "FULL",
            "populates": ["fact.quotes.adjusted_close_price"]
          },
          {
            "columnName": "quote_date_dt",
            "mode": "FULL",
            "populates": ["dt.quotes.quote_date"]
          }
        ],
        "file": "data.csv",
        "dataSet": "dataset.quotes"
      }
    }
    
  2. Store the obtained SLI manifests for later use.

  3. Do one of the following:

    • If you have only one CSV file to upload, name the obtained SLI manifest file upload_info.json, and proceed to Step 6.
    • If you have two or more CSV files to upload, proceed to Step 4.
  4. Create a JSON file, and name it upload_info.json. Copy the following JSON structure into the file:

    {
      "dataSetSLIManifestList" : [ ]
    }
    
  5. Insert the content of the stored SLI manifests into the dataSetSLIManifestList array. Your upload_info.json file should look similar to the following:

    {
      "dataSetSLIManifestList" : [
        {
          "dataSetSLIManifest": {
            "parts": [
              {
                "columnName": "team_id",
                "mode": "FULL",
                "populates": [
                  "label.dim_team.id"
                ],
                "referenceKey": 1
              },
              {
                "columnName": "name",
                "mode": "FULL",
                "populates": [
                  "label.dim_team.name"
                ]
              }
            ],
            "file": "dim_team.csv",
            "dataSet": "dataset.dim_team"
          }
        },
        {
          "dataSetSLIManifest": {
            "parts": [
              {
                "columnName": "assignee_id",
                "mode": "FULL",
                "populates": [
                  "label.dim_assignee.id"
                ],
                "referenceKey": 1
              },
              {
                "columnName": "team_id",
                "mode": "FULL",
                "populates": [
                  "label.dim_assignee.team_id"
                ],
                "referenceKey": 1
              },
              {
                "columnName": "name",
                "mode": "FULL",
                "populates": [
                  "label.dim_assignee.name"
                ]
              }
            ],
            "file": "dim_assignee.csv",
            "dataSet": "dataset.dim_assignee"
          }
        }
      ]
    }
    
  6. Review the content of your upload_info.json file and modify it when needed: 

    • The attributes have multiple labels assigned in the logical data model. "referenceKey": 1 specifies that the primary identifier of an attribute column will be used.

      {
        "columnName": "team_id",
        "mode": "FULL",
        "populates": [
          "label.dim_assignee.team_id"
        ],
        "referenceKey": 1
      }
      

      Select the label that represents the attribute to update its values.

    • To upload data incrementally, use "mode": "INCREMENTAL".

      {
        "columnName": "name",
        "mode": "INCREMENTAL",
        "populates": [
          "label.dim_team.name"
        ]
      }
      
    • For the columns that are related to the predefined date dimensions (provided by GoodData by default) or fiscal date dimensions, manually include a date constraint key with the appropriate date format. For example:

      "constraints": {"date": "yyyy-mm-dd"}
      

      Here is an example of a column related to the date dimension with the constraint key:

      {
        "columnName": "quote_date",
        "mode": "FULL",
        "populates": ["quote.date.mdyy"],
        "constraints": {"date": "yyyy-mm-dd"},
        "referenceKey": 1
      }
      

      If you do not include a date constraint key, you will see the following error message:

      Missing date constraint for column...
      
  7. Save the changes.

Prepare CSV Files with Data to Load

Prepare CSV files with the data that you want to upload. Wherever possible, use a tool or a library to generate your CSV files. Avoid generating them manually.

CSV File Requirements

Verify that your CSV files meet the requirements listed in the following sections.

File Format and Name

  • The files contain plain text in the UTF-8 format or in a UTF-8 compliant format (for example, ASCII).
  • Each file has the header row, which appears as the first row of the file.
  • File names are consistent.
  • Keys are used consistently across all the files.

File Structure

  • All rows have the same number of fields (columns).

  • All rows within a column are formatted consistently.

  • Each field in a row contains a value (no blank cells).

  • The order of the columns and column names are preserved.

  • The columns have the same name and order as in the SLI manifest.

  • The file header row contains the name of each column, is formatted the same way and contains the same number of fields as the other rows.

  • Fields in a row are separated by commas (,).

  • There are no spaces between values. This is correct:

    field1,field2,field3
    

    This is incorrect (note the extra spaces):

    field1, field2, field3
    
  • Each row ends with a line break character: CRLF or LF. You can use both CRLF and LF within one file. The last field in a file may or may not have an ending line break.

    field1,field2,field3<LF>
    field4,field5,field6<CRLF>
    field7,field8,field9<LF>
    field10,field11,field12
    
  • The files do not contain:

    • Empty trailing rows
    • Empty rows at the end of the file
    • Delimiters after the last field in a row
    • Delimiters (including tabs or newlines) at the end of the last row
    • End-of-file (EOF) delimiters

Date Format

  • Dates are formatted consistently across all the files.
  • Date values are between the years 1900 and 2050.
  • Dates are specified in one of the following formats:
    • yyyy/mm/dd
    • dd/mm/yyyy

Number Format

  • Numbers are in the plain #######.## format.
  • Numbers do not contain: 
    • Thousands separator commas (,)
    • Currency symbols (for example, $)
    • Percentage signs (%)
    • Scientific notations (for example, 12E3)

Quotation Marks

  • The quotation marks are used consistently.
  • The quotation marks are used in the following fields:
    • Fields with embedded commas

      field1,"field2 with a comma, and a space",field3
      
    • Fields with embedded line breaks

      field1,field2,field3
      "field4 with
      line breaks",field5,field6
      
    • Fields with embedded double-quote characters (a pair of double-quote characters):

      field1,"field2 with ""double quotes""",field3
      
    • Fields with text strings

      field1,"field2 is a string.",field3
      

Processing Specific Characters

  • Double commas are converted to blank fields. This row:

    1,2,,
    

    will be parsed as the following:

    "1","2","",""
    
  • Leading and trailing spaces are not trimmed. This row:

    field1,    field2_with_leading_trailing_space    ,field3
    

    will be parsed as the following:

    field1,"    field2_with_leading_trailing_space    ",field3
    
  • The null character and the null (SQL) marker are not supported. This row:

    NULL,field1,field2
    

    will be parsed as the following:

    "NULL","field1","field2"
    

Transfer the CSV Files to User-Specific Storage

Steps:

  1. Create a directory in your user-specific storage. To know the server URL to connect to, see User Specific Data Storage.

    curl -i -X MKCOL -v https://username%40email.com:PASSWORD@secure.gooddata.com/gdc/uploads/new-directory/
    
  2. Zip the SLI manifest (the upload_info.json file) and the CSV files with data to load into a single archive file. Name the archive file “upload.zip”.

  3. Put the archive file to the newly created directory.

    curl -i -X PUT --data-binary @upload.zip -v https://user.name%40gooddata.com:PASSWORD@secure.gooddata.com/gdc/uploads/new-directory/upload.zip
    

Run a Load Task

The Single Load Interface (SLI) handles the loading of each dataset with the appropriate CSV file, as specified in the manifest.

The loading process consists of two phases:

  1. The first phase is common for all datasets. An error during this phase causes crash for all datasets.
  2. The second phase processes each dataset independently. Multiple datasets can be processed in parallel. An error encountered with a specific dataset does not affect the loading of the other datasets.

Steps:

  1. Use the API for starting data upload to load the data from the user-specific storage to your workspace: 

    • API resource: https://secure.gooddata.com/gdc/md/{workspace_id}/etl/pull2

    • Method: POST

    • Request body:

      {"pullIntegration": "new-directory"}
      
    • Request headers:

      Content-Type: application/json
      Accept: application/json
      

    The task runs, and the link for polling for data loading status is returned.

  2. Poll for the status until the OK task status is returned.

  3. Log in to your workspace, and verify that the data has been uploaded.

Tips for Uploading CSV Files Using CloudConnect

Define Metadata

If you are using the CSV Reader component to read a CSV file in your CloudConnect process, configure the fields that are passed between components of the ETL graph. To do so, define the metadata between the CSV Reader component and the next component in the ETL graph.

Steps:

  1. In your logical data model, create a connection between the output port of the CSV Reader component and the input port of the next component in the graph.

  2. Double-click the connecting line. The metadata editor dialog appears.

  3. Configure the fields and their format. The following picture shows how to configure delimiters in the file for end-of-line and end-of-file. 

  1. Save the changes.

  2. Publish your logical data model to the server (see Publishing Logical Data Models in CloudConnect).

Use Batch Upload

To enable greater overall throughput for your ETL, use batch mode. Batch mode groups all dataset uploads from a single CloudConnect graph execution into one batch to be executed directly on the GoodData data loading infrastructure. This enables the CloudConnect worker to move on to the next graph execution rather than waiting for all datasets to be separately, independently loaded before the CloudConnect worker is freed up.

To enable batch mode, add the GDC_USE_BATCH_SLI_UPLOAD parameter to the process schedule.

Steps:

  1. From Data Integration Console (see Accessing Data Integration Console), click Projects to open the Projects page.
  2. Click the name of the workspace (project) that the process schedule belongs to.
  3. Locate the process for which you want to enable batch mode, and click the schedule name. The schedule details dialog opens.
  4. Click Add parameter.
  5. For the name of the parameter, enter GDC_USE_BATCH_SLI_UPLOAD. For the parameter value, enter TRUE.
  6. Click Save. Batch mode is set for the schedule.
  7. Run the graph to validate.