Loading Data via REST API

This article describes how to load data from CSV files to your project in a single task instead of loading the CSV files one by one. Loading multiple CSV files at once helps you save time if your project uses incremental data loading, requires a long time to synchronize data, or contains many datasets (multiple data load tasks are integrated into a single task, and you do not have to start a data load task for each dataset).

Loading data via REST API consists of the following steps:

  1. Prepare the SLI manifest.
  2. Prepare CSV files with data to load.
  3. Transfer the CSV files to user-specific storage.
  4. Run a load task.

If you want to delete some old or obsolete data from a dataset while loading new data to the same or different dataset, see Delete Old Data while Loading New Data to a Dataset via API.

Contents:

Prepare the SLI Manifest

Prepare Single Load Interface (SLI) manifests, one for each dataset. These manifests map the data columns to the logical data model in your project.

The SLI manifests are automatically generated by the GoodData platform. You are going to obtain them using the API.

The following procedure assumes that you access your projects at https://secure.gooddata.com/.

If you are a white-labeled customer, replace secure.gooddata.com with your white-labeled domain in the procedure steps.

Steps:

  1. Use the API for obtaining SLI manifests to obtain the SLI manifests for each dataset you want to update:
    • API resource: https://secure.gooddata.com/gdc/md/{project_id}/ldm/singleloadinterface/{dataset}/manifest/
    • Method: GET
    • Request headers:

      Content-Type: application/json
      Accept: application/json

    The API returns the SLI manifest for the specified dataset. The following sample is an example of the SLI manifest for the dataset called 'dataset.quotes':

    {
      "dataSetSLIManifest": {
        "parts": [
          {
            "columnName": "id",
            "mode": "FULL",
            "populates": ["label.quotes.id"],
            "referenceKey": 1
          },
          {
            "columnName": "company",
            "mode": "FULL",
            "populates": ["label.quotes.symbol.company"]
          },
          {
            "columnName": "symbol",
            "mode": "FULL",
            "populates": ["label.quotes.symbol"],
            "referenceKey": 1
          },
          {
            "columnName": "sector",
            "mode": "FULL",
            "populates": ["label.quotes.sector"],
            "referenceKey": 1
          },
          {
            "columnName": "industry",
            "mode": "FULL",
            "populates": ["label.quotes.industry"],
            "referenceKey": 1
          },
          {
            "columnName": "market",
            "mode": "FULL",
            "populates": ["label.quotes.market"],
            "referenceKey": 1
          },
          {
            "columnName": "quote_date",
            "mode": "FULL",
            "populates": ["quote.date.mdyy"],
            "constraints": {"date": "yyyy-MM-dd"},
            "referenceKey": 1
          },
          {
            "columnName": "open_price",
            "mode": "FULL",
            "populates": ["fact.quotes.open_price"]
          },
          {
            "columnName": "high_price",
            "mode": "FULL",
            "populates": ["fact.quotes.high_price"]
          },
          {
            "columnName": "low_price",
            "mode": "FULL",
            "populates": ["fact.quotes.low_price"]
          },
          {
            "columnName": "close_price",
            "mode": "FULL",
            "populates": ["fact.quotes.close_price"]
          },
          {
            "columnName": "volume",
            "mode": "FULL",
            "populates": ["fact.quotes.volume"]
          },
          {
            "columnName": "adjusted_close_price",
            "mode": "FULL",
            "populates": ["fact.quotes.adjusted_close_price"]
          },
          {
            "columnName": "quote_date_dt",
            "mode": "FULL",
            "populates": ["dt.quotes.quote_date"]
          }
        ],
        "file": "data.csv",
        "dataSet": "dataset.quotes"
      }
    }
  2. Store the obtained SLI manifests for later use.
  3. Do one of the following:
    • If you have only one CSV file to upload, name the obtained SLI manifest file upload_info.json, and proceed to Step 6.
    • If you have two or more CSV files to upload, proceed to Step 4.
  4. Create a JSON file, and name it upload_info.json. Copy the following JSON structure into the file:

    {
      "dataSetSLIManifestList" : [ ]
    }
  5. Insert the content of the stored SLI manifests into the dataSetSLIManifestList array.
    Your upload_info.json file should look similar to the following:

    {
      "dataSetSLIManifestList" : [
        {
          "dataSetSLIManifest": {
            "parts": [
              {
                "columnName": "team_id",
                "mode": "FULL",
                "populates": [
                  "label.dim_team.id"
                ],
                "referenceKey": 1
              },
              {
                "columnName": "name",
                "mode": "FULL",
                "populates": [
                  "label.dim_team.name"
                ]
              }
            ],
            "file": "dim_team.csv",
            "dataSet": "dataset.dim_team"
          }
        },
        {
          "dataSetSLIManifest": {
            "parts": [
              {
                "columnName": "assignee_id",
                "mode": "FULL",
                "populates": [
                  "label.dim_assignee.id"
                ],
                "referenceKey": 1
              },
              {
                "columnName": "team_id",
                "mode": "FULL",
                "populates": [
                  "label.dim_assignee.team_id"
                ],
                "referenceKey": 1
              },
              {
                "columnName": "name",
                "mode": "FULL",
                "populates": [
                  "label.dim_assignee.name"
                ]
              }
            ],
            "file": "dim_assignee.csv",
            "dataSet": "dataset.dim_assignee"
          }
        }
      ]
    }
  6. Review the content of your upload_info.json file and modify it when needed:
    • The attributes have multiple labels assigned in the logical data model.
      "referenceKey": 1 specifies that the primary identifier of an attribute column will be used.

      {
        "columnName": "team_id",
        "mode": "FULL",
        "populates": [
          "label.dim_assignee.team_id"
        ],
        "referenceKey": 1
      }

      Select the label that represents the attribute to update its values.

    • To upload data incrementally, use "mode": "INCREMENTAL".

      {
        "columnName": "name",
        "mode": "INCREMENTAL",
        "populates": [
          "label.dim_team.name"
        ]
      }
    • For the columns that are related to the predefined date dimensions (provided by GoodData by default) or fiscal date dimensions, manually include a date constraint key with the appropriate date format. For example:

      "constraints": {"date": "yyyy-MM-dd"}

      Here is an example of a column related to the date dimension with the constraint key:

      {
        "columnName": "quote_date",
        "mode": "FULL",
        "populates": ["quote.date.mdyy"],
        "constraints": {"date": "yyyy-MM-dd"},
        "referenceKey": 1
      }

      If you do not include a date constraint key, you will see the following error message:

      Missing date constraint for column...

      Do not use the constraint key for columns for any other purpose (for example, for columns related to generic attributes).

  7. Save the changes.

Prepare CSV Files with Data to Load

Prepare CSV files with the data that you want to upload. Wherever possible, use a tool or a library to generate your CSV files. Avoid generating them manually.

CSV File Requirements

Verify that your CSV files meet the requirements listed in the following sections.

File Format and Name

  • The files contain plain text in the UTF-8 format or in a UTF-8 compliant format (for example, ISO-8859-1).
  • Each file has the header row, which appears as the first row of the file.
  • File names are consistent.
  • Keys are used consistently across all the files.

File Structure

  • All rows have the same number of fields (columns).
  • All rows within a column are formatted consistently.
  • Each field in a row contains a value (no blank cells).
  • The order of the columns and column names are preserved.
  • The columns have the same name and order as in the SLI manifest.
  • The file header row contains the name of each column, is formatted the same way and contains the same number of fields as the other rows.
  • Fields in a row are separated by commas (,), semicolons (;), vertical bars (|), or other standard delimiters.
  • There are no spaces between values.
    This is correct:

    ["User_ID","Name"]

    This is incorrect (note the extra space):

    ["User_ID", "Name"]
  • Each row ends with a line break character: CRLF or LF. You can use both CRLF and LF within one file. The last field in a file may or may not have an ending line break.

    field1,field2,field3<LF>
    field4,field5,field6<CRLF>
    field7,field8,field9<LF>
    field10,field11,field12

    On a Mac, save your CSV files as “Windows Comma Separated (.csv)” files because Macs may use an unsupported end of line character.

  • The files do not contain:
    • Empty trailing rows
    • Empty rows at the end of the file
    • Delimiters after the last field in a row
    • Delimiters (including tabs or newlines) at the end of the last row
    • End-of-file (EOF) delimiters

Date Format

  • Dates are formatted consistently across all the files.
  • Date values are between the years 1900 and 2050.
  • Dates are specified in one of the following formats:
    • yyyy/mm/dd
    • dd/mm/yyyy

    For CSV Uploader (see Add Data from a File to a Project), the only accepted date format for the ETL to process is yyyy-MM-dd. Each date must be converted to the yyyy-MM-dd format before being processed. Time entries appended to the yyyy-MM-dd format may be submitted to CSV Uploader in the following form: hh:mm:ss

Number Format

  • Number are in the plain #######.## format.
  • Numbers do not contain:
    • Thousands separator commas (,)
    • Currency symbols (for example, $)
    • Percentage signs (%)
    • Scientific notations (for example, 12E3)

Quotation Marks

  • The quotation marks are used consistently.
  • The quotation marks are used in the following fields:
    • Fields with embedded commas

      field1,"field2 with a comma, and a space",field3
    • Fields with embedded line breaks

      field1,field2,field3
      "field4 with
      line breaks",field5,field6
    • Fields with embedded double-quote characters (a pair of double-quote characters):

      field1,"field2 with ""double quotes""",field3
    • Fields with text strings

      field1,"field2 is a string.",field3

Processing Specific Characters

  • Double commas are converted to blank fields.
    This row:

    1,2,,

    will be parsed as the following:

    "1","2","",""
  • Leading and trailing spaces are not trimmed.
    This row:

    field1,    field2_with_leading_trailing_space    ,field3

    will be parsed as the following:

    field1,"    field2_with_leading_trailing_space    ",field3
  • The null character and the null (SQL) marker are not supported.
    This row:

    NULL,field1,field2

    will be parsed as the following:

    "NULL","field1","field2"

Transfer the CSV Files to User-Specific Storage

Steps:

  1. Create a directory in the storage (see User Specific Data Storage).

    curl -i -X MKCOL -v https://username%40email.com:PASSWORD@secure-di.gooddata.com/uploads/new-directory/
  2. Zip the SLI manifest (the upload_info.json file ) and the CSV files with data to load into a single archive file. Name the archive file upload.zip.

    Avoid putting dynamic fields or directory names inside of the zip file.

    On a Mac, do not use the “Compress File” option to create a zip file for upload because it adds some hidden files to the archive. Use command line utilities instead.

  3. Put the archive file to the newly created directory.

    curl -i -X PUT --data-binary @upload.zip -v https://user.name%40gooddata.com:PASSWORD@secure-di.gooddata.com/uploads/new-directory/upload.zip

If you do not want to or cannot use a zip file archiver for any reason, you can put the SLI manifest (the upload_info.json file) and the CSV files to your storage directory without zipping them. For these files to be processed correctly, they must be located in the same location.

Run a Load Task

The Single Load Interface (SLI) handles the loading of each dataset with the appropriate CSV file, as specified in the manifest.

The loading process consists of two phases:

  1. The first phase is common for all datasets. An error during this phase causes crash for all datasets.
  2. The second phase processes each dataset independently. Multiple datasets can be processed in parallel. An error encountered with a specific dataset does not affect the loading of the other datasets.

The following procedure assumes that you access your projects at https://secure.gooddata.com/.

If you are a white-labeled customer, replace secure.gooddata.com with your white-labeled domain in the procedure steps.

Steps:

  1. Use the API for starting data upload to load the data from the user-specific storage to your project:
    • API resource: https://secure.gooddata.com/gdc/md/{project_id}/etl/pull2
    • Method: POST
    • Request body:

      {"pullIntegration": "new-directory"}
    • Request headers:

      Content-Type: application/json
      Accept: application/json

    The task runs, and the link for polling for data loading status is returned.

  2. Poll for the status until the OK task status is returned.
  3. Log in to your project, and verify that the data has been uploaded.

Tips for Uploading CSV Files Using CloudConnect

Define Metadata

If you are using the CSV Reader component to read a CSV file in your CloudConnect process, configure the fields that are passed between components of the ETL graph. To do so, define the metadata between the CSV Reader component and the next component in the ETL graph.

Steps:

  1. In your logical data model, create a connection between the output port of the CSV Reader component and the input port of the next component in the graph.
  2. Double-click the connecting line.
    The metadata editor dialog appears.
  3. Configure the fields and their format.
    The following picture shows how to configure delimiters in the file for end-of-line and end-of-file.

  4. Save the changes.
  5. Publish your logical data model to the server (see Publishing a Logical Data Model).

Use Batch Upload

The following procedure applies to you only if your token was created before January 16, 2016. If your token was created on January 16, 2016, or later, batch mode is already set as default for you. For more details, see Data Loading Modes in CloudConnect.

To enable greater overall throughput for your ETL, use batch mode. Batch mode groups all dataset uploads from a single CloudConnect graph execution into one batch to be executed directly on the GoodData data loading infrastructure. This enables the CloudConnect worker to move on to the next graph execution rather than waiting for all datasets to be separately, independently loaded before the CloudConnect worker is freed up.

To enable batch mode, add the GDC_USE_BATCH_SLI_UPLOAD parameter to the process schedule.

Steps:

  1. From Data Integration Console (see Accessing Data Integration Console), click Projects to open the Projects page.
  2. Click the name of the project that the process schedule belongs to.
  3. Locate the process for which you want to enable batch mode, and click the schedule name.
    The schedule details dialog opens.
  4. Click Add parameter.
  5. For the name of the parameter, enter GDC_USE_BATCH_SLI_UPLOAD. For the parameter value, enter TRUE.
  6. Click Save.
    Batch mode is set for the schedule.
  7. Run the graph to validate.
  • If a data loading process fails in the middle of the running graph, no data is loaded.
  • A finished Dataset Writer component does not mean that the data load finished successfully. The process waits until all dataset writers are finished, and then the data is loaded.