You are viewing our older product's guide. Click here for the documentation of GoodData Cloud, our latest and most advanced product.

Loading Data via REST API

This article describes how to load data from CSV files to your project in a single task instead of loading the CSV files one by one. Loading multiple CSV files at once helps you save time if your workspace uses incremental data loading, requires a long time to synchronize data, or contains many datasets (multiple data load tasks are integrated into a single task, and you do not have to start a data load task for each dataset).

Loading data via REST API consists of the following steps:

Prepare the SLI manifest.
Prepare CSV files with data to load.
Transfer the CSV files to user-specific storage.
Run a load task.

If you want to delete some old or obsolete data from a dataset while loading new data to the same or different dataset, see Delete Old Data while Loading New Data to a Dataset via API.

Prepare the SLI Manifest

Prepare Single Load Interface (SLI) manifests, one for each dataset. These manifests map the data columns to the logical data model in your workspace.

The SLI manifests are automatically generated by the GoodData platform. You are going to obtain them using the API.

The following procedure assumes that you access your workspaces at https://secure.gooddata.com/.

If you are a white-labeled customer, replace secure.gooddata.com with your white-labeled domain in the procedure steps.

In GoodData, terms workspace and project denote the same entity. For example, project ID is exactly the same as workspace ID. See Find Workspace ID.

Steps:

Use the API for obtaining SLI manifests to obtain the SLI manifests for each dataset you want to update:

API resource: https://secure.gooddata.com/gdc/md/{workspace_id}/ldm/singleloadinterface/{dataset}/manifest/
Method: GET

Request headers:

Content-Type: application/json
Accept: application/json

The API returns the SLI manifest for the specified dataset. The following sample is an example of the SLI manifest for the dataset called “dataset.quotes”:

{
  "dataSetSLIManifest": {
    "parts": [
      {
        "columnName": "id",
        "mode": "FULL",
        "populates": ["label.quotes.id"],
        "referenceKey": 1
      },
      {
        "columnName": "company",
        "mode": "FULL",
        "populates": ["label.quotes.symbol.company"]
      },
      {
        "columnName": "symbol",
        "mode": "FULL",
        "populates": ["label.quotes.symbol"],
        "referenceKey": 1
      },
      {
        "columnName": "sector",
        "mode": "FULL",
        "populates": ["label.quotes.sector"],
        "referenceKey": 1
      },
      {
        "columnName": "industry",
        "mode": "FULL",
        "populates": ["label.quotes.industry"],
        "referenceKey": 1
      },
      {
        "columnName": "market",
        "mode": "FULL",
        "populates": ["label.quotes.market"],
        "referenceKey": 1
      },
      {
        "columnName": "quote_date",
        "mode": "FULL",
        "populates": ["quote.date.mdyy"],
        "constraints": {"date": "yyyy-mm-dd"},
        "referenceKey": 1
      },
      {
        "columnName": "open_price",
        "mode": "FULL",
        "populates": ["fact.quotes.open_price"]
      },
      {
        "columnName": "high_price",
        "mode": "FULL",
        "populates": ["fact.quotes.high_price"]
      },
      {
        "columnName": "low_price",
        "mode": "FULL",
        "populates": ["fact.quotes.low_price"]
      },
      {
        "columnName": "close_price",
        "mode": "FULL",
        "populates": ["fact.quotes.close_price"]
      },
      {
        "columnName": "volume",
        "mode": "FULL",
        "populates": ["fact.quotes.volume"]
      },
      {
        "columnName": "adjusted_close_price",
        "mode": "FULL",
        "populates": ["fact.quotes.adjusted_close_price"]
      },
      {
        "columnName": "quote_date_dt",
        "mode": "FULL",
        "populates": ["dt.quotes.quote_date"]
      }
    ],
    "file": "data.csv",
    "dataSet": "dataset.quotes"
  }
}

Store the obtained SLI manifests for later use.
Do one of the following:
- If you have only one CSV file to upload, name the obtained SLI manifest file upload_info.json, and proceed to Step 6.
- If you have two or more CSV files to upload, proceed to Step 4.
Create a JSON file, and name it upload_info.json. Copy the following JSON structure into the file:
```
{
  "dataSetSLIManifestList" : [ ]
}
```

Insert the content of the stored SLI manifests into the dataSetSLIManifestList array. Your upload_info.json file should look similar to the following:

{
  "dataSetSLIManifestList" : [
    {
      "dataSetSLIManifest": {
        "parts": [
          {
            "columnName": "team_id",
            "mode": "FULL",
            "populates": [
              "label.dim_team.id"
            ],
            "referenceKey": 1
          },
          {
            "columnName": "name",
            "mode": "FULL",
            "populates": [
              "label.dim_team.name"
            ]
          }
        ],
        "file": "dim_team.csv",
        "dataSet": "dataset.dim_team"
      }
    },
    {
      "dataSetSLIManifest": {
        "parts": [
          {
            "columnName": "assignee_id",
            "mode": "FULL",
            "populates": [
              "label.dim_assignee.id"
            ],
            "referenceKey": 1
          },
          {
            "columnName": "team_id",
            "mode": "FULL",
            "populates": [
              "label.dim_assignee.team_id"
            ],
            "referenceKey": 1
          },
          {
            "columnName": "name",
            "mode": "FULL",
            "populates": [
              "label.dim_assignee.name"
            ]
          }
        ],
        "file": "dim_assignee.csv",
        "dataSet": "dataset.dim_assignee"
      }
    }
  ]
}

Review the content of your upload_info.json file and modify it when needed:
- The attributes have multiple labels assigned in the logical data model. "referenceKey": 1 specifies that the primary identifier of an attribute column will be used.
```
{
  "columnName": "team_id",
  "mode": "FULL",
  "populates": [
    "label.dim_assignee.team_id"
  ],
  "referenceKey": 1
}
```
  Select the label that represents the attribute to update its values.
- To upload data incrementally, use "mode": "INCREMENTAL".
```
{
  "columnName": "name",
  "mode": "INCREMENTAL",
  "populates": [
    "label.dim_team.name"
  ]
}
```
- For the columns that are related to the predefined date dimensions (provided by GoodData by default) or fiscal date dimensions, manually include a date constraint key with the appropriate date format. For example:
```
"constraints": {"date": "yyyy-mm-dd"}
```
  Here is an example of a column related to the date dimension with the constraint key:
```
{
  "columnName": "quote_date",
  "mode": "FULL",
  "populates": ["quote.date.mdyy"],
  "constraints": {"date": "yyyy-mm-dd"},
  "referenceKey": 1
}
```
  If you do not include a date constraint key, you will see the following error message:
```
Missing date constraint for column...
```
  Do not use the constraint key for columns for any other purpose (for example, for columns related to generic attributes).
Save the changes.

Prepare CSV Files with Data to Load

Prepare CSV files with the data that you want to upload. Wherever possible, use a tool or a library to generate your CSV files. Avoid generating them manually.

CSV File Requirements

Verify that your CSV files meet the requirements listed in the following sections.

File Format and Name

The files contain plain text in the UTF-8 format or in a UTF-8 compliant format (for example, ASCII).
Each file has the header row, which appears as the first row of the file.
File names are consistent.
Keys are used consistently across all the files.

File Structure

All rows have the same number of fields (columns).
All rows within a column are formatted consistently.
Each field in a row contains a value (no blank cells).
The order of the columns and column names are preserved.
The columns have the same name and order as in the SLI manifest.
The file header row contains the name of each column, is formatted the same way and contains the same number of fields as the other rows.
Fields in a row are separated by commas (,).
There are no spaces between values. This is correct:
```
field1,field2,field3
```
This is incorrect (note the extra spaces):
```
field1, field2, field3
```
Each row ends with a line break character: CRLF or LF. You can use both CRLF and LF within one file. The last field in a file may or may not have an ending line break.
```
field1,field2,field3<LF>
field4,field5,field6<CRLF>
field7,field8,field9<LF>
field10,field11,field12
```
On a Mac, save your CSV files as “Windows Comma Separated (.csv)” files because Macs may use an unsupported end of line character.
The files do not contain:
- Empty trailing rows
- Empty rows at the end of the file
- Delimiters after the last field in a row
- Delimiters (including tabs or newlines) at the end of the last row
- End-of-file (EOF) delimiters

Date Format

Dates are formatted consistently across all the files.
Date values are between the years 1900 and 2050.
Dates are specified in one of the following formats:
- yyyy/mm/dd
- dd/mm/yyyy

Number Format

Numbers are in the plain #######.## format.
Numbers do not contain:
- Thousands separator commas (,)
- Currency symbols (for example, $)
- Percentage signs (%)
- Scientific notations (for example, 12E3)

Quotation Marks

The quotation marks are used consistently.
The quotation marks are used in the following fields:
- Fields with embedded commas
```
field1,"field2 with a comma, and a space",field3
```
- Fields with embedded line breaks
```
field1,field2,field3
"field4 with
line breaks",field5,field6
```
- Fields with embedded double-quote characters (a pair of double-quote characters):
```
field1,"field2 with ""double quotes""",field3
```
- Fields with text strings
```
field1,"field2 is a string.",field3
```

Processing Specific Characters

Double commas are converted to blank fields. This row:
```
1,2,,
```
will be parsed as the following:
```
"1","2","",""
```

Leading and trailing spaces are not trimmed. This row:

field1,    field2_with_leading_trailing_space    ,field3

will be parsed as the following:

field1,"    field2_with_leading_trailing_space    ",field3

The null character and the null (SQL) marker are not supported. This row:
```
NULL,field1,field2
```
will be parsed as the following:
```
"NULL","field1","field2"
```

Transfer the CSV Files to User-Specific Storage

Steps:

Create a directory in your user-specific storage. To know the server URL to connect to, see User Specific Data Storage.
```
curl -i -X MKCOL -v https://username%40email.com:PASSWORD@secure.gooddata.com/gdc/uploads/new-directory/
```
Zip the SLI manifest (the upload_info.json file) and the CSV files with data to load into a single archive file. Name the archive file “upload.zip”.
Avoid putting dynamic fields or directory names inside of the zip file.
On a Mac, do not use the “Compress File” option to create a zip file for upload because it adds some hidden files to the archive. Use command line utilities instead.

Put the archive file to the newly created directory.

curl -i -X PUT --data-binary @upload.zip -v https://user.name%40gooddata.com:PASSWORD@secure.gooddata.com/gdc/uploads/new-directory/upload.zip

If you do not want to or cannot use a zip file archiver for any reason, you can put the SLI manifest (the upload_info.json file) and the CSV files to your storage directory without zipping them. For these files to be processed correctly, they must be located in the same location.

Run a Load Task

The Single Load Interface (SLI) handles the loading of each dataset with the appropriate CSV file, as specified in the manifest.

The loading process consists of two phases:

The first phase is common for all datasets. An error during this phase causes crash for all datasets.
The second phase processes each dataset independently. Multiple datasets can be processed in parallel. An error encountered with a specific dataset does not affect the loading of the other datasets.

The following procedure assumes that you access your workspaces at https://secure.gooddata.com/.

If you are a white-labeled customer, replace secure.gooddata.com with your white-labeled domain in the procedure steps.

Steps:

Use the API for starting data upload to load the data from the user-specific storage to your workspace:
- API resource: https://secure.gooddata.com/gdc/md/{workspace_id}/etl/pull2
- Method: POST
- Request body:
```
{"pullIntegration": "new-directory"}
```
- Request headers:
```
Content-Type: application/json
Accept: application/json
```
The task runs, and the link for polling for data loading status is returned.
Poll for the status until the OK task status is returned.
Log in to your workspace, and verify that the data has been uploaded.

Tips for Uploading CSV Files Using CloudConnect

Define Metadata

If you are using the CSV Reader component to read a CSV file in your CloudConnect process, configure the fields that are passed between components of the ETL graph. To do so, define the metadata between the CSV Reader component and the next component in the ETL graph.

Steps:

In your logical data model, create a connection between the output port of the CSV Reader component and the input port of the next component in the graph.
Double-click the connecting line. The metadata editor dialog appears.
Configure the fields and their format. The following picture shows how to configure delimiters in the file for end-of-line and end-of-file.

Save the changes.
Publish your logical data model to the server (see Publishing Logical Data Models in CloudConnect).

Use Batch Upload

The following procedure applies to you only if your token was created before January 16, 2016. If your token was created on January 16, 2016, or later, batch mode is already set as default for you. For more details, see Data Loading Modes in CloudConnect.

To enable greater overall throughput for your ETL, use batch mode. Batch mode groups all dataset uploads from a single CloudConnect graph execution into one batch to be executed directly on the GoodData data loading infrastructure. This enables the CloudConnect worker to move on to the next graph execution rather than waiting for all datasets to be separately, independently loaded before the CloudConnect worker is freed up.

To enable batch mode, add the GDC_USE_BATCH_SLI_UPLOAD parameter to the process schedule.

Steps:

From Data Integration Console (see Accessing Data Integration Console), click Projects to open the Projects page.
Click the name of the workspace (project) that the process schedule belongs to.
Locate the process for which you want to enable batch mode, and click the schedule name. The schedule details dialog opens.
Click Add parameter.
For the name of the parameter, enter GDC_USE_BATCH_SLI_UPLOAD. For the parameter value, enter TRUE.
Click Save. Batch mode is set for the schedule.
Run the graph to validate.

If a data loading process fails in the middle of the running graph, no data is loaded.
A finished Dataset Writer component does not mean that the data load finished successfully. The process waits until all dataset writers are finished, and then the data is loaded.

GoodData IP Addresses and IP Whitelisting

Delete Old Data while Loading New Data to a Dataset via API