CSV Downloader

CSV Downloader downloads CSV files, GZIPed and BZIPed CSV files, and Excel files from a remote location:

  • S3
  • WebDAV
  • SFTP
  • Google Cloud Storage
  • One Drive

For information about how to prepare source files (format, structure) and additional details, see Input Data for CSV Downloader.

Contents:

Configuration File

Creating and setting up the brick's configuration file is the third phase of building your data pipeline. For more information, see Phases of Building the Data Pipeline.

For more information about configuring a brick, see Configure a Brick.

Minimum Layout of the Configuration File

The following JSON sample is the minimum layout of the configuration file that CSV Downloader can use. If you are using more bricks, you can reuse some sections of this configuration file for other bricks. For example, the 'downloaders' section can list more downloaders, and the 'integrators' and 'ads_storage' sections can be reused by other downloaders and executors (see Configure a Brick).

This sample describes the following use case:

  • One instance of CSV Downloader downloads two CSV files from a location on S3.
  • You provide your own manifest files for the downloaded CSV files (see Manifest File).
  • CSV Downloader generates a feed file (see Feed File).

Copy the sample and replace the placeholders with your values.

You can later enrich the sample with more parameters depending on your business and technical requirements (see Customize the Configuration File).

{  
  "entities": {
    "table_1_name": {
      "global": {
        "custom": {
          "hub": ["column_1_name","column_2_name","column_3_name"]
        }
      }
    },
    "table_2_name": {
      "global": {
        "custom": {
          "hub": ["column_1_name","column_2_name","column_3_name"]
        }
      }
    }
  },
  "downloaders": {
    "csv_downloader_id": {
      "type": "csv",
      "entities": ["table_1_name","table_2_name"]
    }
  },
  "csv": {
    "type": "s3",
    "options": {
      "bucket": "s3_bucket_name",
      "access_key": "s3_access_key",
      "folder": "path/to/manifest/file"
    }
  },
  "integrators": {
    "ads_integrator_id": {
      "type": "ads_storage",
      "batches": ["csv_downloader_id"]
    }
  },
  "ads_storage": {
    "instance_id": "data_warehouse_instance_id",
    "username": "dw_email@address.com",
    "options": {}
  }
}


The placeholders to replace with your values:

NameTypeMandatory?DefaultDescription
entities -> table_x_name stringyesn/aThe database tables into which CSV Downloader downloads the data
entities -> table_x_name -> global -> custom -> hubarrayyesn/aThe table columns to generate a primary key
downloaders -> csv_downloader_id stringyesn/aThe ID of the CSV Downloader instance
Example: csv_downloader_1
downloaders -> csv_downloader_id -> entitiesarrayyesn/aThe database tables into which CSV Downloader downloads the data
Must be the same as the database table names in 'entities -> table_x_name'
csv -> options -> bucketstringyesn/aThe name of the S3 bucket where the source data is stored

csv -> options -> access_key

stringyesn/aThe access key to the S3 bucket where the source data is stored

csv -> options -> folder

stringyesn/a

The path to the folder where the manifest files are located
Example: prod/files/manifests/
Recommended: {SOURCE_ROOT}/manifest/

integrators -> ads_integrator_id stringyesn/aThe ID of the ADS Integrator instance
Example: ads_integrator_1
integrators -> ads_integrator_id -> batchesarrayyesn/a

The ID of the CSV Downloader instance
Example: csv_downloader_1

ads_storage -> instance_idstringyesn/aThe ID of the Agile Data Warehousing Service (ADS) instance into which the data is uploaded

ads_storage -> username

stringyesn/aThe access username to the Agile Data Warehousing Service (ADS) instance into which the data is uploaded

Customize the Configuration File

You can change default values of the parameters and set up additional parameters.


Use an SFTP location

To use an SFTP location, set the 'type' parameter to 'sftp', and specify SFTP parameters under 'csv' -> 'options'.

"csv": {
  "type": "sftp",
  "options": {
    "username": "sftp_email@address.com",
    "host": "sftp_hostname",
    "auth_mode": "password|cert",
    "port": sftp_port_number // optional
  }
}
NameTypeMandatory?DefaultDescription
usernamestringyesn/aThe username to access the SFTP server to download the data from
hoststringyesn/aThe host name of the SFTP server to download the data from
Example: ftp.example.com
auth_modestringyesn/a

Mode used to authenticate to the SFTP server from where the data will be downloaded

Available options:

portintegerno22The port on which the SFTP server is running


Use a WebDAV location

To use a WebDAV location, set the 'type' parameter to 'webdav', and specify WebDAV parameters under 'csv' -> 'options'.

"csv": {
  "type": "webdav",
  "options": {
    "username": "webdav_email@address.com",
    "host": "webdav_hostname",
    "default_folder": "path/to/data" // optional
  }
}
NameTypeMandatory?DefaultDescription
usernamestringyesn/a

The username to access the WebDAV server to download the data from

hoststringyesn/a

The host name of the WebDAV server to download the data from
Example: https://secure-di.gooddata.com

default_folderstringnoroot folder

The path to the data on the WebDAV server


Use a Google Cloud Storage location

To use a Google Cloud Storage location, set the 'type' parameter to 'google_cloud_storage', and specify Google Cloud Storage parameters under 'csv' -> 'options'.

"csv": {
  "type": "google_cloud_storage",
  "options": {
    "auth": {
      <content_from_client_secrets.json>
    },
    "bucket": "bucket_name"
  }
}
NameTypeMandatory?DefaultDescription
authJSONyesn/a

The content of client_secrets.json
Example:

"auth": {
  "type": "service_account",
  "project_id": "12345",
  "private_key_id": "54321",
  "client_email": "email@address.com",
  "client_id": "67890",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/09876"
bucketstringyesn/a

The name of the bucket where the source data is stored


Use a OneDrive location

To use a OneDrive location, set the 'type' parameter to 'one_drive', and specify OneDrive parameters under 'csv' -> 'options'.

"csv": {
  "type": "one_drive",
  "options": {
    "client_id": "onedrive_client_id"
  }
}
NameTypeMandatory?DefaultDescription
client_idstringyesn/a

The OneDrive client ID


Use link files

By default, CSV Downloader downloads data from a source and saves it to BDS. If the files to download are big, this process can be time-consuming. In this case, you can create a link file on BDS pointing to the source files instead of downloading the actual data.

You can use the link files only for CSV, GZIPed CSV or BZIPed CSV source files that are located on S3. You cannot use link files for Excel files.

When you use link files, you must do the following:

  • Make sure that the source files remain in their location and will not be deleted or modified as it may result in errors.
  • Use a feed file and a manifest file. See Feed File and Manifest File.
  • Set the ADS Integrator instance that processes CSV Downloader to use link files. See 'Use link files' in ADS Integrator.


"csv": {
  "type": "s3",
  "options": {
    "bucket": "s3_bucket_name",
    "access_key": "access_key",
    "use_link_file": true|false
    }
  }
}
NameTypeMandatory?DefaultDescription
use_link_fileBooleannofalse

Specifies how to handle big source CSV files

  • If not set or set to 'false', the files are downloaded from the source.
  • If set to 'true', the files are not downloaded, and CSV Downloader only creates a link file on BDS with links to these files.


Add optional parameters to the S3 location

Depending on your business and technical requirements, you can set one or more optional parameters for your S3 location.

"csv": {
  "type": "s3",
  "options": {
    "bucket": "s3_bucket_name",
    "access_key": "access_key",
    "server_side_encryption": true|false, // optional
    "region": "s3_bucket_region" // optional
    }
  }
}
NameTypeMandatory?DefaultDescription
server_side_encryptionBooleannofalse

Specifies whether CSV Downloader can work with an S3 location where server-side encryption is enabled.

  • If not set or set to 'false', CSV Downloader is not enabled to work with an S3 location where server-side encryption is enabled.
  • If set to 'true', CSV Downloader is enabled to work with an S3 location where server-side encryption is enabled.
regionstringnous-east-1

The region of the S3 bucket where the source data is stored
CSV Downloader can auto-detect your region, but we recommend that you set it explicitly if you decide to specify the region.


Add multiple instances of CSV Downloader

When you use only one instance of CSV Downloader, it reads its remote location parameters from the 'csv' section.

If you use multiple instances of CSV Downloader and each instance has a separate location to download data from, add several 'csv' sections and give each section a unique name (for example, 'csv_users' or 'csv_sales').

"csv_users": {
  "type": "s3",
  "options": {
    "bucket": "s3_bucket_name",
    "access_key": "s3_access_key"
  }
},
"csv_sales": {
  "type": "sftp",
  "options": {
    "username": "sftp_email@address.com",
    "host": "sftp_hostname",
    "auth_mode": "password|cert",
    "cert_path": "path/to/certificate" // only when auth_mode=cert
  }
}


For each instance of CSV Downloader in the 'downloaders' section, add the 'settings' parameter and set it to the name of the section that contains settings for this CSV Downloader instance.

"downloaders": {
  "csv_downloader_1_id": {
    "type": "csv",
    "settings": "csv_users",
    "entities": ["table_1_name","table_2_name"]
  },
  "csv_downloader_2_id": {
    "type": "csv",
    "settings": "csv_sales",
    "entities": ["table_1_name","table_2_name"]
  }
}


Describe the structure of the source CSV files

You can set the parameters describing the structure of the source CSV files.

"csv": {
  "type": "s3",
  "options": {
    "files_structure": {
      "skip_rows": number_of_rows, // optional
      "column_separator": "separator_character", // optional
      "record_terminator": "record_termination_character", // optional
      "escape_as": "escape_character", // optional
      "file_format": "gzip|xlsx|no_value", // optional
      "db_parser": "gdc|vertica_default|no_value", // optional
      "enclosed_by": "quote character", // optional
      "null_value": "null_value" // optional
    }
  }
}

You can also set any of these parameters at the entity level. To do so, add it to the 'custom' section of the entity.

NameTypeMandatory?DefaultDescription
skip_rowsintegerno1

The number of rows from the top of a file that CSV Downloader will skip
By default, it is set to 1 and removes the first row (that typically features a header) from a CSV file. If you want to use this parameter but want to preserve the data in the first row, set it to 0.

column_separatorcharacterno,

A character that is used to separate fields in a CSV file
NOTE: It cannot be the same as the character in the 'null_value' parameter.

record_terminatorcharacterno,

A character that is used to separate rows in a CSV file
NOTE: It cannot be the same as the character in the 'null_value' or 'column_separator' parameters.

escape_ascharacternon/a

An escape character
You can define an escape character using any ASCII value in the range from E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000')
Example: \"

file_formatstringno""
(no value, as in:
"file_format" : "")

The format of the files to download
The default file format is CSV, which is specified by an empty value of this parameter:
"file_format" : ""

Available options:

  • empty value (""), which defaults to CSV
  • gzip
  • xlsx

NOTE: If your files are spreadsheets (XLSX), you cannot use your own manifest files and you must let CSV Downloader generate manifest files for you (set the 'generate_manifests' parameter to 'true', see Generate a manifest file).

db_parserstringnogdc

The database parser
If this options is not set, the default GoodData parser is used.
To switch to the Vertica parser, explicitly set this parameter to 'vertica_default':
"db_parser" : "vertica_default"

enclosed_bystringnon/a

The quote character within which to enclose data. This allows delimiter characters to be embedded in string values.
You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000').
Example: \"

null_valuestringno""
(no value, as in:
"null_value" : "")

A character that replaces nulls in the data when loading it to BDS (corresponds to NULL AS in the COPY command)
You can specify a null value as any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000').
NOTE: It cannot be the same as the character in the 'column_separator' parameter.


Set optional parameters for your manifest file

The optional 'generate_manifests' parameter specifies whether to generate a manifest file or to use the manifest file that you provide. By default, the 'generate_manifests' parameter is not set and defaults to 'false', which means that:

  • You have to create and provide your own manifest files (see Manifest File).
  • All manifest file properties take on their default values.
  • CSV Downloader expects the manifest files to be found in the folder that you specified in the 'folder' parameter (see Minimum Layout of the Configuration File).

Depending on your business and technical requirements, you can set one or more optional parameters to customize the default settings for the manifest file.

If you want CSV Downloader to generate the manifest file for you, set the 'generate_manifests' parameter to 'true' and see Generate a manifest file.

"csv": {
  "type": "s3",
  "options": {
    "generate_manifests": false,
    "manifest": "manifest_file_name", // optional
    "manifest_process_type": "move|history", // optional
    "number_of_manifest_in_one_run": number_of_manifests, // optional
    "data_location": "path/to/data/files" // optional
  }
}
NameTypeMandatory?DefaultDescription
generate_manifestsBooleannofalse

Specifies whether to generate a manifest file or to use the manifest file that you provide

  • If not set or set to 'false', you have to provide your own manifest file.
  • If set to 'true', CSV Downloader generates a manifest file for you. See Generate a manifest file.
manifeststringnomanifest_{time(%s)}The format of the manifest file name
For information about the format, see Manifest File.
manifest_process_typestringnomove

Specifies how to handle the manifest file after it has been processed

  • If not set or set to 'move', the manifest file is moved to the 'processed' folder (see the 'move_data_after_processing_to_path' parameter in Set up runtime parameters).
  • If set to 'history', the manifest is not moved and remains in its original location.
number_of_manifest_in_one_runintegerno1The maximum number of manifest files to process in one run
data_locationstringnon/aThe path to the data files
Example: prod/files/data_source/


Generate a manifest file

The optional 'generate_manifests' parameter specifies whether to generate a new manifest file or to use the manifest file that you provide. By default, the 'generate_manifests' parameter is not set and defaults to 'false', which means that you have to create and provide your own manifest files.

Depending on your business and technical requirements, you can decide to make CSV Downloader generate the manifest file for you. In generated manifest files, load mode for all source data files is set to 'full' by default.

If data is loaded in incremental mode, the data files are by default moved to the 'processed' folder after being processed (the path to the folder can be set by the 'move_data_after_processing_to_path' parameter). To prevent the data files from being moved, set the 'keep_data_after_processing' parameter to 'true'.

If you want to provide your own manifest files, set the 'generate_manifests' parameter to 'false' and see Set optional parameters for your manifest file.

"csv": {
  "type": "s3",
  "options": {
    "generate_manifests": true,
    "data_location": "path/to/data/files",
    "file_name": "data_file_name", // optional
    "sheet_name": "sheet_file_name", // optional
    "date_from": "YYYYMMDD", // optional
    "date_to": "YYYYMMDD", // optional
    "keep_data_after_processing": false|true, // optional
    "move_data_after_processing_to_path": "path/to/move/data/to" // optional
  }
}
NameTypeMandatory?DefaultDescription
generate_manifestsBooleannofalse

Specifies whether to generate a new manifest file or to use the manifest file that you provide

data_locationstringyesn/aThe path to the source data files
Example: prod/files/data_source/
file_namestringno{entity_name}_{version}_{time(%s)}

The format of the source data file name
For more information, see file_name.

sheet_namestringnoentities.xlsx

(Only for XLSX source files; see the 'file_format' parameter in Describe the structure of the source CSV files) The format of the spreadsheet name in a source data file

If you want to customize the spreadsheet name format, you can use the keywords 'regex', 'time', or 'target_predicate' the same way they can be used in file_name. For example, you may want to set the name format to spreadsheet_{time(%Y.%m.%d)}.xlsx.

NOTE: The names of individual sheets must be the same as the names of your entities (tables to download). For example, if you have entities named 'company' and 'revenue', you are going to name the sheets in the source file 'company' and 'revenue' as well.

date_fromstringnon/a

Use 'date_from' and 'date_to' to specify a time interval, within which the data will be downloaded (>= date_from, <= date_to)).
Format:
YYYYMMDD

Always use the 'date_from' and 'date_to' parameters together.

date_tostringnon/a
keep_data_after_processingBooleannofalse

(For incremental load mode only) Specifies how to handle the data files after they have been processed

  • If not set or set to 'false', the processed data files are moved to the 'processed' folder (set by the 'move_data_after_processing_to_path' parameter).
  • If set to 'true', the processed data files are not moved and remain in their original location (set by the 'data_location' parameter).
move_data_after_processing_to_pathstringnon/a

The path where the data files will be moved to after processing if the 'keep_data_after_processing' parameter is not set or set to 'false'
Example: prod/files/processed/

file_name

The default format of the source file name is the following:

{entity_name}_{version}_{time(%s)}.csv|gz|zip

The entity represents the table to download data from. The {entity_name} section is mandatory, and the rest is optional.

When resolved, the name of a manifest file may look like the following:

entity_1.0_20160905140015.csv

To customize the file name format, you can use the following keywords:

  • entity_name: The name of the entity (the table to download data from)
  • version: Use 'version' to version your entities. For example, 1.0, 3.23, and so on.
    If you do not specify 'version' in the names of your data files and you create your own feed file, use the value 'default' in the 'version' column.

    Do not specify 'version' when you decide to generate the feed file instead of creating it on your own.

  • regex: If a file name has a changing part, use 'regex' to be able to process the files. For more information, see https://ruby-doc.org/core-2.1.1/Regexp.html.
  • time: If 'sequence' is not present, data source files are sorted by time. 'time' can be set as timestamp ( {time(%s)} ) or any kind of the YYYYMMDD format (for example, {time(%Y-%m-%d-%H-%M-%S)} ). For more information about tags, see http://ruby-doc.org/core-2.2.0/Time.html#method-i-strftime.
  • target_predicate: The field name used in partial full load (see Set partial full load mode for the entities).

When building the file name format, use symbols such as underscores to separate individual elements:

{entity_name}_{time(%Y%m%d%H%M%S)}_{version}

Processing file names with no separating symbols between individual elements (for example, {entity_name}{time(%Y%m%d%H%M%S)}{version} ) may cause an error.

Example:
To get the following source data file name:

Event_1.0_20160914_R.abc.csv

the file name format may look like the following:

{entity_name}_{version}_{time(%Y%m%d)}_{regex(R\\.abc)}

Notice that the dot in this regexp means a dot as ( . ), not as 'any character'.


Provide a feed file

By default, CSV Downloader does not check the content of downloaded CSV files. Therefore, you must provide information about the content structure of the CSV files in a feed file.

If you do not provide the feed file, CSV Downloader generates it.

For more information about the feed file, see Feed File.

"csv": {
  "type": "s3",
  "options": {
    "data_structure_info": "path/to/feed/file", // optional
    "feed_on_bds": false|true, // optional
    "default_string_size": "255", // optional
    "generate_feed_from_path": "path/to/test/feed/file", // optional
    "ignore_columns_check": false|true // optional
  }
}
NameTypeMandatory?DefaultDescription
data_structure_infostringnon/a

The path to the feed file

Example: prod/files/feed
Recommended: {SOURCE_ROOT}/feed
If you enter a path ending with a slash ( / ), all files in the directory will be used.
If the feed file is not present in the specified path, CSV Downloader will generate it.

feed_on_bdsBooleannofalse

Specifies whether the feed file is located on BDS instead of the location set by the 'data_structure_info' parameter.

  • If not set or set to 'false', the feed file is located the location set by the 'data_structure_info' parameter.
  • If set to 'true', the feed file is located on BDS.
default_string_sizestringno255The default size of the varchar data type for the generated feed file
generate_feed_from_pathstringnon/a

The path to generate a test feed file in

If this parameter is explicitly set to some path, the feed file is generated but no data is downloaded from the source files. Use this option to test your settings before actually downloading data.

ignore_columns_checkBooleannofalse

Specifies whether to compare column names in the feed file to those in the source files

  • If not set or set to 'false', the column names in the feed file are not compared to those in the source files.
  • If set to 'true', the column names in the feed file are compared to those in the source files.


Set up runtime parameters

You can set up optional processing parameters that are used at runtime.

"csv": {
  "type": "s3",
  "options": {
    "delete_data_after_processing": true|false, // optional
    "ignore_check_sum": true|false, // optional
    "number_of_threads": number_of_threads // optional
  }
}
NameTypeMandatory?DefaultDescription
delete_data_after_processingBooleannofalse

Specifies whether to delete data from the source after processing

  • If not set or set to 'false', the source data is not deleted.
  • If set to 'true', the source data is deleted.
ignore_check_sumBooleannotrue

Specifies whether to ignore MD5 checksums when downloading data

  • If not set or set to 'true', MD5 checksums are ignored when downloading data.
  • If set to 'false', MD5 checksums are processed when downloading data.
number_of_threadsintegerno4 (for S3)
1 (for others)

The number of threads that will be used for copying data from the source location to BDS

  • When set to 1, the maximum file size that CSV Downloader can process is 2 GB.
  • When set to >1, the maximum file size is considerably smaller than 2 GB. However, you can configure CSV Downloader to process multiple files in one run.


Specify additional load options for the entities

For each entity, you can set up additional load options.

Setting up an entity option and then removing it does not set back the default value but keeps the last value that had been set. To change the option to its default value, you have to set it explicitly.

"entities": {
  "table_1_name": {
    "global": {
      "custom": {
        "hub": ["column_1_name","column_2_name","column_3_name"],
        "export_type": "full|inc", // optional
        "skip_rows": number_of_rows // optional
      }
    }
  }
}
NameTypeMandatory?DefaultDescription
export_typestringnon/a

Load mode used for loading the entity to the database

  • If you provide your own manifest file (that is, the 'generate_manifests' parameter is set to 'false'), incremental load is used.
    To use full mode, explicitly set this option to to 'full'.
  • If CSV Downloader generates the manifest file for you (that is, the 'generate_manifests' parameter is set to 'true' ), full load is used.
    To use incremental mode, explicitly set this option to 'inc'.
skip_rowsintegerno1

The number of rows from the top of a file that CSV Downloader will skip
By default, it is set to 1 and removes the first row (that typically features a header) from a CSV file. Set it to zero if you want to use the parameter but want to preserve the data in the first row.


Set partial full load mode for the entities

You can specify partial full load mode for some or all entities. In partial full load, the rows that should be removed must contain the same value in the specified field as the new data does.

Example:
You have two clients: Client A and Client B. You already have data for those clients in your ADS instance from previous data loads. The data for both Client A and Client B is stored in the same table. The table contains, among others, the 'id' and 'client' columns. The names of these columns are specified in the 'hub' parameter of your configuration file (see Minimum Layout of the Configuration File).

This is how your data currently looks in the table on the ADS instance:

idtextclient
1aaabbbbclientA
1ccdddclientB


Now, you want to reload data for Client A without affecting Client B's data in the same table.

This is the new data that you want to load for Client A:

idtextclient
2eeeeeffclientA


After the data load completes, you want the table with the data to look like the following:

idtextclient
2eeeeeffclientA
1ccdddclientB

Notice that Client A has a different ID and text, and Client B's data remains the same.


To achieve this output, use partial full load. Set the 'client' field to be the partial full load field:

"partial_full_load_field": "client"

After the next data load runs, the table with the data will be updated to look exactly as you want it to.


For comparison, this is what your data would look like if you ran the data load in full load mode:

idtextclient
2eeeeeffclientA

Notice that the data for Client A is loaded correctly, but the data for Client B is no longer present in the table.


Depending on whether the field that you want to use as the partial full load field does or does not exist in your data, choose one of the following scenarios to set partial full load mode.

Scenario 1: The field that you want to be the partial full load field exists in your data

Set up the partial full load field:

  • If you want to use partial full load for a specific entity, add the 'partial_full_load_field' parameter to the 'custom' section for the entity and set it to the name of the field that will be used as the partial full load field:

    "entities": {
      "table_1_name": {
        "global": {
          "custom": {
            "hub": ["column_1_name","column_2_name","column_3_name"],
            "partial_full_load_field": "your_field_name"
          }
        }
      }
    }
  • If you want to use partial full load for all entities, add the 'partial_full_load_field' parameter to the 'options' section of the CSV Downloader instance, and set it to the name of the field that will be used as the partial full load field:

    "csv": {
      "type": "s3",
      "options": {
        "bucket": "s3_bucket_name",
        "access_key": "s3_access_key",
        "folder": "path/to/manifest/file",
        "partial_full_load_field": "your_field_name"
      }
    }


Scenario 2: The field that you want to be the partial full load field does not exist in your data

  1. Set up the field that will be used as the partial full load field. Although you can use any field as the partial full load field, we recommend that you use the 'target_predicate' field for that purpose, and this procedure describes the process of setting partial full load mode using the 'target_predicate' field.
    Do one of the following:
    • If you provide your own manifest file, propagate the 'target_predicate' field from the manifest file (see the 'target_predicate' column in Manifest File).
  2. Depending on whether you want to use partial full load for a specific entity or all entities, do one of the following:
    • If you want to use partial full load for a specific entity, configure the 'custom' section for the entity.
      1. Add the 'partial_full_load_field' parameter, and set it to 'target_predicate'.
      2. Add the 'computed_fields' section so that the 'target_predicate' field gets added to your source and stage tables. For more information about the computed fields, see Computed Fields.

        "entities": {
          "table_1_name": {
            "global": {
              "custom": {
                "hub": ["column_1_name","column_2_name","column_3_name"],
                "partial_full_load_field": "target_predicate",
                "computed_fields": [
                  {
                    "name": "target_predicate",
                    "type": "string-32",
                    "function": "runtime_metadata|target_predicate",
                    "table_type": ["src"],
                    "key": true,
                    "position": "first",
                    "encoding": "RLE"
                  },
                  {
                    "name": "target_predicate",
                    "type": "string-32",
                    "function": "value|target_predicate",
                    "table_type": ["stg"],
                    "key": true,
                    "position": "first",
                    "encoding": "RLE"
                  }
                ]
              }
            }
          }
        }
    • If you want to use partial full load for all entities, do the following:
      1. Add the 'partial_full_load_field' parameter to the 'options' section of the CSV Downloader instance, and set it to 'target_predicate':

        "csv": {
          "type": "s3",
          "options": {
            "bucket": "s3_bucket_name",
            "access_key": "s3_access_key",
            "folder": "path/to/manifest/file",
            "partial_full_load_field": "target_predicate"
          }
        }
      2. Add the 'computed_fields' section to the 'ads_storage' section so that the 'target_predicate' field gets added to your source and stage tables. For more information about the computed fields, see Computed Fields.

        "ads_storage": {
          "instance_id": "data_warehouse_instance_id",
          "username": "dw_email@address.com",
          "options": {},
          "computed_fields": [
            {
              "name": "target_predicate",
              "type": "string-32",
              "function": "runtime_metadata|target_predicate",
              "table_type": ["src"],
              "key": true,
              "position": "first",
              "encoding": "RLE"
            },
            {
              "name": "target_predicate",
              "type": "string-32",
              "function": "value|target_predicate",
              "table_type": ["stg"],
              "key": true,
              "position": "first",
              "encoding": "RLE"
            }
          ]
        }


Customize integrators

The 'integrators' and 'ads_storage' sections define the behavior of integrators.

"integrators": {
  "ads_integrator": {
    "type": "ads_storage",
    "batches": ["csv_downloader_id"]
  }
},
"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {}
}

If you want to add more parameters or change the defaults, see ADS Integrator.

Schedule Parameters

When scheduling CSV Downloader (see Phases of Building the Data Pipeline -> Production Implementation), provide the parameters from this section in the schedule.

General Parameters

Some parameters must be entered as secure parameters (see Configuring Schedule Parameters).

Some parameters contain the names of the nesting parameters, which specifies where they belong in the configuration hierarchy. For example, a parameter entered as csv|options|password is resolved as the following:

  "csv": {
    "options": {
      "password": ""
    }
  }

Enter such parameters as is (see Schedule Example).

NameTypMandatory?Secure?DefaultDescription
IDstringyesnon/a

The ID of the CSV Downloader instance being scheduled

Must be the same as the 'batches -> csv_downloader_id' parameter and the 'downloaders -> csv_downloader_id' parameter in the configuration file (see Minimum Layout of the Configuration File)
Example: csv_downloader_1

server_side_encryptionBooleannonofalse

Specifies whether CSV Downloader can work with an S3 location where server-side encryption is enabled.

  • If not set or set to 'false', CSV Downloader is not enabled to work with an S3 location where server-side encryption is enabled.
  • If set to 'true', CSV Downloader is enabled to work with an S3 location where server-side encryption is enabled.

Location-Specific Parameters

Depending on the location where your source data files are stored, provide the corresponding location-specific parameters as secure schedule parameters (see Configuring Schedule Parameters).

Some parameters contain the names of the nesting parameters, which specifies where they belong in the configuration hierarchy. For example, a parameter entered as csv|options|password is resolved as the following:

  "csv": {
    "options": {
      "password": ""
    }
  }

Enter such parameters as is (see Schedule Example).

LocationNameTypMandatory?Secure?DefaultDescription
S3csv|options|secret_keystringyesyesn/a

The access key to the S3 bucket where the source data is stored

SFTP

csv|options|password

stringsee 'Description'yesn/a

The password to the SFTP server to download the data from
This parameter is mandatory only when 'auth_mode' is set to 'password' (see Use an SFTP location). Otherwise, do not use it.

csv|options|cert_keystringsee 'Description'yesn/a

The private key used for authentication
This parameter is mandatory only when 'auth_mode' is set to 'cert' (see Use an SFTP location). Otherwise, do not use it.

If the private key is encrypted, also provide the passphrase to decrypt it (see the 'csv|options|cert_passphrase' parameter).

NOTE: Although you are entering the private key as a secure parameter, its value will be visible right after you paste it in the parameter value field (the show value check-box is selected) because it is a multi-line string. To hide the value, un-select the show value check-box. When the schedule is saved, the value will not appear in clear-text form in any GUI or log entries. For more information, see Configuring Schedule Parameters.

csv|options|cert_passphrase

stringsee 'Description'yesn/a

The passphrase used to decrypt the private key if it is encrypted

This parameter is mandatory only when 'auth_mode' is set to 'cert' (see Use an SFTP location) and the private key (see the 'csv|options|cert_key' parameter) is encrypted. Otherwise, do not use it.

WebDAV

csv|options|password

stringyesyes

n/a

The password to the WebDAV server to download the data from

OneDrivetokenstringyesyes

n/a

The OneDrive refresh token

Schedule Example

The following is an example of how you can specify schedule parameters:

Best Practices

Format

  • Use GZIPed CSV files.
  • If you have multiple source files, compress each file separately (GZIPed CSV files, not tar.gz).

CSV File Structure

  • Use a comma delimiter ( , ).
  • End lines with \n or \r\n.
  • Quote strings that contain commas, newlines, or other special characters with a double quote character ( " ).
  • Escape quotes in strings with a backslash character ( \ ). Duplicated double quotes are accepted, but are slower to process.
  • Make the first line of each file a header.
  • Keep all dates and datetimes in the same format.

Example:

ID,Name,City,Note,timeCreated,dateOpened,dateClosed
1,Texas Account,"Dallas, TX","Width has to be 5\" at least",2017-11-12 23:08:22.000,2017-11-12,2018-01-11
2,NYC Account,"New York City, NY","",2017-10-28 11:18:12.352,2017-11-12,
3,Prague,Prague,"Width has to be 5\" at least",2017-10-15 13:02:22.193,2017-11-12,

Size

  • Keep the size of each CSV file below 100 MB. Split bigger (up to 1 GB) files into smaller files.
  • Keep the size of each XLSX sheet below 10 MB.
  • The maximum file size that CSV Downloader can process is 2 GB. This limit is applied only if the 'number_of_threads' parameter is set to 1 (see Set up runtime parameters). With a higher number of threads, the maximum file size is considerably smaller. Break large data files (greater than 1 GB) into smaller files, and configure CSV Downloader to process multiple files in one run.

Location

  • Store the source files on S3 (see S3 Bucket).

  • On S3, place the source files into the following path: {SOURCE_ROOT}/data/{YYYY}/{MM}/{DD}/
  • Keep the source files in the S3 bucket for a while after CSV Downloader downloaded them. When the data integration process stabilizes over a number of successful executions, you can set up a retention period.

Load

  • Use link files (see Use link files).
  • If you need to do both daily loads and historical loads to the same table, use multiple instances of CSV Downloader (see Add multiple instances of CSV Downloader). If you use multiple instances of CSV Downloaders for one ADS Integrator instance, make sure that each instance of CSV Downloader has a unique ID.
  • For each entity, prepare at least one source file.
  • If you are using the computed fields for setting partial full load mode, use the 'position' and 'encoding' parameters (see Computed Fields) for better performance. Set the 'position' parameter to 'first', and set the 'encoding' parameter to 'RLE'.

Additional Files

  • Provide your own manifest file (the 'generate_manifests' parameter is set to 'false'). See Manifest File.
  • Provide a feed file. See Feed File.

S3 Bucket

Set up the structure of the S3 bucket as follows:

FolderPurposeNote
{SOURCE_ROOT}
The root folder/bucket
{SOURCE_ROOT}/feed/
The location of a feed file (feed.txt)See Feed File.
{SOURCE_ROOT}/data/{YYYY}/{MM}/{DD}/
The location of data source files

Keep the number of data source files in one folder below 500. If needed, distribute source data files among subfolders by year/month/date.

{SOURCE_ROOT}/manifest/
The location of unprocessed manifest files

Make sure that CSV Downloader has write access to this folder so that it can move the manifest files after they have been processed.

{SOURCE_ROOT}/manifest/processed/{YYYY}/{MM}/{DD}/

The location where processed manifests are moved to

CSV Downloader creates the 'processed' folder with subfolders by year/month/date to limit the number of files in a folder.