CSV Downloader

CSV Downloader is a component of the data preparation and distribution pipeline (see Data Preparation and Distribution Pipeline). CSV Downloader downloads source files from a remote location and places them to the Big Data Storage (BDS) for ADS Integrator to take over.

CSV Downloader can download the source files in the following formats:

  • CSV files (uncompressed, GZIPed, or BZIPed)

  • JSON files (uncompressed or GZIPed)

  • Excel files

  • Parquet files (uncompressed or compressed with Snappy)

CSV Downloader can download the source files from the following remote locations:

  • S3
  • WebDAV
  • SFTP
  • Google Cloud Storage
  • One Drive

For information about how to prepare source files (format, structure) and additional details, see Input Data for CSV Downloader.

Configuration File

Creating and setting up the brick’s configuration file is the third phase of building your data pipeline. For more information, see Phases of Building the Data Pipeline.

For more information about configuring a brick, see Configure a Brick.

Minimum Layout of the Configuration File

The following JSON sample is the minimum layout of the configuration file that CSV Downloader can use. If you are using more bricks, you can reuse some sections of this configuration file for other bricks. For example, the ‘downloaders’ section can list more downloaders, and the ‘integrators’ and ‘ads_storage’ sections can be reused by other downloaders and executors (see Configure a Brick).

This sample describes the following use case:

  • One instance of CSV Downloader downloads two CSV files from an S3 location.
  • You provide your own manifest files for the downloaded CSV files (see Manifest File).
  • CSV Downloader generates a feed file (see Feed File).

Copy the sample and replace the placeholders with your values.

You can later enrich the sample with more parameters depending on your business and technical requirements (see Customize the Configuration File).

{  
  "entities": {
    "table_1_name": {
      "global": {
        "custom": {
          "hub": ["column_1_name","column_2_name","column_3_name"]
        }
      }
    },
    "table_2_name": {
      "global": {
        "custom": {
          "hub": ["column_1_name","column_2_name","column_3_name"]
        }
      }
    }
  },
  "downloaders": {
    "csv_downloader_id": {
      "type": "csv",
      "entities": ["table_1_name","table_2_name"]
    }
  },
  "csv": {
    "type": "s3",
    "options": {
      "bucket": "s3_bucket_name",
      "access_key": "s3_access_key",
      "folder": "path/to/manifest/file",
      "region": "s3_bucket_region_code"
    }
  },
  "integrators": {
    "ads_integrator_id": {
      "type": "ads_storage",
      "batches": ["csv_downloader_id"]
    }
  },
  "ads_storage": {
    "instance_id": "data_warehouse_instance_id",
    "username": "dw_email@address.com",
    "options": {}
  }
}

 

The placeholders to replace with your values:

NameTypeMandatory?DefaultDescription
entities -> table_x_namestringyesn/aThe database tables into which CSV Downloader downloads the data
entities -> table_x_name -> global -> custom -> hubarrayyesn/aThe table columns to generate a primary key
downloaders -> csv_downloader_idstringyesn/aThe ID of the CSV Downloader instance Example: csv_downloader_1
downloaders -> csv_downloader_id -> entitiesarrayyesn/aThe database tables into which CSV Downloader downloads the data Must be the same as the database table names in 'entities -> table_x_name'
csv -> options -> bucketstringyesn/aThe name of the S3 bucket where the source data is stored

csv -> options -> access_key

stringyesn/aThe access key to the S3 bucket where the source data is stored

csv -> options -> folder

stringyesn/a

The path to the folder where the manifest files are located Example: prod/files/manifests/  Recommended: {SOURCE_ROOT}/manifest/

csv -> options -> regionstringsee 'Description'n/a

The code of the region of the S3 bucket where the source data is stored This parameter is mandatory for all regions except for US East (N. Virginia) (code: 'us-east-1'). If your region code is 'us-east-1', do not specify it. For the list of regions, see AWS Service Endpoints.

integrators -> ads_integrator_idstringyesn/aThe ID of the ADS Integrator instance Example: ads_integrator_1
integrators -> ads_integrator_id -> batchesarrayyesn/a

The ID of the CSV Downloader instance Example: csv_downloader_1

ads_storage -> instance_idstringyesn/aThe ID of the Agile Data Warehousing Service (ADS) instance into which the data is uploaded

ads_storage -> username

stringyesn/aThe access username to the Agile Data Warehousing Service (ADS) instance into which the data is uploaded

Customize the Configuration File

You can change default values of the parameters and set up additional parameters.

 

Use uncompressed CSV files as source files

By default, CSV Downloader treats the source files as uncompressed CSV files (the ‘file_format’ parameter is either not explicitly set or set to an empty value as in "file_format": ""). If you have been using a different type of the source files and now want to switch back to using uncompressed CSV files, either delete the ‘file_format’ parameter from ‘csv’ -> ‘options’ -> ‘files_structure’ or set this parameter to an empty value.

"csv": {
  "type": "s3",
  "options": {
    "files_structure": {
      "file_format": ""
    }
  }
}

 

Use compressed CSV files as source files

By default, CSV Downloader treats the source files as uncompressed CSV files (the ‘file_format’ parameter is either not explicitly set or set to an empty value as in "file_format": "").

To use compressed (GZIPed or BZIPed) CSV files as source files, set the ‘file_format’ parameter under ‘csv’ -> ‘options’ -> ‘files_structure’ to ‘gzip’.

"csv": {
  "type": "s3",
  "options": {
    "files_structure": {
      "file_format": "gzip"
    }
  }
}

 

Use JSON files as source files

By default, CSV Downloader treats the source files as uncompressed CSV files (the file_format parameter is either not explicitly set or set to an empty value as in "file_format": ""). If your source files are JSON files (uncompressed or GZIPed), set CSV Downloader to process JSON files.

For the JSON files to be processed correctly, you must do the following:

  • Specify the parser that ADS Integrator will use to process the JSON files before it integrates the data from these files into ADS.

    When being integrated into the flex tables, the JSON files are processed with the Vertica fjsonparser parser with or without flattening the nested data (flatten_maps=true or flatten_maps=false). For example, a source JSON file looks like the following:

    {
      "id": 1,
      "devices": [
        "Projector",
        "Table"
      ],
      "date": "2017-05-01",
      "visits": {
        "visitors": 40050,
        "international": false,
        "company": "Acme"
      }
    }
    

     

    • If you choose to flatten the data, the nested data is flattened and the data elements are separated with either the default delimiter (.) or a user-defined delimiter. JSON arrays are converted to arrays of key/value pairs. The processed data will look like the following:

      {
        "date": "2017-05-01",
        "devices": {
          "0": "Projector",
          "1": "Table"
        },
        "id": "1",
        "visits.company": "Acme",
        "visits.international": "false",
        "visits.visitors": "40050",
      }
      
    • If you choose to not flatten the data, the nested data is not flattened. This allows ADS Integrator to use brackets ([ ]) for nested structures and single quotes (' ') for the names of the keys in key/value pairs later in the process when converting the data and loading it to the source tables. JSON arrays are converted to arrays of key/value pairs. The processed data will look like the following:

      {
        "date": "2017-05-01",
        "devices": {
          "0": "Projector",
          "1": "Table"
        },
        "id": "1",
        "visits": {
          "company": "Acme",
          "international": "false",
          "visitors": "40050"
        }
      }
      
  • Provide a feed file that contains XPath notations. The XPath notations define the rules for converting the processed data to the format that allows ADS Integrator to load this data from the flex tables to the source tables. For information about the supported XPath notations and examples of the feed file, see Feed File.

Steps:

  1. Because CSV Downloader treats the format of JSON files and CSV files similarly, set the format of the source files to CSV. To do so, choose one of the following options: 

  2. Set the db_parser parameter under csv -> options -> files_structure to either fjsonparser (if you want to flatten the nested data; you can also specify it as fjsonparser(flatten_maps=true)) or fjsonparser(flatten_maps=false) (if you do not want to flatten the nested data). For example:

    "csv": {
      "type": "s3",
      "options": {
        ...
        "files_structure": {
          ...
          "db_parser": "fjsonparser(flatten_maps=false)"
        }
      }
    }
    
  3. Provide the feed file. The feed file must contain the xpath column with the XPath notations. For information about the supported XPath notations and examples of the feed file, see Feed File.

  4. Specify the path to the feed file.

    "csv": {
      "type": "s3",
      "options": {
        ...
        "data_structure_info": "path/to/feed/file"
      }
    }
    
  5. (Optional) If your source files are stored in an S3 location, switch from using the default COPY FROM LOCAL command to using a faster COPY FROM S3 command. See “Use COPY FROM S3 instead of COPY FROM LOCAL” in ADS Integrator.

 

Use Excel files as source files

By default, CSV Downloader treats the source files as uncompressed CSV files (the ‘file_format’ parameter is either not explicitly set or set to an empty value as in "file_format": "").

To use Excel files as source files, set the ‘file_format’ parameter under ‘csv’ -> ‘options’ -> ‘files_structure’ to ‘xlsx’.

"csv": {
  "type": "s3",
  "options": {
    "files_structure": {
      "file_format": "xlsx"
    }
  }
}

 

Use Parquet files as source files

By default, CSV Downloader treats the source files as uncompressed CSV files (the ‘file_format’ parameter is either not explicitly set or set to an empty value as in "file_format": ""). If your source files are Parquet files (uncompressed or compressed with Snappy; ZIPed Parquet files are not supported), set CSV Downloader to process Parquet files.

Steps:

  1. Specify the Parquet type for the source files.

    • If you want to set Parquet for all entities, set the ‘file_format’ parameter under ‘csv’ -> ‘options’ -> ‘files_structure’ to ‘parquet’.

      "csv": {
        "type": "s3",
        "options": {
          ...
          "files_structure": {
            ...
            "file_format": "parquet"
          }
        }
      }
      
    • If you want to set Parquet for a specific entity, add the ‘file_format’ parameter to the ‘custom’ section for the entity and set it to ‘parquet’:

      "entities": {
        "table_1_name": {
          "global": {
            "custom": {
              "hub": ["column_1_name","column_2_name","column_3_name"],
              "file_format": "parquet"
            }
          }
        }
      }
      
  2. Provide a feed file (see Feed File) and configure the feed file parameters in the following way:

    "csv": {
      "type": "s3",
      "options": {
        ...
        "data_structure_info": "path/to/feed/file",
        "ignore_columns_check": true
      }
    }
    
    • ‘generate_feed_from_path’ is not set.
    • ‘data_structure_info’ is set to the path to the feed file.
    • ‘ignore_columns_check’ is set to ‘true’.
  3. Provide a manifest file (see Manifest File) and make sure that CSV Downloader does not generate manifest files:

    "csv": {
      "type": "s3",
      "options": {
        ...
        "generate_manifests": false
      }
    }
    
  4. Set the ‘db_parser’ parameter under ‘csv’ -> ‘options’ -> ‘files_structure’ to ‘vertica_default’.

    "csv": {
      "type": "s3",
      "options": {
        ...
        "files_structure": {
          ...
          "db_parser": "vertica_default"
        }
      }
    }
    
  5. Set the ADS Integrator instance that processes CSV Downloader to use COPY FROM S3. See ‘Use COPY FROM S3 instead of COPY FROM LOCAL’ in ADS Integrator.

  6. Specify whether ADS Integrator should fail if an error occurs and how it should treat rejected records. See ‘Specify how to proceed in case of errors and rejected records’ in ADS Integrator.

 

Use an S3 location

To use an S3 location, set the ‘type’ parameter to ‘s3’, and specify S3 parameters under ‘csv’ -> ‘options’.

"csv": {
  "type": "s3",
  "options": {
    "bucket": "s3_bucket_name",
    "access_key": "s3_access_key",
    "folder": "path/to/manifest/file",
    "region": "s3_bucket_region_code"
  }
}
NameTypeMandatory?DefaultDescription
bucketstringyesn/aThe name of the S3 bucket where the source data is stored

access_key

stringyesn/aThe access key to the S3 bucket where the source data is stored

folder

stringyesn/a

The path to the folder where the manifest files are located Example: prod/files/manifests/  Recommended: {SOURCE_ROOT}/manifest/

regionstringsee 'Description'n/a

The code of the region of the S3 bucket where the source data is stored This parameter is mandatory for all regions except for US East (N. Virginia) (code: 'us-east-1'). If your region code is 'us-east-1', do not specify it. For the list of regions, see AWS Service Endpoints.

 

Use an SFTP location

To use an SFTP location, set the ‘type’ parameter to ‘sftp’, and specify SFTP parameters under ‘csv’ -> ‘options’.

"csv": {
  "type": "sftp",
  "options": {
    "username": "sftp_email@address.com",
    "host": "sftp_hostname",
    "auth_mode": "password|cert",
    "port": sftp_port_number // optional
  }
}
NameTypeMandatory?DefaultDescription
usernamestringyesn/aThe username to access the SFTP server to download the data from
hoststringyesn/aThe host name of the SFTP server to download the data from Example: ftp.example.com
auth_modestringyesn/a

Mode used to authenticate to the SFTP server from where the data will be downloaded

Available options:

portintegerno22The port on which the SFTP server is running

 

Use a WebDAV location

To use a WebDAV location, set the ‘type’ parameter to ‘webdav’, and specify WebDAV parameters under ‘csv’ -> ‘options’.

"csv": {
  "type": "webdav",
  "options": {
    "username": "webdav_email@address.com",
    "host": "webdav_hostname",
    "data_location": "path/to/data" // optional
  }
}
NameTypeMandatory?DefaultDescription
usernamestringyesn/a

The username to access the WebDAV server to download the data from

hoststringyesn/a

The host name of the WebDAV server to download the data from Example: https://secure-di.gooddata.com

data_locationstringnoroot folder

The path to the data on the WebDAV server

 

Use a Google Cloud Storage location

To use a Google Cloud Storage location, set the ‘type’ parameter to ‘google_cloud_storage’, and specify Google Cloud Storage parameters under ‘csv’ -> ‘options’.

"csv": {
  "type": "google_cloud_storage",
  "options": {
    "auth": {
      <content_from_client_secrets.json>
    },
    "bucket": "bucket_name"
  }
}
NameTypeMandatory?DefaultDescription
authJSONyesn/a

The content of client_secrets.json Example:

"auth": {
  "type": "service_account",
  "project_id": "12345",
  "private_key_id": "54321",
  "client_email": "email@address.com",
  "client_id": "67890",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/09876"
bucketstringyesn/a

The name of the bucket where the source data is stored

 

Use a OneDrive location

To use a OneDrive location, set the ‘type’ parameter to ‘one_drive’, and specify OneDrive parameters under ‘csv’ -> ‘options’.

"csv": {
  "type": "one_drive",
  "options": {
    "client_id": "onedrive_client_id"
  }
}
NameTypeMandatory?DefaultDescription
client_idstringyesn/a

The OneDrive client ID

 

Enable processing of PGP-encrypted source files

If your source files in the remote location are encrypted using PGP, enable CSV Downloader to decrypt them before placing them to the BDS.

Steps:

  1. Enable processing of PGP-encrypted files.

    • To enable processing for all source files, add the ‘encryption’ parameter under ‘csv’ -> ‘options’, and give it a unique name (for example, ‘encryption_1’). For example, this is what it would look like for an S3 location:

      "csv": {
        "type": "s3",
        "options": {
          "bucket": "s3_bucket_name",
          "access_key": "access_key",
          "encryption": "encryption_1"
        }
      }
      

      For an SFTP location, it would look like the following:

      "csv": {
        "type": "sftp",
        "options": {
          "username": "sftp_email@address.com",
          "host": "sftp_hostname",
          "auth_mode": "password|cert",
          "encryption": "encryption_1"
        }
      }
      
    • To enable processing for a specific entity, add the ‘encryption’ parameter to the ‘custom’ section for the entity and give it a unique name (for example, ‘encryption_1’):

      "entities": {
        "table_1_name": {
          "global": {
            "custom": {
              "hub": ["column_1_name","column_2_name","column_3_name"],
              "encryption": "encryption_1"
            }
          }
        }
      }
      
  2. Add a new section to the configuration file and name it with the same unique name that you used for the ‘encryption’ parameter. In this example, it is ‘encryption_1’. This section should contain one parameter, ‘type’, set to ‘pgp’:

    "encryption_1": {
      "type": "pgp"
    }
    
  3. When later scheduling this CVS Downloader instance in the Data Integration Console, provide the parameters in the schedule that are related to the decryption of the PGP-encrypted source files (see Encryption-Specific Parameters).

 

Use link files

By default, CSV Downloader downloads data from a source and saves it to the BDS. If the files to download are big, this process can be time-consuming. In this case, you can create link files on the BDS pointing to the source files instead of downloading the actual data.

When you use link files, you must do the following:

  • Make sure that the source files remain in their location and will not be deleted or modified as it may result in errors.

  • Use a feed file and a manifest file. See Feed File and Manifest File.

  • Set the ADS Integrator instance that processes CSV Downloader to use link files. See ‘Use link files’ in ADS Integrator.

  • If your source files are PGP-encrypted, enable the ADS Integrator instance that processes CSV Downloader to process encrypted files. See “Enable processing of PGP-encrypted source files when using link files” in ADS Integrator.

"csv": {
  "type": "s3",
  "options": {
    "bucket": "s3_bucket_name",
    "access_key": "access_key",
    "use_link_file": true|false
  }
}
NameTypeMandatory?DefaultDescription
use_link_fileBooleannofalse

Specifies how to handle big source CSV files

  • If not set or set to 'false', the files are downloaded from the source.
  • If set to 'true', the files are not downloaded, and CSV Downloader only creates a link file on the BDS with links to these files.

 

Add optional parameters to the S3 location

Depending on your business and technical requirements, you can set one or more optional parameters for your S3 location.

"csv": {
  "type": "s3",
  "options": {
    "bucket": "s3_bucket_name",
    "access_key": "access_key",
    "server_side_encryption": true|false // optional
  }
}
NameTypeMandatory?DefaultDescription
server_side_encryptionBooleannofalse

Specifies whether CSV Downloader can work with an S3 location where server-side encryption is enabled.

  • If not set or set to 'false', CSV Downloader is not enabled to work with an S3 location where server-side encryption is enabled.
  • If set to 'true', CSV Downloader is enabled to work with an S3 location where server-side encryption is enabled.

 

Add multiple instances of CSV Downloader

When you use only one instance of CSV Downloader, it reads its remote location parameters from the ‘csv’ section.

If you use multiple instances of CSV Downloader and each instance has a separate location to download data from, add several ‘csv’ sections and give each section a unique name (for example, ‘csv_users’ or ‘csv_sales’).

"csv_users": {
  "type": "s3",
  "options": {
    "bucket": "s3_bucket_name",
    "access_key": "s3_access_key"
  }
},
"csv_sales": {
  "type": "sftp",
  "options": {
    "username": "sftp_email@address.com",
    "host": "sftp_hostname",
    "auth_mode": "password|cert"
  }
}

 

For each instance of CSV Downloader in the ‘downloaders’ section, add the ‘settings’ parameter and set it to the name of the section that contains settings for this CSV Downloader instance.

"downloaders": {
  "csv_downloader_1_id": {
    "type": "csv",
    "settings": "csv_users",
    "entities": ["table_1_name","table_2_name"]
  },
  "csv_downloader_2_id": {
    "type": "csv",
    "settings": "csv_sales",
    "entities": ["table_1_name","table_2_name"]
  }
}

 

Describe the structure of the source CSV files

You can set the parameters describing the structure of the source CSV files.

"csv": {
  "type": "s3",
  "options": {
    "files_structure": {
      "skip_rows": number_of_rows, // optional
      "column_separator": "separator_character", // optional
      "record_terminator": "record_termination_character", // optional
      "escape_as": "escape_character", // optional
      "db_parser": "gdc|vertica_default|fcsvparser-traditional|fcsvparser|fjsonparser|fjsonparser(flatten_maps=false)|{no_value}", // optional
      "enclosed_by": "quote_character", // optional
      "null_value": "null_value" // optional
    }
  }
}
NameTypeMandatory?DefaultDescription
skip_rowsintegerno1

The number of rows from the top of a file that CSV Downloader will skip By default, it is set to 1 and removes the first row (that typically features a header) from a CSV file. If you want to use this parameter but want to preserve the data in the first row, set it to 0. 

column_separatorcharacterno,

A character that is used to separate fields in a CSV file NOTE: It cannot be the same as the character in the null_value or record_terminator parameters.

record_terminatorcharacterno,

A character that is used to separate rows in a CSV file NOTE: It cannot be the same as the character in the null_value or column_separator parameters.

enclosed_bystringnon/a

The quote character within which to enclose data This allows delimiter characters to be embedded in string values. You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000'). Example: \"

escape_ascharacternon/a

An escape character You can define an escape character using any ASCII value in the range from E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000'). Example: \"

null_valuestringno"" (no value, as in: "null_value" : "")

A character that replaces nulls in the data when loading it to the BDS (corresponds to NULL AS in the COPY command) You can specify a null value as any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000'). NOTE: It cannot be the same as the character in the column_separator or record_terminator parameters.

db_parserstringnogdc

The database parser used to process the source files

  • If not set or set to gdc, the default GoodData parser (GdcCsvParser) is used.
  • If set to vertica_default, the default Vertica parser is used.
  • If set to fcsvparser-traditional, the Vertica fcsvparser parser with type set to traditional is used.
  • If set to fcsvparser, the Vertica fcsvparser parser with type set to rfc4180 is used. NOTE: If you use fcsvparser, do not set the column_separator, enclosed_by, escape_as, and record_terminator parameters. This parser uses its own predefined settings that cannot be changed: 
    • delimiter (,) (corresponds to the column_separator parameter)
    • enclosed_by (")
    • escape (") (corresponds to the escape_as parameter)
    • record_terminator (\n or \r\n)
  • If set to fjsonparser, the Vertica fjsonparser parser with flatten_maps set to true is used. NOTE: You can also specify it as fjsonparser(flatten_maps=true).
  • If set to fjsonparser(flatten_maps=false), the Vertica fjsonparser parser with flatten_maps set to false is used.

NOTE:

  • If you use Parquet source files, set this parameter to vertica_default.
  • If you use JSON source files, set this parameter to either fjsonparser or fjsonparser(flatten_maps=false).
  • If you use the array data types in the feed file (see Feed File), set this parameter to either fcsvparser-traditional or fcsvparser.

 

Set optional parameters for your manifest file

The optional ‘generate_manifests’ parameter specifies whether to generate a manifest file or to use the manifest file that you provide. By default, the ‘generate_manifests’ parameter is not set and defaults to ‘false’, which means that:

  • You have to create and provide your own manifest files (see Manifest File).
  • All manifest file properties take on their default values.
  • CSV Downloader expects the manifest files to be found in the folder that you specified in the ‘folder’ parameter (see Minimum Layout of the Configuration File).

Depending on your business and technical requirements, you can set one or more optional parameters to customize the default settings for the manifest file.

"csv": {
  "type": "s3",
  "options": {
    "generate_manifests": false,
    "manifest": "manifest_file_name", // optional
    "manifest_process_type": "move|history", // optional
    "number_of_manifest_in_one_run": number_of_manifests, // optional
    "data_location": "path/to/data/files" // optional
  }
}
NameTypeMandatory?DefaultDescription
generate_manifestsBooleannofalse

Specifies whether to generate a manifest file or to use the manifest file that you provide

  • If not set or set to 'false', you have to provide your own manifest file.
  • If set to 'true', CSV Downloader generates a manifest file for you. See Generate a manifest file.

NOTE: If you use Parquet source files, set this parameter to 'false'. If you use Excel source files, set this parameter to 'true'.

manifeststringnomanifest_{time(%s)}The format of the manifest file name For information about the format, see Manifest File.
manifest_process_typestringnomove

Specifies how to handle the manifest file after it has been processed

  • If not set or set to 'move', the manifest file is moved to the 'processed' folder (see the 'move_data_after_processing_to_path' parameter in Set up runtime parameters).
  • If set to 'history', the manifest is not moved and remains in its original location.
number_of_manifest_in_one_runintegerno1The maximum number of manifest files to process in one run
data_locationstringnon/aThe path to the data files Example: prod/files/data_source/

 

Generate a manifest file

The optional ‘generate_manifests’ parameter specifies whether to generate a new manifest file or to use the manifest file that you provide. By default, the ‘generate_manifests’ parameter is not set and defaults to ‘false’, which means that you have to create and provide your own manifest files.

Depending on your business and technical requirements, you can decide to make CSV Downloader generate the manifest file for you. For example, you must let CSV Downloader generate the manifest if you use Excel source files (see Use Excel files as source files).

In generated manifest files, load mode for all source data files is set to ‘full’ by default. If data is loaded in incremental mode, the data files are by default moved to the ‘processed’ folder after being processed (the path to the folder can be set by the ‘move_data_after_processing_to_path’ parameter). To prevent the data files from being moved, set the ‘keep_data_after_processing’ parameter to ‘true’.

"csv": {
  "type": "s3",
  "options": {
    "generate_manifests": true,
    "data_location": "path/to/data/files",
    "file_name": "data_file_name", // optional
    "sheet_name": "sheet_file_name", // optional
    "date_from": "YYYYMMDD", // optional
    "date_to": "YYYYMMDD", // optional
    "keep_data_after_processing": false|true, // optional
    "move_data_after_processing_to_path": "path/to/move/data/to" // optional
  }
}
NameTypeMandatory?DefaultDescription
generate_manifestsBooleannofalse

Specifies whether to generate a new manifest file or to use the manifest file that you provide

NOTE: If you use Parquet source files, set this parameter to 'false'. If you use Excel source files, set this parameter to 'true'.

data_locationstringyesn/aThe path to the source data files Example: prod/files/data_source/
file_namestringno{entity_name}_{version}_{time(%s)}

The format of the source data file name For more information, see file_name.

sheet_namestringnoentities.xlsx

(Only for XLSX source files) The format of the spreadsheet name in a source data file

If you want to customize the spreadsheet name format, you can use the keywords 'regex', 'time', or 'target_predicate' the same way they can be used in file_name. For example, you may want to set the name format to spreadsheet_{time(%Y.%m.%d)}.xlsx.

NOTE: The names of individual sheets must be the same as the names of your entities (tables to download). For example, if you have entities named 'company' and 'revenue', you are going to name the sheets in the source file 'company' and 'revenue' as well.

date_fromstringnon/a

Use 'date_from' and 'date_to' to specify a time interval, within which the data will be downloaded (>= date_from, <= date_to)).  Format: YYYYMMDD

Always use the 'date_from' and 'date_to' parameters together.

date_tostringnon/a
keep_data_after_processingBooleannofalse

(For incremental load mode only) Specifies how to handle the data files after they have been processed

  • If not set or set to 'false', the processed data files are moved to the 'processed' folder (set by the 'move_data_after_processing_to_path' parameter).
  • If set to 'true', the processed data files are not moved and remain in their original location (set by the 'data_location' parameter).
move_data_after_processing_to_pathstringnon/a

The path where the data files will be moved to after processing if the 'keep_data_after_processing' parameter is not set or set to 'false' Example: prod/files/processed/

file_name

The default format of the source file name is the following:

{entity_name}_{version}_{time(%s)}.csv|gz|zip

The entity represents the table to download data from. The {entity_name} section is mandatory, and the rest is optional.

When resolved, the name of a manifest file may look like the following:

entity_1.0_20160905140015.csv

To customize the file name format, you can use the following keywords:

  • entity_name: The name of the entity (the table to download data from)

  • version: Use ‘version’ to version your entities. For example, 1.0, 3.23, and so on. If you do not specify ‘version’ in the names of your data files and you create your own feed file, use the value ‘default’ in the ‘version’ column.

  • regex: If a file name has a changing part, use ‘regex’ to be able to process the files. For more information, see https://ruby-doc.org/core-2.1.1/Regexp.html.

  • time: ‘time’ can be set as timestamp ( {time(%s)} ) or any kind of the YYYYMMDD format (for example, {time(%Y-%m-%d-%H-%M-%S)} ). For more information about tags, see http://ruby-doc.org/core-2.2.0/Time.html#method-i-strftime.

  • target_predicate: The field name used in partial full load (see Set partial full load mode for the entities).

When building the file name format, use symbols such as underscores to separate individual elements:

{entity_name}_{time(%Y%m%d%H%M%S)}_{version}

Processing file names with no separating symbols between individual elements (for example, {entity_name}{time(%Y%m%d%H%M%S)}{version} ) may cause an error.

Example:  To get the following source data file name:

Event_1.0_20160914_R.abc.csv

the file name format may look like the following:

{entity_name}_{version}_{time(%Y%m%d)}_{regex(R\\.abc)}

Notice that the dot in this regexp means a dot as (.), not as “any character”.

 

Provide a feed file

By default, CSV Downloader does not check the content of downloaded CSV, JSON, or Parquet files. Therefore, you have to provide information about the content structure of the files in a feed file.

If you do not provide the feed file, CSV Downloader generates it. For more information about the feed file, see Feed File

"csv": {
  "type": "s3",
  "options": {
    "data_structure_info": "path/to/feed/file", // optional
    "feed_on_bds": false|true, // optional
    "default_string_size": "255", // optional
    "generate_feed_from_path": "path/to/test/feed/file", // optional
    "ignore_columns_check": false|true // optional
  }
}
NameTypeMandatory?DefaultDescription
data_structure_infostringnon/a

The path to the feed file

Example: prod/files/feed  Recommended: {SOURCE_ROOT}/feed  If you enter a path ending with a slash ( / ), all files in the directory will be used. If the feed file is not present in the specified path, CSV Downloader will generate it.

NOTE: If you use JSON source files or Parquet source files, set this parameter to the path to the feed file.

feed_on_bdsBooleannofalse

Specifies whether the feed file is located on the BDS instead of the location set by the 'data_structure_info' parameter.

  • If not set or set to 'false', the feed file is located the location set by the 'data_structure_info' parameter.
  • If set to 'true', the feed file is located on the BDS.
default_string_sizestringno255The default size of the VARCHAR data type for the generated feed file
generate_feed_from_pathstringnon/a

The path to generate a test feed file in

If this parameter is explicitly set to some path, the feed file is generated but no data is downloaded from the source files. Use this option to test your settings before actually downloading data.

NOTE: If you use Parquet source files, do not specify this parameter.

ignore_columns_checkBooleannotrue

Specifies whether to compare column names in the feed file to those in the source files

  • If not set or set to 'true', the column names in the feed file are not compared to those in the source files.
  • If set to 'false', the column names in the feed file are compared to those in the source files.

NOTE: If you use Parquet source files, set this parameter to 'true'.

 

Set up runtime parameters

You can set up optional processing parameters that are used at runtime.

"csv": {
  "type": "s3",
  "options": {
    "delete_data_after_processing": true|false, // optional
    "ignore_check_sum": true|false, // optional
    "number_of_threads": number_of_threads // optional
  }
}
NameTypeMandatory?DefaultDescription
delete_data_after_processingBooleannofalse

Specifies whether to delete data from the source after processing

  • If not set or set to 'false', the source data is not deleted.
  • If set to 'true', the source data is deleted.
ignore_check_sumBooleannotrue

Specifies whether to ignore MD5 checksums when downloading data

  • If not set or set to 'true', MD5 checksums are ignored when downloading data.
  • If set to 'false', MD5 checksums are processed when downloading data.
number_of_threadsintegerno4 (for S3) 1 (for others)

The number of threads that will be used for copying data from the source location to the BDS

  • When set to 1, the maximum file size that CSV Downloader can process is 2 GB.
  • When set to >1, the maximum file size is considerably smaller than 2 GB. However, you can configure CSV Downloader to process multiple files in one run.

 

Specify additional load options for the entities

For each entity, you can set up additional load options.

"entities": {
  "table_1_name": {
    "global": {
      "custom": {
        "hub": ["column_1_name","column_2_name","column_3_name"],
        "export_type": "full|inc", // optional
        "skip_rows": number_of_rows // optional
      }
    }
  }
}
NameTypeMandatory?DefaultDescription
export_typestringnon/a

Load mode used for loading the entity to the database

  • If you provide your own manifest file (that is, the 'generate_manifests' parameter is set to 'false'), incremental load is used. To use full mode, explicitly set this parameter to to 'full'.
  • If CSV Downloader generates the manifest file for you (that is, the 'generate_manifests' parameter is set to 'true' ), full load is used. To use incremental mode, explicitly set this parameter to 'inc'.
skip_rowsintegerno1

The number of rows from the top of a file that CSV Downloader will skip By default, it is set to 1 and removes the first row (that typically features a header) from a CSV file. Set it to zero if you want to use the parameter but want to preserve the data in the first row.

 

Set partial full load mode for the entities

You can specify partial full load mode for some or all entities. In partial full load, the rows that should be removed must contain the same value in the specified field as the new data does.

[Example]{.ul}:  You have two clients: Client A and Client B. You already have data for those clients in your ADS instance from previous data loads. The data for both Client A and Client B is stored in the same table. The table contains, among others, the ‘id’ and ‘client’ columns. The names of these columns are specified in the ‘hub’ parameter of your configuration file (see Minimum Layout of the Configuration File).

This is how your data currently looks in the table on the ADS instance:

idtextclient
1aaabbbbclientA
1ccdddclientB

 

Now, you want to reload data for Client A without affecting Client B’s data in the same table.

This is the new data that you want to load for Client A:

idtextclient
2eeeeeffclientA

 

After the data load completes, you want the table with the data to look like the following:

idtextclient
2eeeeeffclientA
1ccdddclientB

Notice that Client A has a different ID and text, and Client B’s data remains the same.

 

To achieve this output, use partial full load , and use the ‘client’ field as its differentiator.

After the next data load runs, the table with the data will be updated to look exactly as you want it to.

 

For comparison, this is what your data would look like if you ran the data load in full load mode:

idtextclient
2eeeeeffclientA

Notice that the data for Client A is loaded correctly, but the data for Client B is no longer present in the table.

 

[How to set partial full load mode]{.ul}:  Depending on whether the field that you want to use as the partial full load field does or does not exist in your data, choose one of the following scenarios to set partial full load mode.

Scenario 1: The field that you want to be the partial full load field [exists]{.ul} in your data

  1. In the manifest file (see Manifest File), set up the full load mode for loading the source data. To do so, set the ‘export_type’ field to ‘full’.
  2. In the configuration file, set up the partial full load field:
    • If you want to use partial full load for a specific entity, add the ‘partial_full_load_field’ parameter to the ‘custom’ section for the entity and set it to the name of the field that will be used as the partial full load field:

      "entities": {
        "table_1_name": {
          "global": {
            "custom": {
              "hub": ["column_1_name","column_2_name","column_3_name"],
              "partial_full_load_field": "your_field_name"
            }
          }
        }
      }
      
    • If you want to use partial full load for all entities, add the ‘partial_full_load_field’ parameter to the ‘options’ section of the CSV Downloader instance, and set it to the name of the field that will be used as the partial full load field:

      "csv": {
        "type": "s3",
        "options": {
          "bucket": "s3_bucket_name",
          "access_key": "s3_access_key",
          "folder": "path/to/manifest/file",
          "partial_full_load_field": "your_field_name"
        }
      }
      

Scenario 2: The field that you want to be the partial full load field [does not exist]{.ul} in your data

  1. In the manifest file (see Manifest File), set up the full load mode for loading the source data. To do so, set the ‘export_type’ field to ‘full’.
  2. In the configuration file, set up the field that will be used as the partial full load field. Although you can use any field as the partial full load field, we recommend that you use the ‘target_predicate’ field for that purpose, and this procedure describes the process of setting partial full load mode using the ‘target_predicate’ field. Do one of the following: 
  3. Depending on whether you want to use partial full load for a specific entity or all entities, do one of the following:
    • If you want to use partial full load for a specific entity, configure the ‘custom’ section for the entity. 
      1. Add the ‘partial_full_load_field’ parameter, and set it to ‘target_predicate’.

      2. Add the ‘computed_fields’ section so that the ‘target_predicate’ field gets added to your source and stage tables. For more information about the computed fields, see Computed Fields.

        "entities": {
          "table_1_name": {
            "global": {
              "custom": {
                "hub": ["column_1_name","column_2_name","column_3_name"],
                "partial_full_load_field": "target_predicate",
                "computed_fields": [
                  {
                    "name": "target_predicate",
                    "type": "string-32",
                    "function": "runtime_metadata|target_predicate",
                    "table_type": ["src"],
                    "key": true,
                    "position": "first",
                    "encoding": "RLE"
                  },
                  {
                    "name": "target_predicate",
                    "type": "string-32",
                    "function": "value|target_predicate",
                    "table_type": ["stg"],
                    "key": true,
                    "position": "first",
                    "encoding": "RLE"
                  }
                ]
              }
            }
          }
        }
        
    • If you want to use partial full load for all entities, do the following:
      1. Add the ‘partial_full_load_field’ parameter to the ‘options’ section of the CSV Downloader instance, and set it to ‘target_predicate’:

        "csv": {
          "type": "s3",
          "options": {
            "bucket": "s3_bucket_name",
            "access_key": "s3_access_key",
            "folder": "path/to/manifest/file",
            "partial_full_load_field": "target_predicate"
          }
        }
        
      2. Add the ‘computed_fields’ section to the ‘ads_storage’ section so that the ‘target_predicate’ field gets added to your source and stage tables. For more information about the computed fields, see Computed Fields.

        "ads_storage": {
          "instance_id": "data_warehouse_instance_id",
          "username": "dw_email@address.com",
          "options": {},
          "computed_fields": [
            {
              "name": "target_predicate",
              "type": "string-32",
              "function": "runtime_metadata|target_predicate",
              "table_type": ["src"],
              "key": true,
              "position": "first",
              "encoding": "RLE"
            },
            {
              "name": "target_predicate",
              "type": "string-32",
              "function": "value|target_predicate",
              "table_type": ["stg"],
              "key": true,
              "position": "first",
              "encoding": "RLE"
            }
          ]
        }
        

 

Customize integrators

The ‘integrators’ and ‘ads_storage’ sections define the behavior of integrators.

"integrators": {
  "ads_integrator": {
    "type": "ads_storage",
    "batches": ["csv_downloader_id"]
  }
},
"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {}
}

If you want to add more parameters or change the defaults, see ADS Integrator.

Schedule Parameters

When scheduling CSV Downloader (see Phases of Building the Data Pipeline -> Production Implementation), provide the parameters from this section in the schedule.

General Parameters

Some parameters must be entered as secure parameters (see Configure Schedule Parameters).

Some parameters contain the names of the nesting parameters, which specifies where they belong in the configuration hierarchy. For example, a parameter entered as csv|options|password is resolved as the following:

"csv": {
  "options": {
    "password": ""
  }
}

Enter such parameters as is (see Schedule Example).

NameTypeMandatory?Secure?DefaultDescription
IDstringyesnon/a

The ID of the CSV Downloader instance being scheduled

Must be the same as the 'batches -> csv_downloader_id' parameter and the 'downloaders -> csv_downloader_id' parameter in the configuration file (see Minimum Layout of the Configuration FileExample: csv_downloader_1

server_side_encryptionBooleannonofalse

Specifies whether CSV Downloader can work with an S3 location where server-side encryption is enabled.

  • If not set or set to 'false', CSV Downloader is not enabled to work with an S3 location where server-side encryption is enabled.
  • If set to 'true', CSV Downloader is enabled to work with an S3 location where server-side encryption is enabled.

Location-Specific Parameters

Depending on the location where your source data files are stored, provide the corresponding location-specific parameters as secure schedule parameters (see Configure Schedule Parameters).

Some parameters contain the names of the nesting parameters, which specifies where they belong in the configuration hierarchy. For example, a parameter entered as csv|options|password is resolved as the following:

"csv": {
  "options": {
    "password": ""
  }
}

Enter such parameters as is (see Schedule Example).

LocationParameter NameTypeMandatory?Secure?DefaultDescription
S3csv|options|secret_keystringyesyesn/a

The secret key to the S3 bucket where the source data is stored

SFTP

csv|options|password

stringsee 'Description'yesn/a

The password to the SFTP server to download the data from This parameter is mandatory only when 'auth_mode' is set to 'password' (see Use an SFTP location). Otherwise, do not use it.

csv|options|cert_keystringsee 'Description'yesn/a

The private key used for authentication This parameter is mandatory only when 'auth_mode' is set to 'cert' (see Use an SFTP location). Otherwise, do not use it.

The private key must be in the PEM (Privacy Enhanced Mail) format. If the private key is encrypted, also provide the passphrase to decrypt it (see the 'csv|options|cert_passphrase' parameter).

NOTE: Although you are entering the private key as a secure parameter, its value will be visible right after you paste it in the parameter value field (the show value check-box is selected) because it is a multi-line string. To hide the value, un-select the show value check-box. When the schedule is saved, the value will not appear in clear-text form in any GUI or log entries. For more information, see Configure Schedule Parameters.

csv|options|cert_passphrase

stringsee 'Description'yesn/a

The passphrase used to decrypt the private key if it is encrypted

This parameter is mandatory only when 'auth_mode' is set to 'cert' (see Use an SFTP location) and the private key (see the 'csv|options|cert_key' parameter) is encrypted. Otherwise, do not use it.

WebDAV

csv|options|password

stringyesyes

n/a

The password to the WebDAV server to download the data from

OneDrivetokenstringyesyes

n/a

The OneDrive refresh token

Encryption-Specific Parameters

If you enable processing of PGP-encrypted source files, provide the encryption-specific parameters as secure schedule parameters (see Configure Schedule Parameters).

Some parameters contain the names of the nesting parameters, which specifies where they belong in the configuration hierarchy. For example, a parameter entered as csv|options|password is resolved as the following:

"csv": {
  "options": {
    "password": ""
  }
}

Enter such parameters as is (see Schedule Example).

Parameter NameTypeMandatory?Secure?DefaultDescription
encryption_name|private_keystringyesyesn/aThe private key used for decrypting the PGP-encrypted files 

If the private key itself is encrypted, also provide the passphrase to decrypt it (see the 'encryption_name|passphrase' parameter).

NOTE: Although you are entering the private key as a secure parameter, its value will be visible right after you paste it in the parameter value field (the show value check-box is selected) because it is a multi-line string. To hide the value, un-select the show value check-box. When the schedule is saved, the value will not appear in clear-text form in any GUI or log entries. For more information, see Configure Schedule Parameters.

encryption_name|passphrase

stringsee 'Description'yesn/a

The passphrase used to decrypt the private key if it is encrypted

This parameter is mandatory only when the private key (see the 'encryption_name|private_key' parameter) is encrypted. Otherwise, do not use it.

Schedule Example

The following is an example of how you can specify schedule parameters:

Best Practices

For better performance, we recommend that you apply the following best practices when using CSV Downloader:

Source Files

Format

  • Use GZIPed CSV files.
  • If you have multiple source files, compress each file separately (GZIPed CSV files, not tar.gz).

CSV File Structure

  • Use the UTF-8 encoding without the byte order mark (BOM).
  • Use a comma delimiter ( , ).
  • End lines with \n or \r\n.
  • Quote strings that contain commas, newlines, or other special characters with a double quote character ( " ).
  • Escape quotes in strings with a backslash character ( \ ). Duplicated double quotes are accepted, but are slower to process.
  • Make the first line of each file a header.
  • Keep all dates and datetimes in the same format.

Example:

ID,Name,City,Note,timeCreated,dateOpened,dateClosed
1,Texas Account,"Dallas, TX","Width has to be 5\" at least",2017-11-12 23:08:22.000,2017-11-12,2018-01-11
2,NYC Account,"New York City, NY","",2017-10-28 11:18:12.352,2017-11-12,
3,Prague,Prague,"Width has to be 5\" at least",2017-10-15 13:02:22.193,2017-11-12,

Size

  • Keep the size of each CSV file below 100 MB. Split bigger (up to 1 GB) files into smaller files.
  • Keep the size of each XLSX sheet below 10 MB.
  • The maximum file size that CSV Downloader can process is 2 GB. This limit is applied only if the ‘number_of_threads’ parameter is set to 1 (see Set up runtime parameters). With a higher number of threads, the maximum file size is considerably smaller. Break large data files (greater than 1 GB) into smaller files, and configure CSV Downloader to process multiple files in one run.

Location

  • Store the source files on S3 (see S3 Bucket).
  • On S3, place the source files into the following path: {SOURCE_ROOT}/data/{YYYY}/{MM}/{DD}/
  • Keep the source files in the S3 bucket for a while after CSV Downloader downloaded them. When the data integration process stabilizes over a number of successful executions, you can set up a retention period.

Load

  • Use link files (see Use link files).
  • If you need to do both daily loads and historical loads to the same table, use multiple instances of CSV Downloader (see Add multiple instances of CSV Downloader). If you use multiple instances of CSV Downloaders for one ADS Integrator instance, make sure that each instance of CSV Downloader has a unique ID.
  • For each entity, prepare at least one source file.
  • If you are using the computed fields for setting partial full load mode, use the ‘position’ and ‘encoding’ parameters (see Computed Fields) for better performance. Set the ‘position’ parameter to ‘first’, and set the ‘encoding’ parameter to ‘RLE’.

Additional Files

  • Provide your own manifest file (the ‘generate_manifests’ parameter is set to ‘false’). See Manifest File.
  • Provide a feed file. See Feed File.

S3 Bucket

Set up the structure of the S3 bucket as follows:

FolderPurposeNote
{SOURCE_ROOT}
The root folder/bucket 
{SOURCE_ROOT}/feed/
The location of a feed file (feed.txt)See Feed File.
{SOURCE_ROOT}/data/{YYYY}/{MM}/{DD}/
The location of data source files

Keep the number of data source files in one folder below 500. If needed, distribute source data files among subfolders by year/month/date.

{SOURCE_ROOT}/manifest/
The location of unprocessed manifest files

Make sure that CSV Downloader has write access to this folder so that it can move the manifest files after they have been processed.

{SOURCE_ROOT}/manifest/processed/{YYYY}/{MM}/{DD}/

The location where processed manifests are moved to

CSV Downloader creates the 'processed' folder with subfolders by year/month/date to limit the number of files in a folder.