Input Data for CSV Downloader

Before you use CSV Downloader, prepare the input data.

For the best performance, we recommend that you use the following setup:

  • Provide your source files in the CSV format.
  • Provide the manifest file.
  • Provide the feed file.
  • Store all the files in an S3 bucket with the recommended folder structure.

Source Files

The source files contain the actual data that you want to add to your workspaces.

You can have one or more source files. For each entity in your data, prepare at least one source file.

  • Provide the source files as CSV files.
  • Compress each source file separately (GZIPed CSV files, not tar.gz).
  • Keep the size of each source file below 1 GB. Split bigger files into smaller files.

Format and Structure

Use the follow guidelines to format the contents of your source files:

  • Use the UTF-8 encoding without the byte order mark (BOM).
  • Use a comma delimiter ( , ).
  • End lines with \n or \r\n.
  • Quote strings that contain commas, newlines, or other special characters with a double quote character (").
  • Escape quotes in strings with a backslash character ( \ ). Duplicated double quotes are accepted, but are slower to process.
  • Make the first line of each file a header.
  • Keep all dates and datetimes in the same format.

Location

Place the source files on S3 into the following path:

{SOURCE_ROOT}/data/{YYYY}/{MM}/{DD}/

Example

In the following example, notice how values containing commas in the ‘City’ column are quoted, and quotes inside other quotes are escaped with a backslash.

ID,Name,City,Note,timeCreated,dateOpened,dateClosed
1,Texas Account,"Dallas, TX","Width has to be 5\" at least",2017-11-12 23:08:22.000,2017-11-12,2018-01-11
2,NYC Account,"New York City, NY","",2017-10-28 11:18:12.352,2017-11-12,
3,Prague,Prague,"Width has to be 5\" at least",2017-10-15 13:02:22.193,2017-11-12,

Manifest File

A manifest file describes the source data files, for example, which files belong to the batch, what entity they correspond to, the timestamp of export (this is important for incremental load). The manifest file can also contain hashes and the number of rows to ensure file integrity.

CSV Downloader processes only the source files that are referenced in the manifest file. A source data file that is not mentioned in the manifest file will not be processed. This way, you can specify only those source files in the manifest that you want to export, and CSV Downloader will process only those files. For example, you can export specific entities monthly while other entities should be exported daily.

Create a new manifest file for each load. Upload it as the last file after all the source files from a specific batch have been uploaded.

Format and Structure

The manifest file is a text file delimited with vertical bars ( | ).

The manifest file can have the following columns:

NameMandatory?Description
file_urlyes

The path to the source data file

Examples:

  • Source files on S3:  s3://bucket/folder/account.1515628800.csv
  • Source files on SFTP, WebDAV, Google Cloud Storage, or One Drive (do not include the root directory in the path): /folder/account.1515628800.csv
timestampyesThe UNIX timestamp representing the time when the source data file was uploaded to storage
feedyesThe name of the entity (table) to download data from The name must match the name of the entity in the feed file.
feed_versionnoThe version in the feed file that the source data file is connected to The version must match the version of the entity in the feed fileNOTE: You can have only one version of the same entity in one manifest file.
num_rowsnoThe number of rows in the source data file Use 'num_rows' to verify integrity of an upload batch. If you want CSV Downloader to skip the verification, put 'unknown' to this column.
md5noThe MD5 checksum of the source data file If you want CSV Downloader to skip the MD5 check, put 'unknown' to this column.
export_typenoLoad mode used for loading the source data file to the database
  • If not set or set to 'inc', incremental load is used.
  • If set to 'full', full load is used.
  • If set to 'delete', CSV Downloader deletes the data from ADS based on the primary key that is set by the 'hub' parameter in the configuration file for CSV Downloader
    • The source CSV file must contain the header with the table columns that generate the primary key. The source file may or may not contain other columns (the feed file is ignored in this case, and only the columns generating the primary key are verified).
    • If the primary key contains more than one column (that is, the 'hub' parameter contains more than one column name), the column names in the header must be specified in the same order as they are specified in the 'hub' parameter.
    • The names of the columns generating the primary key columns are case-sensitive. The column names in the header must be specified in the same case as they are specified in the 'hub' parameter.
target_predicatenoThe field used for partial full load (see 'Set partial full load mode for the entities' in CSV DownloaderNOTE: You can also use this field as a reference field for defining the partition of the source tables (see the 'drop_source_partition' parameter in ADS Integrator).

File Name Format

The name of a manifest file defines the order in which the manifest files will be processed.

Use the default format of the manifest file name:

manifest_{time(%s)}.csv

When resolved, the name of a manifest file may look like the following:

manifest_1468493700.csv

Location

Place the manifest file on S3 into the following path:

{SOURCE_ROOT}/manifest/

Example

The following example of the manifest file defines export of four entities: ‘Account’, ‘User’, ‘Product’ (all in version 1.0), and ‘Facts’ (in version 1.2). The ‘User’ and ‘Product’ entities are marked for full load mode with the MD5 checksum disabled. The ‘Facts’ is split into two files for performance reasons. The both files are referenced in the manifest file.

file_url|timestamp|feed|feed_version|num_rows|md5|export_type
s3://bucket/folder/account.1515628800.csv.gz|1515628800|Account|1.0|3|366513286293c4b369bc7fafca23ddde|inc
s3://bucket/folder/user.1515628800.txt.gz|1515628800|User|1.0|0|unknown|full
s3://bucket/folder/product.1515628800.txt.gz|1515628800|Product|1.0|0|unknown|full
s3://bucket/folder/facts.1.1515628800.txt.gz|1515628800|Facts|1.2|15444|5d0a290ca7fc8d4dc7dd9cdd0dd15f96|inc
s3://bucket/folder/facts.2.1515628800.txt.gz|1515628800|Facts|1.2|52755|ba63d9912e49fa4f4b2e0797d3fcfa41|inc

For detailed information about the manifest file, see Manifest File.

Feed File

A feed file describes the structure of source data files and data types in their columns. Once you created and uploaded the feed file, keep it in the folder and do not change it until the structure of the source files changes (for example, a column is added to a source file).

You can have only one feed file.

Format and Structure

The feed file is a text file delimited with vertical bars ( | ). The whole content of the feed file is case-sensitive.

The feed file has the following columns:

NameMandatory?Description
fileyesThe name of the entity
versionyes

The version of the entity Use 'version' to version entities in the source files if you want to process the files in different structures. You can put multiple versions of the same entity, and you can keep the older version of the entity in the feed file to allow smooth transition between versions.

The version can have one decimal place.

Example:  1.0  3.23

fieldyesThe name of the column It can differ from the column name in the header of the CSV file because the header of the CSV file is ignored.
typeyes

The data type of the field

For the supported types, see Data Types.

orderyesThe order in which the fields are sorted in the source files (from the lowest to the highest number; the first column has 0)
date_formatno

The format of dates in data Example: 'DD.MM.YYYY'

For more information, see Template Patterns for Date/Time Formatting in Vertica.

encodingno

The encoding type of fields

For information about available options, see Encoding Types in Vertica.

Data Types

The following field data types are supported:

SQL type representationFeed file type name
varchar(X)varchar(X)
varchar(255)varchar
decimal(16,10)decimal
decimal(X,Y)decimal(X,Y)
booleanboolean
integerinteger
date (with time)datetime
date (without time)date
time (without date)time
timestamp (without timezone)timestamp without timezone
array[int]array[int]*
array[varchar]array[varchar]*
array[varchar(X)]array[varchar(X)]*

^*^ The database parser must be the Vertica fcsvparser parser with type set to either traditional or rfc4180 (see ‘Describe the structure of the source CSV files’ in CSV Downloader):

"db_parser": "fcsvparser-traditional"

or

"db_parser": "fcsvparser"

File Name

Use the default format of the feed file name:

feed.txt

Location

Place the feed file on S3 into the following path:

{SOURCE_ROOT}/feed/

Example

The following example of the feed file defines four entities: ‘Account’, ‘User’, ‘Product’, and ‘Facts’. The ‘Facts’ entity is defined in version 1.0 (initial version) and version 1.2 (the ‘product_id’ and ‘fact2’ fields were added):

file|version|field|type|order|date_format
Account|1.0|ID|integer|0|
Account|1.0|Name|varchar(255)|1|
Account|1.0|City|varchar(255)|2|
Account|1.0|Note|varchar(255)|3|
Account|1.0|timeCreated|timestamp|4|YYYY-MM-DD HH24:MI:SS.MS
Account|1.0|dateOpened|date|5|YYYY-MM-DD
Account|1.0|dateClosed|timestamp|6|YYYY-MM-DD HH24:MI:SS.MS
User|1.0|ID|integer|0|
User|1.0|Name|varchar(255)|1|
User|1.0|Attribute1|varchar(255)|2|
Product|1.0|ID|integer|0|
Product|1.0|product_name|varchar(255)|1|
Product|1.0|type|varchar(255)|2|
Facts|1.0|ID|integer|0|
Facts|1.0|fact1|decimal(20,5)|1|
Facts|1.2|ID|integer|0|
Facts|1.2|fact1|decimal(20,5)|1|
Facts|1.2|product_id|integer|2|
Facts|1.2|fact2|decimal(20,5)|3|

For detailed information about the feed file, see Feed File.

S3 Bucket

Set up the structure of the S3 bucket as follows:

FolderPurposeNote
{SOURCE_ROOT}The root folder/bucket 
{SOURCE_ROOT}/feed/The location of the feed file (feed.txt) 
{SOURCE_ROOT}/data/{YYYY}/{MM}/{DD}/The location of data source filesKeep the number of data source files in one folder below 500. If needed, distribute source data files among subfolders by year/month/date.
{SOURCE_ROOT}/manifest/The location of unprocessed manifest filesMake sure that CSV Downloader has write access to this folder so that it can move the manifest files after they have been processed.
{SOURCE_ROOT}/manifest/processed/{YYYY}/{MM}/{DD}/The location where processed manifests are moved toCSV Downloader creates the 'processed' folder with subfolders by year/month/date to limit the number of files in a folder.