You are viewing our older product's guide. Click here for the documentation of GoodData Cloud, our latest and most advanced product.

Input Data for CSV Downloader

Before you use CSV Downloader, prepare the input data.

For the best performance, we recommend that you use the following setup:

Provide your source files in the CSV format.
Provide the manifest file.
Provide the feed file.
Store all the files in an S3 bucket with the recommended folder structure.

Source Files

The source files contain the actual data that you want to add to your workspaces.

You can have one or more source files. For each entity in your data, prepare at least one source file.

Provide the source files as CSV files.
Compress each source file separately (GZIPed CSV files, not tar.gz).
Keep the size of each source file below 1 GB. Split bigger files into smaller files.

Format and Structure

Use the follow guidelines to format the contents of your source files:

Use the UTF-8 encoding without the byte order mark (BOM).
Use a comma delimiter ( , ).
End lines with \n or \r\n.
Quote strings that contain commas, newlines, or other special characters with a double quote character (").
Escape quotes in strings with a backslash character ( \ ). Duplicated double quotes are accepted, but are slower to process.
Make the first line of each file a header.
Keep all dates and datetimes in the same format.

If you cannot export data in the recommended way, contact the GoodData specialist who is working with you on setting up the bricks.

Location

Place the source files on S3 into the following path:

{SOURCE_ROOT}/data/{YYYY}/{MM}/{DD}/

Keep the source files in the S3 bucket for a while after CSV Downloader downloaded them. When the data integration process stabilizes over a number of successful executions, you can set up a retention period.

Example

In the following example, notice how values containing commas in the ‘City’ column are quoted, and quotes inside other quotes are escaped with a backslash.

ID,Name,City,Note,timeCreated,dateOpened,dateClosed
1,Texas Account,"Dallas, TX","Width has to be 5\" at least",2017-11-12 23:08:22.000,2017-11-12,2018-01-11
2,NYC Account,"New York City, NY","",2017-10-28 11:18:12.352,2017-11-12,
3,Prague,Prague,"Width has to be 5\" at least",2017-10-15 13:02:22.193,2017-11-12,

Manifest File

A manifest file describes the source data files, for example, which files belong to the batch, what entity they correspond to, the timestamp of export (this is important for incremental load). The manifest file can also contain hashes and the number of rows to ensure file integrity.

CSV Downloader processes only the source files that are referenced in the manifest file. A source data file that is not mentioned in the manifest file will not be processed. This way, you can specify only those source files in the manifest that you want to export, and CSV Downloader will process only those files. For example, you can export specific entities monthly while other entities should be exported daily.

Create a new manifest file for each load. Upload it as the last file after all the source files from a specific batch have been uploaded.

Though you can have multiple manifest files, we recommend that you have one manifest file per load batch. Data files within one manifest file are processed together. Having multiple manifest files may slow down performance.

Format and Structure

The manifest file is a text file delimited with vertical bars ( | ).

The manifest file can have the following columns:

Name	Mandatory?	Description
file_url	yes	The path to the source data file Examples: Source files on S3: `s3://bucket/folder/account.1515628800.csv` Source files on SFTP, WebDAV, Google Cloud Storage, or One Drive (do not include the root directory in the path): `/folder/account.1515628800.csv`
timestamp	yes	The UNIX timestamp representing the time when the source data file was uploaded to storage
feed	yes	The name of the entity (table) to download data from The name must match the name of the entity in the feed file.
feed_version	no	The version in the feed file that the source data file is connected to The version must match the version of the entity in the feed file. NOTE: You can have only one version of the same entity in one manifest file.
num_rows	no	The number of rows in the source data file Use 'num_rows' to verify integrity of an upload batch. If you want CSV Downloader to skip the verification, put 'unknown' to this column.
md5	no	The MD5 checksum of the source data file If you want CSV Downloader to skip the MD5 check, put 'unknown' to this column.
export_type	no	Load mode used for loading the source data file to the database If not set or set to 'inc', incremental load is used. If set to 'full', full load is used. If set to 'delete', CSV Downloader deletes the data from ADS based on the primary key that is set by the 'hub' parameter in the configuration file for CSV Downloader. The source CSV file must contain the header with the table columns that generate the primary key. The source file may or may not contain other columns (the feed file is ignored in this case, and only the columns generating the primary key are verified). If the primary key contains more than one column (that is, the 'hub' parameter contains more than one column name), the column names in the header must be specified in the same order as they are specified in the 'hub' parameter. The names of the columns generating the primary key columns are case-sensitive. The column names in the header must be specified in the same case as they are specified in the 'hub' parameter.
target_predicate	no	The field used for partial full load (see 'Set partial full load mode for the entities' in CSV Downloader) NOTE: You can also use this field as a reference field for defining the partition of the source tables (see the 'drop_source_partition' parameter in ADS Integrator).

File Name Format

The name of a manifest file defines the order in which the manifest files will be processed.

Use the default format of the manifest file name:

manifest_{time(%s)}.csv

When resolved, the name of a manifest file may look like the following:

manifest_1468493700.csv

Location

Place the manifest file on S3 into the following path:

{SOURCE_ROOT}/manifest/

Example

The following example of the manifest file defines export of four entities: ‘Account’, ‘User’, ‘Product’ (all in version 1.0), and ‘Facts’ (in version 1.2). The ‘User’ and ‘Product’ entities are marked for full load mode with the MD5 checksum disabled. The ‘Facts’ is split into two files for performance reasons. The both files are referenced in the manifest file.

file_url|timestamp|feed|feed_version|num_rows|md5|export_type
s3://bucket/folder/account.1515628800.csv.gz|1515628800|Account|1.0|3|366513286293c4b369bc7fafca23ddde|inc
s3://bucket/folder/user.1515628800.txt.gz|1515628800|User|1.0|0|unknown|full
s3://bucket/folder/product.1515628800.txt.gz|1515628800|Product|1.0|0|unknown|full
s3://bucket/folder/facts.1.1515628800.txt.gz|1515628800|Facts|1.2|15444|5d0a290ca7fc8d4dc7dd9cdd0dd15f96|inc
s3://bucket/folder/facts.2.1515628800.txt.gz|1515628800|Facts|1.2|52755|ba63d9912e49fa4f4b2e0797d3fcfa41|inc

For detailed information about the manifest file, see Manifest File.

Feed File

A feed file describes the structure of source data files and data types in their columns. Once you created and uploaded the feed file, keep it in the folder and do not change it until the structure of the source files changes (for example, a column is added to a source file).

You can have only one feed file.

Format and Structure

The feed file is a text file delimited with vertical bars ( | ). The whole content of the feed file is case-sensitive.

The feed file has the following columns:

Name	Mandatory?	Description
file	yes	The name of the entity
version	yes	The version of the entity Use 'version' to version entities in the source files if you want to process the files in different structures. You can put multiple versions of the same entity, and you can keep the older version of the entity in the feed file to allow smooth transition between versions. The version can have one decimal place. Example: `1.0` `3.23`
field	yes	The name of the column It can differ from the column name in the header of the CSV file because the header of the CSV file is ignored.
type	yes	The data type of the field For the supported types, see Data Types.
order	yes	The order in which the fields are sorted in the source files (from the lowest to the highest number; the first column has 0)
date_format	no	The format of dates in data Example: `'DD.MM.YYYY'` For more information, see Template Patterns for Date/Time Formatting in Vertica.
encoding	no	The encoding type of fields For information about available options, see Encoding Types in Vertica.

Data Types

The following field data types are supported:

SQL type representation	Feed file type name
`varchar(X)`	`varchar(X)`
`varchar(255)`	`varchar`
`decimal(16,10)`	`decimal`
`decimal(X,Y)`	`decimal(X,Y)`
`boolean`	`boolean`
`integer`	`integer`
`date (with time)`	`datetime`
`date (without time)`	`date`
`time (without date)`	`time`
`timestamp (without timezone)`	`timestamp without timezone`
`array[int]`	`array[int]*`
`array[varchar]`	`array[varchar]*`
`array[varchar(X)]`	`array[varchar(X)]*`

^*^ The database parser must be the Vertica fcsvparser parser with type set to either traditional or rfc4180 (see ‘Describe the structure of the source CSV files’ in CSV Downloader):

"db_parser": "fcsvparser-traditional"

"db_parser": "fcsvparser"

File Name

Use the default format of the feed file name:

feed.txt

Location

Place the feed file on S3 into the following path:

{SOURCE_ROOT}/feed/

Example

The following example of the feed file defines four entities: ‘Account’, ‘User’, ‘Product’, and ‘Facts’. The ‘Facts’ entity is defined in version 1.0 (initial version) and version 1.2 (the ‘product_id’ and ‘fact2’ fields were added):

file|version|field|type|order|date_format
Account|1.0|ID|integer|0|
Account|1.0|Name|varchar(255)|1|
Account|1.0|City|varchar(255)|2|
Account|1.0|Note|varchar(255)|3|
Account|1.0|timeCreated|timestamp|4|YYYY-MM-DD HH24:MI:SS.MS
Account|1.0|dateOpened|date|5|YYYY-MM-DD
Account|1.0|dateClosed|timestamp|6|YYYY-MM-DD HH24:MI:SS.MS
User|1.0|ID|integer|0|
User|1.0|Name|varchar(255)|1|
User|1.0|Attribute1|varchar(255)|2|
Product|1.0|ID|integer|0|
Product|1.0|product_name|varchar(255)|1|
Product|1.0|type|varchar(255)|2|
Facts|1.0|ID|integer|0|
Facts|1.0|fact1|decimal(20,5)|1|
Facts|1.2|ID|integer|0|
Facts|1.2|fact1|decimal(20,5)|1|
Facts|1.2|product_id|integer|2|
Facts|1.2|fact2|decimal(20,5)|3|

For detailed information about the feed file, see Feed File.

S3 Bucket

Set up the structure of the S3 bucket as follows:

Folder	Purpose	Note
`{SOURCE_ROOT}`	The root folder/bucket
`{SOURCE_ROOT}/feed/`	The location of the feed file (feed.txt)
`{SOURCE_ROOT}/data/{YYYY}/{MM}/{DD}/`	The location of data source files	Keep the number of data source files in one folder below 500. If needed, distribute source data files among subfolders by year/month/date.
`{SOURCE_ROOT}/manifest/`	The location of unprocessed manifest files	Make sure that CSV Downloader has write access to this folder so that it can move the manifest files after they have been processed.
`{SOURCE_ROOT}/manifest/processed/{YYYY}/{MM}/{DD}/`	The location where processed manifests are moved to	CSV Downloader creates the 'processed' folder with subfolders by year/month/date to limit the number of files in a folder.

Manifest File