Input Data for CSV Downloader
Before you use CSV Downloader, prepare the input data.
For the best performance, we recommend that you use the following setup:
- Provide your source files in the CSV format.
- Provide the manifest file.
- Provide the feed file.
- Store all the files in an S3 bucket with the recommended folder structure.
Source Files
The source files contain the actual data that you want to add to your workspaces.
You can have one or more source files. For each entity in your data, prepare at least one source file.
- Provide the source files as CSV files.
- Compress each source file separately (GZIPed CSV files, not tar.gz).
- Keep the size of each source file below 1 GB. Split bigger files into smaller files.
Format and Structure
Use the follow guidelines to format the contents of your source files:
- Use the UTF-8 encoding without the byte order mark (BOM).
- Use a comma delimiter (
,
). - End lines with
\n
or\r\n
. - Quote strings that contain commas, newlines, or other special characters with a double quote character (
"
). - Escape quotes in strings with a backslash character (
\
). Duplicated double quotes are accepted, but are slower to process. - Make the first line of each file a header.
- Keep all dates and datetimes in the same format.
If you cannot export data in the recommended way, contact the GoodData specialist who is working with you on setting up the bricks.
Location
Place the source files on S3 into the following path:
{SOURCE_ROOT}/data/{YYYY}/{MM}/{DD}/
Keep the source files in the S3 bucket for a while after CSV Downloader downloaded them. When the data integration process stabilizes over a number of successful executions, you can set up a retention period.
Example
In the following example, notice how values containing commas in the ‘City’ column are quoted, and quotes inside other quotes are escaped with a backslash.
ID,Name,City,Note,timeCreated,dateOpened,dateClosed
1,Texas Account,"Dallas, TX","Width has to be 5\" at least",2017-11-12 23:08:22.000,2017-11-12,2018-01-11
2,NYC Account,"New York City, NY","",2017-10-28 11:18:12.352,2017-11-12,
3,Prague,Prague,"Width has to be 5\" at least",2017-10-15 13:02:22.193,2017-11-12,
Manifest File
A manifest file describes the source data files, for example, which files belong to the batch, what entity they correspond to, the timestamp of export (this is important for incremental load). The manifest file can also contain hashes and the number of rows to ensure file integrity.
CSV Downloader processes only the source files that are referenced in the manifest file. A source data file that is not mentioned in the manifest file will not be processed. This way, you can specify only those source files in the manifest that you want to export, and CSV Downloader will process only those files. For example, you can export specific entities monthly while other entities should be exported daily.
Create a new manifest file for each load. Upload it as the last file after all the source files from a specific batch have been uploaded.
Though you can have multiple manifest files, we recommend that you have one manifest file per load batch. Data files within one manifest file are processed together. Having multiple manifest files may slow down performance.
Format and Structure
The manifest file is a text file delimited with vertical bars ( | ).
The manifest file can have the following columns:
Name | Mandatory? | Description |
---|---|---|
file_url | yes | The path to the source data file Examples:
|
timestamp | yes | The UNIX timestamp representing the time when the source data file was uploaded to storage |
feed | yes | The name of the entity (table) to download data from The name must match the name of the entity in the feed file. |
feed_version | no | The version in the feed file that the source data file is connected to The version must match the version of the entity in the feed file. NOTE: You can have only one version of the same entity in one manifest file. |
num_rows | no | The number of rows in the source data file Use 'num_rows' to verify integrity of an upload batch. If you want CSV Downloader to skip the verification, put 'unknown' to this column. |
md5 | no | The MD5 checksum of the source data file If you want CSV Downloader to skip the MD5 check, put 'unknown' to this column. |
export_type | no | Load mode used for loading the source data file to the database
|
target_predicate | no | The field used for partial full load (see 'Set partial full load mode for the entities' in CSV Downloader) NOTE: You can also use this field as a reference field for defining the partition of the source tables (see the 'drop_source_partition' parameter in ADS Integrator). |
File Name Format
The name of a manifest file defines the order in which the manifest files will be processed.
Use the default format of the manifest file name:
manifest_{time(%s)}.csv
When resolved, the name of a manifest file may look like the following:
manifest_1468493700.csv
Location
Place the manifest file on S3 into the following path:
{SOURCE_ROOT}/manifest/
Example
The following example of the manifest file defines export of four entities: ‘Account’, ‘User’, ‘Product’ (all in version 1.0), and ‘Facts’ (in version 1.2). The ‘User’ and ‘Product’ entities are marked for full load mode with the MD5 checksum disabled. The ‘Facts’ is split into two files for performance reasons. The both files are referenced in the manifest file.
file_url|timestamp|feed|feed_version|num_rows|md5|export_type
s3://bucket/folder/account.1515628800.csv.gz|1515628800|Account|1.0|3|366513286293c4b369bc7fafca23ddde|inc
s3://bucket/folder/user.1515628800.txt.gz|1515628800|User|1.0|0|unknown|full
s3://bucket/folder/product.1515628800.txt.gz|1515628800|Product|1.0|0|unknown|full
s3://bucket/folder/facts.1.1515628800.txt.gz|1515628800|Facts|1.2|15444|5d0a290ca7fc8d4dc7dd9cdd0dd15f96|inc
s3://bucket/folder/facts.2.1515628800.txt.gz|1515628800|Facts|1.2|52755|ba63d9912e49fa4f4b2e0797d3fcfa41|inc
For detailed information about the manifest file, see Manifest File.
Feed File
A feed file describes the structure of source data files and data types in their columns. Once you created and uploaded the feed file, keep it in the folder and do not change it until the structure of the source files changes (for example, a column is added to a source file).
You can have only one feed file.
Format and Structure
The feed file is a text file delimited with vertical bars ( | ). The whole content of the feed file is case-sensitive.
The feed file has the following columns:
Name | Mandatory? | Description |
---|---|---|
file | yes | The name of the entity |
version | yes | The version of the entity Use 'version' to version entities in the source files if you want to process the files in different structures. You can put multiple versions of the same entity, and you can keep the older version of the entity in the feed file to allow smooth transition between versions. The version can have one decimal place. Example: |
field | yes | The name of the column It can differ from the column name in the header of the CSV file because the header of the CSV file is ignored. |
type | yes | The data type of the field For the supported types, see Data Types. |
order | yes | The order in which the fields are sorted in the source files (from the lowest to the highest number; the first column has 0) |
date_format | no | The format of dates in data Example: For more information, see Template Patterns for Date/Time Formatting in Vertica. |
encoding | no | The encoding type of fields For information about available options, see Encoding Types in Vertica. |
Data Types
The following field data types are supported:
SQL type representation | Feed file type name |
---|---|
varchar(X) | varchar(X) |
varchar(255) | varchar |
decimal(16,10) | decimal |
decimal(X,Y) | decimal(X,Y) |
boolean | boolean |
integer | integer |
date (with time) | datetime |
date (without time) | date |
time (without date) | time |
timestamp (without timezone) | timestamp without timezone |
array[int] | array[int]* |
array[varchar] | array[varchar]* |
array[varchar(X)] | array[varchar(X)]* |
^*^ The database parser must be the Vertica fcsvparser parser with type
set to either traditional
or rfc4180
(see ‘Describe the structure of the source CSV files’ in CSV Downloader):
"db_parser": "fcsvparser-traditional"
or
"db_parser": "fcsvparser"
File Name
Use the default format of the feed file name:
feed.txt
Location
Place the feed file on S3 into the following path:
{SOURCE_ROOT}/feed/
Example
The following example of the feed file defines four entities: ‘Account’, ‘User’, ‘Product’, and ‘Facts’. The ‘Facts’ entity is defined in version 1.0 (initial version) and version 1.2 (the ‘product_id’ and ‘fact2’ fields were added):
file|version|field|type|order|date_format
Account|1.0|ID|integer|0|
Account|1.0|Name|varchar(255)|1|
Account|1.0|City|varchar(255)|2|
Account|1.0|Note|varchar(255)|3|
Account|1.0|timeCreated|timestamp|4|YYYY-MM-DD HH24:MI:SS.MS
Account|1.0|dateOpened|date|5|YYYY-MM-DD
Account|1.0|dateClosed|timestamp|6|YYYY-MM-DD HH24:MI:SS.MS
User|1.0|ID|integer|0|
User|1.0|Name|varchar(255)|1|
User|1.0|Attribute1|varchar(255)|2|
Product|1.0|ID|integer|0|
Product|1.0|product_name|varchar(255)|1|
Product|1.0|type|varchar(255)|2|
Facts|1.0|ID|integer|0|
Facts|1.0|fact1|decimal(20,5)|1|
Facts|1.2|ID|integer|0|
Facts|1.2|fact1|decimal(20,5)|1|
Facts|1.2|product_id|integer|2|
Facts|1.2|fact2|decimal(20,5)|3|
For detailed information about the feed file, see Feed File.
S3 Bucket
Set up the structure of the S3 bucket as follows:
Folder | Purpose | Note |
---|---|---|
{SOURCE_ROOT} | The root folder/bucket | |
{SOURCE_ROOT}/feed/ | The location of the feed file (feed.txt) | |
{SOURCE_ROOT}/data/{YYYY}/{MM}/{DD}/ | The location of data source files | Keep the number of data source files in one folder below 500. If needed, distribute source data files among subfolders by year/month/date. |
{SOURCE_ROOT}/manifest/ | The location of unprocessed manifest files | Make sure that CSV Downloader has write access to this folder so that it can move the manifest files after they have been processed. |
{SOURCE_ROOT}/manifest/processed/{YYYY}/{MM}/{DD}/ | The location where processed manifests are moved to | CSV Downloader creates the 'processed' folder with subfolders by year/month/date to limit the number of files in a folder. |