Manifest File

A manifest file describes the source data files to download and confirms completeness and integrity of an upload batch. The manifest file describes which files belong to the batch, what entity they correspond to, the timestamp of export (this is important for incremental load) and can also contain hashes and the number of rows to ensure file integrity.

CSV Downloader processes only the source files that are referenced in the manifest file. A source data file that is not mentioned in the manifest file will not be processed. This way, you can specify only those source files in the manifest that you want to export, and CSV Downloader will process only those files. For example, you can export specific entities monthly while other entities should be exported daily.

When CSV Downloader reads the manifest file, it processes either all or none of the source files. It cannot process some source files from the manifest file and not process the rest.

Create a new manifest file for each load. Upload it as the last file after all the source files from a specific batch have been uploaded.

Provide Your Own Manifest Files

The optional ‘generate_manifests’ parameter in the configuration file (see ‘Set optional parameters for your manifest file’ in CSV Downloader) specifies whether to generate a manifest file or to use the manifest file that you provide. By default, the ‘generate_manifests’ parameter is not set and defaults to ‘false’, which means that you have to provide your own manifest files.

When creating your own manifest files, make sure that their names and structure meet the requirements that are described in this section.

File Name Format

The name of a manifest file defines the order in which the manifest files will be processed.

Use the Default Format

The default format of the manifest file name is the following:

manifest_{time(%s)}.csv

When resolved, the name of a manifest file may look like the following:

manifest_1468493700.csv

If you want to use the default name format for your manifest, you can use it right away without setting any additional parameters.

Customize the Format

If you want to customize the file name format, generate your format and set the ‘manifest’ parameter in the configuration file to your custom format (see ‘Set optional parameters for your manifest file’ in CSV Downloader).

You can use the following keywords in the file name:

  • sequence: Include ‘sequence’ into the file names to ensure that they are loaded in the correct order. If you use ‘sequence’, all manifest file names must contain a sequence number in the right order (1..2..3; CSV Downloader always expects the last sequence number +1) for a given manifest; otherwise, you will receive an error.
  • regex: If a file name has a changing part, use ‘regex’ to be able to process the files. For more information, see https://ruby-doc.org/core-2.1.1/Regexp.html.
  • time: If ‘sequence’ is not present, manifest files are sorted by time. ‘time’ can be set as timestamp ( {time(%s)} ) or any kind of the YYYYMMDD format (for example, {time(%Y-%m-%d-%H-%M-%S)} ). For more information about tags, see http://ruby-doc.org/core-2.2.0/Time.html#method-i-strftime.

Examples:

  • To get the following manifest file name:

    manifest_1.20180217104924.csv
    

    the file name format may look like the following:

    manifest_{sequence}.{time(%Y%m%d%H%M%S)}.csv
    
  • To get the following manifest file name:

    manifest-datafeed_pot_1.20160905140015.csv
    

    the file name definition may look like the following (notice how ‘\’ is escaped with ‘\'):

    {regex(manifest-datafeed_pot_\\d+)}.{time(%Y%m%d%H%M%S)}.csv
    

File Structure

The manifest file is a text file delimited with vertical bars ( | ).

The manifest file can have the following columns:

NameMandatory?Description
file_urlyes

The path to the source data file

Examples:

  • Source files on S3:  s3://bucket/folder/account.1515628800.csv
  • Source files on SFTP, WebDAV, Google Cloud Storage, or One Drive (do not include the root directory in the path): /folder/account.1515628800.csv
timestampyesThe UNIX timestamp representing the time when the source data file was uploaded to storage
feedyes

The name of the entity (table) to download data from The name must match the name of the entity in the feed file (see Feed File).

feed_versionnoThe version in the feed file that the source data file is connected to The version must match the version of the entity in the feed file (see Feed File). NOTE: You can have only one version of the same entity in one manifest file.
num_rowsnoThe number of rows in the source data file Use 'num_rows' to verify integrity of an upload batch. If you want CSV Downloader to skip the verification, put 'unknown' to this column.
md5no

The MD5 checksum of the source data file If you want CSV Downloader to skip the MD5 check, put 'unknown' to this column.

export_typeno

Load mode used for loading the source data file to the database

  • If not set or set to 'inc', incremental load is used.
  • If set to 'full', full load is used.
  • If set to 'delete', CSV Downloader deletes the data from ADS based on the primary key that is set by the 'hub' parameter in the configuration file for CSV Downloader
    • The source CSV file must contain the header with the table columns that generate the primary key. The source file may or may not contain other columns (the feed file is ignored in this case, and only the columns generating the primary key are verified).
    • If the primary key contains more than one column (that is, the 'hub' parameter contains more than one column name), the column names in the header must be specified in the same order as they are specified in the 'hub' parameter.
    • The names of the columns generating the primary key columns are case-sensitive. The column names in the header must be specified in the same case as they are specified in the 'hub' parameter.
target_predicatenoThe field used for partial full load (see 'Set partial full load mode for the entities' in CSV DownloaderNOTE: You can also use this field as a reference field for defining the partition of the source tables (see the 'drop_source_partition' parameter in ADS Integrator).

File Example

file_url|timestamp|feed|feed_version|num_rows|md5|export_type
s3://bucket/folder/account.1515628800.csv.gz|1515628800|Account|1.0|3|366513286293c4b369bc7fafca23ddde|inc
s3://bucket/folder/user.1515628800.txt.gz|1515628800|User|1.0|0|unknown|full
s3://bucket/folder/product.1515628800.txt.gz|1515628800|Product|1.0|0|unknown|full
s3://bucket/folder/facts.1.1515628800.txt.gz|1515628800|Facts|1.2|15444|5d0a290ca7fc8d4dc7dd9cdd0dd15f96|inc
s3://bucket/folder/facts.2.1515628800.txt.gz|1515628800|Facts|1.2|52755|ba63d9912e49fa4f4b2e0797d3fcfa41|inc