Manifest File
A manifest file describes the source data files to download and confirms completeness and integrity of an upload batch. The manifest file describes which files belong to the batch, what entity they correspond to, the timestamp of export (this is important for incremental load) and can also contain hashes and the number of rows to ensure file integrity.
CSV Downloader processes only the source files that are referenced in the manifest file. A source data file that is not mentioned in the manifest file will not be processed. This way, you can specify only those source files in the manifest that you want to export, and CSV Downloader will process only those files. For example, you can export specific entities monthly while other entities should be exported daily.
When CSV Downloader reads the manifest file, it processes either all or none of the source files. It cannot process some source files from the manifest file and not process the rest.
Create a new manifest file for each load. Upload it as the last file after all the source files from a specific batch have been uploaded.
Though you can have multiple manifest files, we recommend that you have one manifest file per load batch. Data files within one manifest file are processed together. Having multiple manifest files may slow down performance.
Provide Your Own Manifest Files
The optional ‘generate_manifests’ parameter in the configuration file (see ‘Set optional parameters for your manifest file’ in CSV Downloader) specifies whether to generate a manifest file or to use the manifest file that you provide. By default, the ‘generate_manifests’ parameter is not set and defaults to ‘false’, which means that you have to provide your own manifest files.
When creating your own manifest files, make sure that their names and structure meet the requirements that are described in this section.
If you want CSV Downloader to generate the manifest files for you, set the ‘generate_manifests’ parameter to ‘true’ and see ‘Generate a manifest file’ in CSV Downloader.
File Name Format
The name of a manifest file defines the order in which the manifest files will be processed.
Use the Default Format
The default format of the manifest file name is the following:
manifest_{time(%s)}.csv
When resolved, the name of a manifest file may look like the following:
manifest_1468493700.csv
If you want to use the default name format for your manifest, you can use it right away without setting any additional parameters.
Customize the Format
If you want to customize the file name format, generate your format and set the ‘manifest’ parameter in the configuration file to your custom format (see ‘Set optional parameters for your manifest file’ in CSV Downloader).
You can use the following keywords in the file name:
- sequence: Include ‘sequence’ into the file names to ensure that they are loaded in the correct order. If you use ‘sequence’, all manifest file names must contain a sequence number in the right order (1..2..3; CSV Downloader always expects the last sequence number +1) for a given manifest; otherwise, you will receive an error.
- regex: If a file name has a changing part, use ‘regex’ to be able to process the files. For more information, see https://ruby-doc.org/core-2.1.1/Regexp.html.
- time: If ‘sequence’ is not present, manifest files are sorted by time. ‘time’ can be set as timestamp (
{time(%s)}
) or any kind of theYYYYMMDD
format (for example,{time(%Y-%m-%d-%H-%M-%S)}
). For more information about tags, see http://ruby-doc.org/core-2.2.0/Time.html#method-i-strftime.
Examples:
To get the following manifest file name:
manifest_1.20180217104924.csv
the file name format may look like the following:
manifest_{sequence}.{time(%Y%m%d%H%M%S)}.csv
To get the following manifest file name:
manifest-datafeed_pot_1.20160905140015.csv
the file name definition may look like the following (notice how ‘\’ is escaped with ‘\'):
{regex(manifest-datafeed_pot_\\d+)}.{time(%Y%m%d%H%M%S)}.csv
File Structure
The manifest file is a text file delimited with vertical bars ( | ).
The manifest file can have the following columns:
Name | Mandatory? | Description |
---|---|---|
file_url | yes | The path to the source data file Examples:
|
timestamp | yes | The UNIX timestamp representing the time when the source data file was uploaded to storage |
feed | yes | The name of the entity (table) to download data from The name must match the name of the entity in the feed file (see Feed File). |
feed_version | no | The version in the feed file that the source data file is connected to The version must match the version of the entity in the feed file (see Feed File). NOTE: You can have only one version of the same entity in one manifest file. |
num_rows | no | The number of rows in the source data file Use 'num_rows' to verify integrity of an upload batch. If you want CSV Downloader to skip the verification, put 'unknown' to this column. |
md5 | no | The MD5 checksum of the source data file If you want CSV Downloader to skip the MD5 check, put 'unknown' to this column. |
export_type | no | Load mode used for loading the source data file to the database
|
target_predicate | no | The field used for partial full load (see 'Set partial full load mode for the entities' in CSV Downloader) NOTE: You can also use this field as a reference field for defining the partition of the source tables (see the 'drop_source_partition' parameter in ADS Integrator). |
File Example
file_url|timestamp|feed|feed_version|num_rows|md5|export_type
s3://bucket/folder/account.1515628800.csv.gz|1515628800|Account|1.0|3|366513286293c4b369bc7fafca23ddde|inc
s3://bucket/folder/user.1515628800.txt.gz|1515628800|User|1.0|0|unknown|full
s3://bucket/folder/product.1515628800.txt.gz|1515628800|Product|1.0|0|unknown|full
s3://bucket/folder/facts.1.1515628800.txt.gz|1515628800|Facts|1.2|15444|5d0a290ca7fc8d4dc7dd9cdd0dd15f96|inc
s3://bucket/folder/facts.2.1515628800.txt.gz|1515628800|Facts|1.2|52755|ba63d9912e49fa4f4b2e0797d3fcfa41|inc