You are viewing our older product's guide. Click here for the documentation of GoodData Cloud, our latest and most advanced product.

ADS Integrator

ADS Integrator is a component of the data preparation and distribution pipeline (see Data Preparation and Distribution Pipeline). ADS Integrator takes the data that has been downloaded by a downloader (see Downloaders) and integrates it into Agile Data Warehousing Service (ADS) (see Data Warehouse Reference).

How ADS Integrator Works

The following picture shows how data for one entity is integrated into ADS when the default settings are used:

When integrating data downloaded by a downloader, ADS Integrator relies on the information stored in the S3 files' metadata and the S3 metadata folder. If you want to change any parameter in a downloader’s configuration, do so in the metadata files manually.

Data Tables

Source Table

The source tables are at the beginning of the data integration process. COPY commands integrate the prepared data from the BDS to the source tables.

If you use JSON files, the data first is integrated from the BDS to the flex tables and then copied from the flex tables to the source tables.

The naming convention of the source tables is the following: src_(downloader_name)_(entity_name)_merge

In some cases, the source tables can be used directly. However, when ADS Integrator runs with the default settings, data should be read from the stage tables. ADS Integrator adds system fields to each row.

The source tables have the following system fields:

Name	Type	Description
_sys_hub_id	string	The MD5 hash of all key fields
_sys_hash	string	The MD5 hash of all fields except the system fields
_sys_captured_at	timestamp	In entity mode, contains the time when the file was saved on the BDS In batch mode, contains the time when the batch was saved on the BDS
_sys_is_deleted	Boolean	Is set to 'true', the row should be processed as a deleted record. Is set to 'false', the row is processed normally.
_sys_filename	string	The name of the source file
_sys_load_id	integer	The reference to the '_sys_load_id' system table

Stage Table

After successful integration, the data is loaded to the stage tables. Then, these tables are ready for use and the ETL can be connected here.

The naming convention of the stage tables is the following: stg_(downloader_name)_(entity_name)_merge

ADS Integrator adds system fields to each row.

The stage tables have the following system fields:

Name	Type	Description
_sys_id	string	The MD5 hash of all key fields. During integration, the value is copied from the '_sys_hub_id' field of the source table.
_sys_hash	string	The MD5 hash of all fields except the system fields. During the integration, the value is copied from the '_sys_hash' field of the source table.
_sys_valid_from	timestamp	During the integration, the value for this field is copied from the '_sys_captured_at' field of the source table. If multiple batches are integrated and each of them contains the same key, this value will contain the date when the key last occurred in the integrated data.
_sys_load_id	Boolean	During the integration, the value for this field is copied from the '_sys_load_id' field of the source table.
_sys_is_deleted	string	During the integration, the value for this field is filled when the record is marked as 'is_deleted' (deleted records). The deleted records are used when you want to remove some of already loaded values from the stage tables. The following options are available: Send the FULL export If the entity export is marked as FULL, ADS Integrator handles the deleted records itself. All the records (identified by key) that are present in the stage table and are not present in the full export are marked as deleted (the field '_sys_is_deleted' is set to 'true'). (Only for Salesforce Downloader) Send a file with the deleted records ADS Integrator can generate a file with the deleted records. This file contains only the keys that should be deleted for the corresponding stage table. All the records mentioned in the deleted record file are marked as deleted in the stage table (the field '_sys_is_deleted' is set to 'true'). See the 'remove_deleted_records' parameter in Add optional parameters.

Flex Table

If your source files are JSON files (see “Use JSON files as source files” in CSV Downloader), ADS Integrator first integrates the JSON source files from the BDS into flex tables before copying them to the source tables. When integrating the JSON files into the flex tables, ADS Integrator uses the Vertica fjsonparser parser to process the JSON files, either with or without flattening the nested data. Then, ADS Integrator processes the data from the flex tables according to the XPath notations specified in the feed file (see Feed File) and integrates the data into the source tables.

The naming convention of the flex tables is the following: src_(downloader_name)_(entity_name)_flex

The flex tables have the following system fields:

Name	Type	Description
__raw__	string	The processed data from the source JSON files

Rejected Table

If the ‘enable_rejected_table’ parameter is set to ‘true’, a special table is created for rejected records for each entity.

The naming convention of the rejected tables is the following: rejected_(downloader_name)_(entity_name)

In the rejected tables, you can use only SELECT queries. DML and DDL queries are not supported.

The maximum number of rows in a rejected table is to 100000. When this limit is reached, all the data is deleted from the table within the next run of ADS Integrator. To change the maximum allowed number of the rows in rejected tables, set the ‘rejected_table_cleanup_rows_limit’ parameter.

The rejected table contains the following fields:

Name	Type	Description
node_name	varchar	The name of the Vertica node where the source file was located
file_name	varchar	The name of the file being loaded
session_id	varchar	The session ID number in which the COPY statement occurred
transaction_id	integer	The identifier for the transaction within the session, if any; otherwise, NULL
statement_id	integer	The unique identification number of the statement within the transaction that included the rejected data
batch_number	integer	(Internal use) The number of the batch (chunk) that the data comes from
row_number	integer	The rejected row number from the source file
rejected_data	long varchar	The data that was not loaded
rejected_data_orig_length	integer	The length of the rejected data
rejected_reason	varchar	The error that caused the rejected row This column returns the same message that exists in a load exceptions file when you do not save to a table.

System Tables

Batch Table

The batch table is used only when ADS Integrator is running in batch mode.

The batch table contains information about the processed batches. If a batch is processed correctly, the batch table contains the STARTED event and the FINISHED event.

The batch table contains the following fields:

Name	Type	Description
id	integer	The unique numeric ID generated from the sequence
sequence	integer	The sequence number, if exists
identification	varchar	The identification of the batch In most cases, the identification is same as the ID of the downloader that was used for downloading this batch.
load_last_batch	timestamptz	The time when the batch was created on the BDS
filename	varchar	The name of the file from which the batch was created
load_event	varchar	The type of the event Possible values: STARTED FINISHED
load_time	timestamptz	The time when the event was created (by the `now()` function during the inserting)
is_deleted	Boolean	Is set to 'true', the row is ignored, and the batch will be processed again. Is set to 'false', the row is processed normally.

Load Table

The load table is used in both batch mode and entity mode.

The load table contains information about the progress of the load for each entity.

The load table contains the following fields:

Name	Type	Description
load_id	integer	The unique numeric ID generated from the sequence
load_src	varchar	The type of the downloader that downloaded the entity
load_entity	varchar	The name of the entity
load_last_batch	timestamptz	The time when the file for the entity was saved on the BDS. If multiple files are integrated in one run, this field contains the last value.
load_event	varchar	The type of the event. Possible values: STARTED FINISHED
load_time	timestamptz	The time when the record was inserted into the database
is_deleted	Boolean	Is set to 'true', the row is ignored. Is set to 'false', the row is processed normally.
batch_id	integer	(Only for batch mode) The reference to the batch table

File Table

The file table is used in both batch mode and entity mode.

The file table contains information about the integrated files and additional details on each of them.

The file table contains the following fields:

Name	Type	Description
id	integer	The unique numeric ID generated from the sequence
load_id	integer	The reference to the load table
load_at	timestamptz	The time when the file was loaded to the database
entity_name	varchar	The name of the entity connected to the file
filename	varchar	The original name of the file
bds_filename	varchar	The name of the file on the BDS
file_size	integer	(For CSV Downloader only) The size of the file copied from the manifest file (For all downloaders except for CSV Downloader) Empty
file_hash	varchar	(For CSV Downloader only) The MD5 hash of the file copied from the manifest file (For all downloaders except for CSV Downloader) Empty

Configuration File

Creating and setting up the integrator’s configuration file is the third phase of building your data pipeline (see Build Your Data Pipeline).

For more information about configuring a brick, see Configure a Brick.

Minimum Layout of the Configuration File

The following JSON samples are the minimum layout of the configuration file that ADS Integrator can use. The samples differ in load mode that ADS Integrator use:

Batch mode
Entity mode

Choose the mode that is compatible with the downloader that you are using:

Choose batch mode when you want to process the data downloaded by CSV Downloader.
Choose entity mode when you want to process the data downloaded by any downloader except for CSV Downloader.

Copy the appropriate sample and replace the placeholders with your values.

You can later enrich the sample with more parameters depending on your business and technical requirements (see Customize the Configuration File).

Batch Mode

Batch mode is supported by CSV Downloader only.

In batch mode, the data for all entities is put to the source tables first. Then, if all data is correctly copied to the source tables, the process continues. If an error occurred during integrating any of the entities into the batch, the whole process stops, and no data is integrated into the stage tables.

By default, ADS Integrator processes only one batch, but you can change this (see Specify the number of batches to process in one run).

{
  "integrators": {
    "ads_integrator_id": {
      "type": "ads_storage",
      "batches": ["csv_downloader_1","csv_downloader_2"]
    }
  },
  "ads_storage": {
    "instance_id": "data_warehouse_instance_id",
    "username": "dw_email@address.com",
    "options": {}
  }
}

Entity Mode

Entity mode is supported by all downloaders except for CSV Downloader.

In entity mode, entities are processed one by one. The data from one entity is loaded to the source tables and then integrated into the stage tables.

ADS Integrator processes all the files from one entity that are present in the BDS.

{
  "integrators": {
    "ads_integrator_id": {
      "type": "ads_storage",
      "entities": ["entity_1","entity_2","entity_3"]
    }
  },
  "ads_storage": {
    "instance_id": "data_warehouse_instance_id",
    "username": "dw_email@address.com",
    "options": {}
  }
}

The placeholders to replace with your values:

Name	Type	Mandatory?	Default	Description
integrators -> *ads_integrator_id*	string	yes	n/a	The ID of the ADS Integrator instance Example: ads_integrator_1
integrators -> ads_integrator_id -> batches	array	see 'Description'	n/a	(Batch mode only) The IDs of the CSV downloader instances that this ADS Integrator instance processes Example: csv_downloader_1
integrators -> ads_integrator_id -> entities	array	see 'Description'	n/a	(Entity mode only) The names of the entities that this ADS Integrator instance processes Example: entity_1
ads_storage -> instance_id	string	yes	n/a	The ID of the ADS instance into which the data is uploaded
ads_storage -> username	string	yes	n/a	The access username to the ADS instance into which the data is uploaded

Customize the Configuration File

You can change default values of the parameters and set up additional parameters.

When you set up an optional parameter, it is saved in the S3 file’s metadata (not in the metadata folder, but in an attribute of the file itself). To change the parameter value later, edit the metadata files manually.

Set the data integration strategy

The integration strategy defines how data is integrated into the stage tables on ADS.

merge: This strategy uses one key for all operations (uniqueness, merging, and so on). This key is computed during the COPY command as an MD5 of all the fields that are marked as primary key in the ‘hub’ setting of the entity (see the system field ‘_sys_hub_id’ in the source table).
merge_with_pk: This strategy uses all primary keys for all operations (uniqueness, merging, and so on).

The final output is the same for either strategy. The only difference is how the final projection created on stage tables is optimized.

For the merge integration strategy, the projection is optimized for one ID field.
For the merge_with_pk strategy, the projection is optimized for all primary keys in the order they are set up in the entity’s settings (the ‘hub’ parameter in the section of the configuration file that describes the downloader you are using):
```
"entities": {
  "table_1_name": {
    "global": {
      "custom": {
        "hub": ["column_1_name","column_2_name","column_3_name"]
      }
    }
  }
}
```

If you want to set a different data integration strategy for a specific entity, use the advanced parameter ‘default_strategy_override’ (see Adjust advanced settings).

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "default_strategy": "merge|merge_with_pk" // optional
  }
}

Name	Type	Mandatory?	Default	Description
default_strategy	string	no	merge	The data integration strategy used to integrate data into the stage tables on ADS If not set or set to 'merge', the merging strategy is used. If set to 'merge_with_pk', the merging strategy with primary keys is used.

default_strategy

string

merge

The data integration strategy used to integrate data into the stage tables on ADS

If not set or set to 'merge', the merging strategy is used.
If set to 'merge_with_pk', the merging strategy with primary keys is used.

Add multiple ADS Integrators

You can use multiple instances of ADS Integrator to process your downloaders. Update the ‘integrators’ section to contain multiple subsections: a subsection per ADS Integrator instance. Each subsection must refer the ID of the corresponding ADS Integrator instance.

Depending on what downloaders you are using, you need to specify load mode for each ADS Integrator instance: batch mode or entity mode.

"integrators": {
  "ads_integrator_1_id": {
    "type": "ads_storage",
    "batches": ["csv_downloader_1","csv_downloader_2"]
  },
  "ads_integrator_2_id": {
    "type": "ads_storage",
    "batches": ["csv_downloader_3"]
  },
  {
  "ads_integrator_3_id": {
    "type": "ads_storage",
    "entities": ["entity_1","entity_2","entity_3"]
  },
  "ads_integrator_4_id": {
    "type": "ads_storage",
    "entities": ["entity_4","entity_5","entity_6"]
  }
}

Use a different server

By default, ADS Integrator uses secure.gooddata.com to access to the ADS instance. You can specify a different URL (for example, you are a white-labeled customer).

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "server": "your.server.url", // optional
  "options": {}
}

Name	Type	Mandatory?	Default	Description
server	string	no	secure.gooddata.com	The URL used to access the ADS instance

Use COPY FROM S3 instead of COPY FROM LOCAL

If your source files are stored in an S3 location (see ‘Use an S3 location’ in CSV Downloader), the files are first downloaded from the S3 location and then loaded using the COPY FROM LOCAL command. Using COPY FROM S3 loads the files directly from the S3 location which may be faster.

If you use your own S3 location and not a GoodData-managed one, contact GoodData Support so that they enable data load from your S3 location to your ADS instance. In the request, include the ID of your ADS instance (see the ‘instance_id’ parameter in Minimum Layout of the Configuration File) and the name of the S3 bucket.
If you decide to use COPY FROM S3 and at the same time choose not to enable rejected tables (see the ‘enable_rejected_table’ parameter), you will not be able to review rejected records because they will be skipped and not stored anywhere.

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "copy_from_s3": true|false
  }
}

Name	Type	Mandatory?	Default	Description
copy_from_s3	Boolean	see 'Description'	false	Specifies what COPY command is used for downloading the data from an S3 location. If not set or set to 'false', COPY FROM LOCAL is used. If set to 'true', COPY FROM S3 is used. NOTE: If you use Parquet source files (see 'Use Parquet files as source files' in CSV Downloader), set this parameter to 'true'.

copy_from_s3

Boolean

see 'Description'

false

Specifies what COPY command is used for downloading the data from an S3 location.

If not set or set to 'false', COPY FROM LOCAL is used.
If set to 'true', COPY FROM S3 is used.

NOTE: If you use Parquet source files (see 'Use Parquet files as source files' in CSV Downloader), set this parameter to 'true'.

Specify how to proceed in case of errors and rejected records

By default, if an error occurs and some data cannot be integrated into ADS, ADS Integrator fails. To make ADS Integrator ignore errors, set the ‘abort_on_error’ parameter to ‘false’. This disables the ABORT ON ERROR option in each COPY command.

To additionally specify how to treat the records that have not been parsed during the process and have been rejected, use the optional ‘enable_rejected_table’ parameter.

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "abort_on_error": true|false, // optional
    "enable_rejected_table": true|false // optional
  }
}

Name	Type	Mandatory?	Default	Description
abort_on_error	Boolean	no	true	Specifies whether the ABORT ON ERROR option in each COPY command is enabled or disabled. If not set or set to 'true', the ABORT ON ERROR option is enabled, and ADS Integrator fails when an error occurs. If set to 'false', the ABORT ON ERROR option is disabled, ADS Integrator keeps running if an error occurs, and the rejected records are treated according to how the 'enable_rejected_table' parameter is set.
enable_rejected_table	Boolean	no	false	(Only when 'abort_on_error' is set to 'false') Specifies how ADS Integrator treats the rejected records. If not set or set to 'false' and: the 'copy_from_s3' parameter is set to 'true' at the same time, the rejected records are skipped and not stored anywhere. the 'copy_from_s3' parameter is not set or set to 'false' at the same time, ADS Integrator moves the rejected records to a file on the BDS after each run. If set to 'true', ADS Integrator moves the rejected records to rejected tables. NOTE: You can also set the 'enable_rejected_table' parameter at the entity level. To do so, add it to the 'custom' section of the entity.
rejected_table_cleanup_rows_limit	integer	no	100000	The maximum number of rows allowed in a rejected table When the maximum number is reached, all the data is deleted from the table within the next run of ADS Integrator. IMPORTANT: The default value is sufficient for an average workspace on a standard ADS instance. Setting it to a greater value may result in consuming too much space on your ADS instance. Do not change the default unless you are absolutely sure in the results that you want to achieve by changing it.

abort_on_error

Boolean

true

Specifies whether the ABORT ON ERROR option in each COPY command is enabled or disabled.

If not set or set to 'true', the ABORT ON ERROR option is enabled, and ADS Integrator fails when an error occurs.
If set to 'false', the ABORT ON ERROR option is disabled, ADS Integrator keeps running if an error occurs, and the rejected records are treated according to how the 'enable_rejected_table' parameter is set.

enable_rejected_table

Boolean

false

(Only when 'abort_on_error' is set to 'false') Specifies how ADS Integrator treats the rejected records.

If not set or set to 'false' and:
- the 'copy_from_s3' parameter is set to 'true' at the same time, the rejected records are skipped and not stored anywhere.
- the 'copy_from_s3' parameter is not set or set to 'false' at the same time, ADS Integrator moves the rejected records to a file on the BDS after each run.
If set to 'true', ADS Integrator moves the rejected records to rejected tables.

NOTE: You can also set the 'enable_rejected_table' parameter at the entity level. To do so, add it to the 'custom' section of the entity.

rejected_table_cleanup_rows_limit

integer

100000

The maximum number of rows allowed in a rejected table

When the maximum number is reached, all the data is deleted from the table within the next run of ADS Integrator.

IMPORTANT: The default value is sufficient for an average workspace on a standard ADS instance. Setting it to a greater value may result in consuming too much space on your ADS instance. Do not change the default unless you are absolutely sure in the results that you want to achieve by changing it.

To be notified of the ignored records, set up a notification triggered by a custom event (see Create a Notification Rule for a Data Loading Process). Set error-in-data-processing as the custom event triggering the notification, and use the {$params.list_of_files} variable in the message body, which will include a list of ignored files in the notification email.

Add optional parameters

Depending on your business and technical requirements, you can set one or more optional parameters affecting the data integration process.

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "disable_file_level_logging": true|false, // optional
  "options": {
    "integration_group_size": number_of_files, // optional
    "remove_deleted_records": true|false, // optional
    "group_by_version": true|false // optional
  }
}

Name	Type	Mandatory?	Default	Description
disable_file_level_logging	Boolean	no	false	Specifies whether ADS Integrator logs information to the _sys_file table If not set or set to 'false', ADS Integrator logs information to the _sys_file table If set to 'true', ADS Integrator does not log any information to the _sys_file table Set this parameter to 'true' if you want to disable the '_sys_filename' field (see Source Table for more information about the '_sys_filename' field; see the 'ignored_system_fields' parameter in Adjust advanced settings for more information about disabling system fields).
integration_group_size	integer	no	1	The number of files to be integrated in one COPY command NOTE: Set this parameter to a greater number when you are integrating a big number of small files. ADS processes the files in one COPY command serially. Note, however, that setting this parameter to a higher value will stop loading new rows to the file table. Do not use this parameter (or, if using, keep it at its default) if you have multiple versions of a data record in your source data.
remove_deleted_records	Boolean	no	false	Specifies how to handle the records in the stage table that have the '_sys_is_deleted' field set to 'true'. If not set or set to 'false', ADS Integrator does not remove the records. If set to 'true', ADS Integrator purges the records. NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.
group_by_version	Boolean	no	false	(For CSV Downloader only) Specifies how ADS Integrator processes different versions of the same data records in the source data (see 'version' in Feed File) in one run If not set or set to 'false', ADS Integrator integrates data records as they are provided in the source files regardless of whether data records have only one version or multiple versions. If set to 'true', ADS Integrator integrates the data records version by version. That is, ADS Integrator first integrates all data records for version 1, then proceeds to the data records for version 2, and so on.

Configure additional settings in the source table and stage table

To extend the source table and stage table with additional settings, configure computed fields. For more details, see Computed Fields.

Add system fields to the load table or file table

To add the system fields to the load table or file table, configure system fields. Data can populate these fields only from the runtime metadata. The runtime metadata is filled in by downloaders. To find out what metadata a downloader creates, see the corresponding downloader’s article in Downloaders.

The ‘system_table_fields’ section contains arrays of parameters for the load table and file table.

"system_table_fields": {
  "file_table": [
    {
      "name": "custom_parameter",
      "type": "parameter_data_type",
      "function": "function_name"
    }
  ],
  "load_table": [
    {
      "name": "custom_parameter",
      "type": "parameter_data_type",
      "function": "function_name"
    }
  ]
}

Name Type Mandatory? Default Description

name

string

yes

n/a

Name	Type	Mandatory?	Default	Description
name	string	yes	n/a	The name of the custom parameter
type	string	yes	n/a	The type of the custom parameter Possible values: varchar(X): string-X varchar(255): string integer: integer decimal(X,Y): decimal-X-Y decimal(16,10): decimal-16-10 boolean: boolean date (without time): date-false date (with time): date-true time (without date): time
function	string	yes	n/a	The function that will fill in the system field with the runtime metadata parameter Format: `runtime_metadata\|value` `value` specifies the name of the runtime metadata parameter. For example, if you are using CSV Downloader, you can use non-standard fields in the manifest files downloaded by CSV Downloader that are copied to 'runtime_metadata'. Example: `runtime_metadata\|load_param`

The name of the custom parameter

type

string

yes

n/a

The type of the custom parameter

Possible values:

varchar(X): string-X
varchar(255): string
integer: integer
decimal(X,Y): decimal-X-Y
decimal(16,10): decimal-16-10
boolean: boolean
date (without time): date-false
date (with time): date-true
time (without date): time

function

string

yes

n/a

The function that will fill in the system field with the runtime metadata parameter

Format: runtime_metadata|value value specifies the name of the runtime metadata parameter. For example, if you are using CSV Downloader, you can use non-standard fields in the manifest files downloaded by CSV Downloader that are copied to 'runtime_metadata'.

Example: runtime_metadata|load_param

Example:

"system_table_fields": {
  "file_table": [
    {
      "name": "load_param",
      "type": "string-255",
      "function": "runtime_metadata|load_param"
    }
  ]
}

Specify the number of batches to process in one run

When working with batch mode, you can specify how many batches ADS Integrator should process in one run.

[Example]{.ul}: Entity data in all batches is in incremental load mode You have 10 batches to process, and the number of batches in one run is set to 5.

The 10 batches will be processed in the next two runs of ADS Integrator.

ADS Integrator downloads all the data for the number of the batches that you have specified and checks whether all entities have only one version of metadata.

If yes, ADS Integrator will process all data in one run: copy all the data to the source table and then use MERGE to put it to the stage table.
If not, ADS Integrator will split the data into multiple parts and process them independently. The order of the batches will be preserved.

[Example]{.ul}: Some entity data is in full load mode, the rest is in incremental load mode You have 10 batches to process, and the number of batches in one run is set to 5. The entity data in Batch 5 and Batch 8 are marked as ‘full’.

In the first run, ADS Integrator takes the batches from 1 through 5. Batch 5 is the one with full load mode, so ADS Integrator ignores all incremental batches before this batch and only processes Batch 5.

In the second run, ADS Integrator ignores Batch 6 and Batch 7 (incremental mode), then processes Batch 8 (full mode), and then processes incremental batches: Batch 9 and Batch 10.

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "number_of_batches_in_one_run": number_of_batches // optional
  }
}

Name	Type	Mandatory?	Default	Description
number_of_batches_in_one_run	integer	no	100	The number of batches that ADS Integrator should process in one run

Use link files

When you are using CSV Downloader and chose to use link files (see ‘Use link files’ in CSV Downloader), set ADS Integrator that processes this instance of CSV Downloader to use the link files.

If your source files are PGP-encrypted, see Enable processing of PGP-encrypted source files when using link files.

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "remote": {
      "type": "s3",
      "access_key": "access_key",
      "bucket": "bucket_name"
    }
  }
}

Name	Type	Mandatory?	Default	Description
access_key	string	yes	n/a	The access key to the S3 bucket where the source data is stored
bucket	string	yes	n/a	The name of the S3 bucket where the source data is stored

When later scheduling this ADS Integrator instance in the Data Integration Console, provide the ‘ads_integrator|options|remote|secret_key’ parameter in the schedule (see General Parameters).

Enable processing of PGP-encrypted source files when using link files

When you are using link files in CSV Downloader (see ‘Use link files’ in CSV Downloader) and your source files are PGP-encrypted, enable ADS Integrator that processes this instance of CSV Downloader to process encrypted files.

Steps:

Set ADS Integrator that processes this instance of CSV Downloader to use the link files.
```
"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "remote": {
      "type": "s3",
      "access_key": "access_key",
      "bucket": "bucket_name"
    }
  }
}
```
Name Type Mandatory? Default Description
access_key string yes n/a The access key to the S3 bucket where the source data is stored
bucket string yes n/a The name of the S3 bucket where the source data is stored

Add the ‘encryption’ parameter under ‘ads_storage’ -> ‘options’, and give it a unique name (for example, ‘encryption_1’).

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "remote": {
      "type": "s3",
      "access_key": "access_key",
      "bucket": "bucket_name"
    },
    "encryption": "encryption_1"
  }
}

Add a new section to the configuration file and name it with the same unique name that you used for the ‘encryption’ parameter. In this example, it is ‘encryption_1’. This section should contain one parameter, ‘type’, set to ‘pgp’:
```
"encryption_1": {
  "type": "pgp"
}
```
When later scheduling this ADS Integrator instance in the Data Integration Console, provide the ‘ads_integrator|options|remote|secret_key’ parameter (see General Parameters) and the parameters that are related to the decryption of the PGP-encrypted source files (see Encryption-Specific Parameters).

Adjust advanced settings

ADS Integrator allows you to adjust various advanced settings, which, for example, can help improve performance.

Change these settings only if you are confident in executing the task or have no other options. Adjusting the advanced settings in a wrong way may generate unexpected side effects.

Proceed with caution.

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "resource_pool": "resource_pull", // optional
    "pool_timeout": number_of_seconds, // optional
    "integrate_without_sys_hash": true|false, // optional
    "keep_values_when_deleted": true|false, // optional
    "default_strategy_override": {"entity_1": "strategy_name","entity_2": "strategy_name"}, // optional
    "max_connections": number_of_connections, // optional
    "connection_refresh_time": number_of_minutes, // optional
    "multi_load": true|false, // optional
    "use_direct_hint_in_copy": true|false, // optional
    "truncate_source_tables": true|false, // optional
    "disable_statistics_stg": true|false, // optional
    "disable_sys_load_id_logging": true|false, // optional
    "disable_sys_id": true|false, // optional
    "use_uniq_table": true|false, // optional
    "insert_to_stage": true|false, // optional
    "skip_projection_check": true|false, // optional
    "segmentation": segmented|unsegmented, // optional
    "partition_by": "field_name", // optional
    "drop_source_partition": true|false, // optional
    "source_projection": { <custom_segmentation> }, // optional
    "ignored_system_fields": "["sys_field_1","sys_field_2"]", // optional
    "join_files": true|false, // optional
    "finish_in_source": true|false, // optional
    "recreate_source_tables": true|false, // optional
    "number_of_parallel_queries": number_of_copy_commands, // optional
    "number_of_parallel_entities_integrations": number_of_entities_to_source, // optional
    "number_of_parallel_integrations": number_of_entities_to_stage // optional
  }
}

Change these settings only if you are confident in executing the task or have no other options. Adjusting the advanced settings in a wrong way may generate unexpected side effects.

Proceed with caution.

Name	Type	Mandatory?	Default	Description
resource_pool	string	no	n/a	The resource pool to connect to the database
pool_timeout	integer	no	240	The maximum timeout (in seconds) specifying how long ADS Integrator waits for a new connection to be created
integrate_without_sys_hash	Boolean	no	false	Specifies whether '_sys_hash' (see Source Table and Stage Table) is used. If not set or set to 'false', '_sys_hash' is used. If set to 'true', '_sys_hash' is not used. Set this parameter to 'true' if you want to disable the '_sys_hash' field (see Source Table and Stage Table for more information about the '_sys_hash' field; see the 'ignored_system_fields' parameter in this table for more information about disabling system fields). WARNING: Suppressing the '_sys_hash' field may result in reduced integration performance due to potentially costly hash joins. Proceed with caution. NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.
keep_values_when_deleted	Boolean	no	false	(Full loads only) Specifies how to handle the latest values in the records that are marked to be deleted from the ADS instance. If not set or set to 'false', ADS Integrator replaces the latest values with NULLs but keeps the key fields. If set to 'true', ASD integrator does not change the latest values in any way (they remain intact).
default_strategy_override	JSON	no	{ } (empty hash, as in: `"default_strategy_override" : {}` )	Sets a different data integration strategy for a specific entity (a different value than the 'default_strategy' parameter sets, see Set the data integration strategy) Example: `"default_strategy_override": { "entity_1" : "merge_with_pk", "entity_2" : "merge_with_pk" }`
max_connections	integer	no	50	The maximum number of open connections to ADS
connection_refresh_time	integer	no	15	The period (in minutes) after which the connection gets refreshed
multi_load	Boolean	no	false	Specifies whether multi-load mode is enabled (that is, one entity can be loaded by multiple instances of ADS Integrator) By default, one entity can be loaded by only one instance of ADS Integrator. For performance reasons, you may want to allow multiple instances of ADS Integrator to load the same entity. For example, if you have an entity that is downloaded by 10 downloaders, you may want to process this entity with two instances of ADS Integrator. If not set or set to 'false', one entity can be loaded by only one instance of ADS Integrator. If set to 'true', one entity can be loaded by only one instance of ADS Integrator. ADS Integrator creates one source table for each integrator instance and entity. This source table has a postfix set to the ID of the integrator instance. All data processed by the integrator instance is loaded to this specific source table and then merged to the stage table (there is always only one stage table).
use_direct_hint_in_copy	Boolean	no	true	Specifies whether the DIRECT hint is allowed in the COPY command If not set or set to 'true', the DIRECT hint is allowed. If set to 'false', the DIRECT hint is not allowed. NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.
truncate_source_tables	Boolean	no	true	Specifies whether to truncate the source table and to recreate the used sequences If not set or set to 'true', the source table can be truncated and the used sequences can be recreated. If set to 'false', the source table cannot be truncated and the used sequences cannot be recreated. NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.
disable_statistics_stg	Boolean	no	false	Specifies whether the ANALYZE_STATISTICS function is forbidden for use on the stage tables If not set or set to 'false', ADS Integrator can run the ANALYZE_STATISTICS function on the stage tables. If set to 'true', ADS Integrator is not allowed to run the ANALYZE_STATISTICS function on the stage tables. This may help improve performance on big tables. NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.
disable_sys_load_id_logging	Boolean	no	false	(For CSV Downloader only) Specifies whether logging to the '_sys_load_id' system table is enabled If not set or set to 'false', logging to the '_sys_load_id' system table is enabled. If set to 'true', logging to the '_sys_load_id' system table is disabled.
disable_sys_id	Boolean	no	false	Specifies whether the '_sys_id' field (see Stage Table) and the '_sys_hub_id' field (see Source Table) are used If not set or set to 'false', the both '_sys_id' and '_sys_hub_id' fields are used. If set to 'true', neither the '_sys_id' field nor the '_sys_hub_id' field is used.
use_uniq_table	Boolean	no	true	Specifies whether input data should be loaded to the uniq table If not set or set to 'true', the data are loaded to the uniq table. If set to 'false', the data skips the uniq table. NOTE: Use this parameter only when you are sure that the input data does not contain duplicates.
insert_to_stage	Boolean	no	false	Specifies whether the MERGE command is used when the data is inserted into the stage table If not set or set to 'false', the data is inserted into the stage tables using the MERGE command. If set to 'true', the data is inserted into the stage tables directly, without using the MERGE command.
skip_projection_check	Boolean	no	false	Specifies whether to check for the correct projection. If not set or set to 'false', a check for the correct projection is enabled. If set to 'true', a check for the correct projection is disabled. Do so only as a last resort or a temporary solution. NOTE: Set this parameter to 'true' if you want to set custom segmentation and the order in projections (see the 'source_projection' parameter in this table).
source_projection	JSON	no	n/a	Sets custom segmentation and the order in projections Example: `"source_projection": { "segmentation": ["name"], "order_by": ["id"] }` NOTES: To be able to use this parameter, set 'skip_projection_check' to 'true' (see the 'skip_projection_check' parameter in this table). You can set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.
segmentation	string	no	n/a	The expression to segment a table by To prevent segmentation, set this parameter to 'unsegmented'. NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.
partition_by	string	no	n/a	The name of the field to set the partition to NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.
drop_source_partition	Boolean	no	false	(Only for CSV Downloader) Specifies whether to drop the partition in the source tables (see Source Table) If not set or set to 'false', the partition is the source tables remains intact. If set to 'true', the partition in the source tables is dropped at the beginning of each data integration. The partition is defined based on the value of the 'target_predicate' field in the manifest file (see the 'target_predicate' column in Manifest File). NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.
ignored_system_fields	array	no	false	The system fields added by ADS Integrator to each table that you want to disable NOTE: To be able to use this parameter, set 'finish_in_source' to 'true' (see the 'finish_in_source' parameter in this table). Additionally: If you want to disable the '_sys_hash' field (see Source Table and Stage Table), set 'integrate_without_sys_hash' to 'true' (see the 'integrate_without_sys_hash' parameter in this table). If you want to disable the '_sys_filename' field (see Source Table), set 'disable_file_level_logging' to 'true' (see the 'disable_file_level_logging' parameter in Add optional parameters).
join_files	Boolean	no	false	Specifies whether to join files to one bigger file before sending them to ADS If not set or set to 'false', the files are not joined. If set to 'true', ADS Integrator joins the files. NOTE: The files must not have headers. Using this parameter is not recommended for files greater than 100 Mb.
finish_in_source	Boolean	no	false	Specifies whether to end data integration at the source table phase If not set or set to 'false', data integration runs through all the process steps (see How ADS Integrator Works). If set to 'true', ADS Integrator ends the integration at the source table phase. No other steps will be executed. Set this parameter to 'true' if you want to disable some system fields that ADS Integrator adds to each table (see the 'ignored_system_fields' parameter in this table).
recreate_source_tables	Boolean	no	false	Specifies how to handle the source tables If not set or set to 'false', ADS Integrator does not drop the source tables. If set to 'true', ADS Integrator drops and recreates the source tables during each run.
number_of_parallel_queries	integer	no	8	The number of COPY commands to run in parallel when integrating one entity (see How ADS Integrator Works) SQL operation to run: COPY FROM LOCAL Example: If ADS Integrator processes 100 batches and each batch has 5 files for the entity, the total number of files that needs to be processed is 500. If you set the 'number_of_parallel_queries' parameter to 10, ADS Integrator will create 10 slots for COPY commands (up to 10 COPY commands can run simultaneously). Therefore, in the ideal case, the 500 files will be processed 10 times as fast as when this setting is set to 1. IMPORTANT: The default value is sufficient for an average workspace on a standard ADS instance. Do not change the default unless you are absolutely sure in the results that you want to achieve by changing it.
number_of_parallel_entities_integrations	integer	no	1	(Batch mode only) The number of entities to be integrated in parallel to the source table (see How ADS Integrator Works). The operations performed during integration are similar to those performed when you have the 'finish_in_source' parameter set to 'true' (see the 'finish_in_source' parameter in this table). SQL operations to run: Creating flex tables (with fjsonparser) Preparing the source table (truncate, projections, sequences) Processing files with deleted records (see the 'remove_deleted_records' parameter in Add optional parameters) COPY FROM LOCAL Logging Example: If you set this parameter to 2 and the 'number_of_parallel_queries' parameter is set to 5, then you have 5 entities, and each of these entities has multiple files for processing. ADS Integrator will create 2 slots (based on 'number_of_parallel_entities_integrations'). Inside these 2 slots, ADS Integrator will create another 5 slots for COPY commands. As a result, you are going to have up to 10 COPY commands running simultaneously, 5 for each entity. IMPORTANT: The default value is sufficient for an average workspace on a standard ADS instance. Do not change the default unless you are absolutely sure in the results that you want to achieve by changing it.
number_of_parallel_integrations	integer	no	1	(Batch mode only) The number of entities to be integrated in parallel to the stage table (filling in the uniq table, the diff table, and the stage table; see How ADS Integrator Works) SQL operations to run: Analyzing statistics on the source and stage tables Truncating the uniq and diff tables Inserting into the uniq and diff tables Updating the stage table records that are marked as 'is_deleted' (see the '_sys_is_deleted' field in the stage table) Merging data from the diff table to the stage table Logging Example: If you have 10 entities and you set this parameter to 2, ADS Integrator will create 2 slots for the integration command. If all the entities have the same amount of data, the integration (without loading to the source table) runs twice as fast as when this setting is set to 1. IMPORTANT: The default value is sufficient for an average workspace on a standard ADS instance. Do not change the default unless you are absolutely sure in the results that you want to achieve by changing it.

Schedule Parameters

When scheduling ADS Integrator (see Phases of Building the Data Pipeline -> Production Implementation), provide the parameters from this section in the schedule.

General Parameters

Some parameters must be entered as secure parameters (see Configure Schedule Parameters).

Some parameters contain the names of the nesting parameters, which specifies where they belong in the configuration hierarchy. For example, a parameter entered as ads|options|password is resolved as the following:

"ads": {
  "options": {
    "password": ""
  }
}

Enter such parameters as is (see Schedule Example).

Name	Typ	Mandatory?	Secure?	Default	Description
ID	string	yes	no	n/a	The ID of the ADS Integrator instance being scheduled Example: ads_integrator_1
server_side_encryption	Boolean	no	no	false	Specifies whether ADS Integrator can work with an S3 location where server-side encryption is enabled If not set or set to 'false', ADS Integrator is not enabled to work with an S3 location where server-side encryption is enabled. If set to 'true', ADS Integrator is enabled to work with an S3 location where server-side encryption is enabled.
ads_storage\|password	string	yes	yes	n/a	The password for the user that is specified in the configuration file under ads_storage -> username The user and the password are used to access the specified ADS instance where the data will be uploaded to.
ads_integrator\|options\|remote\|secret_key	string	see 'Description' column	yes	n/a	(Only when ADS Integrator uses link files) The access key to the S3 bucket where the source data is stored

Encryption-Specific Parameters

If you enable processing of PGP-encrypted source files when using link files, provide the encryption-specific parameters as secure schedule parameters (see Configure Schedule Parameters).

Some parameters contain the names of the nesting parameters, which specifies where they belong in the configuration hierarchy. For example, a parameter entered as csv|options|password is resolved as the following:

"csv": {
  "options": {
    "password": ""
  }
}

Enter such parameters as is (see Schedule Example).

Parameter Name	Type	Mandatory?	Secure?	Default	Description
encryption_name\|private_key	string	yes	yes	n/a	The private key used for decrypting the PGP-encrypted files If the private key itself is encrypted, also provide the passphrase to decrypt it (see the 'encryption_name\|passphrase' parameter). NOTE: Although you are entering the private key as a secure parameter, its value will be visible right after you paste it in the parameter value field (the show value check-box is selected) because it is a multi-line string. To hide the value, un-select the show value check-box. When the schedule is saved, the value will not appear in clear-text form in any GUI or log entries. For more information, see Configuring Schedule Parameters.
encryption_name\|passphrase	string	see 'Description'	yes	n/a	The passphrase used to decrypt the private key if it is encrypted This parameter is mandatory only when the private key (see the 'encryption_name\|private_key' parameter) is encrypted. Otherwise, do not use it.

encryption_name|private_key

string

yes

n/a

The private key used for decrypting the PGP-encrypted files

If the private key itself is encrypted, also provide the passphrase to decrypt it (see the 'encryption_name|passphrase' parameter).

NOTE: Although you are entering the private key as a secure parameter, its value will be visible right after you paste it in the parameter value field (the show value check-box is selected) because it is a multi-line string. To hide the value, un-select the show value check-box. When the schedule is saved, the value will not appear in clear-text form in any GUI or log entries. For more information, see Configuring Schedule Parameters.

encryption_name|passphrase

string

see 'Description'

yes

n/a

The passphrase used to decrypt the private key if it is encrypted

This parameter is mandatory only when the private key (see the 'encryption_name|private_key' parameter) is encrypted. Otherwise, do not use it.

Schedule Example

The following is an example of how you can specify schedule parameters:

Best Practices

For better performance, we recommend that you apply the following best practices when using ADS Integrator:

Keep the total size of the files in one integration group below 1 GB. An integration group is usually a part of the manifest file containing x files, where x is the value of the ‘integration_group_size’ parameter (see Add optional parameters). The default of this parameter is 1, therefore unless you change it, the files are integrated one by one.
When processing many smaller files, set the ‘integration_group_size’ parameter to a higher value to copy all the files in a single COPY command (see Add optional parameters). Note, however, that setting this parameter to a higher value will stop loading new rows to the file table.
Do not use the ‘integration_group_size’ parameter if you have multiple versions of a data record in your source data.
For better performance on big tables, set the ‘disable_statistics_stg’ parameter to ‘true’ (see Adjust advanced settings).
Use the ‘skip_projection_check’ parameter (see Adjust advanced settings) only as a last resort or a temporary solution.
Use the ‘integrate_without_sys_hash’ parameter (see Adjust advanced settings) carefully. Suppressing the ‘_sys_hash’ field may result in reduced integration performance due to potentially costly hash joins.
Make sure that the source table and stage table have the same columns so that optimized MERGE is used.

Computed Fields