ADS Integrator

ADS Integrator is a component of the data preparation and distribution pipeline (see Data Preparation and Distribution Pipeline). ADS Integrator takes the data that has been downloaded by a downloader (see Downloaders) and integrates it into Agile Data Warehousing Service (ADS) (see Data Warehouse Reference).

Contents:

How ADS Integrator Works

The following picture shows how data for one entity is integrated into ADS when the default settings are used:

When integrating data downloaded by a downloader, ADS Integrator relies on the information stored in the S3 files' metadata and the S3 metadata folder. If you want to change any parameter in a downloader's configuration, do so in the metadata files manually.

Data Tables

Source Table

The source tables are at the beginning of the data integration process. COPY commands integrate the prepared data from the BDS to the source tables.

If you use JSON files, the data first is integrated from the BDS to the flex tables and then copied from the flex tables to the source tables.

The naming convention of the source tables is the following:
src_(downloader_name)_(entity_name)_merge

In some cases, the source tables can be used directly. However, when ADS Integrator runs with the default settings, data should be read from the stage tables. ADS Integrator adds system fields to each row.

The source tables have the following system fields:

NameTypeDescription

_sys_hub_id

string

The MD5 hash of all key fields

_sys_hashstring

The MD5 hash of all fields except the system fields

_sys_captured_attimestamp
  • In entity mode, contains the time when the file was saved on the BDS
  • In batch mode, contains the time when the batch was saved on the BDS
_sys_is_deletedBoolean
  • Is set to 'true', the row should be processed as a deleted record.
  • Is set to 'false', the row is processed normally.
_sys_filenamestringThe name of the source file
_sys_load_idintegerThe reference to the '_sys_load_id' system table

Stage Table

After successful integration, the data is loaded to the stage tables. Then, these tables are ready for use and the ETL can be connected here.

The naming convention of the stage tables is the following:
stg_(downloader_name)_(entity_name)_merge

ADS Integrator adds system fields to each row.

The stage tables have the following system fields:

NameTypeDescription

_sys_id

string

The MD5 hash of all key fields. During integration, the value is copied from the '_sys_hub_id' field of the source table.

_sys_hashstring

The MD5 hash of all fields except the system fields. During the integration, the value is copied from the '_sys_hash' field of the source table.

_sys_valid_fromtimestampDuring the integration, the value for this field is copied from the '_sys_captured_at' field of the source table.
If multiple batches are integrated and each of them contains the same key, this value will contain the date when the key last occurred in the integrated data.
_sys_load_idBooleanDuring the integration, the value for this field is copied from the '_sys_load_id' field of the source table.
_sys_is_deletedstring

During the integration, the value for this field is filled when the record is marked as 'is_deleted' (deleted records).

The deleted records are used when you want to remove some of already loaded values from the stage tables. The following options are available:

  • Send the FULL export
    If the entity export is marked as FULL, ADS Integrator handles the deleted records itself. All the records (identified by key) that are present in the stage table and are not present in the full export are marked as deleted (the field '_sys_is_deleted' is set to 'true').
  • (Only for Salesforce Downloader) Send a file with the deleted records
    ADS Integrator can generate a file with the deleted records. This file contains only the keys that should be deleted for the corresponding stage table. All the records mentioned in the deleted record file are marked as deleted in the stage table (the field '_sys_is_deleted' is set to 'true').

See the 'remove_deleted_records' parameter in Add optional parameters.

Flex Table

If your source files are JSON files (see "Use JSON files as source files" in CSV Downloader), ADS Integrator first integrates the JSON source files from the BDS into flex tables before copying them to the source tables. When integrating the JSON files into the flex tables, ADS Integrator uses the Vertica fjsonparser parser to process the JSON files, either with or without flattening the nested data. Then, ADS Integrator processes the data from the flex tables according to the XPath notations specified in the feed file (see Feed File) and integrates the data into the source tables.

The naming convention of the flex tables is the following:
src_(downloader_name)_(entity_name)_flex

The flex tables have the following system fields:

NameTypeDescription

__raw__

string

The processed data from the source JSON files

Rejected Table

If the 'enable_rejected_table' parameter is set to 'true', a special table is created for rejected records for each entity.

The naming convention of the rejected tables is the following:
rejected_(downloader_name)_(entity_name)

In the rejected tables, you can use only SELECT queries. DML and DDL queries are not supported.

The maximum number of rows in a rejected table is to 100000. When this limit is reached, all the data is deleted from the table within the next run of ADS Integrator. To change the maximum allowed number of the rows in rejected tables, set the 'rejected_table_cleanup_rows_limit' parameter.

The rejected table contains the following fields:

NameTypeDescription
node_namevarcharThe name of the Vertica node where the source file was located
file_namevarcharThe name of the file being loaded
session_idvarcharThe session ID number in which the COPY statement occurred
transaction_idintegerThe identifier for the transaction within the session, if any; otherwise, NULL
statement_idintegerThe unique identification number of the statement within the transaction that included the rejected data
batch_numberinteger(Internal use) The number of the batch (chunk) that the data comes from
row_numberintegerThe rejected row number from the source file
rejected_datalong varcharThe data that was not loaded
rejected_data_orig_lengthintegerThe length of the rejected data
rejected_reasonvarchar

The error that caused the rejected row

This column returns the same message that exists in a load exceptions file when you do not save to a table.

System Tables

Batch Table

The batch table is used only when ADS Integrator is running in batch mode.

The batch table contains information about the processed batches. If a batch is processed correctly, the batch table contains the STARTED event and the FINISHED event.

The batch table contains the following fields:

NameTypeDescription

id

integer

The unique numeric ID generated from the sequence

sequenceinteger

The sequence number, if exists

identificationvarchar

The identification of the batch
In most cases, the identification is same as the ID of the downloader that was used for downloading this batch.

load_last_batchtimestamptzThe time when the batch was created on the BDS
filenamevarcharThe name of the file from which the batch was created
load_eventvarchar

The type of the event

Possible values:

  • STARTED
  • FINISHED
load_timetimestamptzThe time when the event was created (by the now() function during the inserting)
is_deletedBoolean
  • Is set to 'true', the row is ignored, and the batch will be processed again.
  • Is set to 'false', the row is processed normally.

Load Table

The load table is used in both batch mode and entity mode.

The load table contains information about the progress of the load for each entity.

The load table contains the following fields:

NameTypeDescription

load_id

integer

The unique numeric ID generated from the sequence

load_srcvarchar

The type of the downloader that downloaded the entity

load_entityvarcharThe name of the entity
load_last_batchtimestamptzThe time when the file for the entity was saved on the BDS. If multiple files are integrated in one run, this field contains the last value.
load_eventvarchar

The type of the event.

Possible values:

  • STARTED
  • FINISHED
load_timetimestamptzThe time when the record was inserted into the database
is_deletedBoolean
  • Is set to 'true', the row is ignored.
  • Is set to 'false', the row is processed normally.
batch_idinteger(Only for batch mode) The reference to the batch table

File Table

The file table is used in both batch mode and entity mode.

The file table contains information about the integrated files and additional details on each of them.

The file table contains the following fields:

NameTypeDescription

id

integer

The unique numeric ID generated from the sequence

load_idinteger

The reference to the load table

load_attimestamptzThe time when the file was loaded to the database
entity_namevarcharThe name of the entity connected to the file
filenamevarcharThe original name of the file
bds_filenamevarcharThe name of the file on the BDS
file_sizeinteger
  • (For CSV Downloader only) The size of the file copied from the manifest file
  • (For all downloaders except for CSV Downloader) Empty
file_hashvarchar
  • (For CSV Downloader only) The MD5 hash of the file copied from the manifest file
  • (For all downloaders except for CSV Downloader) Empty

Configuration File

Creating and setting up the integrator's configuration file is the third phase of building your data pipeline (see Build Your Data Pipeline).

For more information about configuring a brick, see Configure a Brick.

Minimum Layout of the Configuration File

The following JSON samples are the minimum layout of the configuration file that ADS Integrator can use. The samples differ in load mode that ADS Integrator use:

  • Batch mode
  • Entity mode

Choose the mode that is compatible with the downloader that you are using:

  • Choose batch mode when you want to process the data downloaded by CSV Downloader.
  • Choose entity mode when you want to process the data downloaded by any downloader except for CSV Downloader.

Copy the appropriate sample and replace the placeholders with your values.

You can later enrich the sample with more parameters depending on your business and technical requirements (see Customize the Configuration File).

Batch Mode

Batch mode is supported by CSV Downloader only.

In batch mode, the data for all entities is put to the source tables first. Then, if all data is correctly copied to the source tables, the process continues. If an error occurred during integrating any of the entities into the batch, the whole process stops, and no data is integrated into the stage tables.

By default, ADS Integrator processes only one batch, but you can change this (see Specify the number of batches to process in one run).

{
  "integrators": {
    "ads_integrator_id": {
      "type": "ads_storage",
      "batches": ["csv_downloader_1","csv_downloader_2"]
    }
  },
  "ads_storage": {
    "instance_id": "data_warehouse_instance_id",
    "username": "dw_email@address.com",
    "options": {}
  }
}

Entity Mode

Entity mode is supported by all downloaders except for CSV Downloader.

In entity mode, entities are processed one by one. The data from one entity is loaded to the source tables and then integrated into the stage tables.

ADS Integrator processes all the files from one entity that are present in the BDS.

{
  "integrators": {
    "ads_integrator_id": {
      "type": "ads_storage",
      "entities": ["entity_1","entity_2","entity_3"]
    }
  },
  "ads_storage": {
    "instance_id": "data_warehouse_instance_id",
    "username": "dw_email@address.com",
    "options": {}
  }
}


The placeholders to replace with your values:

NameTypeMandatory?DefaultDescription
integrators -> ads_integrator_id stringyesn/aThe ID of the ADS Integrator instance
Example: ads_integrator_1
integrators -> ads_integrator_id -> batchesarraysee 'Description'n/a

(Batch mode only) The IDs of the CSV downloader instances that this ADS Integrator instance processes
Example: csv_downloader_1

integrators -> ads_integrator_id -> entitiesarraysee 'Description'n/a(Entity mode only) The names of the entities that this ADS Integrator instance processes
Example: entity_1
ads_storage -> instance_idstringyesn/aThe ID of the ADS instance into which the data is uploaded

ads_storage -> username

stringyesn/aThe access username to the ADS instance into which the data is uploaded

Customize the Configuration File

You can change default values of the parameters and set up additional parameters.

When you set up an optional parameter, it is saved in the S3 file's metadata (not in the metadata folder, but in an attribute of the file itself). To change the parameter value later, edit the metadata files manually.


Set the data integration strategy

The integration strategy defines how data is integrated into the stage tables on ADS.

  • merge: This strategy uses one key for all operations (uniqueness, merging, and so on). This key is computed during the COPY command as an MD5 of all the fields that are marked as primary key in the 'hub' setting of the entity (see the system field  '_sys_hub_id' in the source table).
  • merge_with_pk: This strategy uses all primary keys for all operations (uniqueness, merging, and so on).

The final output is the same for either strategy. The only difference is how the final projection created on stage tables is optimized.

  • For the merge integration strategy, the projection is optimized for one ID field.
  • For the merge_with_pk strategy, the projection is optimized for all primary keys in the order they are set up in the entity's settings (the 'hub' parameter in the section of the configuration file that describes the downloader you are using):

    "entities": {
      "table_1_name": {
        "global": {
          "custom": {
            "hub": ["column_1_name","column_2_name","column_3_name"]
          }
        }
      }
    }

If you want to set a different data integration strategy for a specific entity, use the advanced parameter 'default_strategy_override' (see Adjust advanced settings).

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "default_strategy": "merge|merge_with_pk" // optional
  }
}
NameTypeMandatory?DefaultDescription
default_strategystringnomerge

The data integration strategy used to integrate data into the stage tables on ADS

  • If not set or set to 'merge', the merging strategy is used.
  • If set to 'merge_with_pk', the merging strategy with primary keys is used.


Add multiple ADS Integrators

You can use multiple instances of ADS Integrator to process your downloaders. Update the 'integrators' section to contain multiple subsections: a subsection per ADS Integrator instance. Each subsection must refer the ID of the corresponding ADS Integrator instance.

Depending on what downloaders you are using, you need to specify load mode for each ADS Integrator instance: batch mode or entity mode.

"integrators": {
  "ads_integrator_1_id": {
    "type": "ads_storage",
    "batches": ["csv_downloader_1","csv_downloader_2"]
  },
  "ads_integrator_2_id": {
    "type": "ads_storage",
    "batches": ["csv_downloader_3"]
  },
  {
  "ads_integrator_3_id": {
    "type": "ads_storage",
    "entities": ["entity_1","entity_2","entity_3"]
  },
  "ads_integrator_4_id": {
    "type": "ads_storage",
    "entities": ["entity_4","entity_5","entity_6"]
  }
}


Use a different server

By default, ADS Integrator uses secure.gooddata.com to access to the ADS instance. You can specify a different URL (for example, you are a white-labeled customer).

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "server": "your.server.url", // optional
  "options": {}
}
NameTypeMandatory?DefaultDescription
serverstringnosecure.gooddata.comThe URL used to access the ADS instance


Use COPY FROM S3 instead of COPY FROM LOCAL

If your source files are stored in an S3 location (see 'Use an S3 location' in CSV Downloader), the files are first downloaded from the S3 location and then loaded using the COPY FROM LOCAL command. Using COPY FROM S3 loads the files directly from the S3 location which may be faster.

  • If you use your own S3 location and not a GoodData-managed one, contact GoodData Support so that they enable data load from your S3 location to your ADS instance. In the request, include the ID of your ADS instance (see the 'instance_id' parameter in Minimum Layout of the Configuration File) and the name of the S3 bucket.
  • If you decide to use COPY FROM S3 and at the same time choose not to enable rejected tables (see the 'enable_rejected_table' parameter), you will not be able to review rejected records because they will be skipped and not stored anywhere.
"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "copy_from_s3": true|false
  }
}
NameTypeMandatory?DefaultDescription
copy_from_s3Booleansee 'Description'false

Specifies what COPY command is used for downloading the data from an S3 location.

  • If not set or set to 'false', COPY FROM LOCAL is used.
  • If set to 'true', COPY FROM S3 is used.

NOTE: If you use Parquet source files (see 'Use Parquet files as source files' in CSV Downloader), set this parameter to 'true'.


Specify how to proceed in case of errors and rejected records

By default, if an error occurs and some data cannot be integrated into ADS, ADS Integrator fails. To make ADS Integrator ignore errors, set the 'abort_on_error' parameter to 'false'. This disables the ABORT ON ERROR option in each COPY command.

To additionally specify how to treat the records that have not been parsed during the process and have been rejected, use the optional 'enable_rejected_table' parameter.

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "abort_on_error": true|false, // optional
    "enable_rejected_table": true|false // optional
  }
}
NameTypeMandatory?DefaultDescription
abort_on_errorBooleannotrue

Specifies whether the ABORT ON ERROR option in each COPY command is enabled or disabled.

  • If not set or set to 'true', the ABORT ON ERROR option is enabled, and ADS Integrator fails when an error occurs.
  • If set to 'false', the ABORT ON ERROR option is disabled, ADS Integrator keeps running if an error occurs, and the rejected records are treated according to how the 'enable_rejected_table' parameter is set.
enable_rejected_tableBooleannofalse

(Only when 'abort_on_error' is set to 'false') Specifies how ADS Integrator treats the rejected records.

  • If not set or set to 'false' and:
    • the 'copy_from_s3' parameter is set to 'true' at the same time, the rejected records are skipped and not stored anywhere.
    • the 'copy_from_s3' parameter is not set or set to 'false' at the same time, ADS Integrator moves the rejected records to a file on the BDS after each run.
  • If set to 'true', ADS Integrator moves the rejected records to rejected tables.

NOTE: You can also set the 'enable_rejected_table' parameter at the entity level. To do so, add it to the 'custom' section of the entity.

rejected_table_cleanup_rows_limitintegerno100000

The maximum number of rows allowed in a rejected table

When the maximum number is reached, all the data is deleted from the table within the next run of ADS Integrator.

IMPORTANT: The default value is sufficient for an average workspace on a standard ADS instance. Setting it to a greater value may result in consuming too much space on your ADS instance. Do not change the default unless you are absolutely sure in the results that you want to achieve by changing it.

To be notified of the ignored records, set up a notification triggered by a custom event (see Create a Notification Rule for a Data Loading Process). Set error-in-data-processing as the custom event triggering the notification, and use the {$params.list_of_files} variable in the message body, which will include a list of ignored files in the notification email.


Add optional parameters

Depending on your business and technical requirements, you can set one or more optional parameters affecting the data integration process.

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "disable_file_level_logging": true|false, // optional
  "options": {
    "integration_group_size": number_of_files, // optional
    "remove_deleted_records": true|false, // optional
    "group_by_version": true|false // optional
  }
}
NameTypeMandatory?DefaultDescription
disable_file_level_loggingBooleannofalse

Specifies whether ADS Integrator logs information to the _sys_file table

  • If not set or set to 'false', ADS Integrator logs information to the _sys_file table
  • If set to 'true', ADS Integrator does not log any information to the _sys_file table

Set this parameter to 'true' if you want to disable the '_sys_filename' field (see Source Table for more information about the '_sys_filename' field; see the 'ignored_system_fields' parameter in Adjust advanced settings for more information about disabling system fields).

integration_group_sizeintegerno1

The number of files to be integrated in one COPY command

NOTE:

  • Set this parameter to a greater number when you are integrating a big number of small files. ADS processes the files in one COPY command serially. Note, however, that setting this parameter to a higher value will stop loading new rows to the file table.
  • Do not use this parameter (or, if using, keep it at its default) if you have multiple versions of a data record in your source data.
remove_deleted_recordsBooleannofalse

Specifies how to handle the records in the stage table that have the '_sys_is_deleted' field set to 'true'.

  • If not set or set to 'false', ADS Integrator does not remove the records.
  • If set to 'true', ADS Integrator purges the records.

NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.

group_by_versionBooleannofalse

(For CSV Downloader only) Specifies how ADS Integrator processes different versions of the same data records in the source data (see 'version' in Feed File) in one run

  • If not set or set to 'false', ADS Integrator integrates data records as they are provided in the source files regardless of whether data records have only one version or multiple versions.
  • If set to 'true', ADS Integrator integrates the data records version by version. That is, ADS Integrator first integrates all data records for version 1, then proceeds to the data records for version 2, and so on.


Configure additional settings in the source table and stage table

To extend the source table and stage table with additional settings, configure computed fields. For more details, see Computed Fields.


Add system fields to the load table or file table

To add the system fields to the load table or file table, configure system fields. Data can populate these fields only from the runtime metadata. The runtime metadata is filled in by downloaders. To find out what metadata a downloader creates, see the corresponding downloader's article in Downloaders.

The 'system_table_fields' section contains arrays of parameters for the load table and file table.

"system_table_fields": {
  "file_table": [
    {
      "name": "custom_parameter",
      "type": "parameter_data_type",
      "function": "function_name"
    }
  ],
  "load_table": [
    {
      "name": "custom_parameter",
      "type": "parameter_data_type",
      "function": "function_name"
    }
  ]
}
NameTypeMandatory?DefaultDescription
namestringyesn/a

The name of the custom parameter

typestringyesn/a

The type of the custom parameter

Possible values:

  • varchar(X): string-X
  • varchar(255): string
  • integer: integer
  • decimal(X,Y): decimal-X-Y
  • decimal(16,10): decimal-16-10
  • boolean: boolean
  • date (without time): date-false
  • date (with time): date-true
  • time (without date): time
functionstringyesn/a

The function that will fill in the system field with the runtime metadata parameter

Format: runtime_metadata|value
value specifies the name of the runtime metadata parameter. For example, if you are using CSV Downloader, you can use non-standard fields in the manifest files downloaded by CSV Downloader that are copied to 'runtime_metadata'.

Example: runtime_metadata|load_param


Example:

"system_table_fields": {
  "file_table": [
    {
      "name": "load_param",
      "type": "string-255",
      "function": "runtime_metadata|load_param"
    }
  ]
}


Specify the number of batches to process in one run

When working with batch mode, you can specify how many batches ADS Integrator should process in one run.

Example: Entity data in all batches is in incremental load mode
You have 10 batches to process, and the number of batches in one run is set to 5.

The 10 batches will be processed in the next two runs of ADS Integrator.

ADS Integrator downloads all the data for the number of the batches that you have specified and checks whether all entities have only one version of metadata.

  • If yes, ADS Integrator will process all data in one run: copy all the data to the source table and then use MERGE to put it to the stage table.
  • If not, ADS Integrator will split the data into multiple parts and process them independently. The order of the batches will be preserved.

Example: Some entity data is in full load mode, the rest is in incremental load mode
You have 10 batches to process, and the number of batches in one run is set to 5. The entity data in Batch 5 and Batch 8 are marked as 'full'.

In the first run, ADS Integrator takes the batches from 1 through 5. Batch 5 is the one with full load mode, so ADS Integrator ignores all incremental batches before this batch and only processes Batch 5.

In the second run, ADS Integrator ignores Batch 6 and Batch 7 (incremental mode), then processes Batch 8 (full mode), and then processes incremental batches: Batch 9 and Batch 10.

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "number_of_batches_in_one_run": number_of_batches // optional
  }
}
NameTypeMandatory?DefaultDescription
number_of_batches_in_one_runintegerno100The number of batches that ADS Integrator should process in one run


Use link files

When you are using CSV Downloader and chose to use link files (see 'Use link files' in CSV Downloader), set ADS Integrator that processes this instance of CSV Downloader to use the link files.

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "remote": {
      "type": "s3",
      "access_key": "access_key",
      "bucket": "bucket_name"
    }
  }
}
NameTypeMandatory?DefaultDescription
access_keystringyesn/aThe access key to the S3 bucket where the source data is stored
bucketstringyesn/aThe name of the S3 bucket where the source data is stored

When later scheduling this ADS Integrator instance in the Data Integration Console, provide the 'ads_integrator|options|remote|secret_key' parameter in the schedule (see General Parameters).


Enable processing of PGP-encrypted source files when using link files

When you are using link files in CSV Downloader (see 'Use link files' in CSV Downloader) and your source files are PGP-encrypted, enable ADS Integrator that processes this instance of CSV Downloader to process encrypted files.

Steps:

  1. Set ADS Integrator that processes this instance of CSV Downloader to use the link files.

    "ads_storage": {
      "instance_id": "data_warehouse_instance_id",
      "username": "dw_email@address.com",
      "options": {
        "remote": {
          "type": "s3",
          "access_key": "access_key",
          "bucket": "bucket_name"
        }
      }
    }
    NameTypeMandatory?DefaultDescription
    access_keystringyesn/aThe access key to the S3 bucket where the source data is stored
    bucketstringyesn/aThe name of the S3 bucket where the source data is stored
  2. Add the 'encryption' parameter under 'ads_storage' -> 'options', and give it a unique name (for example, 'encryption_1').

    "ads_storage": {
      "instance_id": "data_warehouse_instance_id",
      "username": "dw_email@address.com",
      "options": {
        "remote": {
          "type": "s3",
          "access_key": "access_key",
          "bucket": "bucket_name"
        },
        "encryption": "encryption_1"
      }
    }
  3. Add a new section to the configuration file and name it with the same unique name that you used for the 'encryption' parameter. In this example, it is 'encryption_1'. This section should contain one parameter, 'type', set to 'pgp':

    "encryption_1": {
      "type": "pgp"
    }
  4. When later scheduling this ADS Integrator instance in the Data Integration Console, provide the 'ads_integrator|options|remote|secret_key' parameter (see General Parameters) and the parameters that are related to the decryption of the PGP-encrypted source files (see Encryption-Specific Parameters).


Adjust advanced settings

ADS Integrator allows you to adjust various advanced settings, which, for example, can help improve performance.

Change these settings only if you are confident in executing the task or have no other options. Adjusting the advanced settings in a wrong way may generate unexpected side effects.

Proceed with caution.

"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {
    "resource_pool": "resource_pull", // optional
    "pool_timeout": number_of_seconds, // optional
    "integrate_without_sys_hash": true|false, // optional
    "keep_values_when_deleted": true|false, // optional
    "default_strategy_override": {"entity_1": "strategy_name","entity_2": "strategy_name"}, // optional
    "max_connections": number_of_connections, // optional
    "connection_refresh_time": number_of_minutes, // optional
    "multi_load": true|false, // optional
    "use_direct_hint_in_copy": true|false, // optional
    "truncate_source_tables": true|false, // optional
    "disable_statistics_stg": true|false, // optional
    "disable_sys_load_id_logging": true|false, // optional
    "disable_sys_id": true|false, // optional
    "use_uniq_table": true|false, // optional
    "insert_to_stage": true|false, // optional
    "skip_projection_check": true|false, // optional
    "segmentation": segmented|unsegmented, // optional
    "partition_by": "field_name", // optional
    "drop_source_partition": true|false, // optional
    "source_projection": { <custom_segmentation> }, // optional
    "ignored_system_fields": "["sys_field_1","sys_field_2"]", // optional
    "join_files": true|false, // optional
    "finish_in_source": true|false, // optional
    "recreate_source_tables": true|false, // optional
    "number_of_parallel_queries": number_of_copy_commands, // optional
    "number_of_parallel_entities_integrations": number_of_entities_to_source, // optional
    "number_of_parallel_integrations": number_of_entities_to_stage // optional
  }
}

Change these settings only if you are confident in executing the task or have no other options. Adjusting the advanced settings in a wrong way may generate unexpected side effects.

Proceed with caution.

NameTypeMandatory?DefaultDescription
resource_poolstringnon/aThe resource pool to connect to the database
pool_timeoutintegerno240The maximum timeout (in seconds) specifying how long ADS Integrator waits for a new connection to be created
integrate_without_sys_hashBooleannofalse

Specifies whether '_sys_hash' (see Source Table and Stage Table) is used.

  • If not set or set to 'false', '_sys_hash' is used.
  • If set to 'true', '_sys_hash' is not used.

Set this parameter to 'true' if you want to disable the '_sys_hash' field (see Source Table and Stage Table for more information about the '_sys_hash' field; see the 'ignored_system_fields' parameter in this table for more information about disabling system fields).

WARNING: Suppressing the '_sys_hash' field may result in reduced integration performance due to potentially costly hash joins. Proceed with caution.

NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.

keep_values_when_deletedBooleannofalse

(Full loads only) Specifies how to handle the latest values in the records that are marked to be deleted from the ADS instance.

  • If not set or set to 'false', ADS Integrator replaces the latest values with NULLs but keeps the key fields.
  • If set to 'true', ASD integrator does not change the latest values in any way (they remain intact).
default_strategy_overrideJSONno

{ }

(empty hash, as in:
"default_strategy_override" : {} )

Sets a different data integration strategy for a specific entity (a different value than the 'default_strategy' parameter sets, see Set the data integration strategy)

Example:

"default_strategy_override": {
  "entity_1" : "merge_with_pk",
  "entity_2" : "merge_with_pk"
}
max_connectionsintegerno50The maximum number of open connections to ADS
connection_refresh_timeintegerno15The period (in minutes) after which the connection gets refreshed
multi_loadBooleannofalse

Specifies whether multi-load mode is enabled (that is, one entity can be loaded by multiple instances of ADS Integrator)

By default, one entity can be loaded by only one instance of ADS Integrator. For performance reasons, you may want to allow multiple instances of ADS Integrator to load the same entity. For example, if you have an entity that is downloaded by 10 downloaders, you may want to process this entity with two instances of ADS Integrator.

  • If not set or set to 'false', one entity can be loaded by only one instance of ADS Integrator.
  • If set to 'true', one entity can be loaded by only one instance of ADS Integrator. ADS Integrator creates one source table for each integrator instance and entity. This source table has a postfix set to the ID of the integrator instance. All data processed by the integrator instance is loaded to this specific source table and then merged to the stage table (there is always only one stage table).
use_direct_hint_in_copyBooleannotrue

Specifies whether the DIRECT hint is allowed in the COPY command

  • If not set or set to 'true', the DIRECT hint is allowed.
  • If set to 'false', the DIRECT hint is not allowed.

NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.

truncate_source_tablesBooleannotrue

Specifies whether to truncate the source table and to recreate the used sequences

  • If not set or set to 'true', the source table can be truncated and the used sequences can be recreated.
  • If set to 'false', the source table cannot be truncated and the used sequences cannot be recreated.

NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.

disable_statistics_stgBooleannofalse

Specifies whether the ANALYZE_STATISTICS function is forbidden for use on the stage tables

  • If not set or set to 'false', ADS Integrator can run the ANALYZE_STATISTICS function on the stage tables.
  • If set to 'true', ADS Integrator is not allowed to run the ANALYZE_STATISTICS function on the stage tables. This may help improve performance on big tables.

NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.

disable_sys_load_id_loggingBooleannofalse

(For CSV Downloader only) Specifies whether logging to the '_sys_load_id' system table is enabled

  • If not set or set to 'false', logging to the '_sys_load_id' system table is enabled.
  • If set to 'true', logging to the '_sys_load_id' system table is disabled.
disable_sys_idBooleannofalseSpecifies whether the '_sys_id' field (see Stage Table) and the '_sys_hub_id' field (see Source Table) are used
  • If not set or set to 'false', the both '_sys_id' and '_sys_hub_id' fields are used.
  • If set to 'true', neither the '_sys_id' field nor the '_sys_hub_id' field is used.
use_uniq_tableBooleannotrue

Specifies whether input data should be loaded to the uniq table

  • If not set or set to 'true', the data are loaded to the uniq table.
  • If set to 'false', the data skips the uniq table.

NOTE: Use this parameter only when you are sure that the input data does not contain duplicates.

insert_to_stageBooleannofalse

Specifies whether the MERGE command is used when the data is inserted into the stage table

  • If not set or set to 'false', the data is inserted into the stage tables using the MERGE command.
  • If set to 'true', the data is inserted into the stage tables directly, without using the MERGE command.
skip_projection_checkBooleannofalse

Specifies whether to check for the correct projection.

  • If not set or set to 'false', a check for the correct projection is enabled.
  • If set to 'true', a check for the correct projection is disabled. Do so only as a last resort or a temporary solution.

NOTE: Set this parameter to 'true' if you want to set custom segmentation and the order in projections (see the 'source_projection' parameter in this table).

source_projectionJSONnon/a

Sets custom segmentation and the order in projections

Example:

"source_projection": {
  "segmentation": ["name"],
  "order_by": ["id"]
}

NOTES:

  • To be able to use this parameter, set 'skip_projection_check' to 'true' (see the 'skip_projection_check' parameter in this table).
  • You can set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.
segmentationstringnon/a

The expression to segment a table by
To prevent segmentation, set this parameter to 'unsegmented'.

NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.

partition_bystringnon/a

The name of the field to set the partition to

NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.

drop_source_partitionBooleannofalse

(Only for CSV Downloader) Specifies whether to drop the partition in the source tables (see Source Table)

  • If not set or set to 'false', the partition is the source tables remains intact.
  • If set to 'true', the partition in the source tables is dropped at the beginning of each data integration. The partition is defined based on the value of the 'target_predicate' field in the manifest file (see the 'target_predicate' column in Manifest File).

NOTE: You can also set this parameter at the entity level. To do so, add it to the 'custom' section of the entity.

ignored_system_fieldsarraynofalse

The system fields added by ADS Integrator to each table that you want to disable

NOTE: To be able to use this parameter, set 'finish_in_source' to 'true' (see the 'finish_in_source' parameter in this table).
Additionally:

  • If you want to disable the '_sys_hash' field (see Source Table and Stage Table), set 'integrate_without_sys_hash' to 'true' (see the 'integrate_without_sys_hash' parameter in this table).
  • If you want to disable the '_sys_filename' field (see Source Table), set 'disable_file_level_logging' to 'true' (see the 'disable_file_level_logging' parameter in Add optional parameters).
join_filesBooleannofalse

Specifies whether to join files to one bigger file before sending them to ADS

  • If not set or set to 'false', the files are not joined.
  • If set to 'true', ADS Integrator joins the files.

NOTE: The files must not have headers. Using this parameter is not recommended for files greater than 100 Mb.

finish_in_sourceBooleannofalse

Specifies whether to end data integration at the source table phase

  • If not set or set to 'false', data integration runs through all the process steps (see How ADS Integrator Works).
  • If set to 'true', ADS Integrator ends the integration at the source table phase. No other steps will be executed.

Set this parameter to 'true' if you want to disable some system fields that ADS Integrator adds to each table (see the 'ignored_system_fields' parameter in this table).

recreate_source_tablesBooleannofalse

Specifies how to handle the source tables

  • If not set or set to 'false', ADS Integrator does not drop the source tables.
  • If set to 'true', ADS Integrator drops and recreates the source tables during each run.
number_of_parallel_queriesintegerno8

The number of COPY commands to run in parallel when integrating one entity (see How ADS Integrator Works)

SQL operation to run: COPY FROM LOCAL

Example: If ADS Integrator processes 100 batches and each batch has 5 files for the entity, the total number of files that needs to be processed is 500. If you set the 'number_of_parallel_queries' parameter to 10, ADS Integrator will create 10 slots for COPY commands (up to 10 COPY commands can run simultaneously). Therefore, in the ideal case, the 500 files will be processed 10 times as fast as when this setting is set to 1.

IMPORTANT: The default value is sufficient for an average workspace on a standard ADS instance. Do not change the default unless you are absolutely sure in the results that you want to achieve by changing it.

number_of_parallel_entities_integrationsintegerno1

(Batch mode only) The number of entities to be integrated in parallel to the source table (see How ADS Integrator Works). The operations performed during integration are similar to those performed when you have the 'finish_in_source' parameter set to 'true' (see the 'finish_in_source' parameter in this table).

SQL operations to run:

  • Creating flex tables (with fjsonparser)
  • Preparing the source table (truncate, projections, sequences)
  • Processing files with deleted records (see the 'remove_deleted_records' parameter in Add optional parameters)
  • COPY FROM LOCAL
  • Logging

Example: If you set this parameter to 2 and the 'number_of_parallel_queries' parameter is set to 5, then you have 5 entities, and each of these entities has multiple files for processing. ADS Integrator will create 2 slots (based on 'number_of_parallel_entities_integrations'). Inside these 2 slots, ADS Integrator will create another 5 slots for COPY commands. As a result, you are going to have up to 10 COPY commands running simultaneously, 5 for each entity.

IMPORTANT: The default value is sufficient for an average workspace on a standard ADS instance. Do not change the default unless you are absolutely sure in the results that you want to achieve by changing it.

number_of_parallel_integrationsintegerno1

(Batch mode only) The number of entities to be integrated in parallel to the stage table (filling in the uniq table, the diff table, and the stage table; see How ADS Integrator Works)

SQL operations to run:

  • Analyzing statistics on the source and stage tables
  • Truncating the uniq and diff tables
  • Inserting into the uniq and diff tables
  • Updating the stage table records that are marked as 'is_deleted' (see the '_sys_is_deleted' field in the stage table)
  • Merging data from the diff table to the stage table
  • Logging

Example: If you have 10 entities and you set this parameter to 2, ADS Integrator will create 2 slots for the integration command. If all the entities have the same amount of data, the integration (without loading to the source table) runs twice as fast as when this setting is set to 1.

IMPORTANT: The default value is sufficient for an average workspace on a standard ADS instance. Do not change the default unless you are absolutely sure in the results that you want to achieve by changing it.

Schedule Parameters

When scheduling ADS Integrator (see Phases of Building the Data Pipeline -> Production Implementation), provide the parameters from this section in the schedule.

General Parameters

Some parameters must be entered as secure parameters (see Configure Schedule Parameters).

Some parameters contain the names of the nesting parameters, which specifies where they belong in the configuration hierarchy. For example, a parameter entered as ads|options|password is resolved as the following:

"ads": {
  "options": {
    "password": ""
  }
}

Enter such parameters as is (see Schedule Example).

NameTypMandatory?Secure?DefaultDescription
IDstringyesnon/a

The ID of the ADS Integrator instance being scheduled
Example: ads_integrator_1

server_side_encryptionBooleannonofalse

Specifies whether ADS Integrator can work with an S3 location where server-side encryption is enabled

  • If not set or set to 'false', ADS Integrator is not enabled to work with an S3 location where server-side encryption is enabled.
  • If set to 'true', ADS Integrator is enabled to work with an S3 location where server-side encryption is enabled.
ads_storage|passwordstringyesyesn/a

The password for the user that is specified in the configuration file under ads_storage -> username

The user and the password are used to access the specified ADS instance where the data will be uploaded to.

ads_integrator|options|remote|secret_keystringsee 'Description' columnyesn/a(Only when ADS Integrator uses link files) The access key to the S3 bucket where the source data is stored

Encryption-Specific Parameters

If you enable processing of PGP-encrypted source files when using link files, provide the encryption-specific parameters as secure schedule parameters (see Configure Schedule Parameters).

Some parameters contain the names of the nesting parameters, which specifies where they belong in the configuration hierarchy. For example, a parameter entered as csv|options|password is resolved as the following:

"csv": {
  "options": {
    "password": ""
  }
}

Enter such parameters as is (see Schedule Example).

Parameter NameTypeMandatory?Secure?DefaultDescription
encryption_name|private_keystringyesyesn/a

The private key used for decrypting the PGP-encrypted files

If the private key itself is encrypted, also provide the passphrase to decrypt it (see the 'encryption_name|passphrase' parameter).

NOTE: Although you are entering the private key as a secure parameter, its value will be visible right after you paste it in the parameter value field (the show value check-box is selected) because it is a multi-line string. To hide the value, un-select the show value check-box. When the schedule is saved, the value will not appear in clear-text form in any GUI or log entries. For more information, see Configuring Schedule Parameters.

encryption_name|passphrase

stringsee 'Description'yesn/a

The passphrase used to decrypt the private key if it is encrypted

This parameter is mandatory only when the private key (see the 'encryption_name|private_key' parameter) is encrypted. Otherwise, do not use it.

Schedule Example

The following is an example of how you can specify schedule parameters:

Best Practices

For better performance, we recommend that you apply the following best practices when using ADS Integrator:

  • Keep the total size of the files in one integration group below 1 GB. An integration group is usually a part of the manifest file containing x files, where x is the value of the 'integration_group_size' parameter (see Add optional parameters). The default of this parameter is 1, therefore unless you change it, the files are integrated one by one.
  • When processing many smaller files, set the 'integration_group_size' parameter to a higher value to copy all the files in a single COPY command (see Add optional parameters). Note, however, that setting this parameter to a higher value will stop loading new rows to the file table.
  • Do not use the 'integration_group_size' parameter if you have multiple versions of a data record in your source data.
  • For better performance on big tables, set the 'disable_statistics_stg' parameter to 'true' (see Adjust advanced settings).
  • Use the 'skip_projection_check' parameter (see Adjust advanced settings) only as a last resort or a temporary solution.
  • Use the 'integrate_without_sys_hash' parameter (see Adjust advanced settings) carefully. Suppressing the '_sys_hash' field may result in reduced integration performance due to potentially costly hash joins.
  • Make sure that the source table and stage table have the same columns so that optimized MERGE is used.
Powered by Atlassian Confluence and Scroll Viewport.