GoodData-S3 Integration Details

When setting up direct distribution of data from CSV files in your S3 object storage service, pay attention to the best practices concerning the following:

  • Source file requirements
  • Load frequency

This article is applicable to the use case of GoodData and S3 integration (see Integrate Data from Your Amazon S3 Bucket Directly to GoodData).

Contents:

Source Files

The source files are raw CSV files or compressed (zipped) CSV files.

Prepare one source file per dataset in your logical data model (LDM).

File Name

Use the following format of the source file name:

{dataset_name}_{timestamp}_{load_mode}_{part}.csv|zip
  • {dataset_name} is the name of the dataset in your LDM to which the data from this file will be loaded. {dataset_name} is mandatory.
  • {timestamp} is a timestamp in the yyyyMMddHHmmss format specifying when the data was exported from the source database. The timestamp lets Automated Data Distribution (ADD) v2 (see Automated Data Distribution v2 for Object Storage Services) detect which file should be uploaded to the dataset. The timestamp must increase with each consequent file. If a source file with timestamp T1 was loaded to the dataset, a new source file for this dataset must have a timestamp greater than T1.
    • For the source files to be loaded in incremental mode, the timestamp is mandatory.
    • For the source files to be loaded in full mode:
      • If you have only one source file for a dataset, the file name must not contain a timestamp.
      • If you have multiple source files for a dataset, all those files must have a timestamp for ADD v2 to determine what file to load.
  • {load_mode} is the mode of loading data from the source file to the dataset.
    • To set full mode for a source file, specify full.
    • To set incremental mode for a source file, specify inc.
    • If you do not specify the mode explicitly, it defaults to full.
  • {part} is a suffix in the partX format used when you split a large source file (see Size) into a few smaller files. Add the {part} section to each of those smaller files to indicate what part of the large file (part 1, part 2, part 3, ...) it is. Multiple source files for the same dataset, with the same timestamp and the {part} section, will all be loaded and concatenated.
  • Both CSV files and compressed (zipped) CSV files are supported. A compressed file must contain one CSV file with the same name. For example, the compressed file invoiceItems.zip must contain invoiceItems.csv.

The following are examples of file names:

  • products.csv
  • products_full.csv
  • invoiceItems.zip
  • customers_20200529090000.csv
  • customers_20200529100000.csv
  • premiumContracts_part1.zip
  • premiumContracts_part2.zip
  • users_20200529090000_inc.csv
  • orders_20200529090000_inc_part1.csv
  • orders_20200529090000_inc_part2.csv
  • orders_20200529090000_inc_part3.csv

File Structure

The structure of the source files must correspond to your LDM. Every time you change the LDM, reflect the changes in the source files.

Size

  • Keep the size of the source files:
    • Under 50 GB for a raw CSV file
    • Under 10 GB for a compressed (zipped) CSV file
  • Split larger files into smaller files.

    For better performance, use compressed CSV files, up to 1 GB each.

  • Keep the total size of the files in one load:
    • Under 1 TB for raw CSV files
    • Under 200 GB for compressed CSV files
  • Divide larger amounts of data into multiple loads. For example, place a portion of the source files with timestamp 1 in the S3 bucket, and run the ADD v2 process to load these files. Then, place the remaining source files with timestamp 2, and run the ADD v2 process again.

Location

Depending on the amount of data and the load frequency, you can organize your source files in the following ways:

  • Place the source files directly to the folder according to the S3 path in the Data Source (or to the folder that you added to that path when deploying the ADD v2 process; see Integrate Data from Your Amazon S3 Bucket Directly to GoodData).

    folder
    |-- datasetA_20200529090000_part1.csv
    |-- datasetA_20200529090000_part2.csv
    |-- datasetB.csv
    |-- datasetC.zip
  • Organize the source files into sub-folders per date. The timestamps in the source files in a sub-folder must correspond with the date used in this sub-folder.

    folder
    |--- 20200529
    | |-- datasetA_20200529090000_inc.csv
    | |-- datasetB_full.zip
    | |-- datasetC.zip
    |--- 20200530
    | |-- datasetA_20200530090000_inc.csv
    | |-- datasetB_20200530090000_inc.zip
    | |-- datasetC.zip

For better performance, keep only the files that have to be downloaded. Remove the files that are old or have already been loaded.

Load Frequency

Align the frequency of exporting data from your database to source files with the frequency of loading the source files to your workspaces. For example, if you want to load data to your workspace once a day, generate the source file from your database once a day as well.

Powered by Atlassian Confluence and Scroll Viewport.