GoodData-S3 Integration Details

When setting up direct distribution of data from source files in your S3 object storage service, pay attention to the considerations and best practices listed in this article.

This article is applicable to the use case of GoodData and S3 integration (see Integrate Data from Your Amazon S3 Bucket Directly to GoodData).

Contents:

Bucket Policy

We recommend that you allow read access to the S3 bucket where the source files are stored. To do so, add a bucket policy that allows an s3:GetObject permission.

Source Files

The source files are any of the following file types:

  • Raw CSV files and compressed CSV files (.zip or .gz)
  • Raw Parquet files or compressed Parquet files (.parquet or .parquet.gz)

Prepare one source file per dataset in your logical data model (LDM).

File Name

Use the following format of the source file name:

{dataset_name}_{timestamp}_{load_mode}_{part}.csv|parquet|zip|gz
  • {dataset_name} is the name of the dataset in your LDM to which the data from this file will be loaded.
  • {timestamp} is a timestamp in the yyyyMMddHHmmss format specifying when the data was exported from the source database. The timestamp lets Automated Data Distribution (ADD) v2 (see Automated Data Distribution v2 for Object Storage Services) detect which file should be uploaded to the dataset. The timestamp must increase with each consequent file. If a source file with timestamp T1 was loaded to the dataset, a new source file for this dataset must have a timestamp greater than T1.

    When initially setting up direct distribution of data from your S3 object storage service, we recommend that you set the timestamp of the first S3 folder or file to load to today's date and the time that is later than your current time.

  • {load_mode} is the mode of loading data from the source file to the dataset.
    • To set full mode for a source file, specify full.
    • To set incremental mode for a source file, specify inc.
  • {part} is a suffix in the partX format where X is a positive integer (1, 2, 3, ...) that is used when you split a large source file (see Size) into a few smaller files.
    Add the {part} section to each of those smaller files to indicate what part of the large file it is: part1, part2, part3, and so on. Multiple source files for the same dataset, with the same timestamp and the {part} section, will all be loaded and concatenated.
  • A compressed file (.zip or .gz) must contain one CSV or Parquet file with the same name. For example, the compressed file invoiceItems_20200529090000_inc.zip must contain invoiceItems_20200529090000_inc.csv.
  • A compressed Parquet file must be named as .parquet.gz.

The following are examples of file names:

  • customers_20200529090000_inc.zip
  • premiumContracts_20200529090000_full.zip
  • orders_20200529090000_inc_part1.zip
  • orders_20200529090000_inc_part2.zip
  • orders_20200529090000_inc_part3.zip
  • products_20200529090000_full.gz
  • products_20200529090000_full.parquet
  • products_20200529090000_full.parquet.gz
  • products_20200529090000_full.csv.gz

File Structure

The structure of the source files must correspond to your LDM. Every time you change the LDM, reflect the changes in the source files.

Size

  • Keep the size of the source files:
    • Under 50 GB for a raw CSV file
    • Under 10 GB for a Parquet file
    • Under 10 GB for compressed files (.zip or .gz)
  • Split larger files into smaller files.

    For better performance, use compressed files (.zip or .gz), up to 1 GB each.

  • Keep the total size of the files in one load:
    • Under 1 TB for raw CSV or Parquet files
    • Under 200 GB for compressed files (.zip or .gz)
  • Divide larger amounts of data into multiple loads. For example, place a portion of the source files with timestamp 1 in the S3 bucket, and run the ADD v2 process to load these files. Then, place the remaining source files with timestamp 2, and run the ADD v2 process again.

Location

Depending on the amount of data and the load frequency, you can organize your source files in the following ways:

  • Place the source files directly to the folder according to the S3 path in the Data Source (see Create a Data Source) or to the folder that you added to that path when deploying the ADD v2 process (see Integrate Data from Your Amazon S3 Bucket Directly to GoodData).

    folder
    |-- datasetA_20200529090000_inc_part1.zip
    |-- datasetA_20200529090000_inc_part2.zip
    |-- datasetB_20200529090000_inc.zip
    |-- datasetB_20200530090000_inc.zip
    |-- datasetC_20200530090000_full.zip
  • Organize the source files into sub-folders per date. The timestamps in the source files in a sub-folder must correspond with the date used in this sub-folder.

    folder
    |--- 20200529
    | |-- datasetA_20200529090000_inc.zip
    | |-- datasetB_20200529090000_full.zip
    | |-- datasetC_20200529090000_full.zip
    |--- 20200530
    | |-- datasetA_20200530090000_inc.zip
    | |-- datasetB_20200530090000_inc.zip
    | |-- datasetC_20200530090000_full.zip

For better performance, keep only the files that have to be downloaded. Remove the files that are old or have already been loaded.

Load Frequency

Align the frequency of exporting data from your database to source files with the frequency of loading the source files to your workspaces. For example, if you want to load data to your workspace once a day, generate the source file from your database once a day as well.

Powered by Atlassian Confluence and Scroll Viewport.