GoodData-S3 Integration Details
When setting up direct distribution of data from source files in your S3 object storage service, pay attention to the considerations and best practices listed in this article.
This article is applicable to the use case of GoodData and S3 integration (see Integrate Data from Your Amazon S3 Bucket Directly to GoodData).
Bucket Policy
We recommend that you allow read access to the S3 bucket where the source files are stored. To do so, add a bucket policy that allows an s3:GetObject
permission.
Source Files
The source files are any of the following file types:
- Raw CSV files and compressed CSV files (
.zip
or.gz
) - Raw Parquet files or compressed Parquet files (
.parquet
or.parquet.gz
)
Prepare one source file per dataset in your logical data model (LDM). When mapping your LDM objects, you cannot map more than one field to the same source column.
File Name
Use the following format of the source file name:
{dataset_name}_{timestamp}_{load_mode}_{part}.csv|parquet|zip|gz
{dataset_name}
is the name of the dataset in your LDM to which the data from this file will be loaded.{timestamp}
is a timestamp in theyyyyMMddHHmmss
format specifying when the data was exported from the source database. The timestamp lets Automated Data Distribution (ADD) v2 (see Automated Data Distribution v2 for Object Storage Services) detect which file should be uploaded to the dataset. The timestamp must increase with each consequent file. If a source file with timestamp T1 was loaded to the dataset, a new source file for this dataset must have a timestamp greater than T1.When initially setting up direct distribution of data from your S3 object storage service, we recommend that you set the timestamp of the first S3 folder or file to load to today’s date and the time that is later than your current time.{load_mode}
is the mode of loading data from the source file to the dataset.- To set full mode for a source file, specify
full
. - To set incremental mode for a source file, specify
inc
.
- To set full mode for a source file, specify
{part}
is a suffix in thepartX
format whereX
is a positive integer (1, 2, 3, …) that is used when you split a large source file (see Size) into a few smaller files. Add the{part}
section to each of those smaller files to indicate what part of the large file it is:part1
,part2
,part3
, and so on. Multiple source files for the same dataset, with the same timestamp and the{part}
section, will all be loaded and concatenated.A compressed file (
.zip
or.gz
) must contain one CSV or Parquet file with the same name. For example, the compressed fileinvoiceItems_20200529090000_inc.zip
must containinvoiceItems_20200529090000_inc.csv
.A compressed Parquet file must be named as
.parquet.gz
.
The following are examples of file names:
customers_20200529090000_inc.zip
premiumContracts_20200529090000_full.zip
orders_20200529090000_inc_part1.zip
orders_20200529090000_inc_part2.zip
orders_20200529090000_inc_part3.zip
products_20200529090000_full.gz
products_20200529090000_full.parquet
products_20200529090000_full.parquet.gz
products_20200529090000_full.csv.gz
File Structure
The structure of the source files must correspond to your LDM. Every time you change the LDM, reflect the changes in the source files.
- Each field (fact or attribute) in a dataset must be mapped to a column in the source file for this dataset.
- The column names in a source file must either follow the naming convention (see Naming Convention for Source Files in Automated Data Distribution v2 for Object Storage Services) or correspond to the mapping as defined in the LDM (see Update Data in a Dataset from a CSV File).
- A source file can have more columns than the fields in its corresponding dataset. Those extra columns in the source file would not have any fields in the dataset mapped and will be ignored at data load.
- The header must contain only one row.
- If a source file is split into a few smaller files (see Size), each smaller file must contain the header.
- If you want to distribute the data from one source file to multiple workplaces, this source file must contain the
x__client_id
column that distinguishes what data should be loaded to what workspace (see Automated Data Distribution v2 for Object Storage Services). - If you want to delete some old data while new data is uploaded from a source file, the source file must contain the
x__delete
column (see Load Modes in Automated Data Distribution v2 for Object Storage Services).
Size
Keep the size of the source files:
- Under 50 GB for a raw CSV file
- Under 10 GB for a Parquet file
- Under 10 GB for compressed files (
.zip
or.gz
)
Split larger files into smaller files.
For better performance, use compressed files (.zip
or.gz
), up to 1 GB each.Keep the total size of the files in one load:
- Under 1 TB for raw CSV or Parquet files
- Under 200 GB for compressed files (
.zip
or.gz
)
Divide larger amounts of data into multiple loads. For example, place a portion of the source files with timestamp 1 in the S3 bucket, and run the ADD v2 process to load these files. Then, place the remaining source files with timestamp 2, and run the ADD v2 process again.
Location
Depending on the amount of data and the load frequency, you can organize your source files in the following ways:
Place the source files directly to the folder according to the S3 path in the Data Source (see Create a Data Source) or to the folder that you added to that path when deploying the ADD v2 process (see Integrate Data from Your Amazon S3 Bucket Directly to GoodData).
folder |-- datasetA_20200529090000_inc_part1.zip |-- datasetA_20200529090000_inc_part2.zip |-- datasetB_20200529090000_inc.zip |-- datasetB_20200530090000_inc.zip |-- datasetC_20200530090000_full.zip
Organize the source files into sub-folders per date. The timestamps in the source files in a sub-folder must correspond with the date used in this sub-folder.
folder |--- 20200529 | |-- datasetA_20200529090000_inc.zip | |-- datasetB_20200529090000_full.zip | |-- datasetC_20200529090000_full.zip |--- 20200530 | |-- datasetA_20200530090000_inc.zip | |-- datasetB_20200530090000_inc.zip | |-- datasetC_20200530090000_full.zip
For better performance, keep only the files that have to be downloaded. Remove the files that are old or have already been loaded.
Load Frequency
Align the frequency of exporting data from your database to source files with the frequency of loading the source files to your workspaces. For example, if you want to load data to your workspace once a day, generate the source file from your database once a day as well.