This tutorial guides you through the process of setting up direct data distribution to load data from CSV files stored in an Amazon S3 bucket to the GoodData platform.
This tutorial assumes that you have one or multiple GoodData workspaces (also known as "projects"), and these workspaces have a logical data model (LDM) that meets your business requirements for data analysis.
By the end of this tutorial, you should be able to connect your S3 bucket to the GoodData platform, and set up the data load to run regularly.
If you have multiple workspaces and do not have them organized in segments, set up Automated Data Distribution (ADD) v2 for object storage services first. For more information, see Set Up Automated Data Distribution v2 for Object Storage Services.
Prepare source files
Prepare the source files that you want to load to your workspaces. Make sure that the source files meet the requirements described in GoodData-S3 Integration Details.
Set up a connection between your S3 bucket and the GoodData platform
The connection details are described within the entity called Data Source.
Create a Data Source for the S3 bucket where your source files are stored.
- From the Data Integration Console (see Accessing Data Integration Console), click the Data sources tab.
- Click Create data source in the bottom left corner, and select S3 data source.
The connection parameter screen appears.
- Enter the name of the Data Source.
- Enter the path to your S3 bucket, the access key, and the secret key.
- (Optional) Enter the region and select the Server side encryption check box if your S3 bucket support these options.
- Click Test connection.
If the connection succeeds, the confirmation message appears.
- Click Save.
The Data Source is created. The screen with your connection details appears.
You are now going to deploy, schedule, and run the Automated Data Distribution (ADD) v2 process that will load the data from your S3 bucket to the workspace. For more details about ADD v2, see Automated Data Distribution v2 for Object Storage Services.
Deploy, schedule, and run the ADD v2 process
Deploy the ADD v2 process to the GoodData platform and then create a schedule for the deployed process to automatically execute the data loading process at a specified time.
- If you have only one workspace to load data to, deploy the ADD v2 process to this workspace.
- If you have multiple workspaces to load data to and you have the workspaces organized into segments (see Set Up Automated Data Distribution v2 for Object Storage Services), deploy the ADD v2 to the service workspace.
- From the Projects tab of the Data Integration Console (see Accessing Data Integration Console), select your workspace.
- Click Deploy Process.
The deploy dialog opens.
- From the Component dropdown, select Automated Data Distribution.
- From the Data Source dropdown, select the Data Source that you have created at the step Set up a connection between your S3 bucket and the GoodData platform.
The path to your S3 bucket is obtained from that Data Source and displayed for your review.
- (Optional) Add sub-folders to the path to point to the folder where the source files are located.
- Specify the workspaces to load the data to.
If you want to load the data to the current workspace, select Current Project.
If you have only one workspace to load data to, the Current Project option is the only option available and it is pre-selected by default.
- If you have multiple workspaces to load data to and you have the workspaces organized into segments (see Set Up Automated Data Distribution v2 for Object Storage Services), select Segment (LCM) and then select the segment containing the workspaces to distribute the data to.
- Enter a descriptive name for the ADD v2 process.
- Click Deploy.
The ADD v2 process is deployed.
You are now going to schedule the deployed process.
- For the deployed ADD v2 process, click Create new schedule.
- In the Runs dropdown, set the frequency of execution to manually. This means that the process will run only when manually triggered.
- Click Schedule.
The schedule is saved and opens for your preview.
You are now going to manually run the scheduled process.
- Click Run.
The schedule is queued for execution and is run as platform resources are available.
- If the schedule fails with errors, fix the errors, and run the schedule again. Repeat until the process finishes with a status of OK, which means that the ADD v2 process has loaded the data to your workspace.
- (Optional) In the Runs dropdown, set the frequency of execution to whatever schedule fits your business needs. Click Save.
The schedule is saved.
You can now start analyzing your data. Use Analytical Designer and create your first insight.
You have completed the tutorial. You should now be able to set up direct data distribution from an S3 bucket.