Preparing a Data Loading Process

An ETL (Extract - Transform - Load) graph is a series of steps used to pull data from a source, transform it into an appropriate format, and load it into the GoodData platform.

You create an ETL graph in CloudConnect as a sequence of visual objects. You can set up them to gather data from nearly any enterprise source database, platform, messaging service, or file system.

A CloudConnect project can contain one or more ETL graphs. ETL graphs are stored in the GRF files under the ‘graph’ directory. CloudConnect provides a suite of pre-built components to manipulate data. You can also write custom transform functions.

Demo ETL Graph

When executed, the ETL graph in this demo project accesses Twitter, downloads recent tweets mentioning GoodData, parses those tweets, and uploads that information into your GoodData project.

In CloudConnect, double-click the twitter.grf file.

The demo logical data model opens in the CloudConnect Designer panel. The following picture shows the demo ETL graph:

This graph completes a linear process containing the following steps in order:

 

Sequence

LDM Component

Description

1

Twitter API Call

Reads directly from the Twitter API to extract tweets that contain the value “gooddata”. The Twitter API Call component is a modified version of the REST Connector component. See REST Connector.

2

JSONReader

Converts the JSON stream received from the Twitter API component into XML, as defined by the transformation in the Mapping field.

3

Reformat

Converts the XML data to the internal data types, per the logical data model. For example, the text string for the Retweets value is converted to an integer, so that it can be used as a fact in the contribution of a metric. These reformatting changes are defined as transformations in this component. See Reformat Component.

4

GD Dataset Writer

Loads cleaned GoodData-formatted data into the corresponding tables in the GoodData database. See GD Dataset Writer Component.

Enable Debugging

Before publishing the graph to a live project, we recommend that you first run a graph on your local computer to test the configuration and functionality.

In this demo graph, debugging is enabled in the component connections. Connections configured for debugging are marked with a bug icon: 

Run Graphs Locally

Running an ETL graph locally does not push data to the remote GoodData project.

Steps:

  1. Click the Server Explorer tab. A list of your available projects is displayed in the bottom pane. 
  2. Right-click My Twitter Project, and select Set as Working Project. The project is updated with a new icon to indicate that it is the working project.
  3. Right-click anywhere in the Editor pane, and select Run As -> 1 Graph (Locally). As the process runs, the CloudConnect Console window displays a real-time log of CloudConnect activity. If all steps are processed correctly, the console log displays an “Execution of graph successful” message.

Review Uploaded Records

Review the count of the records processed between each two components in the ETL graph.

Steps:

  1. Right-click the bug icon in a connection, and select View Data. The View data dialog opens.
  2. Un-select the Truncate Strings and Byte Arrays check box.
  3. Click OK. The list of records is generated. 

Deploy the Graph to the GoodData Platform

When you tested the graph locally, publish it to your GoodData project on the GoodData platform.

Steps:

  1. Click the Server Explorer tab, and select the twitter.grf file.
  2. In the toolbar, click the GD icon:
  3. From the dropdown, click Deploy to GoodData Server.
  4. Select My Twitter Project, and click Deploy. The ETL graph is published to the selected GoodData project.