Google Sheets Downloader
Google Sheets Downloader is a component of the data preparation and distribution pipeline (see Data Preparation and Distribution Pipeline). Google Sheets Downloader downloads data from a Google Sheets spreadsheet. You can either specify the source spreadsheet directly by its ID or point to the Google Drive folder where the spreadsheet is stored.
Source Spreadsheet Requirements
Make sure that your source Google Sheets spreadsheet meets the following requirements:
- The source spreadsheet has one or more sheets with data.
- In each sheet:
- The first row is a header.
- Each column with data has a name.
- The names of the columns contain only ASCII characters, preferably in lowercase and separated by underscores (not spaces).
Configuration File
Creating and setting up the brick’s configuration file is the third phase of building your data pipeline. For more information, see Build Your Data Pipeline.
For more information about configuring a brick, see Configure a Brick.
Minimum Layout of the Configuration File
The following JSON sample is the minimum layout of the configuration file that Google Sheets Downloader can use. If you are using more bricks, you can reuse some sections of this configuration file for other bricks. For example, the ‘downloaders’ section can list more downloaders, and the ‘integrators’ and ‘ads_storage’ sections can be reused by other downloaders and executors (see Configure a Brick).
This sample describes the use case of one instance of Google Sheets Downloader that downloads two sheets from a Google Sheets spreadsheet.
Copy the sample and replace the placeholders with your values.
You can later enrich the sample with more parameters depending on your business and technical requirements (see Customize the Configuration File).
{
"entities": {
"table_1_name": {
"global": {
"custom": {
"hub": ["column_1_name","column_2_name","column_3_name"]
}
}
},
"table_2_name": {
"global": {
"custom": {
"hub": ["column_1_name","column_2_name","column_3_name"]
}
}
}
},
"downloaders": {
"gsheets_downloader_id": {
"type": "gs",
"entities": ["table_1_name","table_2_name"]
}
},
"gs": {
"client_id": "google_client_id",
"spreadsheet_id": "spreadsheet_id"
},
"integrators": {
"ads_integrator_1": {
"type": "ads_storage",
"entities": ["table_1_name","table_2_name"]
}
},
"ads_storage": {
"instance_id": "data_warehouse_instance_id",
"username": "dw_email@address.com",
"options": {}
}
}
The placeholders to replace with your values:
Name | Type | Mandatory? | Default | Description |
---|---|---|---|---|
entities -> table_x_name | string | yes | n/a | The database table into which Google Sheets Downloader loads the data One table corresponds to one sheet in the source spreadsheet. Specify as many tables for as many sheets with data that are in the source spreadsheet. For example, if the source spreadsheet contains five sheets, specify five tables under ‘entities’. The name of the table must be the same as the name of the sheet from which the data will be downloaded. For example, if you have a sheet named 'company', name the table 'company':
|
entities -> table_x_name -> global -> custom -> hub | array | yes | n/a | The table columns to generate a primary key The table column names must correspond to the column names in the sheet from which the data will be downloaded. Example: |
downloaders -> gsheets_downloader_id | string | yes | n/a | The ID of the Google Sheets Downloader instance Example: |
downloaders -> gsheets_downloader_id -> entities | array | yes | n/a | The database table into which Google Sheets Downloader loads the data Must be the same as the database table names in 'entities -> table_x_name' |
gs -> client_id | string | yes | n/a | Your Google client ID Example: |
gs -> spreadsheet_id | string | yes | n/a | The ID of the Google Sheets spreadsheet to download data from You can find the spreadsheet ID in the URL of the spreadsheet: |
integrators -> ads_integrator_id | string | yes | n/a | The ID of the ADS Integrator instance Example: |
integrators -> ads_integrator_id -> entities | array | yes | n/a | The names of the entities that this ADS Integrator instance processes Must be the same as the database table names in 'entities -> table_x_name' |
ads_storage -> instance_id | string | yes | n/a | The ID of the Agile Data Warehousing Service (ADS) instance into which the data is uploaded |
ads_storage -> username | string | yes | n/a | The access username to the Agile Data Warehousing Service (ADS) instance into which the data is uploaded |
Customize the Configuration File
You can change default values of the parameters and set up additional parameters.
Describe the structure of the source Google Sheets spreadsheet
You can set the parameters describing the structure of the source Google Sheets spreadsheet.
"gs": {
"client_id": "google_client_id",
"spreadsheet_id": "spreadsheet_id",
"options": {
"files_structure": {
"escape_as": "escape_character", // optional
"db_parser": "gdc|{no_value}", // optional
"enclosed_by": "quote_character" // optional
}
}
}
Name | Type | Mandatory? | Default | Description |
---|---|---|---|---|
escape_as | character | no | n/a | An escape character You can define an escape character using any ASCII value in the range from E'\001' to E'\177' inclusive (any ASCII character except Example: |
db_parser | string | no | gdc | The database parser
|
enclosed_by | string | no | n/a | The quote character within which to enclose data. This allows delimiter characters to be embedded in string values. You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except Example: |
Specify additional load options for the entities
For each entity, you can set up additional load options.
"entities": {
"table_1_name": {
"global": {
"custom": {
"hub": ["column_1_name","column_2_name","column_3_name"],
"skip_rows": number_of_rows, // optional
"db_parser": "gdc|{no_value}" // optional
}
}
}
}
Name | Type | Mandatory? | Default | Description |
---|---|---|---|---|
skip_rows | integer | no | 0 | The number of rows from the top of a Google Sheets spreadsheet that Google Sheets Downloader will skip |
db_parser | string | no | gdc | The database parser
|
Add multiple instances of Google Sheets Downloader
When you use only one instance of Google Sheets Downloader, it reads its remote location parameters from the ‘gs’ section.
If you use multiple instances of Google Sheets Downloader and each instance has a separate location to download data from, add several ‘gs’ sections and give each section a unique name (for example, ‘gs_users’ or ‘gs_sales’).
"gs_users": {
"client_id": "google_client_1_id"
},
"gs_sales": {
"client_id": "google_client_2_id"
}
For each instance of Google Sheets Downloader in the ‘downloaders’ section, add the ‘settings’ parameter and set it to the name of the section that contains settings for this Google Sheets Downloader instance.
"downloaders": {
"gsheets_downloader_1_id": {
"type": "gs",
"settings": "gs_users",
"entities": ["table_1_name","table_2_name"]
},
"gsheets_downloader_2_id": {
"type": "gs",
"settings": "gs_sales",
"entities": ["table_1_name","table_2_name"]
}
}
Point to a folder containing the source Google Sheets spreadsheet
By default, Google Sheets Downloader looks for a specific source Google Sheets spreadsheet to download data from (see the ‘spreadsheet_id’ parameter in Minimum Layout of the Configuration File). If the source spreadsheet does not have a constant ID (for example, the spreadsheet is re-generated from time to time and its ID changes), point to a folder where the source spreadsheet is stored instead of specifying the spreadsheet ID. This way, you do not rely on the spreadsheet ID but rather on its location.
Make sure that your client ID has access to the Google Drive folder and that the API is enabled.
"gs": {
"client_id": "google_client_id",
"folder": "google_drive_folder_id"
},
Name | Type | Mandatory? | Default | Description |
---|---|---|---|---|
folder | string | no | n/a | The ID of the Google Drive folder containing the source Google Sheets spreadsheet to download data from You can find the folder ID in the URL of the folder: NOTE: If you use both 'folder' and 'spreadsheet_id' parameters at the same time, the 'folder' parameter is used, and the 'spreadsheet_id' parameter is ignored. |
A typical use case is downloading data from a single source spreadsheet stored in the specified folder. If you want to download data from multiple source spreadsheets stored in the folder, make sure that all those spreadsheets have the same structure: each source spreadsheet must contain all the sheets that are specified as tables under the ‘entities’ section of the configuration file (see Minimum Layout of the Configuration File). If any source spreadsheet is missing a sheet that is specified in the configuration file, Google Sheets Downloader fails.
When multiple source spreadsheets are processed, the data will be uploaded to Big Data Storage as separate files, one file per source spreadsheet.
Add optional parameters
Depending on your business and technical requirements, you can set one or more optional parameters affecting the data load process.
"gs": {
"client_id": "google_client_id",
"spreadsheet_id": "spreadsheet_id",
"full": true|false, // optional
"value_render_option": "format" // optional
}
Name | Type | Mandatory? | Default | Description |
---|---|---|---|---|
full | Boolean | no | false | Specifies load mode used for loading data to the database
|
value_render_option | string | no | FORMATTED_VALUE | The format in which the API returns the cell values For available options, see https://developers.google.com/sheets/api/reference/rest/v4/ValueRenderOption. |
Customize integrators
The ‘integrators’ and ‘ads_storage’ sections define the behavior of integrators.
"integrators": {
"ads_integrator_1": {
"type": "ads_storage",
"entities": ["table_1_name","table_2_name"]
}
},
"ads_storage": {
"instance_id": "data_warehouse_instance_id",
"username": "dw_email@address.com",
"options": {}
}
If you want to add more parameters or change the defaults, see ADS Integrator.
Schedule Parameters
When scheduling Google Sheets Downloader (see Phases of Building the Data Pipeline -> Production Implementation), provide the parameters from this section in the schedule.
General Parameters
Some parameters must be entered as secure parameters (see Configure Schedule Parameters).
Some parameters contain the names of the nesting parameters, which specifies where they belong in the configuration hierarchy. For example, a parameter entered as gs|options|password
is resolved as the following:
"gs": {
"options": {
"password": ""
}
}
Enter such parameters as is (see Schedule Example).
Name | Typ | Mandatory? | Secure? | Default | Description |
---|---|---|---|---|---|
ID | string | yes | no | n/a | The ID of the Google Sheets Downloader instance being scheduled Must be the same as the value of the 'downloaders -> gsheets_downloader_id' parameter in the configuration file (see Minimum Layout of the Configuration File) Example: |
ads_storage|password | string | yes | yes | n/a | The password to access the ADS instance to upload the data to |
gs|client_secret | string | yes | yes | n/a | The Google client secret |
gs|refresh_token | string | yes | yes | n/a | The OAuth refresh token For information about using OAuth 2.0, see https://developers.google.com/identity/protocols/oauth2. |
Schedule Example
The following is an example of how you can specify schedule parameters: