Google Sheets Downloader

Google Sheets Downloader is a component of the data preparation and distribution pipeline (see Data Preparation and Distribution Pipeline). Google Sheets Downloader downloads data from a Google Sheets spreadsheet. You can either specify the source spreadsheet directly by its ID or point to the Google Drive folder where the spreadsheet is stored.

Source Spreadsheet Requirements

Make sure that your source Google Sheets spreadsheet meets the following requirements:

  • The source spreadsheet has one or more sheets with data.
  • In each sheet:
    • The first row is a header.
    • Each column with data has a name.
    • The names of the columns contain only ASCII characters, preferably in lowercase and separated by underscores (not spaces).

Configuration File

Creating and setting up the brick’s configuration file is the third phase of building your data pipeline. For more information, see Build Your Data Pipeline.

For more information about configuring a brick, see Configure a Brick.

Minimum Layout of the Configuration File

The following JSON sample is the minimum layout of the configuration file that Google Sheets Downloader can use. If you are using more bricks, you can reuse some sections of this configuration file for other bricks. For example, the ‘downloaders’ section can list more downloaders, and the ‘integrators’ and ‘ads_storage’ sections can be reused by other downloaders and executors (see Configure a Brick).

This sample describes the use case of one instance of Google Sheets Downloader that downloads two sheets from a Google Sheets spreadsheet.

Copy the sample and replace the placeholders with your values.

You can later enrich the sample with more parameters depending on your business and technical requirements (see Customize the Configuration File).

{
  "entities": {
    "table_1_name": {
      "global": {
        "custom": {
          "hub": ["column_1_name","column_2_name","column_3_name"]
        }
      }
    },
    "table_2_name": {
      "global": {
        "custom": {
          "hub": ["column_1_name","column_2_name","column_3_name"]
        }
      }
    }
  },
  "downloaders": {
    "gsheets_downloader_id": {
      "type": "gs",
      "entities": ["table_1_name","table_2_name"]
    }
  },
  "gs": {
    "client_id": "google_client_id",
    "spreadsheet_id": "spreadsheet_id"
  },
  "integrators": {
    "ads_integrator_1": {
      "type": "ads_storage",
      "entities": ["table_1_name","table_2_name"]
    }
  },
  "ads_storage": {
    "instance_id": "data_warehouse_instance_id",
    "username": "dw_email@address.com",
    "options": {}
  }
}

The placeholders to replace with your values:

NameTypeMandatory?DefaultDescription
entities -> table_x_namestringyesn/a

The database table into which Google Sheets Downloader loads the data

One table corresponds to one sheet in the source spreadsheet.

Specify as many tables for as many sheets with data that are in the source spreadsheet. For example, if the source spreadsheet contains five sheets, specify five tables under ‘entities’.

The name of the table must be the same as the name of the sheet from which the data will be downloaded. For example, if you have a sheet named 'company', name the table 'company':

"entities": {
  "company": {
    "global": {
      ...
    }
  }
}
entities -> table_x_name -> global -> custom -> hubarrayyesn/a

The table columns to generate a primary key

The table column names must correspond to the column names in the sheet from which the data will be downloaded.

Example: ["id"]

downloaders ->  gsheets_downloader_idstringyesn/a

The ID of the Google Sheets Downloader instance

Example: gsheets_downloader_1

downloaders -> gsheets_downloader_id -> entitiesarrayyesn/a

The database table into which Google Sheets Downloader loads the data

Must be the same as the database table names in 'entities -> table_x_name'
gs -> client_idstringyesn/a

Your Google client ID

Example: 9414861317621.apps.googleusercontent.com

gs -> spreadsheet_idstringyesn/a

The ID of the Google Sheets spreadsheet to download data from

You can find the spreadsheet ID in the URL of the spreadsheet:  https://docs.google.com/spreadsheets/d/{spreadsheet_id}

integrators -> ads_integrator_idstringyesn/a

The ID of the ADS Integrator instance

Example: ads_integrator_1

integrators -> ads_integrator_id -> entitiesarrayyesn/aThe names of the entities that this ADS Integrator instance processes Must be the same as the database table names in 'entities -> table_x_name'
ads_storage -> instance_idstringyesn/aThe ID of the Agile Data Warehousing Service (ADS) instance into which the data is uploaded

ads_storage -> username

stringyesn/aThe access username to the Agile Data Warehousing Service (ADS) instance into which the data is uploaded

 

Customize the Configuration File

You can change default values of the parameters and set up additional parameters.

 

Describe the structure of the source Google Sheets spreadsheet

You can set the parameters describing the structure of the source Google Sheets spreadsheet.

"gs": {
  "client_id": "google_client_id",
  "spreadsheet_id": "spreadsheet_id",
  "options": {
    "files_structure": {
      "escape_as": "escape_character", // optional
      "db_parser": "gdc|{no_value}", // optional
      "enclosed_by": "quote_character" // optional
    }
  }
}
NameTypeMandatory?DefaultDescription
escape_ascharacternon/a

An escape character You can define an escape character using any ASCII value in the range from E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000')

Example: \"

db_parserstringnogdc

The database parser

  • If not set or set to 'gdc', the default GoodData parser (GdcCsvParser) is used.
  • If set to an empty value ("db_parser": ""), the default Vertica parser is used.
enclosed_bystringnon/a

The quote character within which to enclose data. This allows delimiter characters to be embedded in string values. You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000').

Example: \"

 

Specify additional load options for the entities

For each entity, you can set up additional load options.

"entities": {
  "table_1_name": {
    "global": {
      "custom": {
        "hub": ["column_1_name","column_2_name","column_3_name"],
        "skip_rows": number_of_rows, // optional
        "db_parser": "gdc|{no_value}" // optional
      }
    }
  }
}
NameTypeMandatory?DefaultDescription
skip_rowsintegerno0

The number of rows from the top of a Google Sheets spreadsheet that Google Sheets Downloader will skip

db_parserstringnogdc

The database parser

  • If not set or set to 'gdc', the default GoodData parser (GdcCsvParser) is used.
  • If set to an empty value ("db_parser": ""), the default Vertica parser is used.

 

Add multiple instances of Google Sheets Downloader

When you use only one instance of Google Sheets Downloader, it reads its remote location parameters from the ‘gs’ section.

If you use multiple instances of Google Sheets Downloader and each instance has a separate location to download data from, add several ‘gs’ sections and give each section a unique name (for example, ‘gs_users’ or ‘gs_sales’).

"gs_users": {
  "client_id": "google_client_1_id"
},
"gs_sales": {
  "client_id": "google_client_2_id"
}

 

For each instance of Google Sheets Downloader in the ‘downloaders’ section, add the ‘settings’ parameter and set it to the name of the section that contains settings for this Google Sheets Downloader instance.

"downloaders": {
  "gsheets_downloader_1_id": {
    "type": "gs",
    "settings": "gs_users",
    "entities": ["table_1_name","table_2_name"]
  },
  "gsheets_downloader_2_id": {
    "type": "gs",
    "settings": "gs_sales",
    "entities": ["table_1_name","table_2_name"]
  }
}

 

Point to a folder containing the source Google Sheets spreadsheet

By default, Google Sheets Downloader looks for a specific source Google Sheets spreadsheet to download data from (see the ‘spreadsheet_id’ parameter in Minimum Layout of the Configuration File). If the source spreadsheet does not have a constant ID (for example, the spreadsheet is re-generated from time to time and its ID changes), point to a folder where the source spreadsheet is stored instead of specifying the spreadsheet ID. This way, you do not rely on the spreadsheet ID but rather on its location.

"gs": {
  "client_id": "google_client_id",
  "folder": "google_drive_folder_id"
},
NameTypeMandatory?DefaultDescription
folderstringnon/a

The ID of the Google Drive folder containing the source Google Sheets spreadsheet to download data from

You can find the folder ID in the URL of the folder: https://drive.google.com/drive/folders/{folder_id}

NOTE: If you use both 'folder' and 'spreadsheet_id' parameters at the same time, the 'folder' parameter is used, and the 'spreadsheet_id' parameter is ignored.

 

Add optional parameters

Depending on your business and technical requirements, you can set one or more optional parameters affecting the data load process.

"gs": {
  "client_id": "google_client_id",
  "spreadsheet_id": "spreadsheet_id",
  "full": true|false, // optional
  "value_render_option": "format" // optional
}
NameTypeMandatory?DefaultDescription
fullBooleannofalse

Specifies load mode used for loading data to the database

  • If not set or set to 'false', incremental load is used.
  • If set to 'true', full load is used.
value_render_optionstringnoFORMATTED_VALUE

The format in which the API returns the cell values

For available options, see https://developers.google.com/sheets/api/reference/rest/v4/ValueRenderOption.

 

Customize integrators

The ‘integrators’ and ‘ads_storage’ sections define the behavior of integrators.

"integrators": {
  "ads_integrator_1": {
    "type": "ads_storage",
    "entities": ["table_1_name","table_2_name"]
  }
},
"ads_storage": {
  "instance_id": "data_warehouse_instance_id",
  "username": "dw_email@address.com",
  "options": {}
}

If you want to add more parameters or change the defaults, see ADS Integrator.

Schedule Parameters

When scheduling Google Sheets Downloader (see Phases of Building the Data Pipeline -> Production Implementation), provide the parameters from this section in the schedule.

General Parameters

Some parameters must be entered as secure parameters (see Configure Schedule Parameters).

Some parameters contain the names of the nesting parameters, which specifies where they belong in the configuration hierarchy. For example, a parameter entered as gs|options|password is resolved as the following:

"gs": {
  "options": {
    "password": ""
  }
}

Enter such parameters as is (see Schedule Example).

NameTypMandatory?Secure?DefaultDescription
IDstringyesnon/a

The ID of the Google Sheets Downloader instance being scheduled

Must be the same as the value of the 'downloaders -> gsheets_downloader_id' parameter in the configuration file (see Minimum Layout of the Configuration File)

Example: gsheets_downloader_1

ads_storage|passwordstringyesyesn/a

The password to access the ADS instance to upload the data to

gs|client_secretstringyesyesn/aThe Google client secret
gs|refresh_tokenstringyesyesn/a

The OAuth refresh token

For information about using OAuth 2.0, see https://developers.google.com/identity/protocols/oauth2.

Schedule Example

The following is an example of how you can specify schedule parameters: