Capturing More Twitter Search API Data
CloudConnect is a legacy tool and will be discontinued. We recommend that to prepare your data you use the GoodData data pipeline as described in Data Preparation and Distribution. For data modeling, see Data Modeling in GoodData to learn how to work with Logical Data Modeler.
When the Twitter Search API graph is executed, the most recent 100 tweets are posted to the project as the entire dataset for it. However, it’d be more useful if you could capture more of your enterprise’s tweet history. To do so, you must reconfigure your request URL and manage internal to the CloudConnect process an index identifying the most recent tweets that were uploaded.
Below are the basic, generalized steps for managing this process.
Steps:
The Twitter Search API can process a search parameter called max_id, which returns results with an ID less than the specified parameter value. For more information, review Twitter Search API documentation.
Within your CloudConnect project, you must maintain the index, identifying the ID of the last record that was loaded to GoodData (e.g.
LAST_LOADED
). You may defineLAST_LOADED
as a custom property or as a name/value pair within an internal lookup table in the project.For each ETL run, the current value for
LAST_LOADED
must be included as a name-value pair in the URL in the HTTP Connector (Twitter API call) component.As part of the capture, you must extract the id value from the JSON response.
The
LAST_LOADED
value must be updated only after the current dataset has been written through the GD Dataset Writer component.- In the stream, you might write the value for the current record to a
CURRENT_LOADED
custom variable. This step could be completed in the JSONReader component. - In the event of a failure during ETL, you can re-run the ETL process to reload the tweets in the failed upload.
- In the stream, you might write the value for the current record to a
After the last record has been written by the GD Dataset Writer, you can write the value of
CURRENT_LOADED
toLAST_LOADED
.In the GD Dataset Writer component, you must modify the Mode parameter so that each execution of the ETL process performs an incremental data load, instead of a full data load, which replaces all of the data in the project.
For each dataset that you update using an incremental data load, you must verify that the logical data model for the dataset contains a connection point. This point must mark the attribute used to identify uniqueness among records of the dataset.
You can perform incremental data loads into a dataset only if it contains a connection point.You should run this process a few times locally to verify that the tweets are getting captured sequentially.
If all is looking good, then you can publish the CloudConnect project to the working project. Then, you can configure a recurring scheduled execution of the ETL process through the Data Integration Console. For more information, see Data Integration Console Reference.