Capturing More Twitter Search API Data

When the Twitter Search API graph is executed, the most recent 100 tweets are posted to the project as the entire dataset for it. However, it’d be more useful if you could capture more of your enterprise’s tweet history. To do so, you must reconfigure your request URL and manage internal to the CloudConnect process an index identifying the most recent tweets that were uploaded.

Below are the basic, generalized steps for managing this process.

Steps:

  1. The Twitter Search API can process a search parameter called max_id, which returns results with an ID less than the specified parameter value. For more information, review Twitter Search API documentation.

  2. Within your CloudConnect project, you must maintain the index, identifying the ID of the last record that was loaded to GoodData (e.g. LAST_LOADED). You may define LAST_LOADED as a custom property or as a name/value pair within an internal lookup table in the project.

  3. For each ETL run, the current value for LAST_LOADED must be included as a name-value pair in the URL in the HTTP Connector (Twitter API call) component.

  4. As part of the capture, you must extract the id value from the JSON response.

  5. The LAST_LOADED value must be updated only after the current dataset has been written through the GD Dataset Writer component.

    1. In the stream, you might write the value for the current record to a CURRENT_LOADED custom variable. This step could be completed in the JSONReader component.
    2. In the event of a failure during ETL, you can re-run the ETL process to reload the tweets in the failed upload.
  6. After the last record has been written by the GD Dataset Writer, you can write the value of CURRENT_LOADED to LAST_LOADED.

  7. In the GD Dataset Writer component, you must modify the Mode parameter so that each execution of the ETL process performs an incremental data load, instead of a full data load, which replaces all of the data in the project.

  8. For each dataset that you update using an incremental data load, you must verify that the logical data model for the dataset contains a connection point. This point must mark the attribute used to identify uniqueness among records of the dataset.

  9. You should run this process a few times locally to verify that the tweets are getting captured sequentially.

  10. If all is looking good, then you can publish the CloudConnect project to the working project. Then, you can configure a recurring scheduled execution of the ETL process through the Data Integration Console. For more information, see Data Integration Console Reference.