Delete Old Data while Loading New Data to a Dataset via API
You can delete old or obsolete data from a dataset while loading new data to the same or different dataset using the REST API. The data to delete and upload are specified in CSV files.
When you delete old data and upload new data to datasets, deleting and uploading the data are done in a single transaction. This helps keep data in your workspace consistent and avoid situations when new data is already in the dataset but the old data are not yet removed.
Specifications
- You can specify multiple CSV files to delete from the same dataset in one load.
- You can choose to only delete some data from a dataset, without uploading any new data.
- New data must be uploaded to a dataset in
INCREMENTAL
mode.FULL
mode is not supported for loading when you are going to also delete some data.
How Deleting Works
Deleting data can be done by:
- Primary key (or fact table grain): this method is useful for routine deleting while uploading
- Foreign key: this method is useful for one-time fixes of data in a dataset
While you can upload new data by any label of an attribute, deleting the old data can be done only by the primary attribute label. Therefore, the values in the columns to be deleted must have a signed
referenceKey
(the primary attribute label).During the deleting phase, data is deleted only from the center of a dataset (the central table in a star/snowflake schema). Lookup tables do not get modified, and some abandoned rows that are not referenced from any table can remain after the deletion completes. In most cases, this does not cause any issues in workspace data and can be fixed by loading data to the dataset in
FULL
mode.When the data processing starts, the first step is always to delete the data marked for deletion, and the second step is to upload the new data, regardless of the actual order of the data to upload and to delete in your manifest file. Therefore, the CSV files to upload data from should contain only the data that you want to add from scratch or use to overwrite existing data. All the data in the CSV file to upload is going to be uploaded. No additional deletion is run after the data is uploaded.
During the deletion phase, though, data is deleted from each dataset according to the order specified in your manifest file.
For smooth performance, we recommend that you have the number of records to be deleted as small as possible. Deleting up to millions of records still can be done but may cause performance issues and result in load failures due to a timeout. If you need to delete a large amount of data and load new data, consider loading the new data in
FULL
mode instead.
Steps
- Build a manifest file to specify the dataset and the data to remove and add. Name the manifest file
upload_info.json
. For an example of the manifest file, see Manifest File and Input Data. - Prepare the CSV files that contain the data to load or to remove. For examples of the CSV files, see Manifest File and Input Data.
- Create a ZIP file named
upload.zip
that containsupload_info.json
(the manifest file) and the CSV files with data. Upload this ZIP file to a directory on the GoodData platform. For more information, see User-Specific Storage. - Start data upload via API.
Manifest File and Input Data
Imagine that you have four CSV files: data.csv
, delete-pk.csv
, delete-fk1.csv
, and delete-fk2.csv
.
You want to delete some data from delete-pk.csv
, delete-fk1.csv
and delete-fk2.csv
, and, at the same time, you want to upload new data from data.csv
.
The following table shows the structure (column names and values in the columns) of each CSV files that you are going to process:
data.csv | delete-pk.csv | delete-fk1.csv | delete-fk2.csv |
---|---|---|---|
See the example CSV data file in Loading Data via REST API | csv_id 12246545 12246547 | market,quote_date NASDAQ,8/15/08 NASDAQ,8/16/08 NASDAQ,8/18/08 | sector Healthcare |
Build a manifest file to specify what data to upload and what to delete for each dataset. The manifest file maps the columns in the CSV files to your data model. You can build it from scratch, or you can retrieve an SLI manifest via API and then manually update it.
In the manifest file:
- For each CSV file to upload or delete data from, create a separate
dataSetSLIManifest
section (see Loading Data via REST API). - For the data to delete, set mode to
DELETE-CENTER
. - For the data to load, set mode to
INCREMENTAL
.
If you want to only delete some data from a dataset without uploading any new data, do not provide any dataSetSLIManifest
section with INCREMENTAL
mode in the manifest file and do not include any CSV file to upload data from.
The following manifest file specifies what data to upload from data.csv
, and what data to delete from delete-pk.csv
, delete-fk1.csv
and delete-fk2.csv
.
{
"dataSetSLIManifestList": [
{
"dataSetSLIManifest": {
"parts": [
{
"columnName": "id",
"populates": [ "label.quotes.id" ],
"referenceKey": 1,
"mode": "DELETE-CENTER"
}
],
"file": "delete-pk.csv",
"dataSet": "dataset.quotes"
}
},
{
"dataSetSLIManifest": {
"parts": [
{
"columnName": "market",
"populates": [ "label.quotes.market" ],
"referenceKey": 1,
"mode": "DELETE-CENTER"
},
{
"columnName": "quote_date",
"populates": [ "quote.date.mdyy" ],
"constraints": { "date": "MM/dd/yy" },
"referenceKey": 1,
"mode": "DELETE-CENTER"
}
],
"file": "delete-fk1.csv",
"dataSet": "dataset.quotes"
}
},
{
"dataSetSLIManifest": {
"parts": [
{
"columnName": "sector",
"populates": [ "label.quotes.sector" ],
"referenceKey": 1,
"mode": "DELETE-CENTER"
}
],
"file": "data-fk2.csv",
"dataSet": "dataset.quotes"
}
},
{
"dataSetSLIManifest": {
"parts": [
# For the definition of this 'parts' section, see https://help.gooddata.com/pages/viewpage.action?pageId=86796461.
],
"file": "data.csv",
"dataSet": "dataset.quotes"
}
}
]
}