Deleting Records from Datasets

This article provides information and use cases on how to delete records from a dataset using MAQL DML, the Data Manipulation Language variant of MAQL.

This method of removing data from a workspace is just one of several available strategies, each of which applies to a different set of use cases and has different comparative strengths and weaknesses. Before you begin removing data, you should review the various available strategies. For more information, see Deleting Data from a Workspace.

You should already be familiar with MAQL, GoodData’s proprietary Multi-Dimension Analytical Query Language, which is used for specifying metrics, defining data models, and deleting data. For more information on MAQL for reporting, see MAQL - Analytical Query Language.

Overview

MAQL DML statements reference specific attributes, facts, and datasets in a GoodData workspace, which are translated via SQL into references to tables in the database.

Warnings

MAQL DML statements cause programmatic and permanent removal of data. Mistakes cannot be recovered.

Some other important considerations:

  • No validation of the statement is performed prior to execution. The statement is submitted and applied directly to the referenced objects.
  • There is no effective way to rollback the data. For example, if you delete the wrong attribute value, you cannot retrieve it. At best, you can re-run ETL to re-populate removed values.
  • As noted in the previous warning, it is very easy to wipe out values if the datatypes in the MAQL DML statement do not match the datatypes in the stored dataset.
  • It is very easy to introduce referential integrity issues in the data, which can be challenging to debug. Always delete fact data before you delete corresponding attribute values.
  • You cannot delete date records from a Date dimension using MAQL DML.

Performance

The speed at which data is removed from a workspace depends on a number of complex factors, some of which cannot be controlled or managed. The following conditions can typically impact performance of queries and data removal:

  • Attribute labels and their values (not indexed)
  • Facts are not indexed
  • Larger volumes of data

In the table below, you can see some generalized guidelines for data removal performance. Suppose that in this case you are attempting to remove records from a large table.

Removal of more than 10% of rows: Always slow.

Removal of less than 10% of rows: It depends:

OperationExamplePerformance
Rows are identified by a single attribute valueWHERE {label.mydataset.name} = "Joe User"Fast, since attribute labels are indexed
Rows are identified by a small number of attribute valuesWHERE {label.mydataset.name} = "Joe User" and {label.mydataset.company} = "GoodData"Fast, since attribute labels are indexed
Rows are identified by a large number of attribute valuesnone.Slow, since the indices are not used
Rows identified by a fact valueWHERE {fact.twitter.retweets} = 2Slow, since facts are not indexed

Sequence for deleting records from a dataset

At the physical data layer, there are multiple steps required to remove what is conceptually a single set of data. In the database, the data that represents the contents of a dataset is stored across multiple tables.

  • The data for all facts of the dataset are stored in a single table. So, if your dataset contains multiple facts, the data for each fact is stored in separate columns of the same table, with some columns shared across all facts in the dataset.
  • For each fact, any related attribute value is stored as a lookup identifier in a specific column in the fact table.
  • The data for each attribute in the dataset is stored in its own lookup table. This table includes the primary key for the attribute and all of its labels and their values. For each attribute value, there is an attribute value ID and the value itself.

In the above diagram, you can see how data is stored for an example dataset. Suppose you wish to remove the tweets issued by the user GoodData. In this case, you must perform deletes from each table using MAQL DML.

After you perform the delete from the fact table, the attribute value still exists in the attribute table, which has implications in reporting. For example, this value is still available for selection in the attribute filter selections in the report configuration of the workspace.

The above example is explored in greater detail at the bottom of the article. For more information, see Basic Workflow Example.

Applying MAQL DML

Validation

Before and after you finish each delete operations, it’s important to validate your entire workspace again. Validation shows you the actual status of the data model and checks the referential integrity of the whole model.

For more information, see Validate a Workspace.

General Syntax

In the example below, you can see the generalized syntax for a MAQL DELETE statement:

DELETE FROM <object> WHERE <condition>;

where:

  • <object> - determines the object from which the data is to be deleted. See below.
  • <condition> - identifies the records to be deleted.

Acquiring Object Identifiers

As part of building your MAQL DML statements, you need to acquire and use the internal identifiers for the objects you reference. Typically, these object references can be acquired from the logical data model.

  • The identifier for the workspace itself can be extracted from the URL when the workspace is opened in the GoodData Portal.
  • Most object identifiers can be retrieved in the LDM Modeler integrated in CloudConnect Designer. Select the object, and its internal identifier is displayed in the sidebar. For labels and hyperlinks, you must edit the dataset, and select the Show Identifiers checkbox.
  • Some object identifiers are automatically created for you in your logical data model and may be hidden from view. You may need to use other mechanisms to acquire them.
  • For more information, see Acquiring Identifiers for Workspace Metadata.

MAQL DML Interfaces

GoodData API

MAQL DML statements can be executed via GoodData API. The API is located at the following URL:

https://secure.gooddata.com/gdc/md/{workspace_id}/dml/manage

Gray Pages

You can also use the gray pages for your workspace to execute MAQL DML statements:

CloudConnect Designer

MAQL DML statements may also be executed in CloudConnect Designer using the MAQL DML Execute component, which is available under the Others category of ETL components.

Example MAQL DML Statements

The following examples may assist you in building your MAQL DML queries.

These examples reference the Twitter Search API project, which is part of the Getting Started workflow. For more information, see Getting Started.

Example - Deleting all rows in an attribute

The following example deletes or destroys all facts in the Twitter dataset:

DELETE FROM {attr.twitter.factsof};

In the above, the expression {attr.twitter.factsof} refers all the facts in the Twitter dataset. Since the statement contains no condition (no WHERE clause), all rows of data in the fact table are removed.

To remove the attribute values that are no longer referenced by the facts in the Twitter dataset, you must execute versions of the following statement for each affected attribute in the dataset. This one removes all attribute values for the author attribute:

DELETE FROM {attr.twitter.author};

After removing all data associated with the attributes, all data related to the dataset has been removed from the workspace.

Example - Building conditions into MAQL DML statements

You can use a WHERE to apply conditions to your MAQL DML statement. The left side of the condition can be defined from a fact value or a label for the attribute from which you are deleting data. The right side provides a value expression.

For example, suppose you need to delete all rows where the Retweets fact has a value of zero. Extending the previous example yields the following statement:

DELETE FROM {attr.twitter.factsof} WHERE {fact.twitter.retweets}=0; 

Example - Adding logical operators

You may also apply logical operators to your DELETE statements. Into the WHERE, you may apply the following basic operators:

  • =
  • IN
  • NOT IN
  • <
  • >
  • BETWEEN
  • NOT BETWEEN
  • AND
  • OR
  • NOT

The following example shows how to construct a DELETE statement to remove all rows from the retweets and favorites facts in which the value is 0:

DELETE FROM {attr.twitter.factsof} WHERE {fact.twitter.retweets} = 0 OR {fact.twitter.favorites} = 0;

Example - Deleting a range of values

Depending on the type of data you are removing, you may use one of the following methods for removing rows containing a range of values.

Deleting rows based on fact values

Suppose you are not interested in any tweets that have not been retweeted at least five times. Also, you want to remove any tweets that have been retweeted more than 100 times, as they represent outliers that are skewing other metrics. Here is the statement to remove these rows:

DELETE FROM {attr.twitter.factsof} WHERE {fact.twitter.retweets} < 5 OR {fact.twitter.retweets} > 100;

You may also use the BETWEEN keyword for specifying ranges for facts. In the following example, you reference a range of attribute values using the BETWEEN keyword. This example deletes all rows in the fact table in which the number of favorites of the tweet is between 1 and 5:

DELETE FROM {attr.twitter.factsof} WHERE {fact.twitter.favorites} BETWEEN 1 AND 5;

In the above, the BETWEEN range is inclusive of the end points.

Deleting rows based on attribute values

Suppose you discover that an employee sent the same tweets from two different accounts (the GoodData and the GoodData-Mktng account). To remove the double-counting from the report, you want to remove the tweets from the marketing account for that date:

The format of the date must match the format specified in the date label.

DELETE FROM {attr.twitter.factsof} WHERE {label.twitter.author.author2} = "GoodData-Mktng";

Example - Deleting aged data based on date values

A common use case for MAQL DML is deleting data that has aged. Typically, this step involves deleting rows based on date values.

When a Date dimension is added to your logical data model, a set of attribute labels is automatically added to the workspace, so that dates can be displayed in all of the supported formats. Depending on the format of dates in your workspace data, you must reference the appropriate label when building a condition (WHERE clause) in your MAQL DML statement.

Suppose you need to remove the fact records from the twitter dataset from before January 1st, 1999:

In this case, you must reference the external date dimension in the format listed below:

DELETE FROM {attr.twitter.factsof} WHERE  {tweetdate.date.yyyymmdd} < "1999-01-01";

The right-hand side of the condition (1999-01-01) must use the same format as the values expressed in the date, or an error occurs. For more information on the supported date formats and their corresponding attribute labels, see Acquiring Object Identifiers for Workspace Metadata.

Basic Workflow Example

In this example, you can review the basic workflow required to remove data from a workspace. In this case, the example workspace for the Twitter Search API has been developed and published to the GoodData platform, resulting in an insight identifying tweets by author that contain the phrase gooddata.

Suppose that you wish to remove the tweet references submitted from the official GoodData account, which would provide better information on secondary Twitter activity. The imaginary insight contains eight tweets from the GoodData author.

Acquire identifiers for removing GoodData tweets

To build the MAQL statements, you must acquire the object identifiers for the following objects from the Twitter Search API project:

  • Attribute “Facts Of” records for the dataset
  • Attribute containing the GoodData value (The Author attribute)
  • Label of the attribute containing the GoodData value (The Author2 attribute label)

For more information on how to acquire these object identifiers, see Acquiring Object Identifiers for Workspace Metadata.

Building the MAQL DML statements

After you have retrieved the identifiers for the attribute labels, you can build your MAQL DML statements. You must build and execute two MAQL DML statements.

  1. Delete data from the fact table.
  2. Delete data from the attribute value table.

MAQL to delete fact data from dataset

First, you must delete the fact data from the dataset. This step removes the records from the reporting data, so any reports that utilizes these facts are immediately updated. Build your fact deletion statement in the following form, modifying based on the specific values for your workspace’s dataset:

DELETE FROM {attr.twitter.factsof} WHERE {label.twitter.author.author2} = "GoodData";

Legend:

  • <attr.twitter.factsof> - This reference corresponds to the identifier of the facts that pertain to the referenced attribute. For more information, see Acquiring Object Identifiers for an Attribute of Facts.

  • <label.twitter.author.author2> - This reference defines the attribute label of the Author attribute, where attribute values are stored.

MAQL to delete attribute value data from dataset

You must also build the statement to remove the attribute value from the attribute. If this step is not completed, the attribute value is available for selection when the report is filtered by attribute. If a user tries to apply the filter, the report is blank because there are no corresponding facts, which have already been deleted.

Build your fact deletion statement in the following form, modifying based on the specific values for your workspace’s dataset:

DELETE FROM {attr.twitter.author} WHERE {label.twitter.author.author2} = "GoodData";

Legend:

  • <attr.twitter.author> - This reference identifies the Author attribute, where the attribute values to remove are stored.
  • <label.twitter.author.author2> - This reference defines the attribute label of the Author attribute, where attribute values are stored.

Executing the MAQL DML via API

To execute the MAQL DML statement, you can execute it through a number of interfaces. In this example, it is executed through API using a commonly available REST client.

For more information about available interfaces, see MAQL DML Interfaces.

After you have deleted the fact records, you should execute the MAQL DML command to remove the value (GoodData, in this case) from the attribute label.

Verifying the change

If the API call is processed, the response includes a URI to the execution resource. You should review the URI to see if there were any errors in the status message. A status code of OK indicates that the deletion occurred.

You should also verify that the data has been removed from the workspace. Note that there is no data from the GoodData user.

Additional Documentation

There is significant overlap between MAQL DML and MAQL DQL. For more information on the basic syntax of MAQL DQL, see MAQL - Analytical Query Language.