Deleting Records from Datasets

This article provides information and use cases on how to delete records from a dataset using MAQL DML, the Data Manipulation Language variant of MAQL.

Contents:

This method of removing data from a project is just one of several available strategies, each of which applies to a different set of use cases and has different comparative strengths and weaknesses. Before you begin removing data, you should review the various available strategies. For more information, see Deleting Data from Projects.

MAQL DELETE statements are translated to a corresponding SQL command, which may generate unexpected side effects. For example, if attribute values are not applicable to the type of attribute, then they may be converted to NULL values, and all records in the dataset are deleted. Please proceed with caution. See Warnings below.

You should already be familiar with MAQL, GoodData’s proprietary Multi-Dimension Analytical Query Language, which is used for specifying metrics, defining data models, and deleting data. For more information on MAQL for reporting, see 2019-05-31_12-02-48_MAQL - Multi-Dimension Analytical Query Language.

Overview

MAQL DML statements reference specific attributes, facts, and datasets in a GoodData project, which are translated via SQL into references to tables in the database.

Warnings

MAQL DML statements cause programmatic and permanent removal of data. Mistakes cannot be recovered.

Some other important considerations:

  • No validation of the statement is performed prior to execution. The statement is submitted and applied directly to the referenced objects.
  • There is no effective way to rollback the data. For example, if you delete the wrong attribute value, you cannot retrieve it. At best, you can re-run ETL to re-populate removed values.
  • As noted in the previous warning, it is very easy to wipe out values if the datatypes in the MAQL DML statement do not match the datatypes in the stored dataset.
  • It is very easy to introduce referential integrity issues in the data, which can be challenging to debug. Always delete fact data before you delete corresponding attribute values.
  • You cannot delete date records from a Date dimension using MAQL DML.

This method should be limited for use by only trained database administrators. Where possible, it is recommended that you pursue a different strategy for removing data from your project. For more information on the available strategies, see Deleting Data from Projects.

Performance

The speed at which data is removed from a project depends on a number of complex factors, some of which cannot be controlled or managed. The following conditions can typically impact performance of queries and data removal:

  • Attribute labels and their values (not indexed)
  • Facts are not indexed
  • Larger volumes of data

When the MAQL DML statement is issued to the platform, it may take a significant period of time before the data is removed from the project. When the URI resource in the response is queried, a Running message appears until the data has been removed, and users of the report are viewing the old data until it has been purged. As a result, for large projects or large volumes of deleted data, you should schedule these executions during off-peak hours.

In the table below, you can see some generalized guidelines for data removal performance. Suppose that in this case you are attempting to remove records from a large table.

Removal of more than 10% of rows: Always slow.

Removal of less than 10% of rows: It depends:

OperationExamplePerformance
Rows are identified by a single attribute valueWHERE {label.mydataset.name} = "Joe User"Fast, since attribute labels are indexed
Rows are identified by a small number of attribute valuesWHERE {label.mydataset.name} = "Joe User" and {label.mydataset.company} = "GoodData"Fast, since attribute labels are indexed
Rows are identified by a large number of attribute valuesnone.Slow, since the indices are not used
Rows identified by a fact valueWHERE {fact.twitter.retweets} = 2Slow, since facts are not indexed

Sequence for deleting records from a dataset

At the physical data layer, there are multiple steps required to remove what is conceptually a single set of data. In the database, the data that represents the contents of a dataset is stored across multiple tables.

  • The data for all facts of the dataset are stored in a single table. So, if your dataset contains multiple facts, the data for each fact is stored in separate columns of the same table, with some columns shared across all facts in the dataset.
  • For each fact, any related attribute value is stored as a lookup identifier in a specific column in the fact table.
  • The data for each attribute in the dataset is stored in its own lookup table. This table includes the primary key for the attribute and all of its labels and their values. For each attribute value, there is an attribute value ID and the value itself.

In the above diagram, you can see how data is stored for an example dataset. Suppose you wish to remove the tweets issued by the user GoodData. In this case, you must perform deletes from each table using MAQL DML.

It is important to perform the first deletion from the fact table to remove the data first. Then, you can perform any related deletes from the attribute value table. If you perform deletions in the opposite order, you can create referential integrity problems, in which the fact data has empty values associated with the attribute.

After you perform the delete from the fact table, the attribute value still exists in the attribute table, which has implications in reporting. For example, this value is still available for selection in the attribute filter selections in the report configuration of the project.

The above example is explored in greater detail at the bottom of the article. For more information, see Basic Workflow Example.

Applying MAQL DML

Validation

Before and after you finish each delete operations, it’s important to validate your entire project again. Validation shows you the actual status of the data model and checks the referential integrity of the whole model.

Referential integrity checks are not part of a data model-only validation check. Validate the entire project.

For more information, see Validate a Project.

General Syntax

In the example below, you can see the generalized syntax for a MAQL DELETE statement:

DELETE FROM <object> WHERE <condition>;

where:

  • <object> - determines the object from which the data is to be deleted. See below.
  • <condition> - identifies the records to be deleted.

Acquiring Object Identifiers

As part of building your MAQL DML statements, you need to acquire and use the internal identifiers for the objects you reference. Typically, these object references can be acquired from the logical data model.

  • The identifier for the project itself can be extracted from the URL when the project is opened in the GoodData Portal.
  • Most object identifiers can be retrieved in the LDM Modeler integrated in CloudConnect Designer. Select the object, and its internal identifier is displayed in the sidebar. For labels and hyperlinks, you must edit the dataset, and select the Show Identifiers checkbox.
  • Some object identifiers are automatically created for you in your logical data model and may be hidden from view. You may need to use other mechanisms to acquire them.
  • For more information, see Acquiring Identifiers for Project Metadata.

MAQL DML Interfaces

You cannot execute MAQL DML statements through the GoodData Portal.

GoodData API

MAQL DML statements can be executed via GoodData API. The API is located at the following URL:

https://secure.gooddata.com/gdc/md/{project-id}/dml/manage

You must be a project administrator to invoke this API.

Gray Pages

You can also use the gray pages for your project to execute MAQL DML statements:

CloudConnect Designer

MAQL DML statements may also be executed in CloudConnect Designer using the MAQL DML Execute component, which is available under the Others category of ETL components.

Example MAQL DML Statements

The following examples may assist you in building your MAQL DML queries.

These examples reference the Twitter Search API project, which is part of the Getting Started workflow. For more information, see Getting Started.

Example - Deleting all rows in an attribute

This example is not commonly used, since it is so destructive.

The following example deletes or destroys all facts in the Twitter dataset:

DELETE FROM {attr.twitter.factsof};

In the above, the expression {attr.twitter.factsof} refers all the facts in the Twitter dataset. Since the statement contains no condition (no WHERE clause), all rows of data in the fact table are removed.

After you have completed the above execution, the attribute values associated with these facts remain in the attribute tables. They may appear in the list of available attribute values when filtering your reports, although there will be no data associated with them.

To remove the attribute values that are no longer referenced by the facts in the Twitter dataset, you must execute versions of the following statement for each affected attribute in the dataset. This one removes all attribute values for the author attribute:

Do not execute this statement before removing the facts, or you will cause referential integrity problems in your fact table, where references to attribute values may remain. Always delete from the fact table first.

DELETE FROM {attr.twitter.author};

After removing all data associated with the attributes, all data related to the dataset has been removed from the project.

Example - Building conditions into MAQL DML statements

You can use a WHERE to apply conditions to your MAQL DML statement. The left side of the condition can be defined from a fact value or a label for the attribute from which you are deleting data. The right side provides a value expression.

For example, suppose you need to delete all rows where the Retweets fact has a value of zero. Extending the previous example yields the following statement:

DELETE FROM {attr.twitter.factsof} WHERE {fact.twitter.retweets}=0; 

Example - Adding logical operators

You may also apply logical operators to your DELETE statements. Into the WHERE, you may apply the following basic operators:

  • =
  • IN
  • NOT IN
  • <
  • >
  • BETWEEN
  • NOT BETWEEN
  • AND
  • OR
  • NOT

The following example shows how to construct a DELETE statement to remove all rows from the retweets and favorites facts in which the value is 0:

Quotes are not necessary around the values for facts, since they are numeric.

DELETE FROM {attr.twitter.factsof} WHERE {fact.twitter.retweets} = 0 OR {fact.twitter.favorites} = 0;

Example - Deleting a range of values

Depending on the type of data you are removing, you may use one of the following methods for removing rows containing a range of values.

Deleting rows based on fact values

Suppose you are not interested in any tweets that have not been retweeted at least five times. Also, you want to remove any tweets that have been retweeted more than 100 times, as they represent outliers that are skewing other metrics. Here is the statement to remove these rows:

DELETE FROM {attr.twitter.factsof} WHERE {fact.twitter.retweets} < 5 OR {fact.twitter.retweets} > 100;

You may also use the BETWEEN keyword for specifying ranges for facts. In the following example, you reference a range of attribute values using the BETWEEN keyword. This example deletes all rows in the fact table in which the number of favorites of the tweet is between 1 and 5:

DELETE FROM {attr.twitter.factsof} WHERE {fact.twitter.favorites} BETWEEN 1 AND 5;

In the above, the BETWEEN range is inclusive of the end points.

Deleting rows based on attribute values

If you want to delete any attribute value, you must first delete the rows in the fact table that reference the attribute value first. If you do not follow this workflow, referential integrity is broken. For example, if you want to delete Author = “GoodData”, you must modify or remove all records of employees who work in that department first.

Suppose you discover that an employee sent the same tweets from two different accounts (the GoodData and the GoodData-Mktng account). To remove the double-counting from the report, you want to remove the tweets from the marketing account for that date:

The format of the date must match the format specified in the date label.

DELETE FROM {attr.twitter.factsof} WHERE {label.twitter.author.author2} = "GoodData-Mktng";

To remove facts from the dataset based on attribute values, you must reference a label of the attribute ({label.twitter.author.author2}). You cannot reference the attribute itself.

Example - Deleting aged data based on date values

A common use case for MAQL DML is deleting data that has aged. Typically, this step involves deleting rows based on date values.

When a Date dimension is added to your logical data model, a set of attribute labels is automatically added to the project, so that dates can be displayed in all of the supported formats. Depending on the format of dates in your project data, you must reference the appropriate label when building a condition (WHERE clause) in your MAQL DML statement.

MAQL DML does not support the deletion of dates directly from a GoodData Date dimension. In general, you should avoid implementing dates as fact values in your data model. Use the attribute label references as part of your delete statements.

Suppose you need to remove the fact records from the the twitter dataset from before January 1st, 1999:

In this case, you must reference the external date dimension in the format listed below:

DELETE FROM {attr.twitter.factsof} WHERE  {tweetdate.date.yyyymmdd} < "1999-01-01";

The right-hand side of the condition (1999-01-01) must use the same format as the values expressed in the date, or an error occurs. For more information on the supported date formats and their corresponding attribute labels, see Acquiring Object Identifiers for Project Metadata.

Basic Workflow Example

In this example, you can review the basic workflow required to remove data from a project. In this case, the CloudConnect example project for the Twitter Search API has been developed and published to the GoodData platform, resulting in a report identifying tweets by author that contain the phrase gooddata.

For more information on this project, see Analyzing Twitter Search.

Suppose that you wish to remove the tweet references submitted from the official GoodData account, which would provide better information on secondary Twitter activity. The above report contains eight tweets from the GoodData author.

  •  In a real project, you would be better served by building a custom MAQL metric that invokes a WHERE clause to remove references counts in which Author=GoodData. This is just for demonstration purposes.
  • This specific project is designed to load the most recent data, so each time the ETL is executed, it grabs a full data load of the most recent 100 tweets referencing the term gooddata. If you needed to replace all of the data in the project, you could simply execute a new ETL run. For more information on this project, see Analyzing Twitter Search.

Acquire identifiers for removing GoodData tweets

To build the MAQL statements, you must acquire the object identifiers for the following objects from the Twitter Search API project:

  • Attribute “Facts Of” records for the dataset
  • Attribute containing the GoodData value (The Author attribute)
  • Label of the attribute containing the GoodData value (The Author2 attribute label)

For more information on how to acquire these object identifiers, see Acquiring Object Identifiers for Project Metadata.

Building the MAQL DML statements

After you have retrieved the identifiers for the attribute labels, you can build your MAQL DML statements. You must build and execute two MAQL DML statements.

  1. Delete data from the fact table.
  2. Delete data from the attribute value table.

It is important to remove data in the order listed above to maintain referential integrity. If you remove the values from the attribute value table first, then the facts still remain in the database, and any reports utilizing the facts, filtered by the attribute value, will contain (empty value).

MAQL to delete fact data from dataset

First, you must delete the fact data from the dataset. This step removes the records from the reporting data, so any reports that utilizes these facts are immediately updated. Build your fact deletion statement in the following form, modifying based on the specific values for your project’s dataset:

DELETE FROM {attr.twitter.factsof} WHERE {label.twitter.author.author2} = "GoodData";

Legend:

  • <attr.twitter.factsof> - This reference corresponds to the identifier of the facts that pertain to the referenced attribute. For more information, see Acquiring Object Identifiers for an Attribute of Facts.
  • <label.twitter.author.author2> - This reference defines the attribute label of the Author attribute, where attribute values are stored.

    You must reference an attribute label in the WHERE clause.

MAQL to delete attribute value data from dataset

You must also build the statement to remove the attribute value from the attribute. If this step is not completed, the attribute value is available for selection when the report is filtered by attribute. If a user tries to apply the filter, the report is blank because there are no corresponding facts, which have already been deleted.

Build your fact deletion statement in the following form, modifying based on the specific values for your project’s dataset:

DELETE FROM {attr.twitter.author} WHERE {label.twitter.author.author2} = "GoodData";

Legend:

  • <attr.twitter.author> - This reference identifies the Author attribute, where the attribute values to remove are stored.
  • <label.twitter.author.author2> - This reference defines the attribute label of the Author attribute, where attribute values are stored.

Executing the MAQL DML via API

Before you begin making changes to your project, you must authenticate to the GoodData platform using the login API and then acquire the temporary token to access resources. For more information, see Authentication via API.

To execute the MAQL DML statement, you can execute it through a number of interfaces. In this example, it is executed through API using a commonly available REST client.

For more information about available interfaces, see MAQL DML Interfaces.

In the JSON statement, the quote marks around GoodData must be escaped (\").

After you have deleted the fact records, you should execute the MAQL DML command to remove the value (GoodData, in this case) from the attribute label.

Verifying the change

If the API call is processed, the response includes a URI to the execution resource. You should review the URI to see if there were any errors in the status message. A status code of OK indicates that the deletion occurred.

Syntax errors should be reported in the REST client.

You should also verify that the data has been removed from the project. Note that there is no data from the GoodData user.

Additional Documentation

There is significant overlap between MAQL DML and MAQL DQL. For more information on the basic syntax of MAQL DQL, see 2019-05-31_12-02-48_MAQL - Multi-Dimension Analytical Query Language.