Deleting Records from Datasets
This article provides information and use cases on how to delete records from a dataset using MAQL DML, the Data Manipulation Language variant of MAQL.
This method of removing data from a workspace is just one of several available strategies, each of which applies to a different set of use cases and has different comparative strengths and weaknesses. Before you begin removing data, you should review the various available strategies. For more information, see Deleting Data from a Workspace.
MAQL DELETE statements are translated to a corresponding SQL command, which may generate unexpected side effects. For example, if attribute values are not applicable to the type of attribute, then they may be converted to NULL values, and all records in the dataset are deleted. Please proceed with caution. See Warnings below.
By default, the delete operation on the PostgreSQL backend is performed using inverse copy. The performance of inverse copy is not optimal when deleting small amounts of data. To switch to direct delete, which is more suitable for deleting smaller amounts of data, please contact GoodData Support.
You should already be familiar with MAQL, GoodData’s proprietary Multi-Dimension Analytical Query Language, which is used for specifying metrics, defining data models, and deleting data. For more information on MAQL for reporting, see MAQL - Analytical Query Language.
Overview
MAQL DML statements reference specific attributes, facts, and datasets in a GoodData workspace, which are translated via SQL into references to tables in the database.
Warnings
MAQL DML statements cause programmatic and permanent removal of data. Mistakes cannot be recovered.
Some other important considerations:
- No validation of the statement is performed prior to execution. The statement is submitted and applied directly to the referenced objects.
- There is no effective way to rollback the data. For example, if you delete the wrong attribute value, you cannot retrieve it. At best, you can re-run ETL to re-populate removed values.
- As noted in the previous warning, it is very easy to wipe out values if the datatypes in the MAQL DML statement do not match the datatypes in the stored dataset.
- It is very easy to introduce referential integrity issues in the data, which can be challenging to debug. Always delete fact data before you delete corresponding attribute values.
- You cannot delete date records from a Date dimension using MAQL DML.
This method should be limited for use by only trained database administrators. Where possible, it is recommended that you pursue a different strategy for removing data from your workspace. For more information on the available strategies, see Deleting Data from a Workspace.
Performance
The speed at which data is removed from a workspace depends on a number of complex factors, some of which cannot be controlled or managed. The following conditions can typically impact performance of queries and data removal:
- Attribute labels and their values (not indexed)
- Facts are not indexed
- Larger volumes of data
When the MAQL DML statement is issued to the platform, it may take a significant period of time before the data is removed from the workspace. When the URI resource in the response is queried, a Running
message appears until the data has been removed, and users of the report are viewing the old data until it has been purged. As a result, for large workspaces or large volumes of deleted data, you should schedule these executions during off-peak hours.
In the table below, you can see some generalized guidelines for data removal performance. Suppose that in this case you are attempting to remove records from a large table.
Removal of more than 10% of rows: Always slow.
Removal of less than 10% of rows: It depends:
Operation | Example | Performance |
---|---|---|
Rows are identified by a single attribute value | WHERE {label.mydataset.name} = "Joe User" | Fast, since attribute labels are indexed |
Rows are identified by a small number of attribute values | WHERE {label.mydataset.name} = "Joe User" and {label.mydataset.company} = "GoodData" | Fast, since attribute labels are indexed |
Rows are identified by a large number of attribute values | none. | Slow, since the indices are not used |
Rows identified by a fact value | WHERE {fact.twitter.retweets} = 2 | Slow, since facts are not indexed |
Sequence for deleting records from a dataset
At the physical data layer, there are multiple steps required to remove what is conceptually a single set of data. In the database, the data that represents the contents of a dataset is stored across multiple tables.
- The data for all facts of the dataset are stored in a single table. So, if your dataset contains multiple facts, the data for each fact is stored in separate columns of the same table, with some columns shared across all facts in the dataset.
- For each fact, any related attribute value is stored as a lookup identifier in a specific column in the fact table.
- The data for each attribute in the dataset is stored in its own lookup table. This table includes the primary key for the attribute and all of its labels and their values. For each attribute value, there is an attribute value ID and the value itself.
In the above diagram, you can see how data is stored for an example dataset. Suppose you wish to remove the tweets issued by the user GoodData
. In this case, you must perform deletes from each table using MAQL DML.
It is important to perform the first deletion from the fact table to remove the data first. Then, you can perform any related deletes from the attribute value table. If you perform deletions in the opposite order, you can create referential integrity problems, in which the fact data has empty values associated with the attribute.
After you perform the delete from the fact table, the attribute value still exists in the attribute table, which has implications in reporting. For example, this value is still available for selection in the attribute filter selections in the report configuration of the workspace.
The above example is explored in greater detail at the bottom of the article. For more information, see Basic Workflow Example.
Applying MAQL DML
Validation
Before and after you finish each delete operations, it’s important to validate your entire workspace again. Validation shows you the actual status of the data model and checks the referential integrity of the whole model.
Referential integrity checks are not part of a data model-only validation check. Validate the entire workspace.
For more information, see Validate a Workspace.
General Syntax
In the example below, you can see the generalized syntax for a MAQL DELETE statement:
DELETE FROM <object> WHERE <condition>;
where:
<object>
- determines the object from which the data is to be deleted. See below.<condition>
- identifies the records to be deleted.
Acquiring Object Identifiers
As part of building your MAQL DML statements, you need to acquire and use the internal identifiers for the objects you reference. Typically, these object references can be acquired from the logical data model.
- The identifier for the workspace itself can be extracted from the URL when the workspace is opened in the GoodData Portal.
- Most object identifiers can be retrieved in the LDM Modeler integrated in CloudConnect Designer. Select the object, and its internal identifier is displayed in the sidebar. For labels and hyperlinks, you must edit the dataset, and select the Show Identifiers checkbox.
- Some object identifiers are automatically created for you in your logical data model and may be hidden from view. You may need to use other mechanisms to acquire them.
- For more information, see Acquiring Identifiers for Workspace Metadata.
MAQL DML Interfaces
You cannot execute MAQL DML statements through the GoodData Portal.
GoodData API
MAQL DML statements can be executed via GoodData API. The API is located at the following URL:
https://secure.gooddata.com/gdc/md/{workspace_id}/dml/manage
In GoodData, terms workspace and project denote the same entity. For example, project ID is exactly the same as workspace ID. See Find the Workspace ID.
You must be a workspace administrator to invoke this API.
Gray Pages
You can also use the gray pages for your workspace to execute MAQL DML statements:
CloudConnect Designer
MAQL DML statements may also be executed in CloudConnect Designer using the MAQL DML Execute component, which is available under the Others category of ETL components.
Example MAQL DML Statements
The following examples may assist you in building your MAQL DML queries.
These examples reference the Twitter Search API project, which is part of the Getting Started workflow. For more information, see Getting Started.
Example - Deleting all rows in an attribute
This example is not commonly used, since it is so destructive.
The following example deletes or destroys all facts in the Twitter dataset:
DELETE FROM {attr.twitter.factsof};
In the above, the expression {attr.twitter.factsof}
refers all the facts in the Twitter dataset. Since the statement contains no condition (no WHERE
clause), all rows of data in the fact table are removed.
After you have completed the above execution, the attribute values associated with these facts remain in the attribute tables. They may appear in the list of available attribute values when filtering your reports, although there will be no data associated with them.
To remove the attribute values that are no longer referenced by the facts in the Twitter dataset, you must execute versions of the following statement for each affected attribute in the dataset. This one removes all attribute values for the author
attribute:
Do not execute this statement before removing the facts, or you will cause referential integrity problems in your fact table, where references to attribute values may remain. Always delete from the fact table first.
DELETE FROM {attr.twitter.author};
After removing all data associated with the attributes, all data related to the dataset has been removed from the workspace.
Example - Building conditions into MAQL DML statements
You can use a WHERE
to apply conditions to your MAQL DML statement. The left side of the condition can be defined from a fact value or a label for the attribute from which you are deleting data. The right side provides a value expression.
For example, suppose you need to delete all rows where the Retweets fact has a value of zero. Extending the previous example yields the following statement:
DELETE FROM {attr.twitter.factsof} WHERE {fact.twitter.retweets}=0;
Example - Adding logical operators
You may also apply logical operators to your DELETE
statements. Into the WHERE
, you may apply the following basic operators:
=
IN
NOT IN
<
>
BETWEEN
NOT BETWEEN
AND
OR
NOT
The following example shows how to construct a DELETE
statement to remove all rows from the retweets
and favorites
facts in which the value is 0:
Quotes are not necessary around the values for facts, since they are numeric.
DELETE FROM {attr.twitter.factsof} WHERE {fact.twitter.retweets} = 0 OR {fact.twitter.favorites} = 0;
Example - Deleting a range of values
Depending on the type of data you are removing, you may use one of the following methods for removing rows containing a range of values.
Deleting rows based on fact values
Suppose you are not interested in any tweets that have not been retweeted at least five times. Also, you want to remove any tweets that have been retweeted more than 100 times, as they represent outliers that are skewing other metrics. Here is the statement to remove these rows:
DELETE FROM {attr.twitter.factsof} WHERE {fact.twitter.retweets} < 5 OR {fact.twitter.retweets} > 100;
You may also use the BETWEEN
keyword for specifying ranges for facts. In the following example, you reference a range of attribute values using the BETWEEN
keyword. This example deletes all rows in the fact table in which the number of favorites of the tweet is between 1 and 5:
DELETE FROM {attr.twitter.factsof} WHERE {fact.twitter.favorites} BETWEEN 1 AND 5;
In the above, the BETWEEN
range is inclusive of the end points.
Deleting rows based on attribute values
If you want to delete any attribute value, you must first delete the rows in the fact table that reference the attribute value first. If you do not follow this workflow, referential integrity is broken. For example, if you want to delete Author = “GoodData”, you must modify or remove all records of employees who work in that department first.
Suppose you discover that an employee sent the same tweets from two different accounts (the GoodData
and the GoodData-Mktng
account). To remove the double-counting from the report, you want to remove the tweets from the marketing account for that date:
The format of the date must match the format specified in the date label.
DELETE FROM {attr.twitter.factsof} WHERE {label.twitter.author.author2} = "GoodData-Mktng";
To remove facts from the dataset based on attribute values, you must reference a label of the attribute ({label.twitter.author.author2}
). You cannot reference the attribute itself.
Example - Deleting aged data based on date values
A common use case for MAQL DML is deleting data that has aged. Typically, this step involves deleting rows based on date values.
When a Date dimension is added to your logical data model, a set of attribute labels is automatically added to the workspace, so that dates can be displayed in all of the supported formats. Depending on the format of dates in your workspace data, you must reference the appropriate label when building a condition (WHERE
clause) in your MAQL DML statement.
MAQL DML does not support the deletion of dates directly from a GoodData Date dimension. In general, you should avoid implementing dates as fact values in your data model. Use the attribute label references as part of your delete statements.
Suppose you need to remove the fact records from the twitter
dataset from before January 1st, 1999:
In this case, you must reference the external date dimension in the format listed below:
DELETE FROM {attr.twitter.factsof} WHERE {tweetdate.date.yyyymmdd} < "1999-01-01";
The right-hand side of the condition (1999-01-01
) must use the same format as the values expressed in the date, or an error occurs. For more information on the supported date formats and their corresponding attribute labels, see Acquiring Object Identifiers for Workspace Metadata.
Basic Workflow Example
In this example, you can review the basic workflow required to remove data from a workspace. In this case, the example workspace for the Twitter Search API has been developed and published to the GoodData platform, resulting in an insight identifying tweets by author that contain the phrase gooddata
.
Suppose that you wish to remove the tweet references submitted from the official GoodData account, which would provide better information on secondary Twitter activity. The imaginary insight contains eight tweets from the GoodData
author.
- In a real workspace, you would be better served by building a custom MAQL metric that invokes a
WHERE
clause to remove references counts in whichAuthor=GoodData
. This is just for demonstration purposes. - This specific workspace is designed to load the most recent data, so each time the ETL is executed, it grabs a full data load of the most recent 100 tweets referencing the term
gooddata
. If you needed to replace all of the data in the workspace, you could simply execute a new ETL run. For more information on this workspace, see Analyzing Twitter Search.
Acquire identifiers for removing GoodData tweets
To build the MAQL statements, you must acquire the object identifiers for the following objects from the Twitter Search API project:
- Attribute “Facts Of” records for the dataset
- Attribute containing the
GoodData
value (TheAuthor
attribute) - Label of the attribute containing the
GoodData
value (TheAuthor2
attribute label)
For more information on how to acquire these object identifiers, see Acquiring Object Identifiers for Workspace Metadata.
Building the MAQL DML statements
After you have retrieved the identifiers for the attribute labels, you can build your MAQL DML statements. You must build and execute two MAQL DML statements.
- Delete data from the fact table.
- Delete data from the attribute value table.
It is important to remove data in the order listed above to maintain referential integrity. If you remove the values from the attribute value table first, then the facts still remain in the database, and any reports utilizing the facts, filtered by the attribute value, will contain (empty value)
.
MAQL to delete fact data from dataset
First, you must delete the fact data from the dataset. This step removes the records from the reporting data, so any reports that utilizes these facts are immediately updated. Build your fact deletion statement in the following form, modifying based on the specific values for your workspace’s dataset:
DELETE FROM {attr.twitter.factsof} WHERE {label.twitter.author.author2} = "GoodData";
Legend:
<attr.twitter.factsof>
- This reference corresponds to the identifier of the facts that pertain to the referenced attribute. For more information, see Acquiring Object Identifiers for an Attribute of Facts.<label.twitter.author.author2>
- This reference defines the attribute label of theAuthor
attribute, where attribute values are stored.You must reference an attribute label in theWHERE
clause.
MAQL to delete attribute value data from dataset
You must also build the statement to remove the attribute value from the attribute. If this step is not completed, the attribute value is available for selection when the report is filtered by attribute. If a user tries to apply the filter, the report is blank because there are no corresponding facts, which have already been deleted.
Build your fact deletion statement in the following form, modifying based on the specific values for your workspace’s dataset:
DELETE FROM {attr.twitter.author} WHERE {label.twitter.author.author2} = "GoodData";
Legend:
<attr.twitter.author>
- This reference identifies theAuthor
attribute, where the attribute values to remove are stored.<label.twitter.author.author2>
- This reference defines the attribute label of theAuthor
attribute, where attribute values are stored.
Executing the MAQL DML via API
Before you begin making changes to your workspace, you must authenticate to the GoodData platform using the login API and then acquire the temporary token to access resources. For more information, see Authentication via API.
To execute the MAQL DML statement, you can execute it through a number of interfaces. In this example, it is executed through API using a commonly available REST client.
For more information about available interfaces, see MAQL DML Interfaces.
In the JSON statement, the quote marks around GoodData
must be escaped (\"
).
After you have deleted the fact records, you should execute the MAQL DML command to remove the value (GoodData
, in this case) from the attribute label.
Verifying the change
If the API call is processed, the response includes a URI to the execution resource. You should review the URI to see if there were any errors in the status
message. A status code of OK
indicates that the deletion occurred.
Syntax errors should be reported in the REST client.
You should also verify that the data has been removed from the workspace. Note that there is no data from the GoodData
user.
Additional Documentation
There is significant overlap between MAQL DML and MAQL DQL. For more information on the basic syntax of MAQL DQL, see MAQL - Analytical Query Language.