Filter delete operations with Amazon Web Services DMS

by Deepthi Saina and Vivekananda Mohapatra | on

Amazon Web Services Database Migration Service (Amazon Web Services DMS) is a fully managed service that helps you migrate databases to Amazon Web Services quickly and securely. Every customer’s use cases are unique; you can use Amazon Web Services DMS not only for a one-time data migration solution but also to replicate the data as per the requirements of your downstream applications.

In this post, we demonstrate two options to configure Amazon Web Services DMS to filter deletes when replicating data to the target database. The first strategy uses Amazon Simple Storage Service (Amazon S3), and the second strategy uses Amazon Web Services DMS transformation rules to flag delete operations during replication. This configuration allows you to selectively retain deleted data in the target database for auditing purposes, even when the source undergoes regular cleanup operations.

Use Amazon S3 as a staging environment

One way to filter deletes with Amazon Web Services DMS is to use Amazon S3 as a staging environment. Amazon Web Services DMS migrates inserts and updates and ignores deletes from the source database to an S3 bucket. You can use this approach for any supported source or target endpoints in Amazon Web Services DMS. Review the limitations for the chosen endpoints before starting the migration.

As shown in the following diagram, Amazon S3 acts as an intermediate system that filters out the deletes from applying to the final target by using the CdcInsertsAndUpdates endpoint setting available with Amazon S3 as target .

Amazon Web Services Direct Connect is an optional but recommended component for improved security. Amazon Web Services Direct Connect is a networking service that provides an alternative to using the internet to connect to Amazon Web Services. Using Amazon Web Services Direct Connect, data that would have previously been transported over the internet is delivered through a private network connection between your facilities and Amazon Web Services.

When you use the cdcInsertsAndUpdates S3 target endpoint setting, an additional first field is added, which indicates whether the row was inserted (I) or updated (U) at the source database. How Amazon Web Services DMS creates and sets this field depends on the migration task type and the settings of includeOpForFullLoad, cdcInsertsOnly, and cdcInsertsAndUpdates.

If our Amazon Web Services DMS task performs both a full load and change data capture (CDC), we must add the includeOpForFullLoad parameter to our target S3 endpoint settings. Amazon Web Services DMS creates an additional field in our load files, which ensures that the columns in the load and CDC files are consistent.

Without this parameter, the load and CDC files will have different numbers of columns, which can cause the following error if you try to use the data from both files together:

[SOURCE_UNLOAD   ]E:  Failed to write record id: 2, Number of values: 6 is not equal to number of columns: 5. [1020417]  (file_unload.c:477)

For example, let’s say we have the following data in our source table:

+----+----------------+------+--------------------+-------------------------------+

| id | name | age | email | address |

+----+----------------+------+--------------------+-------------------------------+

| 1 | John Doe | 30 | john@example.com | 123 Any Street, Any Town, USA |

| 2 | Jane Smith | 25 | jane@example.com | 456 Any Street, Any Town, USA |

| 3 | Carlos Salazar | 40 | carlos@example.com | 789 Any Street, Any Town, USA |

+----+----------------+------+--------------------+-------------------------------+

Now, let’s say we perform the following DML operations on this table:

-- Update a row
UPDATE my_table SET age = 34 WHERE id = 1;
 Query OK, 1 row affected (0.090 sec)
Rows matched: 1  Changed: 1  Warnings: 0

-- Delete a row
DELETE FROM my_table WHERE id = 2;
Query OK, 1 row affected (0.088 sec)

-- Insert another row
INSERT INTO my_table (id, name, age, email, address) VALUES (5, ‘Mary Major', 35, mary@example.com', 231 Main Street, Anytown, USA);

After performing these operations, the data present in our table is as follows:

+----+----------------+------+--------------------+-------------------------------+

| id | name | age | email | address |

+----+----------------+------+--------------------+-------------------------------+

| 1 | John Doe | 34 | john@example.com | 123 Any Street, Any Town, USA |

| 3 | Carlos Salazar | 40 | carlos@example.com | 789 Any Street, Any Town, USA |

| 5 | Mary Major | 35 | mary@example.com | 231 Main Street, Anytown, USA |

+----+----------------+------+--------------------+-------------------------------+

The following is the Amazon Web Services DMS output in the S3 staging bucket when the cdcInsertsAndUpdates parameter is used:

U 1 John Doe 34 john@example.com 123 Main St

I 5 Mary Major 35 mary@example.com 321 Pine St

We can observe that only inserts and updates are captured and deletes were filtered when the cdcInsertsAndUpdates parameter was used. This is because this parameter tells Amazon Web Services DMS to only capture change data that is inserted or updated, and to filter out any change data that is deleted.

Now we use this intermediate S3 bucket as a source and create another Amazon Web Services DMS task between Amazon S3 and the target database.

When configuring Amazon S3 as source, we also need to provide a JSON file with the external table definition so that Amazon Web Services DMS can replicate the data correctly. The following is the example external table definition that we use:

{
    "TableCount": "1",
    "Tables": [
        {
            "TableName": "my_table",
            "TablePath": "test/my_table",
            "TableOwner": "test",
            "TableColumns": [
                {
                    "ColumnName": "operation_indicator",
                    "ColumnType": "STRING",
                    "ColumnNullable": "false",
                    "ColumnLength": 10
                },
                {
                    "ColumnName": "id",
                    "ColumnType": "INT8",
                    "ColumnNullable": "false",
                    "ColumnIsPk": "true"
                },
                {
                    "ColumnName": "name",
                    "ColumnType": "STRING",
                    "ColumnLength": "20"
                },
                {
                    "ColumnName": "age",
                    "ColumnType": "INT8"
                },
                {
                    "ColumnName": "email",
                    "ColumnType": "STRING",
                    "ColumnLength": "100"
                },
                {
                    "ColumnName": "address",
                    "ColumnType": "STRING",
                    "ColumnLength": "50"
                }
            ],
            "TableColumnsTotal": "6"
        }
    ]
}

The target database shows that the deletes were not applied, and the record with ID 2 still exists:

+---------------------+----+----------------+------+--------------------+-------------------------------+

| operation_indicator | id | name | age | email | address |

+---------------------+----+----------------+------+--------------------+-------------------------------+

| U | 1 | John Doe | 34 | john@example.com | 123 Any Street, Any Town, USA |

| I | 2 | Jane Doe | 25 | jane@example.com | 456 Any Street, Any Town, USA |

| I | 3 | Carlos Salazar | 40 | carlos@example.com | 789 Any Street, Any Town, USA |

| I | 5 | Mary Major | 35 | mary@example.com | 231 Any Street, Any Town, USA |

+---------------------+----+----------------+------+--------------------+-------------------------------+

By using Amazon S3 as a staging environment to filter DML operations, you can selectively capture only the changes that you want to replicate. For example, if we want replicate only inserts, we can use cdcinsertsonly. This is useful for us to retain the deleted data on the target for auditing purposes even after the records are deleted in the source.

Use an Amazon Web Services DMS transformation rule

As shown in the following diagram, Amazon Web Services DMS is used to migrate and replicate the data from the on-premises data center. With the built-in functionality of the transformation rule, it filters out the delete operations.

Amazon Web Services Direct Connect is an optional but recommended component for improved security. Amazon Web Services Direct Connect is a networking service that provides an alternative to using the internet to connect to Amazon Web Services. Using Amazon Web Services Direct Connect, data that would have previously been transported over the internet is delivered through a private network connection between your facilities and Amazon Web Services.

Arch-dig-02

Using the operation_indicator function makes sure that the target records are flagged with the DML operation identifier in an extra column in the target table. To have this extra column in the target table, have the following transformation rule in the table mapping s section of the Amazon Web Services DMS task. With this transformation rule, the records in the target table are flagged with a user-provided value as per the value specified in the expression parameter in the transformation JSON rule. One of the prerequisites for using this functionality is that both the source and target table should have a primary key in place.

{
      "rule-type": "transformation",
      "rule-id": "2",
      "rule-name": "2",
      "rule-target": "column",
      "object-locator": {
        "schema-name": "%",
        "table-name": "%"
      },
      "rule-action": "add-column",
      "value": "Operation",
      "expression": "operation_indicator('D', 'U', 'I')",
      "data-type": {
        "type": "string",
        "length": 50
      }

This allows you to keep the record alive in the target even though it has been deleted in the source. The deleted record at the source gets flagged with the user-provided value of D as per the previous transformation rule in the target.

To test this functionality, we use the same table metadata and the DML transactions mentioned in the beginning of the post. The final output at the target table with the preceding activities is as follows:

+----+----------------+------+--------------------+-------------------------------+---------------------+

| id | name | age | email | address | operation_indicator |

+----+----------------+------+--------------------+-------------------------------+---------------------+

| 3 | Carlos Salazar | 40 | carlos@example.com | 789 Any Street, Any Town, USA | |

| 2 | Jane Smith | 25 | jane@example.com | 456 Any Street, Any Town, USA | U |

| 1 | John Doe | 34 | john@example.com | 123 Any Street, Any Town, USA | D |

| 5 | Mary Major | 35 | mary@example.com | 231 Main Street, Anytown, USA | I |

+----+----------------+------+--------------------+-------------------------------+---------------------+

The record with ID 2 is still intact in the target with a value of D under the operation column, as specified in the transformation rule. We can also view the table contents without deletes by running a query or creating a view with the condition operator <> ‘D’.

Conclusion

In this post, we demonstrated two approaches to filtering out the delete operation at the source database by using the in-house functionalities of Amazon Web Services DMS. Depending on the data requirement at the target database, you can use either approach. Using Amazon S3 as a target adds an additional cost of intermediate data storage, but it can be useful if multiple consumers need to process different subsets of the data from Amazon S3. Similarly, using the operation_indicator function of the transformation rule is a good option when there is a single line of source-to-target data migration or replication requirement.

Check out Database Migration—What Do You Need to Know Before You Start? to get started. Also review the recommended best practices associated with Amazon Web Services DMS and other S3 endpoint settings to address various use cases.


About the authors

Deepthi Saina is a Data Architect at Amazon Web Services based in the UK. Deepthi specializes in Amazon RDS, Amazon Aurora, and Amazon Web Services DMS. She is also a subject matter expert in Amazon Web Services DMS and Amazon RDS. In her role, she helps customers across the EMEA migrate, modernize, and optimize database solutions on Amazon Web Services.

Vivekananda Mohapatra is a Lead Consultant with the Amazon Web Services ProServe team. He has deep expertise in database development and administration on Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL, and Amazon Redshift databases. He is also a subject matter expert in Amazon Web Services DMS. He works closely with customers to help migrate and modernize their databases and applications to Amazon Web Services on a daily basis.