How-to guides

Migrating your plaintext data to be fully encrypted

This how to guide explains how to migrate your plaintext data to be fully encrypted with CipherStash. It also explains the different encryption modes, and how they work. Finally, it explains how to clean up your plaintext data.

Introduction

You have plaintext data you want to protect. CipherStash helps you encrypt your data in your existing database, while keeping it searchable.

But what are the steps you need to follow to go from plaintext to fully encrypted? CipherStash provides a way to do this without any application downtime, while keeping the encrypted and plaintext field values consistent.

CipherStash uses encryption modes to incrementally migrate your plaintext to ciphertext

Say you have a patients table in your application that has an email column you want to protect. All the values in that email column are currently in plaintext. To encrypt that email column, you need to create two additional encrypted columns:

  1. A "source" column, which contains the encrypted value from an AES-GCM encryption operation
  2. An "index" column, which contains an encrypted index that can be used to search and order the encrypted value

The CipherStash driver can operate in four modes:

  1. plaintext — the plaintext column is used for all CRUD operations.
  2. plaintext-duplicate — the plaintext column is used for all CRUD operations, but INSERTs and UPDATEs are also written to the encrypted columns.
  3. encrypted-duplicate — the encrypted columns are used for all CRUD operations, but INSERTs and UPDATEs are also written to the plaintext column.
  4. encrypted — the encrypted columns are used for all CRUD operations.

While it's possible to do a knife-switch cutover from plaintext to encrypted modes, we recommend CipherStash users migrate their data by stepping through the modes incrementally. This builds confidence in the performance and availability of the encryption, and gives you opportunity to roll back.

This also allows for zero-downtime migrations.

Understanding the different encryption modes

When integrating CipherStash into your application, you create a dataset config which describes which fields you want to encrypt, what encrypted indexes to enable, and what encryption mode to use for each field.

The following is an example of a dataset config with a single field configured:

1# dataset.yml
2tables:
3  - path: patients
4    fields:
5      - name: email
6        in_place: false
7        cast_type: utf8-str
8        mode: plaintext-duplicate
9        indexes:
10          - version: 1
11            kind: match
12            tokenizer:
13              kind: ngram
14              token_length: 3
15            token_filters:
16              - kind: downcase
17            k: 6
18            m: 2048
19            include_original: true
20          - version: 1
21            kind: ore
22          - version: 1
23            kind: unique
24

From the config, the field email of the type utf8-str is configured with match, ore and unique indexes. This field should be represented by the following database columns:

columndescription
emailplaintext source value for email
__email_encryptedencrypted source value for email
__email_matchencrypted match index for email
__email_oreencrypted ore index for email
__email_uniqueencrypted unique index for email

The encryption mode for this field is set to plaintext-duplicate.

Each field's encryption mode can have one of the following values:

  • plaintext-duplicate
  • encrypted-duplicate
  • encrypted

With CipherStash, the driver is responsible for database reads, writes, query mapping, encryption, and decryption. Each encryption mode affects how the driver behaves.

Encryption Mode: plaintext-duplicate

columndescriptionoperations
emailplaintext source value for emailreads, writes, lookups
__email_encryptedencrypted source value for emailwrites
__email_matchencrypted match index for emailwrites
__email_oreencrypted ore index for emailwrites
__email_uniqueencrypted unique index for emailwrites

In plaintext-duplicate mode, the driver uses the plaintext source column for all read operations. However, the driver writes to the plaintext source, encrypted source, and the encrypted index columns.

plaintext-duplicate-reads

Figure: Mapping of SELECT statements and WHERE clauses for a field in plaintext-duplicate mode

The driver handles SELECT statements and WHERE clauses as follows:

  • When selecting for the field's value, the driver retrieves the plaintext value from the plaintext column directly.
  • When the field is used as a filter in a WHERE clause, the driver will pass that through to the database.
  • In plaintext-duplicate mode, the driver does not use the encrypted source and index columns for read operations.

This works in the same way as plain SQL without the CipherStash driver.

plaintext-duplicate-writes

Figure: Mapping of INSERT and UPDATE statements for a field in plaintext-duplicate mode

The driver handles DB INSERT and UPDATE statements as follows. When inserting a new row, or updating the field, the driver will encrypt the field's value and encrypted index. The driver will then write the plaintext value, the encrypted value, and the encrypted index values into each column respectively.

This encryption mode is used when first setting up CipherStash on an existing plaintext field. When initially adding the encrypted source column, and the encrypted indexes to the database their values have not been populated yet. This mode allows the application to keep working with the plaintext values during reads, and incrementally write the encrypted values, while also allowing you to bulk encrypt every record in the database in the background without causing application downtime.

Encryption Mode: encrypted-duplicate

In the following example, the encryption mode on the dob field is set to encrypted-duplicate

1# dataset.yml
2tables:
3  - path: patients
4    fields:
5      - name: email
6        in_place: false
7        cast_type: utf8-str
8        mode: encrypted-duplicate
9        indexes:
10          - version: 1
11            kind: match
12            tokenizer:
13              kind: ngram
14              token_length: 3
15            token_filters:
16              - kind: downcase
17            k: 6
18            m: 2048
19            include_original: true
20          - version: 1
21            kind: ore
22          - version: 1
23            kind: unique
24
columndescriptionoperations
emailplaintext source value for emailwrites
__email_encryptedencrypted source value for emailreads, writes
__email_matchencrypted match index for emailwrites, match lookups (LIKE)
__email_oreencrypted ore index for emailwrites, ORE lookups (<, <=, >, >=)
__email_uniqueencrypted unique index for emailwrites, exact lookups (=)

In encrypted-duplicate mode, the CipherStash driver:

  • Uses the encrypted source column for retrieving the field value.
  • Uses the encrypted index columns for lookups by this field.
  • Writes to the plaintext source column, encrypted source column, and all index columns, in the same way as in the plaintext-duplicate mode.

encrypted-duplicate-reads

Figure: Mapping of SELECT statements and WHERE clauses for a field in encrypted-duplicate mode

The CipherStash driver handles SELECT statements and WHERE clauses as follows:

  • When selecting for the field's value, the driver retrieves the ciphertext from the encrypted source column, and decrypts it.
  • When the field is used as a filter in a WHERE clause, the driver will encrypt the filter term, and map the filter to use the appropriate encrypted index column for the filter comparison.

To show how this works in practice, per the diagram above:

  • The clause email LIKE 'user%' is mapped to __email_match @> xxx, where xxx is the encrypted match term.
  • The clause email >= '[email protected]' is mapped to __email_ore > xxx, where xxx is the encrypted ore term.
  • The clause email = '[email protected]' is mapped to __email_unique = xxx, where xxx is the encrypted exact term.

Because this relies on the encrypted column and the encrypted index columns, the query will not work correctly if the table contains rows that have not been encrypted. This applies to rows where the encrypted source column and index columns are NULL. Such rows might incorrectly appear with NULL values, or might not appear in queries that filter based on the encrypted index.

Additionally, queries can fail if the required encrypted index columns are not configured. In the example above, supposed that the match index was left out, if you try to do a LIKE query on this field, the CipherStash driver will fail to map the query because a match index is required.

encrypted-duplicate-write

Figure: Mapping of INSERT and UPDATE statements for a field in encrypted-duplicate mode

In encrypted-duplicate mode, the CipherStash driver handles DB INSERT and UPDATE statements similar to plaintext-duplicate mode. When inserting a new row, or updating the field, the driver will encrypt the field's value and encrypted index. The driver will then insert the plaintext value, the encrypted value, and the encrypted index values into each column respectively.

The encrypted-duplicate mode is used when all records in the table have been fully encrypted, and there are no more encrypted source column or index columns with NULL values. This mode makes the application operate fully on encrypted reads and writes, while still keeping the value in the plaintext column consistent and updated. This allows for the application to be tested extensively with fully encrypted read-writes. If there is a bug or broken functionality discovered at this stage, the application can easily be reverted to the plaintext-duplicate mode.

Encryption mode: encrypted

In the following example, the field dob encryption mode is set to encrypted:

1# dataset.yml
2tables:
3  - path: patients
4    fields:
5      - name: email
6        in_place: false
7        cast_type: utf8-str
8        mode: encrypted
9        indexes:
10          - version: 1
11            kind: match
12            tokenizer:
13              kind: ngram
14              token_length: 3
15            token_filters:
16              - kind: downcase
17            k: 6
18            m: 2048
19            include_original: true
20          - version: 1
21            kind: ore
22          - version: 1
23            kind: unique
24
columndescriptionoperations
__email_encryptedencrypted source value for emailreads, writes
__email_matchencrypted match index for emailwrites, match lookups (LIKE)
__email_oreencrypted ore index for emailwrites, ore lookups (<, <=, >, >=)
__email_uniqueencrypted unique index for emailwrites, exact lookups (=)

In encrypted mode, the CipherStash driver handles reads and writes similar to encrypted-duplicate mode:

  • The driver uses the encrypted source column for retrieving the field value.
  • The driver uses the encrypted index columns for query filters.
  • The driver writes to the encrypted source and the encrypted index columns.

The only difference between the encrypted and encrypted-duplicate mode is that in the encrypted mode, the driver no longer writes to the plaintext column.

encrypted-reads

Figure: Mapping of SELECT statements and WHERE clauses for a field in encrypted mode

The driver handles SELECT statements and WHERE clauses similar to encrypted-duplicate mode. When selecting for the field's value, the driver retrieves the ciphertext from the encrypted source column, and decrypts it. When the field is used as a filter in a WHERE clause, the driver will encrypt the filter term, and map the filter to use the appropriate encrypted index column for the filter comparison.

encrypted-write

Figure: Mapping of INSERT and UPDATE statements for a field in encrypted mode

In this mode, the driver handles database INSERT and UPDATE statements similar to plaintext-duplicate and encrypted-duplicate modes. However, the driver no longer writes into the plaintext column.

When inserting a new row, or updating the field, the driver will encrypt the field's value and encrypted index. The driver will then write the encrypted value and the encrypted index values into each column respectively.

This mode is used when the application has been verified to work correctly with CipherStash in encrypted-duplicate mode. When encrypted mode is used, the plaintext column is no longer used, and can be safely removed from the table.

Migrating a field from plaintext-duplicate mode to encrypted

Prerequisites

Before starting, please ensure that you:

  1. Have already created a CipherStash account (you can do this with the stash signup command)
  2. Have already installed Stash CLI in your local development environment
  3. Are connected to the Internet
  4. Have an application that integrates with CipherStash. You can use our example application to get started.
  5. Have an existing plaintext field that has been configured in plaintext-duplicate mode.

For this example, let's call this field email on the users table.

Example: migrating users.email from plaintext-duplicate to encrypted

For this example, let's assume that you have a field email on the table users, that have been configured in plaintext-duplicate mode. This field has the following configuration in your dataset config file.

1# dataset.yml
2tables:
3  # other tables and fields above here
4  - path: users
5    fields:
6      - name: email
7        in_place: false
8        cast_type: utf8-str
9        mode: plaintext-duplicate
10        indexes:
11          - version: 1
12            kind: match
13            tokenizer:
14              kind: ngram
15              token_length: 3
16            token_filters:
17              - kind: downcase
18            k: 6
19            m: 2048
20            include_original: true
21          - version: 1
22            kind: ore
23          - version: 1
24            kind: unique
25

We will also assume that the following columns have been created in your database:

  • email
  • __email_encrypted
  • __email_match
  • __email_ore
  • __email_unique

Encrypting all the records

In plaintext-duplicate mode, your application will write to all plaintext, encrypted, and index columns, while only reading from the plaintext column. Because of this, the application will work without issues even when retrieving records that have not been encrypted yet.

However, the application will not function correctly if it is switched to one of the other modes (plaintext-duplicate, encrypted-duplicate, or encrypted) at this stage. Before switching this field to either encrypted-duplicate or encrypted mode, we need to encrypt all records in the table.

The steps to take depends on which type of application you are working with.

Migrating to encrypted mode

At this point, all the records on the table have been encrypted. You have also verified that your application still works when you switch the users.email field to encrypted-duplicate mode. Your application is now ready to switch this field to encrypted mode.

As before, update the encryption mode in the dataset config file to encrypted.

1# dataset.yml
2tables:
3  # snip
4  - path: users
5    fields:
6      - name: email
7        in_place: false
8        cast_type: utf8-str
9        mode: encrypted # updated
10# snip
11

Next, upload the new dataset config to CipherStash.

1stash datasets config upload --file dataset.yml --client-id $CS_CLIENT_ID --client-key $CS_CLIENT_KEY
2

Finally, restart your application.

Once your application restarts, it will no longer write any values to the plaintext column you started with. All reads and writes should only depend on the encrypted source column and the encrypted index columns.

Cleaning up the plaintext column

At this point, it should be safe to drop the plaintext column from the database. The steps will depend on how database migrations are handled in your application.

Previous
Creating access keys