How we used the PostgreSQL wire protocol to bring you searchable encryption

Toby Hede - Avatar
Toby HedePrincipal Engineer
postgres wire protocol

CipherStash Proxy provides searchable encryption for PostgreSQL without the need to adapt application SQL. CipherStash Proxy automatically encrypts and decrypts data, and supports most query types over encrypted values.

When using CipherStash Proxy, applications can interact with encrypted data in a Postgres database as if it were plaintext. SQL statements are intercepted, parsed, type-checked, and mapped to the configuration specifying the types of encryption to use. You can read about rewriting SQL in part one of our series Rewriting SQL on the fly in CipherStash Proxy.

In order to make all of this smart magic work to abstract the complexity of searchable encryption, CipherStash Proxy has to handle the nuances of the PostgreSQL wire protocol.

What is the PostgreSQL wire protocol?

The PostgreSQL protocol describes the communication process between client (applications accessing the database) and server (the database).

As in all things computer, nothing is as simple as it seems; the protocol is actually two protocols:

  • Simple

  • Extended

Regardless of flavor, a stream of messages is sent over the network. The first byte of a message identifies the message type, and the next four bytes specify the length of the rest of the message. The body of a message depends on the message type (you can find details in the PostgreSQL Message formats reference).

The simple query protocol is, as the name suggests, simple. SQL statements are sent as a string. This is, in fact, quite simple. The PostgreSQL team can’t be impeached for false advertising here.

-- first byte is `Q` denoting a Simple Protocol query
Q\0\0\03SELECT encrypted_email FROM users WHERE id = 1\0

The SQL string will contain any dynamic variables or data as literal values. Applications will need to build this SQL string, interpolating and safely quoting and escaping any data, and suddenly the simple protocol is not that simple at all. The whole thing is a bad idea in practice and a potential source of SQL injection attacks.

Injection attacks are the number 3 vulnerability in the OWASP Top 10 because dynamically building a sql string is not as simple as in practice as it sounds.

Unsurprisingly, the extended protocol is not simple, and extends the query process into multiple steps, with the core benefit of separating the SQL statement from parameter data.

In the extended protocol the statement execution process is broken into a series of steps, and multiple messages are sent by the client:

  • Parse - sends a (parameterized) SQL statement and an optional list of parameter data types

  • Describe - requests a description of the parameters and rows returned

  • Bind - sends values to be mapped to the statement parameters

  • Execute - executes the statement with the bound parameters values

In general, any application using parameterized statements is using the extended protocol under the covers.

What about prepared statements?

This is where things become nuanced.

All parameterized statements use the extended protocol, but not all parameterised statements are prepared.

A prepared statement adds a client-specified name to the Parse step. A named statement lives for the length of the database session (as long as the client keeps the connection).

Once a named statement exists, the client can refer to the statement by name and skip the Parse step.

Fun fact: all statements using the extended protocol actually have a name, but if the client does not supply a name to make it an “officially” prepared statement the name is empty “” and the statement is reset on every Parse.

The performance benefit of prepared statements is difficult to quantify. The SQL string is only transmitted once, so there is less network overhead. Query planning typically occurs on Bind as parameter values may impact the query plan, so internal caching of statements in the database is not guaranteed.

Where was I?

Anyway, all of this is very complex and I hate it.

Let’s have a look at how transparently securing data interacts with the protocol.

How CipherStash Proxy uses the PostgreSQL protocol

Encrypting a value is generally relatively simple. Input data is encrypted using a service like CipherStash ZeroKMS or AWS KMS, and stored in the database.

Searchable encryption is slightly more complicated. Searchable encryption works by encrypting the same input value in several different ways, based on the types of search required, not just the value itself.

For example, an email might be configured to enable a unique index on the encrypted data, as well as partial match search operations (eg finding all emails by an encrypted name input ).

An encrypted column is defined in the database as PostgreSQL type called encrypted (part of the CipherStash Encrypt Query Language and bundled with proxy). The type enables proxy to discover encrypted columns by interrogating the database schema. The type is a wrapper around PostgreSQL’s native jsonb type, allowing us to store many encrypted terms in a single column.

Additional configuration defines the source data type and the types of searchable index terms to generate on write. The source data type is required so that a decrypted value can be returned as the original type.

This all sounds complicated, but is simpler in practice.

Assume that we want to encrypt an email in the users table.

We define:

  • an encrypted column named email_encrypted

  • with a cast type of text

  • and match index term.

Given a SQL statement:

INSERT INTO users (email_encrypted) VALUES ($1);

The client application passes the $1 parameter as a regular string

CipherStash Proxy uses the database schema to identify email_encrypted as an encrypted column and generates both the encrypted value and the match index term.

To the application, the column and parameter are a string, but internally, the data is encrypted.

Totally simple.

The steps for transparently encrypting and decrypting data map neatly onto the extended protocol:

  1. Parse and type check sql to identify encrypted columns

  2. Encrypt all input data destined for an encrypted column

  3. Execute the SQL

  4. Decrypt all returned data sourced from an encrypted column

The same steps apply to queries using the simple protocol, but it is much less interesting.

So what happens behind the scenes?

Parse

As we saw, the parse messages contain the SQL statement.

Proxy parses (surprise) the SQL statement and type-checks the statement against the database schema. Parameters and literals that require encryption are identified and mapped to their encrypt configuration. Encrypted columns in the “projection” (the returned list of columns in a SELECT statement) are also identified. Returned data needs to be decrypted.

If the statement contains literals that need encryption (because the data's sensitive), the encryption happens in this step. (But also, it's probably best not to use literals.)

A parse message can contain an optional list of data types, which is where things get a bit tricky. What does this actually mean?

Assume we have a users table with an int4 column appropriate named int4_column

Given a parameterized statement:

INSERT INTO users (int4_column) VALUES ($1);

The expected type of the $1 parameter will be an int4.

However, the protocol enables a client to specify a different type for this parameter in the Parse message. The specified type needs to follow PostgreSQL conversion rules, and has to be something that PostgreSQL can convert to the target column type. So, for example, the client might specify that it will actually send an int2 instead of an int4 . The conversion actually happens in Bind and if the column is encrypted, CipherStash Proxy needs to handles this conversion.

Describe

A describe message requests a description of the parameters and rows returned (if any) of a SQL statement.

Describe is where the magic really begins. And by magic I mean strategic misrepresentation.

Going back to our earlier users table with an encrypted column named email_encrypted configured as an encrypted text .

Given the SQL statement:

INSERT INTO users (email_encrypted) VALUES ($1);

The actual type of this column is our encrypted type containing jsonb.

On Describe CipherStash Proxy reports the parameter type as the cast text type. To the client the column is text and the email value can be sent as a text parameter.

A similar process happens with the types of columns in any rows returned. Any encrypted columns are described as the cast type and not the actual column type.

Bind

A Bind message passes any required statement parameters in preparation for execution.

Bind is where encryption happens.

If a parameter references an encrypted column, the data is encrypted, transformed into theencrypted type representation, and passed through to the database to be stored as jsonb.

Of course, reality is slightly more complicated. We’ve already seen that the protocol enables a different type to be set in Parse (for example sending anint2 instead of an int4) and we need to handle this conversion. For various reasons, the extended protocol also has two different encoding formats (text and binary). Parameters need to be decoded using the appropriate approach before being encrypted.

Execute

Once we have a parsed statement and bound any parameters, the statement can be executed.

The database takes the sql and parameters and does the work. From the PostgreSQL perspective this is a totally vanilla sql statement. An encrypted column expects jsonb, CipherStash Proxy ensures that the encrypted data is jsonb , and things just work (tm).

INSERT INTO users (email_encrypted) VALUES ('{...}');

More work is required for statements that return data.

The response to a SELECT statement is a RowDescription message followed by zero or more DataRow messages, each representing a single returned record. Proxy decrypts the data encodes them into the expected type.

Going back to our earlier example schema:

SELECT email_encrypted FROM users;

The email_encrypted data is decrypted back into the original source text and passed through.

From the client perspective, the fact that email_encrypted is jsonb containing one or more index terms is completely hidden. The column is text at read time, but encrypted and secure inside the database.

What next?

As you can see, CipherStash Proxy uses statement type checking, statement mapping, and the PostgreSQL protocol to provide searchable encryption for PostgreSQL without you needing to adapt any SQL in your application.

Try out CipherStash Proxy yourself, or get in touch to find out more — we love to nerd out about protecting data with encryption.

Start protecting your data

Get started by creating a free account and choosing your integration path, or get in touch to learn more.