Securing AI and RAG pipelines

Protect sensitive data in AI retrieval-augmented generation pipelines with encrypted vector storage and searchable encryption

Retrieval-Augmented Generation (RAG) pipelines commonly store sensitive documents alongside vector embeddings. Without encryption, this data is exposed at rest and during retrieval, creating a significant attack surface. CipherStash lets you encrypt sensitive content while preserving the ability to search and retrieve it.

The problem

RAG architectures typically store:

Document chunks: the original text, often containing PII, financial data, or confidential business information
Metadata: source references, user associations, access tags
Vector embeddings: numeric representations used for similarity search

If any of this data is exfiltrated from the database, the plaintext content is immediately exposed. Encryption-at-rest does not help because the data is decrypted as soon as it is queried.

Encrypting RAG context data

Use the Encryption SDK to encrypt sensitive fields before storing them alongside your embeddings.

Define a schema for your documents

import { encryptedTable, encryptedColumn } from "@cipherstash/stack/schema"

export const documents = encryptedTable("documents", {
  content: encryptedColumn("content")
    .freeTextSearch(),
  source: encryptedColumn("source")
    .equality(),
  userId: encryptedColumn("user_id")
    .equality(),
})

Encrypt before storage

import { Encryption } from "@cipherstash/stack"
import { documents } from "./schema"

const client = await Encryption({ schemas: [documents] })

async function ingestDocument(doc: { content: string; source: string; userId: string; embedding: number[] }) {
  const encryptedContent = await client.encrypt(doc.content, {
    column: documents.content,
    table: documents,
  })

  const encryptedSource = await client.encrypt(doc.source, {
    column: documents.source,
    table: documents,
  })

  const encryptedUserId = await client.encrypt(doc.userId, {
    column: documents.userId,
    table: documents,
  })

  if (encryptedContent.failure || encryptedSource.failure || encryptedUserId.failure) {
    throw new Error("Encryption failed")
  }

  // Store encrypted fields alongside the vector embedding
  await db.query(
    `INSERT INTO documents (content, source, user_id, embedding)
     VALUES ($1::jsonb, $2::jsonb, $3::jsonb, $4)`,
    [encryptedContent.data, encryptedSource.data, encryptedUserId.data, JSON.stringify(doc.embedding)]
  )
}

Decrypt retrieved context

After vector similarity search retrieves relevant documents, decrypt the content before passing it to the LLM:

async function retrieveContext(queryEmbedding: number[], topK: number = 5) {
  // Vector similarity search returns encrypted rows
  const results = await db.query(
    `SELECT content, source FROM documents
     ORDER BY embedding <-> $1
     LIMIT $2`,
    [JSON.stringify(queryEmbedding), topK]
  )

  // Decrypt the content for each result
  const decryptedDocs = await Promise.all(
    results.rows.map(async (row) => {
      const content = await client.decrypt(row.content)
      const source = await client.decrypt(row.source)
      return {
        content: content.failure ? null : content.data,
        source: source.failure ? null : source.data,
      }
    })
  )

  return decryptedDocs.filter((doc) => doc.content !== null)
}

Searchable encrypted retrieval

When you need to filter documents by metadata before or alongside vector search, use searchable encryption with EQL:

-- Find documents for a specific user using encrypted equality search
SELECT content, source, embedding
FROM documents
WHERE eql_v2.eq(user_id, $1)
ORDER BY embedding <-> $2
LIMIT 10;

This combines encrypted metadata filtering with vector similarity, without ever decrypting the metadata in the database.

Benefits for AI pipelines

Sensitive context stays encrypted: document chunks containing PII or confidential data are never stored in plaintext
Compliance-ready: encrypted storage meets GDPR, HIPAA, and SOC2 requirements for data protection
Selective decryption: only decrypt what the LLM needs, reducing exposure surface
Audit trail: track who retrieved which documents and when using identity-aware encryption