Multi-Tenant SaaS Architecture: The Decisions That Matter Most at Scale

We've architected multi-tenant platforms for HR tech, vertical CRM, and healthcare ops - from 50 tenants on a single RDS instance to 14,000 tenants on sharded Postgres with per-tenant encryption keys. The mistakes are predictable; the right tradeoffs depend on compliance tier, noisy-neighbor tolerance, and how often you onboard enterprise customers who demand isolated infrastructure.

Tenancy models: a spectrum, not a religion

Shared database, shared schema + tenant_id column - lowest ops cost, highest blast radius
Shared database, schema-per-tenant - migration pain, moderate isolation
Database-per-tenant - strongest isolation, highest cost and automation requirement
Cell-based: pools of tenants per cluster; enterprise tenants get dedicated cells

What we recommend at Vextrosys

Default to shared schema with strict RLS and application-layer tenant context for B2B SaaS under SOC 2. Move enterprise logos to dedicated cells or DB-per-tenant when contract or throughput demands it - automate provisioning from day one.

Row-level security in PostgreSQL

RLS is your safety net when an engineer forgets WHERE tenant_id = ?. We set session variable app.current_tenant from the JWT at connection checkout (PgBouncer transaction mode requires care - we use SET LOCAL per request in a thin proxy layer).

ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON orders
  USING (tenant_id = current_setting('app.current_tenant')::uuid);

Connection pooling pitfalls

PgBouncer in transaction pooling breaks SET unless you use SET LOCAL inside an explicit transaction. We route tenant-scoped API traffic through a sidecar that binds tenant context before queries run. Read replicas get the same policies via logical replication of policies in Postgres 15+.

Tenant context propagation

API gateway validates JWT, extracts tenant_id and plan tier
Service mesh header x-tenant-id propagated to all internal gRPC calls
Workers consume tenant from envelope metadata, never from job body alone
Background jobs: partition Kafka topics by tenant hash for fair scheduling

The worst multi-tenant bug is not SQL injection - it's a cache key without tenant prefix. We mandate tenant-scoped cache namespaces in code review checklists.

Noisy neighbor and fair scheduling

One tenant's CSV import should not starve others. We implement per-tenant rate limits at the API gateway (token bucket by plan), separate SQS queues for bulk operations, and query timeouts on reporting replicas. Heavy analytics tenants get routed to ClickHouse with tenant_id in the sorting key.

CPU quotas on Kubernetes namespaces per cell for largest customers
S3 prefix isolation: s3://bucket/{tenant_id}/... with IAM ABAC where possible
Encryption: SSE-KMS with tenant-specific CMK for enterprise; shared key for SMB tier

Migrations and schema evolution

Shared schema means one migration path. We use expand-contract: add nullable columns, dual-write, backfill per tenant batch, then contract. For schema-per-tenant, we built a migration orchestrator that runs Flyway per schema with concurrency caps - a failed tenant migration must not block the fleet.

# Tenant-aware migration job (conceptual)
for tenant in tenants(batch_size=50):
    with tenant_context(tenant.id):
        flyway.migrate()
    emit_metric("migration.success", tenant=tenant.id)

Custom fields without EAV hell

Vertical SaaS needs per-tenant custom attributes. Pure EAV crushes query performance. We use JSONB columns with GIN indexes for filterable custom fields, plus a metadata registry that defines types and validation. Search-heavy tenants sync to OpenSearch with flattened custom field mappings generated from the registry.

Enterprise onboarding

Provision tenant in <5 minutes automated: DB role, RLS policies, default buckets, IdP SAML metadata slot, and feature flags. Manual provisioning does not scale past 200 tenants/year.

Observability and support

Every log line and trace span includes tenant_id (hashed in public logs if needed)
Support impersonation: time-boxed, audited, read-only by default
Per-tenant SLO dashboards: error rate, p95 latency, queue depth
Billing alignment: usage meters (API calls, seats, storage) emitted from the same tenant context

Compliance tiers

HIPAA and PCI workloads pushed us to cell isolation: dedicated VPC, no shared Redis for session data, BAA with AWS. SOC 2 Type II on shared infrastructure is achievable with RLS, encryption in transit/at rest, and annual pen tests that include cross-tenant escalation attempts.

Define isolation tier in the sales contract before engineering promises DB-per-tenant
Automate provisioning and deprovisioning including crypto key retirement
Test tenant isolation in CI with integration tests that attempt cross-tenant reads
Plan cell migration strategy before tenant 500 - moving tenants is harder than creating them

Multi-tenant architecture is an operational discipline. The schema choice matters less than consistent tenant context, automated provisioning, and observability that treats each tenant as a first-class dimension. That's the work we do in architecture engagements before the first production tenant lands.