Frappe SaaS Multitenant Docker Standard (ADR-002)

Status: Accepted (data model). Partially superseded by ADR-003 for the IaC and provisioning control path. Date: 2026-04-24 Supersedes: ADR-001 — Hybrid Multi-site Architecture Partially superseded by: Control Plane Direct Provisioning (ADR-003) — Pulumi and Ansible are removed; the Worker calls Hetzner Cloud API + cloud-init + Docker Remote API + scope-limited Agent inbound exec directly. Sections §11, §17, §18 below incorporate the ADR-003 changes; the rest of this document is unchanged. Canonical: this document. Implementation ADR mirror lives at prego-control-plane/docs/rearchitecture/adr-002-multitenant-docker-standard.md; ADR-003 mirror at prego-control-plane/docs/rearchitecture/adr-003-direct-control-no-iac.md.

Companion documents

Topic	Document
Worker → Hetzner → cloud-init → Agent (no Pulumi/Ansible)	Control Plane Direct Provisioning
Placement Engine — input/output, bin-packing, plan tier filter	Tenant Placement Policy
Pre-warmed empty sites	Site Pool Strategy
8-state machine + transitions + side effects	Tenant Lifecycle
Agent — push telemetry + scope-limited inbound exec (mTLS, JWT)	Agent Security Standard

Related platform docs:

Repo responsibility matrix — which repo owns which artifact
Tenant Provisioning Flow — Pulumi, Ansible, Frappe — historical pipeline (Pulumi/Ansible deprecated by ADR-003)
SaaS Infrastructure — DB Separation and Usage-Based Scaling — DB tier rationale

1. Goal

Prego is a SaaS ERP built on Frappe/ERPNext. Operating tens of thousands of tenants requires that we never create per-tenant containers and never put per-tenant tables in a single shared schema.

The platform standard is:

Container = execution environment
Tenant    = Frappe Site
Tenant DB = independent MariaDB database per Site
App Runtime = shared
DB Server   = shared at the shard level
Redis       = shared with namespace separation

DO NOT create app/db/redis containers per tenant.
DO NOT put per-tenant tables in a single shared database.

This standard affirms ADR-001’s decision to use one shared prego-frappe-bench container per app server and adds DB sharding, Redis pool, Site pool, plan-based isolation matrix, multi-cloud abstraction, expanded lifecycle, and push-only Agent.

2. Standard architecture

2.1 Top-level flow

[ Zuplo Gateway ]
        ↓
[ Cloudflare Worker — Control Plane ]
        ↓
[ Workflow Engine (D1-backed) / Queue ]
        ↓
[ Regional Dispatcher ]
        ↓
[ Docker Hosts ]

2.2 Per-server roles

flowchart TB
    subgraph cp [Control Plane Cloudflare Worker]
        api[Hono API]
        wfe[WorkflowEngine D1]
        plc[PlacementEngine]
        infra[InfraProvider abstraction]
    end
    api --> wfe
    wfe --> plc
    plc --> infra
    infra --> hetzner[HetznerProvider]
    infra --> aws[AWSProvider future]
    infra --> gcp[GCPProvider future]

    subgraph region [Region cluster sg eu us]
        subgraph appPool [App Pool shared bench]
            app1[app-001 prego-frappe-bench]
            app2[app-002 prego-frappe-bench]
        end
        subgraph dbShards [DB Shards]
            db1[db-shard-001 tenant 1-5000]
            db2[db-shard-002 tenant 5001-10000]
        end
        subgraph redisPool [Redis Pool]
            rc[redis-cache]
            rq[redis-queue]
            rs[redis-socketio]
        end
        sitePool[(Site Pool pre-warmed)]
    end
    wfe -->|"Docker API mTLS"| app1
    wfe -->|"Docker API mTLS"| app2
    appPool --> dbShards
    appPool --> redisPool
    sitePool -.->|reserve assign| appPool

    agentNode[Push Agent on host] -->|"OTLP HTTPS"| cp
    app1 -.-> agentNode
    db1 -.-> agentNode
    rc -.-> agentNode

3. Per-server Docker standard

3.1 App Server

Host: app-<region>-NNN (e.g. app-sgp-001)
Containers: frappe-app (the shared prego-frappe-bench), nginx, socketio (optional)
Internal layout:

/home/frappe/frappe-bench
├── apps/
└── sites/
    ├── tenant-a.com
    ├── tenant-b.com
    └── common_site_config.json

Principles:

One app container handles many tenant sites.
App container runs stateless-ish (only sites/ and logs/ are mounted).
Scale by CPU / memory / worker load, not by tenant count alone.
Horizontal scaling is the primary growth path.

3.2 DB Server

Host: db-shard-<region>-NNN (e.g. db-shard-sgp-001)
Container: mariadb
Internal layout:

MariaDB
├── tenant_a_db
├── tenant_b_db
├── tenant_c_db
└── tenant_d_db

Principles:

Never create per-tenant containers.
Always create a dedicated MariaDB database per tenant.
One DB server hosts many tenant DBs until shard capacity is reached.
When a shard hits capacity, create db-shard-<region>-NN+1.

db-shard-sgp-001 → tenant 1 ~ 5000
db-shard-sgp-002 → tenant 5001 ~ 10000

3.3 Redis Server

Host: redis-<region>-NNN
Container: redis

Logical separation (always required):

cache
queue
socketio

Initial pattern: single Redis instance with DB index per purpose (matches today’s prego-docker compose pattern: REDIS_CACHE_DB=1, REDIS_QUEUE_DB=2, REDIS_SOCKETIO_DB=3).

Growth path (managed via redis_pools registry):

Split into redis-cache, redis-queue, redis-socketio instances
Or move to redis-cluster

4. Tenant placement standard

4.1 Forbidden patterns

Forbidden 1 — per-tenant app/db/redis containers:

tenant-a-app
tenant-a-db
tenant-a-redis

This pattern is not used for Trial / Starter / Business plans.

Forbidden 2 — per-tenant tables in a shared DB:

sales_invoice with tenant_id column

Incompatible with Frappe/ERPNext schema model. Forbidden.

4.2 Required pattern

Tenant
 → Frappe Site
 → independent MariaDB Database
 → Shared App Runtime
 → Shared Redis (logical separation)

Example:

tenant-a
→ site: tenant-a.pregoi.com
→ app:  app-sgp-001
→ db:   db-shard-sgp-001 / tenant_a_db
→ redis:redis-sgp-001 (cache db=1, queue=2, socketio=3)

5. Plan-based isolation matrix

Plan	App	DB	Redis	Notes
Trial	Shared App	Shared trial-only DB shard	Shared Redis	Site Pool reserves immediately
Starter	Shared App	Shared DB shard	Shared Redis	Default SaaS pattern
Business	Shared App	Business DB shard (`plan_tier_lock='business'`)	Shared or Dedicated Redis	Stronger performance guarantee
Enterprise	Dedicated App	Dedicated DB	Dedicated Redis	Strong isolation, contract-driven

Implementation note: db_shards.plan_tier_lock is the primary filter in PlacementEngine. See Tenant Placement Policy.

6. Site Pool strategy (summary)

Frappe bench new-site takes 2~5 minutes; provisioning latency at scale becomes the dominant UX cost. We pre-warm empty sites in a pool and promote them on tenant assignment.

sites/
├── pool-001  (state=available)
├── pool-002  (state=reserved)
└── pool-003  (state=available)

Provision:
  pool-001 → tenant-a.pregoi.com (state=assigned)

States: available → reserved → assigned. Failures land in failed and are reaped by a background warmer. Full design in Site Pool Strategy.

7. Control Plane responsibilities

The Cloudflare Worker Control Plane must not execute long-running tasks itself.

Responsibilities:

Receive tenant create requests
Persist tenant metadata (D1 tenants_master)
Record placement decision (D1 tenant_allocations, allocation_snapshots)
Enqueue workflow job (Cloudflare Queue → prego-provision-queue)
Provide status API (GET /v1/jobs/:id)

Forbidden:

Long polling inside a Worker request
Direct bench command execution from a Worker
Polling agents on every host from the Worker

The current implementation src/index.ts (fetch / queue / scheduled) already follows this rule. ADR-002 codifies it.

8. Workflow / Queue responsibilities

The workflow engine (D1-backed WorkflowEngine in src/orchestration/workflow-engine.ts) handles each provisioning job through the following ordered steps:

Receive tenant creation request from queue
Call PlacementEngine (region / plan / shard / pool decision)
Reserve a Site from site_pool
Create DB on chosen shard, or attach pool DB if pre-baked
Update common_site_config.json on app server (Docker exec)
Configure custom domain + DNS (Cloudflare DNS API)
Transition tenant to active
On failure: rollback (drop site, decrement counters, mark failed)

Initial implementation: Cloudflare Queue + custom WorkflowEngine (already in place). Growth path: consider Temporal / Inngest for cross-region sagas if needed.

9. Placement Engine standard

Input:

tenant_id, region, plan, expected_users, storage_quota_gb, data_residency

Output:

app_server_id, db_shard_id, redis_pool_id, site_name, database_name

Initial algorithm:

region match
plan match (against db_shards.plan_tier_lock, dedicated server constraints)
Available capacity (memory, tenant count)
Least-loaded shard / app server

Growth path:

Bin-packing (multi-dimensional)
Tenant affinity (group same-customer tenants for cache locality)
Noisy-neighbor isolation (move hot tenants to dedicated shards)
Enterprise dedicated placement (single-tenant servers)

Full spec in Tenant Placement Policy.

10. Tenant lifecycle (8 states)

trial → provisioning → active → suspended
                         ↓
                    grace_period → terminated → data_purged
                         ↓
                       failed

Additional actions (out-of-band, not state transitions):

upgrade / downgrade (plan tier change)
migrate_shard (move DB to a different shard)
change_domain
install_app
backup / restore
rolling_migrate (Frappe / ERPNext version bump)

State machine, transition triggers, and side effects (DNS, billing, backup) are defined in Tenant Lifecycle.

11. Agent standard

Updated by ADR-003 (2026-04-24). The Agent is no longer push-only. It now also accepts a scope-limited inbound /agent/v1/exec endpoint to run bench / mariadb / redis-cli, replacing what was previously an Ansible playbook run. Container lifecycle remains on the Docker Remote API.

Each Docker host runs a hardened Push + Scope-Limited Inbound Exec Agent, deployed by cloud-init at server boot.

11.1 Outbound (push) responsibilities

Push container status (running/stopped, image tag)
Push host metrics (CPU, memory, disk)
Push Frappe/bench worker status
Push site health summaries (per-site doctor)
Push DB connection / Redis memory metrics
Push backup completion reports

11.2 Inbound (exec) responsibilities — ADR-003

POST /agent/v1/exec with kind ∈ { bench, mariadb, redis-cli }
POST /agent/v1/exec/stream (SSE) for long-running commands (e.g., bench backup, bench migrate)
Single-shot per call; Worker supplies timeoutSeconds
Every invocation writes to agent_command_audit D1 table

11.3 Hard boundary

The Agent MUST NOT:

Accept arbitrary shell (no /agent/v1/shell, no generic /exec without kind)
Run docker run or docker exec on behalf of callers — that path stays on the Docker Remote API (mTLS port 2376)
Hold long-lived sessions

The Worker MUST NOT:

Use SSH for any automated operation
Use Pulumi or Ansible to drive any host (deprecated by ADR-003)

11.4 Security

mTLS (CA-issued client cert; same CA as Docker Remote API)
Short-lived JWT (≤5 min), scope = exec.bench / exec.mariadb / exec.redis-cli for Path D, or heartbeat / metrics / logs / backup_report / rotate-token for Path E (push)
Optional Cloudflare egress IP allowlist on host firewall (open decision: ADR-003 OD-A)
systemd hardening (NoNewPrivileges, restricted capabilities)
Bootstrap token: single-use, ≤15 min TTL, embedded in cloud-init user_data

Full security standard in Agent Security Standard. Full provisioning sequence in Control Plane Direct Provisioning.

12. Observability standard

The Agent pushes telemetry to Control Plane, which forwards to the central collector.

Metrics collected:

server_cpu, server_memory, disk_pct
container_status
bench_worker_status
site_health (per site)
db_connections, db_disk_usage
redis_memory, redis_eviction_rate
queue_depth (Frappe RQ)
provisioning_duration (per tenant)
error_count

Recommended sinks:

OpenTelemetry envelope (OTLP/HTTP)
Grafana Cloud or Datadog (configurable per region)
Cloudflare Logpush for Worker logs

13. Backup / DR standard

Plan	RPO	RTO	Backup mechanism
Trial	24h	Best effort	`bench backup` daily, R2 7d retention
Starter	12h	24h	`bench backup` 12h, MariaDB dump daily, R2 30d
Business	1h ~ 4h	4h ~ 8h	MariaDB physical dump hourly, volume snapshot daily, R2 90d
Enterprise	Custom	Custom	Cross-region replica + custom RPO; contract-driven

Backup methods (in priority order):

bench backup (Frappe-native, includes site files)
MariaDB physical dump (mysqldump or mariabackup)
Volume snapshot (provider-specific)
R2 / S3 archive (long-term cold storage)

14. Multi-cloud abstraction

Hetzner is the current sole cloud provider. To avoid vendor lock-in (regional coverage gaps, single-vendor outages) the Control Plane uses an InfraProvider interface.

export interface InfraProvider {
  readonly id: "hetzner" | "aws" | "gcp";
  createServer(spec: ServerSpec): Promise<ServerHandle>;
  destroyServer(id: string): Promise<void>;
  attachVolume?(id: string, gb: number): Promise<void>;
  listRegions(): Promise<Region[]>;
}

infra_providers D1 table records which provider serves which region. Initial implementation: Hetzner only. Expansion happens without changing PlacementEngine consumers.

15. TypeScript interface reference

These types are the contract, not the implementation. Definitions live in prego-control-plane/src/types.ts (Phase 2 will reconcile current TenantStatus 4-state and PlanTier 5-tier with this standard — see §17).

export type TenantPlan = "trial" | "starter" | "business" | "enterprise";

export type TenantStatus =
  | "trial"
  | "provisioning"
  | "active"
  | "suspended"
  | "grace_period"
  | "terminated"
  | "data_purged"
  | "failed";

export type Region = "sg" | "eu" | "us";

export interface PlacementRequest {
  tenantId: string;
  region: Region;
  plan: TenantPlan;
  expectedUsers?: number;
  storageQuotaGb?: number;
  dataResidency?: string;
}

export interface PlacementResult {
  appServerId: string;
  dbShardId: string;
  redisPoolId: string;
  siteName: string;
  databaseName: string;
  infraProvider: "hetzner" | "aws" | "gcp";
}

16. Final principles

Tenant       != Container
Tenant       == Frappe Site
Site DB      == Independent MariaDB Database
App Runtime  == Shared
DB           == Sharded by capacity and plan
Redis        == Shared with namespace separation
Enterprise   == Dedicated isolation

Every change to prego-control-plane, Docker deployment, or Frappe provisioning must be evaluated against these invariants.

17. Reconciliation with current implementation

ADR-002 supersedes ADR-001 and adds the following items. ADR-003 (2026-04-24) revises the IaC / provisioning rows below: Pulumi and Ansible are removed in favour of direct Worker calls.

Standard item	Current state	Phase
`db_shards` D1 table	Not present	Phase 2
`redis_pools` D1 table	Not present (single Redis node only)	Phase 2
`site_pool` D1 table	Not present (on-demand `bench new-site` only)	Phase 2
`infra_providers` D1 table	Not present (Hetzner hardcoded)	Phase 4
`tenant_lifecycle_events` D1 table	Partial (`provisioning_workflows` only)	Phase 2
`server_bootstrap_tokens` D1 table (ADR-003)	Not present	Phase 2
`agent_command_audit` D1 table (ADR-003)	Not present	Phase 2
`TenantStatus` 8-state	4-state (`Pending \| Active \| Pending_Deletion \| Deleted`)	Phase 5
`TenantPlan` naming	`free \| basic \| professional \| business \| enterprise` (5-tier)	Phase 5
Worker → Hetzner Cloud API direct (ADR-003)	Currently via `prego-pulumi`	Phase 3 (replaces Pulumi)
cloud-init dynamically generated by Worker (ADR-003)	Static template per environment	Phase 3
Agent inbound `/agent/v1/exec` for `bench / mariadb / redis-cli` (ADR-003)	Currently via `prego-ansible` playbooks	Phase 3 (replaces Ansible)
`InfraProvider` interface	`HetznerProvider` direct call	Phase 4 (Worker-side, no Pulumi stack split)
Push + Inbound Exec Agent	Not deployed	Phase 3

Migration mapping (Phase 5 backfill):

Current `TenantStatus`	New `TenantStatus`
`Pending`	`provisioning`
`Active`	`active`
`Pending_Deletion`	`grace_period`
`Deleted`	`data_purged`

New states added without legacy mapping: trial, suspended, terminated, failed.

TenantPlan reconciliation (open decision — see open decisions issue tracker): how to map professional (currently between basic and business) into the 4-tier standard. Most likely: professional → business, free → trial, basic → starter.

18. Cross-repo impact

This standard is canonical. The table below reflects the post-ADR-003 disposition.

Repo	Status	Impact
`prego-control-plane`	Active, expanded	D1 migrations (ADR-002 set + ADR-003 `server_bootstrap_tokens`, `agent_command_audit`); `PlacementEngine` v2; Worker-side `InfraProvider` directly calling Hetzner API; `provisioning-workflow.ts` rewrite; Agent inbound `exec-client.ts`; reconciliation of `TenantStatus`/`PlanTier`
`prego-pulumi`	Deprecated (ADR-003)	No new resources. Existing stacks readable for cutover only. See the migration runbook. Repo will be archived once `servers.managed_by = 'pulumi'` count reaches zero.
`prego-ansible`	Deprecated (ADR-003)	No new playbooks. Existing inventories retained for cutover only. Same archival condition as above.
`prego-docker`	Active, scope reduced	Frappe images only. Site-pool fast-init flags verified (`bench new-site --no-setup-db` etc.). The Agent image is published from a separate source (ADR-003 OD-C) and pulled by cloud-init at server boot — not packaged in the Frappe compose snippet.
`prego_saas`	Unchanged	No DocType change for ADR-002 / ADR-003. Future server-side helpers require a separate ADR.

19. References

ADR-001 — Hybrid Multi-site Architecture (superseded by this standard)
ADR-002 implementation mirror (prego-control-plane)
ADR-003 — Direct Control Plane Provisioning (partial supersession for IaC/provisioning)
Control Plane Direct Provisioning (canonical companion)
Imperative Control Layer (Path 1 preserved; Path 3 added by ADR-003)
Hybrid Multi-site Operations Runbook (§7.1 / §8.1 use Worker /internal/servers/create post-ADR-003)
Migration Runbook — From Pulumi/Ansible to Direct Control
Tenant Provisioning Flow — Pulumi, Ansible, Frappe (historical; Pulumi/Ansible deprecated by ADR-003)
SaaS Infrastructure — DB Separation and Usage-Based Scaling