Frappe SaaS Multitenant Docker Standard (ADR-002)
Status: Accepted (data model). Partially superseded by ADR-003 for the IaC and provisioning control path. Date: 2026-04-24 Supersedes: ADR-001 — Hybrid Multi-site Architecture Partially superseded by: Control Plane Direct Provisioning (ADR-003) — Pulumi and Ansible are removed; the Worker calls Hetzner Cloud API + cloud-init + Docker Remote API + scope-limited Agent inbound
execdirectly. Sections §11, §17, §18 below incorporate the ADR-003 changes; the rest of this document is unchanged. Canonical: this document. Implementation ADR mirror lives atprego-control-plane/docs/rearchitecture/adr-002-multitenant-docker-standard.md; ADR-003 mirror atprego-control-plane/docs/rearchitecture/adr-003-direct-control-no-iac.md.
Companion documents
| Topic | Document |
|---|---|
| Worker → Hetzner → cloud-init → Agent (no Pulumi/Ansible) | Control Plane Direct Provisioning |
| Placement Engine — input/output, bin-packing, plan tier filter | Tenant Placement Policy |
| Pre-warmed empty sites | Site Pool Strategy |
| 8-state machine + transitions + side effects | Tenant Lifecycle |
| Agent — push telemetry + scope-limited inbound exec (mTLS, JWT) | Agent Security Standard |
Related platform docs:
- Repo responsibility matrix — which repo owns which artifact
- Tenant Provisioning Flow — Pulumi, Ansible, Frappe — historical pipeline (Pulumi/Ansible deprecated by ADR-003)
- SaaS Infrastructure — DB Separation and Usage-Based Scaling — DB tier rationale
1. Goal
Prego is a SaaS ERP built on Frappe/ERPNext. Operating tens of thousands of tenants requires that we never create per-tenant containers and never put per-tenant tables in a single shared schema.
The platform standard is:
Container = execution environmentTenant = Frappe SiteTenant DB = independent MariaDB database per SiteApp Runtime = sharedDB Server = shared at the shard levelRedis = shared with namespace separation
DO NOT create app/db/redis containers per tenant.DO NOT put per-tenant tables in a single shared database.This standard affirms ADR-001’s decision to use one shared prego-frappe-bench container per app server and adds DB sharding, Redis pool, Site pool, plan-based isolation matrix, multi-cloud abstraction, expanded lifecycle, and push-only Agent.
2. Standard architecture
2.1 Top-level flow
[ Zuplo Gateway ] ↓[ Cloudflare Worker — Control Plane ] ↓[ Workflow Engine (D1-backed) / Queue ] ↓[ Regional Dispatcher ] ↓[ Docker Hosts ]2.2 Per-server roles
flowchart TB
subgraph cp [Control Plane Cloudflare Worker]
api[Hono API]
wfe[WorkflowEngine D1]
plc[PlacementEngine]
infra[InfraProvider abstraction]
end
api --> wfe
wfe --> plc
plc --> infra
infra --> hetzner[HetznerProvider]
infra --> aws[AWSProvider future]
infra --> gcp[GCPProvider future]
subgraph region [Region cluster sg eu us]
subgraph appPool [App Pool shared bench]
app1[app-001 prego-frappe-bench]
app2[app-002 prego-frappe-bench]
end
subgraph dbShards [DB Shards]
db1[db-shard-001 tenant 1-5000]
db2[db-shard-002 tenant 5001-10000]
end
subgraph redisPool [Redis Pool]
rc[redis-cache]
rq[redis-queue]
rs[redis-socketio]
end
sitePool[(Site Pool pre-warmed)]
end
wfe -->|"Docker API mTLS"| app1
wfe -->|"Docker API mTLS"| app2
appPool --> dbShards
appPool --> redisPool
sitePool -.->|reserve assign| appPool
agentNode[Push Agent on host] -->|"OTLP HTTPS"| cp
app1 -.-> agentNode
db1 -.-> agentNode
rc -.-> agentNode
3. Per-server Docker standard
3.1 App Server
- Host:
app-<region>-NNN(e.g.app-sgp-001) - Containers:
frappe-app(the sharedprego-frappe-bench),nginx,socketio(optional) - Internal layout:
/home/frappe/frappe-bench├── apps/└── sites/ ├── tenant-a.com ├── tenant-b.com └── common_site_config.jsonPrinciples:
- One app container handles many tenant sites.
- App container runs stateless-ish (only
sites/andlogs/are mounted). - Scale by CPU / memory / worker load, not by tenant count alone.
- Horizontal scaling is the primary growth path.
3.2 DB Server
- Host:
db-shard-<region>-NNN(e.g.db-shard-sgp-001) - Container:
mariadb - Internal layout:
MariaDB├── tenant_a_db├── tenant_b_db├── tenant_c_db└── tenant_d_dbPrinciples:
- Never create per-tenant containers.
- Always create a dedicated MariaDB database per tenant.
- One DB server hosts many tenant DBs until shard capacity is reached.
- When a shard hits capacity, create
db-shard-<region>-NN+1.
db-shard-sgp-001 → tenant 1 ~ 5000db-shard-sgp-002 → tenant 5001 ~ 100003.3 Redis Server
- Host:
redis-<region>-NNN - Container:
redis
Logical separation (always required):
cachequeuesocketio
Initial pattern: single Redis instance with DB index per purpose (matches today’s prego-docker compose pattern: REDIS_CACHE_DB=1, REDIS_QUEUE_DB=2, REDIS_SOCKETIO_DB=3).
Growth path (managed via redis_pools registry):
- Split into
redis-cache,redis-queue,redis-socketioinstances - Or move to
redis-cluster
4. Tenant placement standard
4.1 Forbidden patterns
Forbidden 1 — per-tenant app/db/redis containers:
tenant-a-apptenant-a-dbtenant-a-redisThis pattern is not used for Trial / Starter / Business plans.
Forbidden 2 — per-tenant tables in a shared DB:
sales_invoice with tenant_id columnIncompatible with Frappe/ERPNext schema model. Forbidden.
4.2 Required pattern
Tenant → Frappe Site → independent MariaDB Database → Shared App Runtime → Shared Redis (logical separation)Example:
tenant-a→ site: tenant-a.pregoi.com→ app: app-sgp-001→ db: db-shard-sgp-001 / tenant_a_db→ redis:redis-sgp-001 (cache db=1, queue=2, socketio=3)5. Plan-based isolation matrix
| Plan | App | DB | Redis | Notes |
|---|---|---|---|---|
| Trial | Shared App | Shared trial-only DB shard | Shared Redis | Site Pool reserves immediately |
| Starter | Shared App | Shared DB shard | Shared Redis | Default SaaS pattern |
| Business | Shared App | Business DB shard (plan_tier_lock='business') | Shared or Dedicated Redis | Stronger performance guarantee |
| Enterprise | Dedicated App | Dedicated DB | Dedicated Redis | Strong isolation, contract-driven |
Implementation note: db_shards.plan_tier_lock is the primary filter in PlacementEngine. See Tenant Placement Policy.
6. Site Pool strategy (summary)
Frappe bench new-site takes 2~5 minutes; provisioning latency at scale becomes the dominant UX cost. We pre-warm empty sites in a pool and promote them on tenant assignment.
sites/├── pool-001 (state=available)├── pool-002 (state=reserved)└── pool-003 (state=available)
Provision: pool-001 → tenant-a.pregoi.com (state=assigned)States: available → reserved → assigned. Failures land in failed and are reaped by a background warmer. Full design in Site Pool Strategy.
7. Control Plane responsibilities
The Cloudflare Worker Control Plane must not execute long-running tasks itself.
Responsibilities:
- Receive tenant create requests
- Persist tenant metadata (D1
tenants_master) - Record placement decision (D1
tenant_allocations,allocation_snapshots) - Enqueue workflow job (Cloudflare Queue →
prego-provision-queue) - Provide status API (
GET /v1/jobs/:id)
Forbidden:
- Long polling inside a Worker request
- Direct
benchcommand execution from a Worker - Polling agents on every host from the Worker
The current implementation src/index.ts (fetch / queue / scheduled) already follows this rule. ADR-002 codifies it.
8. Workflow / Queue responsibilities
The workflow engine (D1-backed WorkflowEngine in src/orchestration/workflow-engine.ts) handles each provisioning job through the following ordered steps:
- Receive tenant creation request from queue
- Call
PlacementEngine(region / plan / shard / pool decision) - Reserve a Site from
site_pool - Create DB on chosen shard, or attach pool DB if pre-baked
- Update
common_site_config.jsonon app server (Docker exec) - Configure custom domain + DNS (Cloudflare DNS API)
- Transition tenant to
active - On failure: rollback (drop site, decrement counters, mark
failed)
Initial implementation: Cloudflare Queue + custom WorkflowEngine (already in place). Growth path: consider Temporal / Inngest for cross-region sagas if needed.
9. Placement Engine standard
Input:
tenant_id, region, plan, expected_users, storage_quota_gb, data_residencyOutput:
app_server_id, db_shard_id, redis_pool_id, site_name, database_nameInitial algorithm:
regionmatchplanmatch (againstdb_shards.plan_tier_lock, dedicated server constraints)- Available capacity (memory, tenant count)
- Least-loaded shard / app server
Growth path:
- Bin-packing (multi-dimensional)
- Tenant affinity (group same-customer tenants for cache locality)
- Noisy-neighbor isolation (move hot tenants to dedicated shards)
- Enterprise dedicated placement (single-tenant servers)
Full spec in Tenant Placement Policy.
10. Tenant lifecycle (8 states)
trial → provisioning → active → suspended ↓ grace_period → terminated → data_purged ↓ failedAdditional actions (out-of-band, not state transitions):
upgrade/downgrade(plan tier change)migrate_shard(move DB to a different shard)change_domaininstall_appbackup/restorerolling_migrate(Frappe / ERPNext version bump)
State machine, transition triggers, and side effects (DNS, billing, backup) are defined in Tenant Lifecycle.
11. Agent standard
Updated by ADR-003 (2026-04-24). The Agent is no longer push-only. It now also accepts a scope-limited inbound
/agent/v1/execendpoint to runbench / mariadb / redis-cli, replacing what was previously an Ansible playbook run. Container lifecycle remains on the Docker Remote API.
Each Docker host runs a hardened Push + Scope-Limited Inbound Exec Agent, deployed by cloud-init at server boot.
11.1 Outbound (push) responsibilities
- Push container status (running/stopped, image tag)
- Push host metrics (CPU, memory, disk)
- Push Frappe/bench worker status
- Push site health summaries (per-site doctor)
- Push DB connection / Redis memory metrics
- Push backup completion reports
11.2 Inbound (exec) responsibilities — ADR-003
POST /agent/v1/execwithkind ∈ { bench, mariadb, redis-cli }POST /agent/v1/exec/stream(SSE) for long-running commands (e.g.,bench backup,bench migrate)- Single-shot per call; Worker supplies
timeoutSeconds - Every invocation writes to
agent_command_auditD1 table
11.3 Hard boundary
The Agent MUST NOT:
- Accept arbitrary shell (no
/agent/v1/shell, no generic/execwithoutkind) - Run
docker runordocker execon behalf of callers — that path stays on the Docker Remote API (mTLS port 2376) - Hold long-lived sessions
The Worker MUST NOT:
- Use SSH for any automated operation
- Use Pulumi or Ansible to drive any host (deprecated by ADR-003)
11.4 Security
- mTLS (CA-issued client cert; same CA as Docker Remote API)
- Short-lived JWT (≤5 min), scope =
exec.bench/exec.mariadb/exec.redis-clifor Path D, orheartbeat/metrics/logs/backup_report/rotate-tokenfor Path E (push) - Optional Cloudflare egress IP allowlist on host firewall (open decision: ADR-003 OD-A)
- systemd hardening (
NoNewPrivileges, restricted capabilities) - Bootstrap token: single-use, ≤15 min TTL, embedded in cloud-init
user_data
Full security standard in Agent Security Standard. Full provisioning sequence in Control Plane Direct Provisioning.
12. Observability standard
The Agent pushes telemetry to Control Plane, which forwards to the central collector.
Metrics collected:
server_cpu,server_memory,disk_pctcontainer_statusbench_worker_statussite_health(per site)db_connections,db_disk_usageredis_memory,redis_eviction_ratequeue_depth(Frappe RQ)provisioning_duration(per tenant)error_count
Recommended sinks:
- OpenTelemetry envelope (OTLP/HTTP)
- Grafana Cloud or Datadog (configurable per region)
- Cloudflare Logpush for Worker logs
13. Backup / DR standard
| Plan | RPO | RTO | Backup mechanism |
|---|---|---|---|
| Trial | 24h | Best effort | bench backup daily, R2 7d retention |
| Starter | 12h | 24h | bench backup 12h, MariaDB dump daily, R2 30d |
| Business | 1h ~ 4h | 4h ~ 8h | MariaDB physical dump hourly, volume snapshot daily, R2 90d |
| Enterprise | Custom | Custom | Cross-region replica + custom RPO; contract-driven |
Backup methods (in priority order):
bench backup(Frappe-native, includes site files)- MariaDB physical dump (
mysqldumpormariabackup) - Volume snapshot (provider-specific)
- R2 / S3 archive (long-term cold storage)
14. Multi-cloud abstraction
Hetzner is the current sole cloud provider. To avoid vendor lock-in (regional coverage gaps, single-vendor outages) the Control Plane uses an InfraProvider interface.
export interface InfraProvider { readonly id: "hetzner" | "aws" | "gcp"; createServer(spec: ServerSpec): Promise<ServerHandle>; destroyServer(id: string): Promise<void>; attachVolume?(id: string, gb: number): Promise<void>; listRegions(): Promise<Region[]>;}infra_providers D1 table records which provider serves which region. Initial implementation: Hetzner only. Expansion happens without changing PlacementEngine consumers.
15. TypeScript interface reference
These types are the contract, not the implementation. Definitions live in prego-control-plane/src/types.ts (Phase 2 will reconcile current TenantStatus 4-state and PlanTier 5-tier with this standard — see §17).
export type TenantPlan = "trial" | "starter" | "business" | "enterprise";
export type TenantStatus = | "trial" | "provisioning" | "active" | "suspended" | "grace_period" | "terminated" | "data_purged" | "failed";
export type Region = "sg" | "eu" | "us";
export interface PlacementRequest { tenantId: string; region: Region; plan: TenantPlan; expectedUsers?: number; storageQuotaGb?: number; dataResidency?: string;}
export interface PlacementResult { appServerId: string; dbShardId: string; redisPoolId: string; siteName: string; databaseName: string; infraProvider: "hetzner" | "aws" | "gcp";}16. Final principles
Tenant != ContainerTenant == Frappe SiteSite DB == Independent MariaDB DatabaseApp Runtime == SharedDB == Sharded by capacity and planRedis == Shared with namespace separationEnterprise == Dedicated isolationEvery change to prego-control-plane, Docker deployment, or Frappe provisioning must be evaluated against these invariants.
17. Reconciliation with current implementation
ADR-002 supersedes ADR-001 and adds the following items. ADR-003 (2026-04-24) revises the IaC / provisioning rows below: Pulumi and Ansible are removed in favour of direct Worker calls.
| Standard item | Current state | Phase |
|---|---|---|
db_shards D1 table | Not present | Phase 2 |
redis_pools D1 table | Not present (single Redis node only) | Phase 2 |
site_pool D1 table | Not present (on-demand bench new-site only) | Phase 2 |
infra_providers D1 table | Not present (Hetzner hardcoded) | Phase 4 |
tenant_lifecycle_events D1 table | Partial (provisioning_workflows only) | Phase 2 |
server_bootstrap_tokens D1 table (ADR-003) | Not present | Phase 2 |
agent_command_audit D1 table (ADR-003) | Not present | Phase 2 |
TenantStatus 8-state | 4-state (Pending | Active | Pending_Deletion | Deleted) | Phase 5 |
TenantPlan naming | free | basic | professional | business | enterprise (5-tier) | Phase 5 |
| Worker → Hetzner Cloud API direct (ADR-003) | Currently via prego-pulumi | Phase 3 (replaces Pulumi) |
| cloud-init dynamically generated by Worker (ADR-003) | Static template per environment | Phase 3 |
Agent inbound /agent/v1/exec for bench / mariadb / redis-cli (ADR-003) | Currently via prego-ansible playbooks | Phase 3 (replaces Ansible) |
InfraProvider interface | HetznerProvider direct call | Phase 4 (Worker-side, no Pulumi stack split) |
| Push + Inbound Exec Agent | Not deployed | Phase 3 |
Migration mapping (Phase 5 backfill):
Current TenantStatus | New TenantStatus |
|---|---|
Pending | provisioning |
Active | active |
Pending_Deletion | grace_period |
Deleted | data_purged |
New states added without legacy mapping: trial, suspended, terminated, failed.
TenantPlan reconciliation (open decision — see open decisions issue tracker): how to map professional (currently between basic and business) into the 4-tier standard. Most likely: professional → business, free → trial, basic → starter.
18. Cross-repo impact
This standard is canonical. The table below reflects the post-ADR-003 disposition.
| Repo | Status | Impact |
|---|---|---|
prego-control-plane | Active, expanded | D1 migrations (ADR-002 set + ADR-003 server_bootstrap_tokens, agent_command_audit); PlacementEngine v2; Worker-side InfraProvider directly calling Hetzner API; provisioning-workflow.ts rewrite; Agent inbound exec-client.ts; reconciliation of TenantStatus/PlanTier |
prego-pulumi | Deprecated (ADR-003) | No new resources. Existing stacks readable for cutover only. See the migration runbook. Repo will be archived once servers.managed_by = 'pulumi' count reaches zero. |
prego-ansible | Deprecated (ADR-003) | No new playbooks. Existing inventories retained for cutover only. Same archival condition as above. |
prego-docker | Active, scope reduced | Frappe images only. Site-pool fast-init flags verified (bench new-site --no-setup-db etc.). The Agent image is published from a separate source (ADR-003 OD-C) and pulled by cloud-init at server boot — not packaged in the Frappe compose snippet. |
prego_saas | Unchanged | No DocType change for ADR-002 / ADR-003. Future server-side helpers require a separate ADR. |
19. References
- ADR-001 — Hybrid Multi-site Architecture (superseded by this standard)
- ADR-002 implementation mirror (
prego-control-plane) - ADR-003 — Direct Control Plane Provisioning (partial supersession for IaC/provisioning)
- Control Plane Direct Provisioning (canonical companion)
- Imperative Control Layer (Path 1 preserved; Path 3 added by ADR-003)
- Hybrid Multi-site Operations Runbook (§7.1 / §8.1 use Worker
/internal/servers/createpost-ADR-003) - Migration Runbook — From Pulumi/Ansible to Direct Control
- Tenant Provisioning Flow — Pulumi, Ansible, Frappe (historical; Pulumi/Ansible deprecated by ADR-003)
- SaaS Infrastructure — DB Separation and Usage-Based Scaling