Skip to content

Frappe SaaS Multitenant Docker Standard (ADR-002)

Status: Accepted (data model). Partially superseded by ADR-003 for the IaC and provisioning control path. Date: 2026-04-24 Supersedes: ADR-001 — Hybrid Multi-site Architecture Partially superseded by: Control Plane Direct Provisioning (ADR-003) — Pulumi and Ansible are removed; the Worker calls Hetzner Cloud API + cloud-init + Docker Remote API + scope-limited Agent inbound exec directly. Sections §11, §17, §18 below incorporate the ADR-003 changes; the rest of this document is unchanged. Canonical: this document. Implementation ADR mirror lives at prego-control-plane/docs/rearchitecture/adr-002-multitenant-docker-standard.md; ADR-003 mirror at prego-control-plane/docs/rearchitecture/adr-003-direct-control-no-iac.md.

Companion documents

TopicDocument
Worker → Hetzner → cloud-init → Agent (no Pulumi/Ansible)Control Plane Direct Provisioning
Placement Engine — input/output, bin-packing, plan tier filterTenant Placement Policy
Pre-warmed empty sitesSite Pool Strategy
8-state machine + transitions + side effectsTenant Lifecycle
Agent — push telemetry + scope-limited inbound exec (mTLS, JWT)Agent Security Standard

Related platform docs:


1. Goal

Prego is a SaaS ERP built on Frappe/ERPNext. Operating tens of thousands of tenants requires that we never create per-tenant containers and never put per-tenant tables in a single shared schema.

The platform standard is:

Container = execution environment
Tenant = Frappe Site
Tenant DB = independent MariaDB database per Site
App Runtime = shared
DB Server = shared at the shard level
Redis = shared with namespace separation
DO NOT create app/db/redis containers per tenant.
DO NOT put per-tenant tables in a single shared database.

This standard affirms ADR-001’s decision to use one shared prego-frappe-bench container per app server and adds DB sharding, Redis pool, Site pool, plan-based isolation matrix, multi-cloud abstraction, expanded lifecycle, and push-only Agent.


2. Standard architecture

2.1 Top-level flow

[ Zuplo Gateway ]
[ Cloudflare Worker — Control Plane ]
[ Workflow Engine (D1-backed) / Queue ]
[ Regional Dispatcher ]
[ Docker Hosts ]

2.2 Per-server roles

flowchart TB
    subgraph cp [Control Plane Cloudflare Worker]
        api[Hono API]
        wfe[WorkflowEngine D1]
        plc[PlacementEngine]
        infra[InfraProvider abstraction]
    end
    api --> wfe
    wfe --> plc
    plc --> infra
    infra --> hetzner[HetznerProvider]
    infra --> aws[AWSProvider future]
    infra --> gcp[GCPProvider future]

    subgraph region [Region cluster sg eu us]
        subgraph appPool [App Pool shared bench]
            app1[app-001 prego-frappe-bench]
            app2[app-002 prego-frappe-bench]
        end
        subgraph dbShards [DB Shards]
            db1[db-shard-001 tenant 1-5000]
            db2[db-shard-002 tenant 5001-10000]
        end
        subgraph redisPool [Redis Pool]
            rc[redis-cache]
            rq[redis-queue]
            rs[redis-socketio]
        end
        sitePool[(Site Pool pre-warmed)]
    end
    wfe -->|"Docker API mTLS"| app1
    wfe -->|"Docker API mTLS"| app2
    appPool --> dbShards
    appPool --> redisPool
    sitePool -.->|reserve assign| appPool

    agentNode[Push Agent on host] -->|"OTLP HTTPS"| cp
    app1 -.-> agentNode
    db1 -.-> agentNode
    rc -.-> agentNode

3. Per-server Docker standard

3.1 App Server

  • Host: app-<region>-NNN (e.g. app-sgp-001)
  • Containers: frappe-app (the shared prego-frappe-bench), nginx, socketio (optional)
  • Internal layout:
/home/frappe/frappe-bench
├── apps/
└── sites/
├── tenant-a.com
├── tenant-b.com
└── common_site_config.json

Principles:

  • One app container handles many tenant sites.
  • App container runs stateless-ish (only sites/ and logs/ are mounted).
  • Scale by CPU / memory / worker load, not by tenant count alone.
  • Horizontal scaling is the primary growth path.

3.2 DB Server

  • Host: db-shard-<region>-NNN (e.g. db-shard-sgp-001)
  • Container: mariadb
  • Internal layout:
MariaDB
├── tenant_a_db
├── tenant_b_db
├── tenant_c_db
└── tenant_d_db

Principles:

  • Never create per-tenant containers.
  • Always create a dedicated MariaDB database per tenant.
  • One DB server hosts many tenant DBs until shard capacity is reached.
  • When a shard hits capacity, create db-shard-<region>-NN+1.
db-shard-sgp-001 → tenant 1 ~ 5000
db-shard-sgp-002 → tenant 5001 ~ 10000

3.3 Redis Server

  • Host: redis-<region>-NNN
  • Container: redis

Logical separation (always required):

  • cache
  • queue
  • socketio

Initial pattern: single Redis instance with DB index per purpose (matches today’s prego-docker compose pattern: REDIS_CACHE_DB=1, REDIS_QUEUE_DB=2, REDIS_SOCKETIO_DB=3).

Growth path (managed via redis_pools registry):

  • Split into redis-cache, redis-queue, redis-socketio instances
  • Or move to redis-cluster

4. Tenant placement standard

4.1 Forbidden patterns

Forbidden 1 — per-tenant app/db/redis containers:

tenant-a-app
tenant-a-db
tenant-a-redis

This pattern is not used for Trial / Starter / Business plans.

Forbidden 2 — per-tenant tables in a shared DB:

sales_invoice with tenant_id column

Incompatible with Frappe/ERPNext schema model. Forbidden.

4.2 Required pattern

Tenant
→ Frappe Site
→ independent MariaDB Database
→ Shared App Runtime
→ Shared Redis (logical separation)

Example:

tenant-a
→ site: tenant-a.pregoi.com
→ app: app-sgp-001
→ db: db-shard-sgp-001 / tenant_a_db
→ redis:redis-sgp-001 (cache db=1, queue=2, socketio=3)

5. Plan-based isolation matrix

PlanAppDBRedisNotes
TrialShared AppShared trial-only DB shardShared RedisSite Pool reserves immediately
StarterShared AppShared DB shardShared RedisDefault SaaS pattern
BusinessShared AppBusiness DB shard (plan_tier_lock='business')Shared or Dedicated RedisStronger performance guarantee
EnterpriseDedicated AppDedicated DBDedicated RedisStrong isolation, contract-driven

Implementation note: db_shards.plan_tier_lock is the primary filter in PlacementEngine. See Tenant Placement Policy.


6. Site Pool strategy (summary)

Frappe bench new-site takes 2~5 minutes; provisioning latency at scale becomes the dominant UX cost. We pre-warm empty sites in a pool and promote them on tenant assignment.

sites/
├── pool-001 (state=available)
├── pool-002 (state=reserved)
└── pool-003 (state=available)
Provision:
pool-001 → tenant-a.pregoi.com (state=assigned)

States: available → reserved → assigned. Failures land in failed and are reaped by a background warmer. Full design in Site Pool Strategy.


7. Control Plane responsibilities

The Cloudflare Worker Control Plane must not execute long-running tasks itself.

Responsibilities:

  • Receive tenant create requests
  • Persist tenant metadata (D1 tenants_master)
  • Record placement decision (D1 tenant_allocations, allocation_snapshots)
  • Enqueue workflow job (Cloudflare Queue → prego-provision-queue)
  • Provide status API (GET /v1/jobs/:id)

Forbidden:

  • Long polling inside a Worker request
  • Direct bench command execution from a Worker
  • Polling agents on every host from the Worker

The current implementation src/index.ts (fetch / queue / scheduled) already follows this rule. ADR-002 codifies it.


8. Workflow / Queue responsibilities

The workflow engine (D1-backed WorkflowEngine in src/orchestration/workflow-engine.ts) handles each provisioning job through the following ordered steps:

  1. Receive tenant creation request from queue
  2. Call PlacementEngine (region / plan / shard / pool decision)
  3. Reserve a Site from site_pool
  4. Create DB on chosen shard, or attach pool DB if pre-baked
  5. Update common_site_config.json on app server (Docker exec)
  6. Configure custom domain + DNS (Cloudflare DNS API)
  7. Transition tenant to active
  8. On failure: rollback (drop site, decrement counters, mark failed)

Initial implementation: Cloudflare Queue + custom WorkflowEngine (already in place). Growth path: consider Temporal / Inngest for cross-region sagas if needed.


9. Placement Engine standard

Input:

tenant_id, region, plan, expected_users, storage_quota_gb, data_residency

Output:

app_server_id, db_shard_id, redis_pool_id, site_name, database_name

Initial algorithm:

  1. region match
  2. plan match (against db_shards.plan_tier_lock, dedicated server constraints)
  3. Available capacity (memory, tenant count)
  4. Least-loaded shard / app server

Growth path:

  • Bin-packing (multi-dimensional)
  • Tenant affinity (group same-customer tenants for cache locality)
  • Noisy-neighbor isolation (move hot tenants to dedicated shards)
  • Enterprise dedicated placement (single-tenant servers)

Full spec in Tenant Placement Policy.


10. Tenant lifecycle (8 states)

trial → provisioning → active → suspended
grace_period → terminated → data_purged
failed

Additional actions (out-of-band, not state transitions):

  • upgrade / downgrade (plan tier change)
  • migrate_shard (move DB to a different shard)
  • change_domain
  • install_app
  • backup / restore
  • rolling_migrate (Frappe / ERPNext version bump)

State machine, transition triggers, and side effects (DNS, billing, backup) are defined in Tenant Lifecycle.


11. Agent standard

Updated by ADR-003 (2026-04-24). The Agent is no longer push-only. It now also accepts a scope-limited inbound /agent/v1/exec endpoint to run bench / mariadb / redis-cli, replacing what was previously an Ansible playbook run. Container lifecycle remains on the Docker Remote API.

Each Docker host runs a hardened Push + Scope-Limited Inbound Exec Agent, deployed by cloud-init at server boot.

11.1 Outbound (push) responsibilities

  • Push container status (running/stopped, image tag)
  • Push host metrics (CPU, memory, disk)
  • Push Frappe/bench worker status
  • Push site health summaries (per-site doctor)
  • Push DB connection / Redis memory metrics
  • Push backup completion reports

11.2 Inbound (exec) responsibilities — ADR-003

  • POST /agent/v1/exec with kind ∈ { bench, mariadb, redis-cli }
  • POST /agent/v1/exec/stream (SSE) for long-running commands (e.g., bench backup, bench migrate)
  • Single-shot per call; Worker supplies timeoutSeconds
  • Every invocation writes to agent_command_audit D1 table

11.3 Hard boundary

The Agent MUST NOT:

  • Accept arbitrary shell (no /agent/v1/shell, no generic /exec without kind)
  • Run docker run or docker exec on behalf of callers — that path stays on the Docker Remote API (mTLS port 2376)
  • Hold long-lived sessions

The Worker MUST NOT:

  • Use SSH for any automated operation
  • Use Pulumi or Ansible to drive any host (deprecated by ADR-003)

11.4 Security

  • mTLS (CA-issued client cert; same CA as Docker Remote API)
  • Short-lived JWT (≤5 min), scope = exec.bench / exec.mariadb / exec.redis-cli for Path D, or heartbeat / metrics / logs / backup_report / rotate-token for Path E (push)
  • Optional Cloudflare egress IP allowlist on host firewall (open decision: ADR-003 OD-A)
  • systemd hardening (NoNewPrivileges, restricted capabilities)
  • Bootstrap token: single-use, ≤15 min TTL, embedded in cloud-init user_data

Full security standard in Agent Security Standard. Full provisioning sequence in Control Plane Direct Provisioning.


12. Observability standard

The Agent pushes telemetry to Control Plane, which forwards to the central collector.

Metrics collected:

  • server_cpu, server_memory, disk_pct
  • container_status
  • bench_worker_status
  • site_health (per site)
  • db_connections, db_disk_usage
  • redis_memory, redis_eviction_rate
  • queue_depth (Frappe RQ)
  • provisioning_duration (per tenant)
  • error_count

Recommended sinks:

  • OpenTelemetry envelope (OTLP/HTTP)
  • Grafana Cloud or Datadog (configurable per region)
  • Cloudflare Logpush for Worker logs

13. Backup / DR standard

PlanRPORTOBackup mechanism
Trial24hBest effortbench backup daily, R2 7d retention
Starter12h24hbench backup 12h, MariaDB dump daily, R2 30d
Business1h ~ 4h4h ~ 8hMariaDB physical dump hourly, volume snapshot daily, R2 90d
EnterpriseCustomCustomCross-region replica + custom RPO; contract-driven

Backup methods (in priority order):

  • bench backup (Frappe-native, includes site files)
  • MariaDB physical dump (mysqldump or mariabackup)
  • Volume snapshot (provider-specific)
  • R2 / S3 archive (long-term cold storage)

14. Multi-cloud abstraction

Hetzner is the current sole cloud provider. To avoid vendor lock-in (regional coverage gaps, single-vendor outages) the Control Plane uses an InfraProvider interface.

export interface InfraProvider {
readonly id: "hetzner" | "aws" | "gcp";
createServer(spec: ServerSpec): Promise<ServerHandle>;
destroyServer(id: string): Promise<void>;
attachVolume?(id: string, gb: number): Promise<void>;
listRegions(): Promise<Region[]>;
}

infra_providers D1 table records which provider serves which region. Initial implementation: Hetzner only. Expansion happens without changing PlacementEngine consumers.


15. TypeScript interface reference

These types are the contract, not the implementation. Definitions live in prego-control-plane/src/types.ts (Phase 2 will reconcile current TenantStatus 4-state and PlanTier 5-tier with this standard — see §17).

export type TenantPlan = "trial" | "starter" | "business" | "enterprise";
export type TenantStatus =
| "trial"
| "provisioning"
| "active"
| "suspended"
| "grace_period"
| "terminated"
| "data_purged"
| "failed";
export type Region = "sg" | "eu" | "us";
export interface PlacementRequest {
tenantId: string;
region: Region;
plan: TenantPlan;
expectedUsers?: number;
storageQuotaGb?: number;
dataResidency?: string;
}
export interface PlacementResult {
appServerId: string;
dbShardId: string;
redisPoolId: string;
siteName: string;
databaseName: string;
infraProvider: "hetzner" | "aws" | "gcp";
}

16. Final principles

Tenant != Container
Tenant == Frappe Site
Site DB == Independent MariaDB Database
App Runtime == Shared
DB == Sharded by capacity and plan
Redis == Shared with namespace separation
Enterprise == Dedicated isolation

Every change to prego-control-plane, Docker deployment, or Frappe provisioning must be evaluated against these invariants.


17. Reconciliation with current implementation

ADR-002 supersedes ADR-001 and adds the following items. ADR-003 (2026-04-24) revises the IaC / provisioning rows below: Pulumi and Ansible are removed in favour of direct Worker calls.

Standard itemCurrent statePhase
db_shards D1 tableNot presentPhase 2
redis_pools D1 tableNot present (single Redis node only)Phase 2
site_pool D1 tableNot present (on-demand bench new-site only)Phase 2
infra_providers D1 tableNot present (Hetzner hardcoded)Phase 4
tenant_lifecycle_events D1 tablePartial (provisioning_workflows only)Phase 2
server_bootstrap_tokens D1 table (ADR-003)Not presentPhase 2
agent_command_audit D1 table (ADR-003)Not presentPhase 2
TenantStatus 8-state4-state (Pending | Active | Pending_Deletion | Deleted)Phase 5
TenantPlan namingfree | basic | professional | business | enterprise (5-tier)Phase 5
Worker → Hetzner Cloud API direct (ADR-003)Currently via prego-pulumiPhase 3 (replaces Pulumi)
cloud-init dynamically generated by Worker (ADR-003)Static template per environmentPhase 3
Agent inbound /agent/v1/exec for bench / mariadb / redis-cli (ADR-003)Currently via prego-ansible playbooksPhase 3 (replaces Ansible)
InfraProvider interfaceHetznerProvider direct callPhase 4 (Worker-side, no Pulumi stack split)
Push + Inbound Exec AgentNot deployedPhase 3

Migration mapping (Phase 5 backfill):

Current TenantStatusNew TenantStatus
Pendingprovisioning
Activeactive
Pending_Deletiongrace_period
Deleteddata_purged

New states added without legacy mapping: trial, suspended, terminated, failed.

TenantPlan reconciliation (open decision — see open decisions issue tracker): how to map professional (currently between basic and business) into the 4-tier standard. Most likely: professional → business, free → trial, basic → starter.


18. Cross-repo impact

This standard is canonical. The table below reflects the post-ADR-003 disposition.

RepoStatusImpact
prego-control-planeActive, expandedD1 migrations (ADR-002 set + ADR-003 server_bootstrap_tokens, agent_command_audit); PlacementEngine v2; Worker-side InfraProvider directly calling Hetzner API; provisioning-workflow.ts rewrite; Agent inbound exec-client.ts; reconciliation of TenantStatus/PlanTier
prego-pulumiDeprecated (ADR-003)No new resources. Existing stacks readable for cutover only. See the migration runbook. Repo will be archived once servers.managed_by = 'pulumi' count reaches zero.
prego-ansibleDeprecated (ADR-003)No new playbooks. Existing inventories retained for cutover only. Same archival condition as above.
prego-dockerActive, scope reducedFrappe images only. Site-pool fast-init flags verified (bench new-site --no-setup-db etc.). The Agent image is published from a separate source (ADR-003 OD-C) and pulled by cloud-init at server boot — not packaged in the Frappe compose snippet.
prego_saasUnchangedNo DocType change for ADR-002 / ADR-003. Future server-side helpers require a separate ADR.

19. References

Help