Site Pool Strategy
Companion to: Frappe SaaS Multitenant Docker Standard (ADR-002) Related: Tenant Placement Policy, Tenant Lifecycle
1. Problem
bench new-site is slow:
- Empty site creation: 30~90 seconds
- With ERPNext install + sample CoA: 2~5 minutes
- Concurrent provisioning under load: 5~10 minutes (DB / disk contention)
For a SaaS funnel where the user expects to land on their workspace within 60 seconds, on-demand site creation is unacceptable above a few-tenants/hour rate.
2. Solution
Maintain a pool of pre-created, empty Frappe sites on each app server. On tenant assignment, promote a pool site to the tenant by:
- Renaming the site directory (
pool-001→acme.pregoi.com) - Updating
site_config.jsonwith the new site name + DB binding - Running
bench install-appfor tenant-selected apps - Optionally importing customer CoA / templates
End-to-end promotion target: ≤ 5 seconds (excluding optional app installs).
3. Data model
3.1 site_pool table (new)
CREATE TABLE site_pool ( pool_site_id TEXT PRIMARY KEY, app_server_id TEXT NOT NULL REFERENCES servers(server_id), region TEXT NOT NULL, pool_site_name TEXT NOT NULL, -- e.g. pool-sgp-001-042 database_name TEXT NOT NULL, -- pre-created tenant_pool_xxx_db db_shard_id TEXT NOT NULL REFERENCES db_shards(shard_id), state TEXT NOT NULL CHECK (state IN ('available','reserved','assigned','failed','draining')), reserved_for_tenant_id TEXT, reserved_at TEXT, reserved_until TEXT, -- TTL for stuck reservations assigned_tenant_id TEXT, assigned_at TEXT, assigned_site_name TEXT, -- post-promotion site name failure_reason TEXT, created_at TEXT NOT NULL DEFAULT (datetime('now')), updated_at TEXT NOT NULL DEFAULT (datetime('now')), UNIQUE (app_server_id, pool_site_name));
CREATE INDEX idx_site_pool_state_app ON site_pool(state, app_server_id);CREATE INDEX idx_site_pool_reserved_until ON site_pool(reserved_until) WHERE state = 'reserved';The DB shard is pinned at creation time because a pool site has a real MariaDB database already attached.
3.2 Pool sizing parameters
Per-region configuration in scaling_config (existing — see migrations/0044_hybrid_multisite.sql):
| Key | Default | Meaning |
|---|---|---|
site_pool_target_per_app_server | 5 | Desired available count per app server |
site_pool_min_available_global | 20 | Region-wide floor that triggers urgent refill |
site_pool_max_per_app_server | 15 | Hard cap (cost control) |
site_pool_reservation_ttl_seconds | 600 | Stuck reservation reaped after this |
site_pool_refill_concurrency | 3 | Parallel bench new-site per app server |
site_pool_refill_priority_threshold | 2 | When available < this, refill becomes high-priority |
4. State machine
stateDiagram-v2
[*] --> creating: warmer creates new pool site
creating --> available: bench new-site succeeds
creating --> failed: bench new-site fails
available --> reserved: PlacementEngine reserves
reserved --> assigned: provisioning workflow promotes
reserved --> available: TTL expired or workflow rolled back
reserved --> failed: promotion error
assigned --> [*]: tenant terminated and site dropped
failed --> draining: operator marks for cleanup
draining --> [*]: cleanup complete
available --> draining: app server scale-in
4.1 State transitions
| From | To | Trigger | Side effect |
|---|---|---|---|
(none) | creating | Warmer cron decides to refill | Insert row, status creating |
creating | available | bench new-site exit 0 + smoke check | Set available, log refill event |
creating | failed | bench new-site non-zero exit | Capture stderr to failure_reason, alert if rate exceeds threshold |
available | reserved | PlacementEngine selects pool site | Set reserved_for_tenant_id, reserved_until = now + TTL |
reserved | assigned | Promotion workflow finishes (rename + config + first login OK) | Set assigned_*, also update tenants_master.site_name |
reserved | available | TTL expired (workflow stalled) | Clear reservation; place tenant_lifecycle_events row with reason pool_reservation_ttl |
reserved | failed | Promotion rename / config errored | Capture error, alert |
assigned | (deleted) | Tenant fully purged | Drop site + DB; row hard-deleted |
available | draining | App server scale-in | Stop offering pool sites; let warmer drain |
5. Warming SLO
The warmer cron runs every 60 seconds per region (Cloudflare Worker scheduled handler).
5.1 Refill decision
For each (app_server_id, region):
available_count = SELECT COUNT(*) FROM site_pool WHERE app_server_id = $1 AND state = 'available'
if available_count < site_pool_refill_priority_threshold: deficit = site_pool_target_per_app_server - available_count enqueue HIGH-priority refill job (deficit, app_server_id)elif available_count < site_pool_target_per_app_server: deficit = site_pool_target_per_app_server - available_count enqueue NORMAL-priority refill job5.2 SLO targets
| SLO | Target | Measurement |
|---|---|---|
available per app server | ≥ site_pool_target_per_app_server 99% of the time | 1-minute snapshots |
Pool exhaustion (zero available for ≥ 5 min) | ≤ 1 event/region/week | Alert page |
Promotion latency (reserved → assigned) | p95 ≤ 5s | Per-tenant timing |
failed ratio | ≤ 1% of creating attempts | Daily aggregate |
Breaches page on-call (see prego-control-plane/docs/runbook/hybrid-multisite-operations.md for response).
6. Refill execution
The refill job runs inside the existing provisioning workflow (src/orchestration/workflow-engine.ts) with a dedicated workflow type site_pool_refill.
6.1 Steps per refill
flowchart TB
pick[Select target app server and DB shard]
pick --> name[Generate pool_site_name]
name --> insert[Insert site_pool row state=creating]
insert --> docker[Docker exec bench new-site --no-setup-db]
docker --> dbcreate[Create empty MariaDB database on chosen shard]
dbcreate --> bind[bench setup db-only]
bind --> smoke[bench --site doctor smoke check]
smoke --> done[Update state=available]
docker -.->|error| failrow[Update state=failed]
smoke -.->|error| failrow
6.2 Resource controls
site_pool_refill_concurrencylimits parallelbench new-siteper app server (avoid I/O collapse)- DB creation runs before
bench new-site --skip-redis-config-generationto keep the bench step short - Failed refills back off with exponential delay (
30s → 60s → 120s → 5m) and alert after 3 consecutive failures on the same(app_server, db_shard)pair
7. Promotion (reserve → assigned)
Triggered by the tenant provisioning workflow once PlacementEngine reserves a pool site.
7.1 Steps
- Verify reservation still owned by current job (
reserved_for_tenant_id == job.tenant_id). bench --site pool-XXX rename-site --new-name acme.pregoi.com(or equivalent rename).- Update
site_config.json(DB host, redis URLs, custom site flags). - Optionally
bench --site acme.pregoi.com install-app prego_saasif tenant requires beyond the pre-installed app set. - Health probe (
bench --site acme.pregoi.com doctor). - Set
state='assigned',assigned_*fields. - Update
tenants_master.site_name,tenant_runtime.origin_url. - Insert
tenant_lifecycle_eventsrow (provisioning → active).
7.2 Rollback
If any step fails:
- Revert site rename if performed
- Reset row to
availableif database / file-system is untouched - Reset row to
failedand trigger refill if state is unrecoverable - Compensate
tenant_allocationscounters
8. Pre-installed app strategy
Pool sites are created with a base app set to minimize promotion latency:
| Plan | Pre-installed apps |
|---|---|
trial, starter | frappe, erpnext, prego_saas |
business | frappe, erpnext, prego_saas, hrms, crm |
enterprise | Provisioned on-demand (no pool used; bench new-site directly) |
Enterprise uses a dedicated server (per Plan Isolation Matrix §5) and skips the pool. The longer provisioning time is acceptable for Enterprise contracts.
The pool is partitioned by app set: site_pool.pool_site_name encodes the app set (e.g. pool-sgp-001-bus-042 for the Business set) so the warmer maintains separate counts per set.
9. Cost / disk footprint
A pool site occupies:
- One MariaDB database (~30 MB empty Frappe schema)
- One
sites/<pool>/private/directory (~5 MB) - One
sites/<pool>/site_config.json
Per app server with target=5 and pre-installed Business set: ~250 MB disk + ~150 MB MariaDB. Negligible at the per-host scale; relevant when planning shard capacity.
Cost monitoring: a dedicated metric site_pool_disk_bytes is pushed by the Agent and aggregated regionally.
10. Edge cases
10.1 App server scale-in with pooled sites
When an app server is marked draining:
- Stop the warmer from refilling that server
- Existing
availablerows transition todraining - Existing
reservedrows are honored to completion (then drained) - After all reservations close, drop pool sites + DBs
10.2 DB shard exhaustion mid-promotion
If the pinned DB shard runs out of capacity between reservation and promotion (rare; pool sites already have DBs), promotion still succeeds — the DB was created at pool-site creation time. Subsequent new pool sites will be pinned to a different shard.
10.3 Concurrent reservation race
Reservation uses an atomic D1 update:
UPDATE site_poolSET state='reserved', reserved_for_tenant_id=?, reserved_at=datetime('now'), reserved_until=datetime('now', '+10 minutes')WHERE pool_site_id=? AND state='available'RETURNING pool_site_id;If 0 rows updated → the candidate was taken by a concurrent placement; the engine retries with the next candidate.
10.4 Reservation TTL expiry
A scheduled sweep every 60s reverts reserved rows whose reserved_until < now() back to available, unless the tenant has progressed beyond provisioning (in which case the row is moved to failed for manual triage — likely a bug).
11. Operator interface
Surfaced via existing CP admin SPA (/cp/console) and internal/* APIs:
GET /internal/site-pool?app_server_id=...— current pool snapshotPOST /internal/site-pool/refill— manual refill triggerPOST /internal/site-pool/:id/drain— mark for cleanupPOST /internal/site-pool/:id/reset— forcefailed→ re-create
Audit: every operator action writes a row to tenant_lifecycle_events with actor='operator' (even though no tenant is involved, the table reuses the same event log).
12. References
- Frappe SaaS Multitenant Docker Standard (ADR-002)
- Control Plane Direct Provisioning (ADR-003) — site pool promotion (
bench --site pool-XXX rename-site) is dispatched via Agent inbound/agent/v1/execwithkind=bench(no Ansible) - Tenant Placement Policy
- Tenant Lifecycle
prego-control-plane/src/orchestration/workflow-engine.tsprego-control-plane/migrations/0044_hybrid_multisite.sql—scaling_events/scaling_config- Hybrid Multi-site Operations Runbook — Site Pool ops section (extended by ADR-002)