Skip to content

Site Pool Strategy

Companion to: Frappe SaaS Multitenant Docker Standard (ADR-002) Related: Tenant Placement Policy, Tenant Lifecycle

1. Problem

bench new-site is slow:

  • Empty site creation: 30~90 seconds
  • With ERPNext install + sample CoA: 2~5 minutes
  • Concurrent provisioning under load: 5~10 minutes (DB / disk contention)

For a SaaS funnel where the user expects to land on their workspace within 60 seconds, on-demand site creation is unacceptable above a few-tenants/hour rate.

2. Solution

Maintain a pool of pre-created, empty Frappe sites on each app server. On tenant assignment, promote a pool site to the tenant by:

  1. Renaming the site directory (pool-001acme.pregoi.com)
  2. Updating site_config.json with the new site name + DB binding
  3. Running bench install-app for tenant-selected apps
  4. Optionally importing customer CoA / templates

End-to-end promotion target: ≤ 5 seconds (excluding optional app installs).


3. Data model

3.1 site_pool table (new)

CREATE TABLE site_pool (
pool_site_id TEXT PRIMARY KEY,
app_server_id TEXT NOT NULL REFERENCES servers(server_id),
region TEXT NOT NULL,
pool_site_name TEXT NOT NULL, -- e.g. pool-sgp-001-042
database_name TEXT NOT NULL, -- pre-created tenant_pool_xxx_db
db_shard_id TEXT NOT NULL REFERENCES db_shards(shard_id),
state TEXT NOT NULL CHECK (state IN ('available','reserved','assigned','failed','draining')),
reserved_for_tenant_id TEXT,
reserved_at TEXT,
reserved_until TEXT, -- TTL for stuck reservations
assigned_tenant_id TEXT,
assigned_at TEXT,
assigned_site_name TEXT, -- post-promotion site name
failure_reason TEXT,
created_at TEXT NOT NULL DEFAULT (datetime('now')),
updated_at TEXT NOT NULL DEFAULT (datetime('now')),
UNIQUE (app_server_id, pool_site_name)
);
CREATE INDEX idx_site_pool_state_app ON site_pool(state, app_server_id);
CREATE INDEX idx_site_pool_reserved_until ON site_pool(reserved_until) WHERE state = 'reserved';

The DB shard is pinned at creation time because a pool site has a real MariaDB database already attached.

3.2 Pool sizing parameters

Per-region configuration in scaling_config (existing — see migrations/0044_hybrid_multisite.sql):

KeyDefaultMeaning
site_pool_target_per_app_server5Desired available count per app server
site_pool_min_available_global20Region-wide floor that triggers urgent refill
site_pool_max_per_app_server15Hard cap (cost control)
site_pool_reservation_ttl_seconds600Stuck reservation reaped after this
site_pool_refill_concurrency3Parallel bench new-site per app server
site_pool_refill_priority_threshold2When available < this, refill becomes high-priority

4. State machine

stateDiagram-v2
    [*] --> creating: warmer creates new pool site
    creating --> available: bench new-site succeeds
    creating --> failed: bench new-site fails
    available --> reserved: PlacementEngine reserves
    reserved --> assigned: provisioning workflow promotes
    reserved --> available: TTL expired or workflow rolled back
    reserved --> failed: promotion error
    assigned --> [*]: tenant terminated and site dropped
    failed --> draining: operator marks for cleanup
    draining --> [*]: cleanup complete
    available --> draining: app server scale-in

4.1 State transitions

FromToTriggerSide effect
(none)creatingWarmer cron decides to refillInsert row, status creating
creatingavailablebench new-site exit 0 + smoke checkSet available, log refill event
creatingfailedbench new-site non-zero exitCapture stderr to failure_reason, alert if rate exceeds threshold
availablereservedPlacementEngine selects pool siteSet reserved_for_tenant_id, reserved_until = now + TTL
reservedassignedPromotion workflow finishes (rename + config + first login OK)Set assigned_*, also update tenants_master.site_name
reservedavailableTTL expired (workflow stalled)Clear reservation; place tenant_lifecycle_events row with reason pool_reservation_ttl
reservedfailedPromotion rename / config erroredCapture error, alert
assigned(deleted)Tenant fully purgedDrop site + DB; row hard-deleted
availabledrainingApp server scale-inStop offering pool sites; let warmer drain

5. Warming SLO

The warmer cron runs every 60 seconds per region (Cloudflare Worker scheduled handler).

5.1 Refill decision

For each (app_server_id, region):

available_count = SELECT COUNT(*) FROM site_pool
WHERE app_server_id = $1 AND state = 'available'
if available_count < site_pool_refill_priority_threshold:
deficit = site_pool_target_per_app_server - available_count
enqueue HIGH-priority refill job (deficit, app_server_id)
elif available_count < site_pool_target_per_app_server:
deficit = site_pool_target_per_app_server - available_count
enqueue NORMAL-priority refill job

5.2 SLO targets

SLOTargetMeasurement
available per app serversite_pool_target_per_app_server 99% of the time1-minute snapshots
Pool exhaustion (zero available for ≥ 5 min)≤ 1 event/region/weekAlert page
Promotion latency (reservedassigned)p95 ≤ 5sPer-tenant timing
failed ratio≤ 1% of creating attemptsDaily aggregate

Breaches page on-call (see prego-control-plane/docs/runbook/hybrid-multisite-operations.md for response).


6. Refill execution

The refill job runs inside the existing provisioning workflow (src/orchestration/workflow-engine.ts) with a dedicated workflow type site_pool_refill.

6.1 Steps per refill

flowchart TB
    pick[Select target app server and DB shard]
    pick --> name[Generate pool_site_name]
    name --> insert[Insert site_pool row state=creating]
    insert --> docker[Docker exec bench new-site --no-setup-db]
    docker --> dbcreate[Create empty MariaDB database on chosen shard]
    dbcreate --> bind[bench setup db-only]
    bind --> smoke[bench --site doctor smoke check]
    smoke --> done[Update state=available]
    docker -.->|error| failrow[Update state=failed]
    smoke -.->|error| failrow

6.2 Resource controls

  • site_pool_refill_concurrency limits parallel bench new-site per app server (avoid I/O collapse)
  • DB creation runs before bench new-site --skip-redis-config-generation to keep the bench step short
  • Failed refills back off with exponential delay (30s → 60s → 120s → 5m) and alert after 3 consecutive failures on the same (app_server, db_shard) pair

7. Promotion (reserve → assigned)

Triggered by the tenant provisioning workflow once PlacementEngine reserves a pool site.

7.1 Steps

  1. Verify reservation still owned by current job (reserved_for_tenant_id == job.tenant_id).
  2. bench --site pool-XXX rename-site --new-name acme.pregoi.com (or equivalent rename).
  3. Update site_config.json (DB host, redis URLs, custom site flags).
  4. Optionally bench --site acme.pregoi.com install-app prego_saas if tenant requires beyond the pre-installed app set.
  5. Health probe (bench --site acme.pregoi.com doctor).
  6. Set state='assigned', assigned_* fields.
  7. Update tenants_master.site_name, tenant_runtime.origin_url.
  8. Insert tenant_lifecycle_events row (provisioning → active).

7.2 Rollback

If any step fails:

  • Revert site rename if performed
  • Reset row to available if database / file-system is untouched
  • Reset row to failed and trigger refill if state is unrecoverable
  • Compensate tenant_allocations counters

8. Pre-installed app strategy

Pool sites are created with a base app set to minimize promotion latency:

PlanPre-installed apps
trial, starterfrappe, erpnext, prego_saas
businessfrappe, erpnext, prego_saas, hrms, crm
enterpriseProvisioned on-demand (no pool used; bench new-site directly)

Enterprise uses a dedicated server (per Plan Isolation Matrix §5) and skips the pool. The longer provisioning time is acceptable for Enterprise contracts.

The pool is partitioned by app set: site_pool.pool_site_name encodes the app set (e.g. pool-sgp-001-bus-042 for the Business set) so the warmer maintains separate counts per set.


9. Cost / disk footprint

A pool site occupies:

  • One MariaDB database (~30 MB empty Frappe schema)
  • One sites/<pool>/private/ directory (~5 MB)
  • One sites/<pool>/site_config.json

Per app server with target=5 and pre-installed Business set: ~250 MB disk + ~150 MB MariaDB. Negligible at the per-host scale; relevant when planning shard capacity.

Cost monitoring: a dedicated metric site_pool_disk_bytes is pushed by the Agent and aggregated regionally.


10. Edge cases

10.1 App server scale-in with pooled sites

When an app server is marked draining:

  • Stop the warmer from refilling that server
  • Existing available rows transition to draining
  • Existing reserved rows are honored to completion (then drained)
  • After all reservations close, drop pool sites + DBs

10.2 DB shard exhaustion mid-promotion

If the pinned DB shard runs out of capacity between reservation and promotion (rare; pool sites already have DBs), promotion still succeeds — the DB was created at pool-site creation time. Subsequent new pool sites will be pinned to a different shard.

10.3 Concurrent reservation race

Reservation uses an atomic D1 update:

UPDATE site_pool
SET state='reserved',
reserved_for_tenant_id=?,
reserved_at=datetime('now'),
reserved_until=datetime('now', '+10 minutes')
WHERE pool_site_id=? AND state='available'
RETURNING pool_site_id;

If 0 rows updated → the candidate was taken by a concurrent placement; the engine retries with the next candidate.

10.4 Reservation TTL expiry

A scheduled sweep every 60s reverts reserved rows whose reserved_until < now() back to available, unless the tenant has progressed beyond provisioning (in which case the row is moved to failed for manual triage — likely a bug).


11. Operator interface

Surfaced via existing CP admin SPA (/cp/console) and internal/* APIs:

  • GET /internal/site-pool?app_server_id=... — current pool snapshot
  • POST /internal/site-pool/refill — manual refill trigger
  • POST /internal/site-pool/:id/drain — mark for cleanup
  • POST /internal/site-pool/:id/reset — force failed → re-create

Audit: every operator action writes a row to tenant_lifecycle_events with actor='operator' (even though no tenant is involved, the table reuses the same event log).


12. References

Help