Tenant Lifecycle
Companion to: Frappe SaaS Multitenant Docker Standard (ADR-002) Related: Tenant Placement Policy, Site Pool Strategy
This document defines the canonical 8-state lifecycle for every Prego tenant and the migration plan from the current 4-state model in prego-control-plane/src/types.ts ('Pending' | 'Active' | 'Pending_Deletion' | 'Deleted').
1. State machine
stateDiagram-v2
[*] --> trial: signup before payment
[*] --> provisioning: paid checkout webhook
trial --> provisioning: upgrade to paid
trial --> terminated: trial expiry no conversion
provisioning --> active: workflow success
provisioning --> failed: workflow non-recoverable error
active --> suspended: admin or billing failure
suspended --> active: dispute resolved
suspended --> grace_period: dunning final notice
active --> grace_period: voluntary cancel
grace_period --> active: reactivation within window
grace_period --> terminated: grace expired
failed --> provisioning: operator retry
failed --> terminated: operator gives up
terminated --> data_purged: retention window expired
data_purged --> [*]
2. State definitions
| State | Meaning | DB shape | Site / DB exists? | Access |
|---|---|---|---|---|
trial | Pre-payment trial; tenant_id allocated, no infra yet (or trial-shard infra) | tenants_master row | Optional (trial-only shard or none) | Read-write trial scope |
provisioning | Workflow running (placement / pool / DNS) | tenants_master + provision_jobs running | Being created or rented from pool | None until active |
active | Production tenant | All tables populated | Yes | Full |
suspended | Temporarily blocked (billing failure, abuse, manual) | All tables populated | Yes (sites disabled) | None (HTTP 402/403) |
grace_period | Cancellation requested but reversible window open | All tables populated | Yes (read-only banner) | Read-only or full per policy |
terminated | Final cancellation, infra still present pending purge | tenants_master.status='terminated', tenant_runtime retained | Yes (offline) | None |
data_purged | All tenant data destroyed (DBs dropped, R2 archived) | tenants_master row only (audit) | No | None |
failed | Provisioning workflow failed unrecoverably | tenants_master + provision_jobs.status='Failed' | Partial / inconsistent | None |
3. Transitions
| Transition | Trigger | Actor | Required steps | Side effects |
|---|---|---|---|---|
(none) → trial | Trial signup form submitted | www → CP POST /v1/trial/start | Validate email, create tenants_master row, send OTP | Funnel event, no infra |
(none) → provisioning | Stripe Checkout webhook fires | CP webhook handler | Enqueue prego-provision-queue message | Funnel event |
trial → provisioning | Trial → paid upgrade | CP POST /v1/tenants/:id/convert | Enqueue provision with from_state='trial' | Stripe subscription created |
provisioning → active | Workflow completes all steps | WorkflowEngine | tenant_runtime populated, KV TENANT_ORIGINS updated, DNS live | Onboarding email, funnel provision_completed |
provisioning → failed | Workflow exhausted retries | WorkflowEngine | Persist error, leave partial infra for triage | Page on-call if rate exceeds threshold |
failed → provisioning | Operator retry | Admin SPA /cp/console | Reset provision_jobs.status='Pending', re-enqueue | Audit log |
failed → terminated | Operator abandons | Admin SPA | Mark terminated, schedule purge | Audit log |
active → suspended | Stripe invoice.payment_failed (after dunning) or abuse flag or manual | CP webhook / admin | Disable Frappe site (bench --site X disable-website or nginx 402), keep data | Email customer, funnel tenant_suspended |
suspended → active | Payment recovered or operator clears | CP webhook / admin | Re-enable site | Email customer |
suspended → grace_period | Final dunning notice sent | CP scheduled | Mark grace_period_end timestamp | Email customer |
active → grace_period | Customer requests cancel | Admin SPA / portal | Set grace_period_end = now + 30 days (plan-dependent) | Backup snapshot, email confirmation |
grace_period → active | Customer reactivates | Admin SPA / portal | Clear grace_period_end | Stripe subscription resumed |
grace_period → terminated | Grace expired | CP scheduled | Stop site, retain backup | Email customer, funnel tenant_terminated |
terminated → data_purged | Retention window elapsed (e.g. 90 days) | CP scheduled | Drop DBs, delete site directory, archive R2, remove tenant_runtime | Email customer (final), funnel tenant_data_purged |
Every transition writes a row to tenant_lifecycle_events:
tenant_id, from_state, to_state, actor, reason, at, workflow_id, evidence_url4. Side effects per transition
4.1 DNS
provisioning → active: create<subdomain>.pregoi.comA/CNAME via Cloudflare DNS API (src/clients/cloudflare/)active → suspended: leave DNS, return 402 at appsuspended → grace_period: no DNS changegrace_period → terminated: optionally redirect to “service ended” landingterminated → data_purged: delete DNS record, free subdomain
4.2 Billing (Stripe)
trial → provisioning: create Stripe subscriptionactive → suspended: pause subscription (or rely on Stripe failure state)grace_period → terminated: cancel subscription, issue final invoiceterminated → data_purged: no Stripe change (subscription already cancelled)
4.3 Backup
provisioning → active: schedule first backup per plan tier (see ADR-002 §13)active → grace_period: take terminal snapshot (full backup to R2)grace_period → terminated: take final snapshot, store underR2://prego-backups/terminated/<tenant>/terminated → data_purged: archive snapshots are retained per legal/compliance policy (default 7 years for Enterprise, 1 year for Starter, 90 days for Trial)
4.4 KV / Routing
provisioning → active: writeTENANT_ORIGINS[<host>] = origin_urlactive → suspended: leave KV; gateway returns 402 based ontenants_master.statusterminated → data_purged: delete KV entry
5. Migration mapping (4-state → 8-state)
The current code in prego-control-plane/src/types.ts defines:
export type TenantStatus = 'Pending' | 'Active' | 'Pending_Deletion' | 'Deleted';Backfill mapping (Phase 5 — see ADR-002 §17):
| Current | New | Notes |
|---|---|---|
Pending | provisioning | Direct semantic match |
Active | active | Lower-case rename |
Pending_Deletion | grace_period | Naming aligned with billing semantics |
Deleted | data_purged | Distinct from terminated (which is “final cancel, data still present”) |
New states without backfill (created by future flows):
trial— created bywwwtrial signup flow before paymentsuspended— created by Stripe dunning / abuse flagterminated— distinct interstitial between grace and purgefailed— distinct fromPending(failedrequires operator action;Pendingis in-flight)
5.1 Phase 5 migration plan
- Add new D1 columns / CHECK constraint allowing all 8 states (CHECK includes both legacy + new during transition).
- Backfill:
UPDATE tenants_master SET status = MAP(status)per the table above. - Update all readers (
src/types.ts,src/jobs.ts, Admin SPA) to accept new states. - Drop legacy values from CHECK constraint after readers ship.
- Update Zuplo OpenAPI in
prego-zuploconfig/*.oas.json. - Sync admin SPA strings (
/cp/consolelifecycle widget).
5.2 Backwards compatibility window
For one quarter after Phase 5 ships, the API exposes both representations:
{ "status": "active", // canonical "_legacy_status": "Active" // until cohort N+1 SDKs ship}Same pattern as _legacy_job_id / _legacy_tenant_id already used for ID format migration.
6. Out-of-band actions
Actions that do not transition the lifecycle state but operate on an active tenant:
| Action | Description | Reference |
|---|---|---|
upgrade | Plan tier change (Starter → Business) | May trigger shard migration if plan_tier_lock differs |
downgrade | Plan tier change (Business → Starter) | May trigger shard migration; data must remain |
migrate_shard | Move tenant DB to a different shard | Backup → restore on new shard → switch tenants_master.db_shard_id → drop on old shard |
change_domain | Custom domain change | DNS update + Frappe bench --site X set-config host_name + KV update |
install_app | Add Frappe app to active tenant | bench --site X install-app <app> via Docker exec |
backup | On-demand backup | bench --site X backup --with-files → R2 upload |
restore | Restore from snapshot | Suspend → restore → resume; logs to tenant_lifecycle_events with reason restore |
rolling_migrate | Frappe / ERPNext version bump | Per-server rolling, see hybrid-multisite-operations.md §2.1 |
All out-of-band actions write to tenant_lifecycle_events with from_state == to_state and actor set to the requester.
7. Idempotency & retries
Every transition is idempotent at the workflow level. Re-applying a transition that has already happened is a no-op and returns the current state.
Implementation contract:
async function transitionTenant( tenantId: string, toState: TenantStatus, actor: string, reason: string): Promise<{ from: TenantStatus; to: TenantStatus; changed: boolean }>;- If
currentState == toState: returns{ changed: false }and writes no event. - If transition is illegal (e.g.
data_purged → active): throwsIllegalTransitionErrorwith HTTP 409. - If transition is legal: updates
tenants_master.status, writestenant_lifecycle_events, fires side effects in declared order.
8. Allowed transition matrix
| From \ To | trial | provisioning | active | suspended | grace_period | terminated | data_purged | failed |
|---|---|---|---|---|---|---|---|---|
(none) | ✅ | ✅ | — | — | — | — | — | — |
trial | — | ✅ | — | — | — | ✅ | — | — |
provisioning | — | — | ✅ | — | — | — | — | ✅ |
active | — | — | — | ✅ | ✅ | — | — | — |
suspended | — | — | ✅ | — | ✅ | — | — | — |
grace_period | — | — | ✅ | — | — | ✅ | — | — |
terminated | — | — | — | — | — | — | ✅ | — |
data_purged | — | — | — | — | — | — | — | — |
failed | — | ✅ | — | — | — | ✅ | — | — |
Any cell not marked ✅ throws IllegalTransitionError.
9. Observability
Per-state counts surfaced as Cloudflare Analytics Engine metric:
tenants_by_state{region, plan, state}Per-transition latency (e.g. how long from provisioning → active) tracked as a histogram.
Dashboards:
- Funnel:
(none) → trial → provisioning → activerates and dropouts - Health: count of
failedandprovisioning(stalled if older than threshold) - Churn:
active → grace_period → terminated → data_purgedper cohort
The existing /internal/pending-tenants endpoint covers the “stalled in provisioning” case today and will be extended to surface stalls in grace_period (overdue purge) after Phase 5.
10. References
- Frappe SaaS Multitenant Docker Standard (ADR-002)
- Control Plane Direct Provisioning (ADR-003) — lifecycle side-effects (
bench backup,bench drop-site, etc.) are dispatched via Agent inbound/agent/v1/exec(no Ansible) - Tenant Placement Policy
- Site Pool Strategy
prego-control-plane/src/types.ts— current 4-state definitionprego-control-plane/src/repositories/workflows-repo.ts—WorkflowTypealready includessuspend / resume / delete- Hybrid Multi-site Operations Runbook