Skip to content

Tenant Lifecycle

Companion to: Frappe SaaS Multitenant Docker Standard (ADR-002) Related: Tenant Placement Policy, Site Pool Strategy

This document defines the canonical 8-state lifecycle for every Prego tenant and the migration plan from the current 4-state model in prego-control-plane/src/types.ts ('Pending' | 'Active' | 'Pending_Deletion' | 'Deleted').


1. State machine

stateDiagram-v2
    [*] --> trial: signup before payment
    [*] --> provisioning: paid checkout webhook
    trial --> provisioning: upgrade to paid
    trial --> terminated: trial expiry no conversion
    provisioning --> active: workflow success
    provisioning --> failed: workflow non-recoverable error
    active --> suspended: admin or billing failure
    suspended --> active: dispute resolved
    suspended --> grace_period: dunning final notice
    active --> grace_period: voluntary cancel
    grace_period --> active: reactivation within window
    grace_period --> terminated: grace expired
    failed --> provisioning: operator retry
    failed --> terminated: operator gives up
    terminated --> data_purged: retention window expired
    data_purged --> [*]

2. State definitions

StateMeaningDB shapeSite / DB exists?Access
trialPre-payment trial; tenant_id allocated, no infra yet (or trial-shard infra)tenants_master rowOptional (trial-only shard or none)Read-write trial scope
provisioningWorkflow running (placement / pool / DNS)tenants_master + provision_jobs runningBeing created or rented from poolNone until active
activeProduction tenantAll tables populatedYesFull
suspendedTemporarily blocked (billing failure, abuse, manual)All tables populatedYes (sites disabled)None (HTTP 402/403)
grace_periodCancellation requested but reversible window openAll tables populatedYes (read-only banner)Read-only or full per policy
terminatedFinal cancellation, infra still present pending purgetenants_master.status='terminated', tenant_runtime retainedYes (offline)None
data_purgedAll tenant data destroyed (DBs dropped, R2 archived)tenants_master row only (audit)NoNone
failedProvisioning workflow failed unrecoverablytenants_master + provision_jobs.status='Failed'Partial / inconsistentNone

3. Transitions

TransitionTriggerActorRequired stepsSide effects
(none) → trialTrial signup form submittedwww → CP POST /v1/trial/startValidate email, create tenants_master row, send OTPFunnel event, no infra
(none) → provisioningStripe Checkout webhook firesCP webhook handlerEnqueue prego-provision-queue messageFunnel event
trial → provisioningTrial → paid upgradeCP POST /v1/tenants/:id/convertEnqueue provision with from_state='trial'Stripe subscription created
provisioning → activeWorkflow completes all stepsWorkflowEnginetenant_runtime populated, KV TENANT_ORIGINS updated, DNS liveOnboarding email, funnel provision_completed
provisioning → failedWorkflow exhausted retriesWorkflowEnginePersist error, leave partial infra for triagePage on-call if rate exceeds threshold
failed → provisioningOperator retryAdmin SPA /cp/consoleReset provision_jobs.status='Pending', re-enqueueAudit log
failed → terminatedOperator abandonsAdmin SPAMark terminated, schedule purgeAudit log
active → suspendedStripe invoice.payment_failed (after dunning) or abuse flag or manualCP webhook / adminDisable Frappe site (bench --site X disable-website or nginx 402), keep dataEmail customer, funnel tenant_suspended
suspended → activePayment recovered or operator clearsCP webhook / adminRe-enable siteEmail customer
suspended → grace_periodFinal dunning notice sentCP scheduledMark grace_period_end timestampEmail customer
active → grace_periodCustomer requests cancelAdmin SPA / portalSet grace_period_end = now + 30 days (plan-dependent)Backup snapshot, email confirmation
grace_period → activeCustomer reactivatesAdmin SPA / portalClear grace_period_endStripe subscription resumed
grace_period → terminatedGrace expiredCP scheduledStop site, retain backupEmail customer, funnel tenant_terminated
terminated → data_purgedRetention window elapsed (e.g. 90 days)CP scheduledDrop DBs, delete site directory, archive R2, remove tenant_runtimeEmail customer (final), funnel tenant_data_purged

Every transition writes a row to tenant_lifecycle_events:

tenant_id, from_state, to_state, actor, reason, at, workflow_id, evidence_url

4. Side effects per transition

4.1 DNS

  • provisioning → active: create <subdomain>.pregoi.com A/CNAME via Cloudflare DNS API (src/clients/cloudflare/)
  • active → suspended: leave DNS, return 402 at app
  • suspended → grace_period: no DNS change
  • grace_period → terminated: optionally redirect to “service ended” landing
  • terminated → data_purged: delete DNS record, free subdomain

4.2 Billing (Stripe)

  • trial → provisioning: create Stripe subscription
  • active → suspended: pause subscription (or rely on Stripe failure state)
  • grace_period → terminated: cancel subscription, issue final invoice
  • terminated → data_purged: no Stripe change (subscription already cancelled)

4.3 Backup

  • provisioning → active: schedule first backup per plan tier (see ADR-002 §13)
  • active → grace_period: take terminal snapshot (full backup to R2)
  • grace_period → terminated: take final snapshot, store under R2://prego-backups/terminated/<tenant>/
  • terminated → data_purged: archive snapshots are retained per legal/compliance policy (default 7 years for Enterprise, 1 year for Starter, 90 days for Trial)

4.4 KV / Routing

  • provisioning → active: write TENANT_ORIGINS[<host>] = origin_url
  • active → suspended: leave KV; gateway returns 402 based on tenants_master.status
  • terminated → data_purged: delete KV entry

5. Migration mapping (4-state → 8-state)

The current code in prego-control-plane/src/types.ts defines:

export type TenantStatus = 'Pending' | 'Active' | 'Pending_Deletion' | 'Deleted';

Backfill mapping (Phase 5 — see ADR-002 §17):

CurrentNewNotes
PendingprovisioningDirect semantic match
ActiveactiveLower-case rename
Pending_Deletiongrace_periodNaming aligned with billing semantics
Deleteddata_purgedDistinct from terminated (which is “final cancel, data still present”)

New states without backfill (created by future flows):

  • trial — created by www trial signup flow before payment
  • suspended — created by Stripe dunning / abuse flag
  • terminated — distinct interstitial between grace and purge
  • failed — distinct from Pending (failed requires operator action; Pending is in-flight)

5.1 Phase 5 migration plan

  1. Add new D1 columns / CHECK constraint allowing all 8 states (CHECK includes both legacy + new during transition).
  2. Backfill: UPDATE tenants_master SET status = MAP(status) per the table above.
  3. Update all readers (src/types.ts, src/jobs.ts, Admin SPA) to accept new states.
  4. Drop legacy values from CHECK constraint after readers ship.
  5. Update Zuplo OpenAPI in prego-zuplo config/*.oas.json.
  6. Sync admin SPA strings (/cp/console lifecycle widget).

5.2 Backwards compatibility window

For one quarter after Phase 5 ships, the API exposes both representations:

{
"status": "active", // canonical
"_legacy_status": "Active" // until cohort N+1 SDKs ship
}

Same pattern as _legacy_job_id / _legacy_tenant_id already used for ID format migration.


6. Out-of-band actions

Actions that do not transition the lifecycle state but operate on an active tenant:

ActionDescriptionReference
upgradePlan tier change (Starter → Business)May trigger shard migration if plan_tier_lock differs
downgradePlan tier change (Business → Starter)May trigger shard migration; data must remain
migrate_shardMove tenant DB to a different shardBackup → restore on new shard → switch tenants_master.db_shard_id → drop on old shard
change_domainCustom domain changeDNS update + Frappe bench --site X set-config host_name + KV update
install_appAdd Frappe app to active tenantbench --site X install-app <app> via Docker exec
backupOn-demand backupbench --site X backup --with-files → R2 upload
restoreRestore from snapshotSuspend → restore → resume; logs to tenant_lifecycle_events with reason restore
rolling_migrateFrappe / ERPNext version bumpPer-server rolling, see hybrid-multisite-operations.md §2.1

All out-of-band actions write to tenant_lifecycle_events with from_state == to_state and actor set to the requester.


7. Idempotency & retries

Every transition is idempotent at the workflow level. Re-applying a transition that has already happened is a no-op and returns the current state.

Implementation contract:

async function transitionTenant(
tenantId: string,
toState: TenantStatus,
actor: string,
reason: string
): Promise<{ from: TenantStatus; to: TenantStatus; changed: boolean }>;
  • If currentState == toState: returns { changed: false } and writes no event.
  • If transition is illegal (e.g. data_purged → active): throws IllegalTransitionError with HTTP 409.
  • If transition is legal: updates tenants_master.status, writes tenant_lifecycle_events, fires side effects in declared order.

8. Allowed transition matrix

From \ Totrialprovisioningactivesuspendedgrace_periodterminateddata_purgedfailed
(none)
trial
provisioning
active
suspended
grace_period
terminated
data_purged
failed

Any cell not marked ✅ throws IllegalTransitionError.


9. Observability

Per-state counts surfaced as Cloudflare Analytics Engine metric:

tenants_by_state{region, plan, state}

Per-transition latency (e.g. how long from provisioning → active) tracked as a histogram.

Dashboards:

  • Funnel: (none) → trial → provisioning → active rates and dropouts
  • Health: count of failed and provisioning (stalled if older than threshold)
  • Churn: active → grace_period → terminated → data_purged per cohort

The existing /internal/pending-tenants endpoint covers the “stalled in provisioning” case today and will be extended to surface stalls in grace_period (overdue purge) after Phase 5.


10. References

Help