Control Plane Direct Provisioning

Status: Accepted (canonical for the IaC + provisioning control path) Date: 2026-04-24 Supersedes: Frappe SaaS Multitenant Docker Standard §3.2 (prego-pulumi), §3.3 (prego-ansible), §3.4 (prego-docker Agent compose snippet) Companion to: Agent Security Standard (Agent role definition) Implementation ADR: ADR-003 (prego-control-plane)

This document is the canonical source for how Prego provisions and manages tenant infrastructure. The Cloudflare Worker Control Plane is the only control plane. There is no Pulumi, no Ansible, no SSH-based configuration management.

1. The four execution paths

[Cloudflare Worker Control Plane]
        ↓ (D1-backed Workflow)
        ↓
[Provisioning Worker]
        ├─ Path A : Hetzner Cloud API   — server / volume / firewall create+delete
        ├─ Path B : cloud-init          — Docker + Agent bootstrap (one-shot, generated by Worker)
        ├─ Path C : Docker Remote API   — container lifecycle (mTLS port 2376)
        └─ Path D : Agent inbound API   — bench / mariadb / redis-cli scope-limited execution (mTLS + JWT)

Path	Direction	Auth	What it MUST do	What it MUST NOT do
A. Hetzner Cloud API	Worker → Hetzner	API token (Worker Secret)	Server / volume / firewall / placement-group lifecycle	Anything tenant-internal
B. cloud-init	Worker → new server (one-shot, via Hetzner `user_data`)	Bootstrap token (≤15 min TTL, single-use)	Install Docker, deploy mTLS certs, install + start Agent, seed bootstrap token	Receive runtime commands; long-lived
C. Docker Remote API	Worker → host:2376	mTLS (CA-issued client cert)	Container create / start / stop / exec / pull / inspect / logs	`bench` / `mariadb` / app-level commands
D. Agent inbound API	Worker → host:443	mTLS + 5-min JWT, scope=`exec.<kind>`	`bench`, `mariadb`, `redis-cli` (closed enum)	Arbitrary shell, `docker run`, direct `docker exec`
(E. Agent outbound, existing)	Agent → Worker	mTLS + JWT (`heartbeat`/`metrics`/`logs`/`backup_report`)	Telemetry, health, backup-report	Receive commands

Paths A+B replace prego-pulumi. Path D replaces prego-ansible. Path C is preserved verbatim from the prior imperative-control-layer design.

2. End-to-end provisioning sequence

sequenceDiagram
    autonumber
    participant Op as Operator / API caller
    participant CP as Cloudflare Worker (Control Plane)
    participant D1 as D1 (servers, server_bootstrap_tokens)
    participant Hetz as Hetzner Cloud API
    participant Srv as New Server (boot)
    participant Ag as Agent (on Srv)

    Op->>CP: POST /internal/servers/create
    CP->>D1: INSERT server row (status=provisioning)
    CP->>D1: INSERT server_bootstrap_tokens (single-use, ≤15min)
    CP->>CP: render cloud-init (Docker + Agent + token)
    CP->>Hetz: POST /servers (user_data=base64(cloud-init))
    Hetz-->>CP: server id, ipv4
    CP->>D1: UPDATE server.host
    Note over Srv: Hetzner boots image, runs cloud-init
    Srv->>Srv: install docker; pull prego-agent
    Srv->>Ag: systemctl start prego-agent (token mounted)
    Ag->>CP: POST /internal/agent/handshake (token, mTLS)
    CP->>D1: token.consumed_at = now()
    CP->>Ag: 200 OK + long-lived JWT
    Ag->>CP: POST /internal/agent/heartbeat (every 30s)
    CP->>D1: server.status = active
    CP-->>Op: workflow.status = completed

Total target: ≤120 s from POST /internal/servers/create to server.status = 'active'.

Per-step contracts

Step	Input	Output	Failure mode	Idempotency
1. `POST /internal/servers/create`	`{role, region, infra_provider, spec, server_id}`	`{workflow_id}`	4xx if D1 row already exists with same `server_id`	Idempotent on `server_id`
2. Issue bootstrap token	`server_id`	`token_id`, raw token (returned only once, hashed in D1)	D1 write fail → workflow retry	Per-attempt token; previous tokens marked `superseded_at`
3. Render cloud-init	`bootstrap_token`, `agent_image`, `mtls_certs`	YAML ≤32 KB	Size overrun → fall back to 2-stage bootstrap (R2 fetch); see ADR-003 OD-B	Deterministic given same inputs
4. `Hetzner POST /servers`	server_type, location, image, ssh_keys=[] (none — no SSH), user_data	Hetzner server id	Hetzner error → mark workflow retry; no D1 server row update	Hetzner endpoint is not idempotent — Worker enforces idempotency by checking D1 first
5. Wait for cloud-init completion	`server.status == 'running'` (Hetzner-side)	—	5-min timeout → mark `failed`, destroy server, retry	Workflow-level retry; previous server destroyed
6. Agent handshake	bootstrap_token (over mTLS)	long-lived JWT for outbound push	Token already consumed → 409, server marked `failed_bootstrap`	Token enforces single-use
7. First heartbeat	JWT (mTLS)	server.agent_status = `connected`	No heartbeat in 90s → workflow `failed_no_heartbeat`, destroy server	Heartbeat-driven; no replay needed
8. Container creation (if `role=app`)	server_id, image	container running	Docker API error → workflow `failed_container`, server retained for debug	Docker API idempotent on container name
9. `server.status = 'active'`	All above succeeded	—	—	Final transition is a single UPDATE

3. Failure handling and rollback

Failure point	Worker action	Human action
Hetzner API 5xx during create	Exponential backoff (3 attempts)	None
cloud-init exceeds 32 KB	Fall back to 2-stage bootstrap from R2	None until OD-B is decided
Agent fails to handshake within 90 s	`Hetzner.deleteServer()` + token revoke + workflow `failed`; retry up to 2×	Inspect Hetzner serial console
Agent handshake succeeds but heartbeat stops	Mark `degraded` after 60 s; mark `failed` after 300 s	Page on-call
Docker container fails to start	Workflow `failed_container`; server retained for debug	SSH-less debug via Agent exec (`kind=bench`)
D1 write fails after Hetzner success	Workflow recorded as `inconsistent`; out-of-band reconciliation job	Manual D1 reconcile within 24 h

Rollback guarantee: every workflow step writes to D1 before calling Hetzner. If Hetzner succeeds but D1 write fails, the housekeeping job (next 5-min window) reconciles by enumerating Hetzner servers and matching against D1 — orphans are destroyed.

4. Bootstrap token security

Token shape: 256-bit random, returned to the Worker once at issuance, stored in D1 as SHA-256 hash only.
TTL: 15 minutes. After TTL the Agent handshake fails with 410 Gone; Worker auto-revokes.
Single-use: D1 column consumed_at; second use returns 409 Conflict.
Transport: only ever inside user_data (encrypted in transit by Hetzner) and over the Agent’s mTLS handshake.
Audit: every token issue + consume writes to agent_command_audit with kind='bootstrap'.

5. Cloud-init template invariants

The cloud-init template is rendered by the Worker (no static file in prego-docker). Invariants:

No SSH key authorisation — ssh_authorized_keys: []. The Agent is the only inbound path.
Docker daemon binds 0.0.0.0:2376 with mTLS — same CA as today; CA cert deployed via write_files.
Firewall — ufw allow 2376/tcp from <CP egress IPs>, ufw allow 443/tcp from <CP egress IPs> (Agent), default deny.
Agent unit file — systemd Restart=always, RestartSec=5s, hardening per Agent Security Standard §8.
Bootstrap token — written to /etc/prego-agent/bootstrap.token mode 0600 owner root. Agent reads, then unlinks.

6. D1 schema (additions on top of ADR-002 set)

CREATE TABLE server_bootstrap_tokens (
  token_id            TEXT PRIMARY KEY,
  server_id           TEXT NOT NULL,
  token_hash          TEXT NOT NULL,              -- sha256(token)
  purpose             TEXT NOT NULL CHECK (purpose IN ('agent_bootstrap','import_existing_server')),
  issued_at           INTEGER NOT NULL,
  expires_at          INTEGER NOT NULL,
  consumed_at         INTEGER,
  superseded_at       INTEGER,
  FOREIGN KEY (server_id) REFERENCES servers(server_id)
);
CREATE INDEX idx_bootstrap_token_server ON server_bootstrap_tokens(server_id);

CREATE TABLE agent_command_audit (
  audit_id            TEXT PRIMARY KEY,
  server_id           TEXT NOT NULL,
  workflow_id         TEXT,
  command_kind        TEXT NOT NULL CHECK (command_kind IN ('bench','mariadb','redis-cli','bootstrap')),
  args_redacted       TEXT NOT NULL,              -- joined args with secrets masked
  actor               TEXT NOT NULL CHECK (actor IN ('workflow','operator','system')),
  requested_at        INTEGER NOT NULL,
  started_at          INTEGER,
  finished_at         INTEGER,
  exit_code           INTEGER,
  duration_ms         INTEGER,
  stdout_truncated    TEXT,                       -- last 8 KB
  stderr_truncated    TEXT,                       -- last 8 KB
  FOREIGN KEY (server_id) REFERENCES servers(server_id)
);
CREATE INDEX idx_audit_server ON agent_command_audit(server_id);
CREATE INDEX idx_audit_workflow ON agent_command_audit(workflow_id);

The servers table (defined under ADR-002) gains:

managed_by — worker_direct | pulumi (Pulumi value used only for in-flight cutover)
agent_status — unknown | connected | degraded | disconnected
agent_first_seen_at, agent_last_heartbeat_at
imported_from_pulumi_at (nullable; non-null only for Option B imports)

7. Worker-side TypeScript contracts

export interface InfraProvider {
  readonly id: "hetzner" | "aws" | "gcp";
  createServer(spec: ServerSpec): Promise<ServerHandle>;
  destroyServer(id: string): Promise<void>;
  attachVolume?(id: string, gb: number): Promise<void>;
}

// src/clients/agent/types.ts
export type AgentCommandKind = "bench" | "mariadb" | "redis-cli";

export interface AgentExecRequest {
  workflowId: string;
  serverId: string;
  kind: AgentCommandKind;
  args: string[];
  timeoutSeconds?: number;       // default 60
  workingDir?: string;           // default /home/frappe/frappe-bench for bench
}

export interface AgentExecResult {
  exitCode: number;
  stdoutTruncated: string;       // last 8 KB
  stderrTruncated: string;       // last 8 KB
  durationMs: number;
  auditId: string;
}

8. Agent inbound API contract

Endpoint	Method	Direction	Auth	Purpose
`/agent/v1/exec`	`POST`	Worker → Agent	mTLS + JWT scope=`exec.<kind>`	Synchronous scope-limited CLI execution
`/agent/v1/exec/stream`	`POST` (SSE)	Worker → Agent	same	Long-running command with progress stream
`/internal/agent/handshake`	`POST`	Agent → Worker	mTLS + bootstrap token	One-shot at first boot, exchanges token for long-lived JWT
`/internal/agent/heartbeat`	`POST`	Agent → Worker	mTLS + JWT scope=`heartbeat`	30 s liveness
`/internal/agent/metrics`	`POST`	Agent → Worker	mTLS + JWT scope=`metrics`	OTLP metrics push
`/internal/agent/logs`	`POST`	Agent → Worker	mTLS + JWT scope=`logs`	Log shipping
`/internal/agent/backup-report`	`POST`	Agent → Worker	mTLS + JWT scope=`backup_report`	bench backup completion notification
`/internal/agent/rotate-token`	`POST`	Agent → Worker	mTLS + current JWT	Periodic rotation

Forbidden (must not exist):

POST /agent/v1/shell (arbitrary shell)
POST /agent/v1/exec without kind enum
POST /agent/v1/docker/run or anything bypassing Path C

9. What changes in each repo

Repo	Status	Concrete change
`prego-control-plane`	Active, expanded	New D1 tables, Worker-side `InfraProvider`, Agent inbound `exec-client.ts`, `provisioning-workflow.ts` rewrite
`prego-pulumi`	Deprecated	No new resources. Existing stacks readable for cutover. See migration runbook.
`prego-ansible`	Deprecated	No new playbooks. See migration runbook.
`prego-docker`	Active, scope reduced	Frappe images only. Agent image is published from `prego-agent` (see ADR-003 OD-C) and pulled by cloud-init.
`prego_saas`	Unchanged	No DocType changes.

10. Coexistence with the rest of the standard

ADR-003 is intentionally narrow:

DB sharding, Redis pooling, Site Pool, 8-state Tenant Lifecycle, Plan Isolation Matrix, Backup/DR (Standard §3, §5, §6, §10, §13) — unchanged, but their operations now flow through Path C (Docker API) and Path D (Agent exec) instead of Ansible.
Push Agent telemetry (Agent Security Standard) — unchanged direction, now coexists with Path D inbound.
InfraProvider interface (Standard §14) — unchanged at type level; implementations move into the Worker (no Pulumi stacks).

11. Forbidden patterns (enforcement)

The CI must fail PRs that introduce:

pulumi invocation from prego-control-plane/src/
ansible-playbook invocation from prego-control-plane/src/
ssh invocation from prego-control-plane/src/ (excluding doc strings clearly marked as historical)
New cross-repo dependency on prego-pulumi or prego-ansible
New /agent/v1/* endpoint without an explicit kind enum