Control Plane Direct Provisioning
Status: Accepted (canonical for the IaC + provisioning control path) Date: 2026-04-24 Supersedes: Frappe SaaS Multitenant Docker Standard §3.2 (
prego-pulumi), §3.3 (prego-ansible), §3.4 (prego-dockerAgent compose snippet) Companion to: Agent Security Standard (Agent role definition) Implementation ADR: ADR-003 (prego-control-plane)
This document is the canonical source for how Prego provisions and manages tenant infrastructure. The Cloudflare Worker Control Plane is the only control plane. There is no Pulumi, no Ansible, no SSH-based configuration management.
1. The four execution paths
[Cloudflare Worker Control Plane] ↓ (D1-backed Workflow) ↓[Provisioning Worker] ├─ Path A : Hetzner Cloud API — server / volume / firewall create+delete ├─ Path B : cloud-init — Docker + Agent bootstrap (one-shot, generated by Worker) ├─ Path C : Docker Remote API — container lifecycle (mTLS port 2376) └─ Path D : Agent inbound API — bench / mariadb / redis-cli scope-limited execution (mTLS + JWT)| Path | Direction | Auth | What it MUST do | What it MUST NOT do |
|---|---|---|---|---|
| A. Hetzner Cloud API | Worker → Hetzner | API token (Worker Secret) | Server / volume / firewall / placement-group lifecycle | Anything tenant-internal |
| B. cloud-init | Worker → new server (one-shot, via Hetzner user_data) | Bootstrap token (≤15 min TTL, single-use) | Install Docker, deploy mTLS certs, install + start Agent, seed bootstrap token | Receive runtime commands; long-lived |
| C. Docker Remote API | Worker → host:2376 | mTLS (CA-issued client cert) | Container create / start / stop / exec / pull / inspect / logs | bench / mariadb / app-level commands |
| D. Agent inbound API | Worker → host:443 | mTLS + 5-min JWT, scope=exec.<kind> | bench, mariadb, redis-cli (closed enum) | Arbitrary shell, docker run, direct docker exec |
| (E. Agent outbound, existing) | Agent → Worker | mTLS + JWT (heartbeat/metrics/logs/backup_report) | Telemetry, health, backup-report | Receive commands |
Paths A+B replace prego-pulumi. Path D replaces prego-ansible. Path C is preserved verbatim from the prior imperative-control-layer design.
2. End-to-end provisioning sequence
sequenceDiagram
autonumber
participant Op as Operator / API caller
participant CP as Cloudflare Worker (Control Plane)
participant D1 as D1 (servers, server_bootstrap_tokens)
participant Hetz as Hetzner Cloud API
participant Srv as New Server (boot)
participant Ag as Agent (on Srv)
Op->>CP: POST /internal/servers/create
CP->>D1: INSERT server row (status=provisioning)
CP->>D1: INSERT server_bootstrap_tokens (single-use, ≤15min)
CP->>CP: render cloud-init (Docker + Agent + token)
CP->>Hetz: POST /servers (user_data=base64(cloud-init))
Hetz-->>CP: server id, ipv4
CP->>D1: UPDATE server.host
Note over Srv: Hetzner boots image, runs cloud-init
Srv->>Srv: install docker; pull prego-agent
Srv->>Ag: systemctl start prego-agent (token mounted)
Ag->>CP: POST /internal/agent/handshake (token, mTLS)
CP->>D1: token.consumed_at = now()
CP->>Ag: 200 OK + long-lived JWT
Ag->>CP: POST /internal/agent/heartbeat (every 30s)
CP->>D1: server.status = active
CP-->>Op: workflow.status = completed
Total target: ≤120 s from POST /internal/servers/create to server.status = 'active'.
Per-step contracts
| Step | Input | Output | Failure mode | Idempotency |
|---|---|---|---|---|
1. POST /internal/servers/create | {role, region, infra_provider, spec, server_id} | {workflow_id} | 4xx if D1 row already exists with same server_id | Idempotent on server_id |
| 2. Issue bootstrap token | server_id | token_id, raw token (returned only once, hashed in D1) | D1 write fail → workflow retry | Per-attempt token; previous tokens marked superseded_at |
| 3. Render cloud-init | bootstrap_token, agent_image, mtls_certs | YAML ≤32 KB | Size overrun → fall back to 2-stage bootstrap (R2 fetch); see ADR-003 OD-B | Deterministic given same inputs |
4. Hetzner POST /servers | server_type, location, image, ssh_keys=[] (none — no SSH), user_data | Hetzner server id | Hetzner error → mark workflow retry; no D1 server row update | Hetzner endpoint is not idempotent — Worker enforces idempotency by checking D1 first |
| 5. Wait for cloud-init completion | server.status == 'running' (Hetzner-side) | — | 5-min timeout → mark failed, destroy server, retry | Workflow-level retry; previous server destroyed |
| 6. Agent handshake | bootstrap_token (over mTLS) | long-lived JWT for outbound push | Token already consumed → 409, server marked failed_bootstrap | Token enforces single-use |
| 7. First heartbeat | JWT (mTLS) | server.agent_status = connected | No heartbeat in 90s → workflow failed_no_heartbeat, destroy server | Heartbeat-driven; no replay needed |
8. Container creation (if role=app) | server_id, image | container running | Docker API error → workflow failed_container, server retained for debug | Docker API idempotent on container name |
9. server.status = 'active' | All above succeeded | — | — | Final transition is a single UPDATE |
3. Failure handling and rollback
| Failure point | Worker action | Human action |
|---|---|---|
| Hetzner API 5xx during create | Exponential backoff (3 attempts) | None |
| cloud-init exceeds 32 KB | Fall back to 2-stage bootstrap from R2 | None until OD-B is decided |
| Agent fails to handshake within 90 s | Hetzner.deleteServer() + token revoke + workflow failed; retry up to 2× | Inspect Hetzner serial console |
| Agent handshake succeeds but heartbeat stops | Mark degraded after 60 s; mark failed after 300 s | Page on-call |
| Docker container fails to start | Workflow failed_container; server retained for debug | SSH-less debug via Agent exec (kind=bench) |
| D1 write fails after Hetzner success | Workflow recorded as inconsistent; out-of-band reconciliation job | Manual D1 reconcile within 24 h |
Rollback guarantee: every workflow step writes to D1 before calling Hetzner. If Hetzner succeeds but D1 write fails, the housekeeping job (next 5-min window) reconciles by enumerating Hetzner servers and matching against D1 — orphans are destroyed.
4. Bootstrap token security
- Token shape: 256-bit random, returned to the Worker once at issuance, stored in D1 as SHA-256 hash only.
- TTL: 15 minutes. After TTL the Agent handshake fails with 410 Gone; Worker auto-revokes.
- Single-use: D1 column
consumed_at; second use returns 409 Conflict. - Transport: only ever inside
user_data(encrypted in transit by Hetzner) and over the Agent’s mTLS handshake. - Audit: every token issue + consume writes to
agent_command_auditwithkind='bootstrap'.
5. Cloud-init template invariants
The cloud-init template is rendered by the Worker (no static file in prego-docker). Invariants:
- No SSH key authorisation —
ssh_authorized_keys: []. The Agent is the only inbound path. - Docker daemon binds 0.0.0.0:2376 with mTLS — same CA as today; CA cert deployed via
write_files. - Firewall —
ufw allow 2376/tcp from <CP egress IPs>,ufw allow 443/tcp from <CP egress IPs>(Agent), default deny. - Agent unit file — systemd
Restart=always,RestartSec=5s, hardening per Agent Security Standard §8. - Bootstrap token — written to
/etc/prego-agent/bootstrap.tokenmode0600ownerroot. Agent reads, then unlinks.
6. D1 schema (additions on top of ADR-002 set)
CREATE TABLE server_bootstrap_tokens ( token_id TEXT PRIMARY KEY, server_id TEXT NOT NULL, token_hash TEXT NOT NULL, -- sha256(token) purpose TEXT NOT NULL CHECK (purpose IN ('agent_bootstrap','import_existing_server')), issued_at INTEGER NOT NULL, expires_at INTEGER NOT NULL, consumed_at INTEGER, superseded_at INTEGER, FOREIGN KEY (server_id) REFERENCES servers(server_id));CREATE INDEX idx_bootstrap_token_server ON server_bootstrap_tokens(server_id);
CREATE TABLE agent_command_audit ( audit_id TEXT PRIMARY KEY, server_id TEXT NOT NULL, workflow_id TEXT, command_kind TEXT NOT NULL CHECK (command_kind IN ('bench','mariadb','redis-cli','bootstrap')), args_redacted TEXT NOT NULL, -- joined args with secrets masked actor TEXT NOT NULL CHECK (actor IN ('workflow','operator','system')), requested_at INTEGER NOT NULL, started_at INTEGER, finished_at INTEGER, exit_code INTEGER, duration_ms INTEGER, stdout_truncated TEXT, -- last 8 KB stderr_truncated TEXT, -- last 8 KB FOREIGN KEY (server_id) REFERENCES servers(server_id));CREATE INDEX idx_audit_server ON agent_command_audit(server_id);CREATE INDEX idx_audit_workflow ON agent_command_audit(workflow_id);The servers table (defined under ADR-002) gains:
managed_by—worker_direct | pulumi(Pulumi value used only for in-flight cutover)agent_status—unknown | connected | degraded | disconnectedagent_first_seen_at,agent_last_heartbeat_atimported_from_pulumi_at(nullable; non-null only for Option B imports)
7. Worker-side TypeScript contracts
export interface InfraProvider { readonly id: "hetzner" | "aws" | "gcp"; createServer(spec: ServerSpec): Promise<ServerHandle>; destroyServer(id: string): Promise<void>; attachVolume?(id: string, gb: number): Promise<void>;}
// src/clients/agent/types.tsexport type AgentCommandKind = "bench" | "mariadb" | "redis-cli";
export interface AgentExecRequest { workflowId: string; serverId: string; kind: AgentCommandKind; args: string[]; timeoutSeconds?: number; // default 60 workingDir?: string; // default /home/frappe/frappe-bench for bench}
export interface AgentExecResult { exitCode: number; stdoutTruncated: string; // last 8 KB stderrTruncated: string; // last 8 KB durationMs: number; auditId: string;}8. Agent inbound API contract
| Endpoint | Method | Direction | Auth | Purpose |
|---|---|---|---|---|
/agent/v1/exec | POST | Worker → Agent | mTLS + JWT scope=exec.<kind> | Synchronous scope-limited CLI execution |
/agent/v1/exec/stream | POST (SSE) | Worker → Agent | same | Long-running command with progress stream |
/internal/agent/handshake | POST | Agent → Worker | mTLS + bootstrap token | One-shot at first boot, exchanges token for long-lived JWT |
/internal/agent/heartbeat | POST | Agent → Worker | mTLS + JWT scope=heartbeat | 30 s liveness |
/internal/agent/metrics | POST | Agent → Worker | mTLS + JWT scope=metrics | OTLP metrics push |
/internal/agent/logs | POST | Agent → Worker | mTLS + JWT scope=logs | Log shipping |
/internal/agent/backup-report | POST | Agent → Worker | mTLS + JWT scope=backup_report | bench backup completion notification |
/internal/agent/rotate-token | POST | Agent → Worker | mTLS + current JWT | Periodic rotation |
Forbidden (must not exist):
POST /agent/v1/shell(arbitrary shell)POST /agent/v1/execwithoutkindenumPOST /agent/v1/docker/runor anything bypassing Path C
9. What changes in each repo
| Repo | Status | Concrete change |
|---|---|---|
prego-control-plane | Active, expanded | New D1 tables, Worker-side InfraProvider, Agent inbound exec-client.ts, provisioning-workflow.ts rewrite |
prego-pulumi | Deprecated | No new resources. Existing stacks readable for cutover. See migration runbook. |
prego-ansible | Deprecated | No new playbooks. See migration runbook. |
prego-docker | Active, scope reduced | Frappe images only. Agent image is published from prego-agent (see ADR-003 OD-C) and pulled by cloud-init. |
prego_saas | Unchanged | No DocType changes. |
10. Coexistence with the rest of the standard
ADR-003 is intentionally narrow:
- DB sharding, Redis pooling, Site Pool, 8-state Tenant Lifecycle, Plan Isolation Matrix, Backup/DR (Standard §3, §5, §6, §10, §13) — unchanged, but their operations now flow through Path C (Docker API) and Path D (Agent exec) instead of Ansible.
- Push Agent telemetry (Agent Security Standard) — unchanged direction, now coexists with Path D inbound.
InfraProviderinterface (Standard §14) — unchanged at type level; implementations move into the Worker (no Pulumi stacks).
11. Forbidden patterns (enforcement)
The CI must fail PRs that introduce:
pulumiinvocation fromprego-control-plane/src/ansible-playbookinvocation fromprego-control-plane/src/sshinvocation fromprego-control-plane/src/(excluding doc strings clearly marked as historical)- New cross-repo dependency on
prego-pulumiorprego-ansible - New
/agent/v1/*endpoint without an explicitkindenum