Skip to content

Control Plane Direct Provisioning

Status: Accepted (canonical for the IaC + provisioning control path) Date: 2026-04-24 Supersedes: Frappe SaaS Multitenant Docker Standard §3.2 (prego-pulumi), §3.3 (prego-ansible), §3.4 (prego-docker Agent compose snippet) Companion to: Agent Security Standard (Agent role definition) Implementation ADR: ADR-003 (prego-control-plane)

This document is the canonical source for how Prego provisions and manages tenant infrastructure. The Cloudflare Worker Control Plane is the only control plane. There is no Pulumi, no Ansible, no SSH-based configuration management.

1. The four execution paths

[Cloudflare Worker Control Plane]
↓ (D1-backed Workflow)
[Provisioning Worker]
├─ Path A : Hetzner Cloud API — server / volume / firewall create+delete
├─ Path B : cloud-init — Docker + Agent bootstrap (one-shot, generated by Worker)
├─ Path C : Docker Remote API — container lifecycle (mTLS port 2376)
└─ Path D : Agent inbound API — bench / mariadb / redis-cli scope-limited execution (mTLS + JWT)
PathDirectionAuthWhat it MUST doWhat it MUST NOT do
A. Hetzner Cloud APIWorker → HetznerAPI token (Worker Secret)Server / volume / firewall / placement-group lifecycleAnything tenant-internal
B. cloud-initWorker → new server (one-shot, via Hetzner user_data)Bootstrap token (≤15 min TTL, single-use)Install Docker, deploy mTLS certs, install + start Agent, seed bootstrap tokenReceive runtime commands; long-lived
C. Docker Remote APIWorker → host:2376mTLS (CA-issued client cert)Container create / start / stop / exec / pull / inspect / logsbench / mariadb / app-level commands
D. Agent inbound APIWorker → host:443mTLS + 5-min JWT, scope=exec.<kind>bench, mariadb, redis-cli (closed enum)Arbitrary shell, docker run, direct docker exec
(E. Agent outbound, existing)Agent → WorkermTLS + JWT (heartbeat/metrics/logs/backup_report)Telemetry, health, backup-reportReceive commands

Paths A+B replace prego-pulumi. Path D replaces prego-ansible. Path C is preserved verbatim from the prior imperative-control-layer design.

2. End-to-end provisioning sequence

sequenceDiagram
    autonumber
    participant Op as Operator / API caller
    participant CP as Cloudflare Worker (Control Plane)
    participant D1 as D1 (servers, server_bootstrap_tokens)
    participant Hetz as Hetzner Cloud API
    participant Srv as New Server (boot)
    participant Ag as Agent (on Srv)

    Op->>CP: POST /internal/servers/create
    CP->>D1: INSERT server row (status=provisioning)
    CP->>D1: INSERT server_bootstrap_tokens (single-use, ≤15min)
    CP->>CP: render cloud-init (Docker + Agent + token)
    CP->>Hetz: POST /servers (user_data=base64(cloud-init))
    Hetz-->>CP: server id, ipv4
    CP->>D1: UPDATE server.host
    Note over Srv: Hetzner boots image, runs cloud-init
    Srv->>Srv: install docker; pull prego-agent
    Srv->>Ag: systemctl start prego-agent (token mounted)
    Ag->>CP: POST /internal/agent/handshake (token, mTLS)
    CP->>D1: token.consumed_at = now()
    CP->>Ag: 200 OK + long-lived JWT
    Ag->>CP: POST /internal/agent/heartbeat (every 30s)
    CP->>D1: server.status = active
    CP-->>Op: workflow.status = completed

Total target: ≤120 s from POST /internal/servers/create to server.status = 'active'.

Per-step contracts

StepInputOutputFailure modeIdempotency
1. POST /internal/servers/create{role, region, infra_provider, spec, server_id}{workflow_id}4xx if D1 row already exists with same server_idIdempotent on server_id
2. Issue bootstrap tokenserver_idtoken_id, raw token (returned only once, hashed in D1)D1 write fail → workflow retryPer-attempt token; previous tokens marked superseded_at
3. Render cloud-initbootstrap_token, agent_image, mtls_certsYAML ≤32 KBSize overrun → fall back to 2-stage bootstrap (R2 fetch); see ADR-003 OD-BDeterministic given same inputs
4. Hetzner POST /serversserver_type, location, image, ssh_keys=[] (none — no SSH), user_dataHetzner server idHetzner error → mark workflow retry; no D1 server row updateHetzner endpoint is not idempotent — Worker enforces idempotency by checking D1 first
5. Wait for cloud-init completionserver.status == 'running' (Hetzner-side)5-min timeout → mark failed, destroy server, retryWorkflow-level retry; previous server destroyed
6. Agent handshakebootstrap_token (over mTLS)long-lived JWT for outbound pushToken already consumed → 409, server marked failed_bootstrapToken enforces single-use
7. First heartbeatJWT (mTLS)server.agent_status = connectedNo heartbeat in 90s → workflow failed_no_heartbeat, destroy serverHeartbeat-driven; no replay needed
8. Container creation (if role=app)server_id, imagecontainer runningDocker API error → workflow failed_container, server retained for debugDocker API idempotent on container name
9. server.status = 'active'All above succeededFinal transition is a single UPDATE

3. Failure handling and rollback

Failure pointWorker actionHuman action
Hetzner API 5xx during createExponential backoff (3 attempts)None
cloud-init exceeds 32 KBFall back to 2-stage bootstrap from R2None until OD-B is decided
Agent fails to handshake within 90 sHetzner.deleteServer() + token revoke + workflow failed; retry up to 2×Inspect Hetzner serial console
Agent handshake succeeds but heartbeat stopsMark degraded after 60 s; mark failed after 300 sPage on-call
Docker container fails to startWorkflow failed_container; server retained for debugSSH-less debug via Agent exec (kind=bench)
D1 write fails after Hetzner successWorkflow recorded as inconsistent; out-of-band reconciliation jobManual D1 reconcile within 24 h

Rollback guarantee: every workflow step writes to D1 before calling Hetzner. If Hetzner succeeds but D1 write fails, the housekeeping job (next 5-min window) reconciles by enumerating Hetzner servers and matching against D1 — orphans are destroyed.

4. Bootstrap token security

  • Token shape: 256-bit random, returned to the Worker once at issuance, stored in D1 as SHA-256 hash only.
  • TTL: 15 minutes. After TTL the Agent handshake fails with 410 Gone; Worker auto-revokes.
  • Single-use: D1 column consumed_at; second use returns 409 Conflict.
  • Transport: only ever inside user_data (encrypted in transit by Hetzner) and over the Agent’s mTLS handshake.
  • Audit: every token issue + consume writes to agent_command_audit with kind='bootstrap'.

5. Cloud-init template invariants

The cloud-init template is rendered by the Worker (no static file in prego-docker). Invariants:

  1. No SSH key authorisationssh_authorized_keys: []. The Agent is the only inbound path.
  2. Docker daemon binds 0.0.0.0:2376 with mTLS — same CA as today; CA cert deployed via write_files.
  3. Firewallufw allow 2376/tcp from <CP egress IPs>, ufw allow 443/tcp from <CP egress IPs> (Agent), default deny.
  4. Agent unit file — systemd Restart=always, RestartSec=5s, hardening per Agent Security Standard §8.
  5. Bootstrap token — written to /etc/prego-agent/bootstrap.token mode 0600 owner root. Agent reads, then unlinks.

6. D1 schema (additions on top of ADR-002 set)

CREATE TABLE server_bootstrap_tokens (
token_id TEXT PRIMARY KEY,
server_id TEXT NOT NULL,
token_hash TEXT NOT NULL, -- sha256(token)
purpose TEXT NOT NULL CHECK (purpose IN ('agent_bootstrap','import_existing_server')),
issued_at INTEGER NOT NULL,
expires_at INTEGER NOT NULL,
consumed_at INTEGER,
superseded_at INTEGER,
FOREIGN KEY (server_id) REFERENCES servers(server_id)
);
CREATE INDEX idx_bootstrap_token_server ON server_bootstrap_tokens(server_id);
CREATE TABLE agent_command_audit (
audit_id TEXT PRIMARY KEY,
server_id TEXT NOT NULL,
workflow_id TEXT,
command_kind TEXT NOT NULL CHECK (command_kind IN ('bench','mariadb','redis-cli','bootstrap')),
args_redacted TEXT NOT NULL, -- joined args with secrets masked
actor TEXT NOT NULL CHECK (actor IN ('workflow','operator','system')),
requested_at INTEGER NOT NULL,
started_at INTEGER,
finished_at INTEGER,
exit_code INTEGER,
duration_ms INTEGER,
stdout_truncated TEXT, -- last 8 KB
stderr_truncated TEXT, -- last 8 KB
FOREIGN KEY (server_id) REFERENCES servers(server_id)
);
CREATE INDEX idx_audit_server ON agent_command_audit(server_id);
CREATE INDEX idx_audit_workflow ON agent_command_audit(workflow_id);

The servers table (defined under ADR-002) gains:

  • managed_byworker_direct | pulumi (Pulumi value used only for in-flight cutover)
  • agent_statusunknown | connected | degraded | disconnected
  • agent_first_seen_at, agent_last_heartbeat_at
  • imported_from_pulumi_at (nullable; non-null only for Option B imports)

7. Worker-side TypeScript contracts

src/clients/infra-provider/types.ts
export interface InfraProvider {
readonly id: "hetzner" | "aws" | "gcp";
createServer(spec: ServerSpec): Promise<ServerHandle>;
destroyServer(id: string): Promise<void>;
attachVolume?(id: string, gb: number): Promise<void>;
}
// src/clients/agent/types.ts
export type AgentCommandKind = "bench" | "mariadb" | "redis-cli";
export interface AgentExecRequest {
workflowId: string;
serverId: string;
kind: AgentCommandKind;
args: string[];
timeoutSeconds?: number; // default 60
workingDir?: string; // default /home/frappe/frappe-bench for bench
}
export interface AgentExecResult {
exitCode: number;
stdoutTruncated: string; // last 8 KB
stderrTruncated: string; // last 8 KB
durationMs: number;
auditId: string;
}

8. Agent inbound API contract

EndpointMethodDirectionAuthPurpose
/agent/v1/execPOSTWorker → AgentmTLS + JWT scope=exec.<kind>Synchronous scope-limited CLI execution
/agent/v1/exec/streamPOST (SSE)Worker → AgentsameLong-running command with progress stream
/internal/agent/handshakePOSTAgent → WorkermTLS + bootstrap tokenOne-shot at first boot, exchanges token for long-lived JWT
/internal/agent/heartbeatPOSTAgent → WorkermTLS + JWT scope=heartbeat30 s liveness
/internal/agent/metricsPOSTAgent → WorkermTLS + JWT scope=metricsOTLP metrics push
/internal/agent/logsPOSTAgent → WorkermTLS + JWT scope=logsLog shipping
/internal/agent/backup-reportPOSTAgent → WorkermTLS + JWT scope=backup_reportbench backup completion notification
/internal/agent/rotate-tokenPOSTAgent → WorkermTLS + current JWTPeriodic rotation

Forbidden (must not exist):

  • POST /agent/v1/shell (arbitrary shell)
  • POST /agent/v1/exec without kind enum
  • POST /agent/v1/docker/run or anything bypassing Path C

9. What changes in each repo

RepoStatusConcrete change
prego-control-planeActive, expandedNew D1 tables, Worker-side InfraProvider, Agent inbound exec-client.ts, provisioning-workflow.ts rewrite
prego-pulumiDeprecatedNo new resources. Existing stacks readable for cutover. See migration runbook.
prego-ansibleDeprecatedNo new playbooks. See migration runbook.
prego-dockerActive, scope reducedFrappe images only. Agent image is published from prego-agent (see ADR-003 OD-C) and pulled by cloud-init.
prego_saasUnchangedNo DocType changes.

10. Coexistence with the rest of the standard

ADR-003 is intentionally narrow:

  • DB sharding, Redis pooling, Site Pool, 8-state Tenant Lifecycle, Plan Isolation Matrix, Backup/DR (Standard §3, §5, §6, §10, §13) — unchanged, but their operations now flow through Path C (Docker API) and Path D (Agent exec) instead of Ansible.
  • Push Agent telemetry (Agent Security Standard) — unchanged direction, now coexists with Path D inbound.
  • InfraProvider interface (Standard §14) — unchanged at type level; implementations move into the Worker (no Pulumi stacks).

11. Forbidden patterns (enforcement)

The CI must fail PRs that introduce:

  • pulumi invocation from prego-control-plane/src/
  • ansible-playbook invocation from prego-control-plane/src/
  • ssh invocation from prego-control-plane/src/ (excluding doc strings clearly marked as historical)
  • New cross-repo dependency on prego-pulumi or prego-ansible
  • New /agent/v1/* endpoint without an explicit kind enum

12. References

Help