Agent Security Standard

Companion to: Frappe SaaS Multitenant Docker Standard (ADR-002) and Control Plane Direct Provisioning (ADR-003) Boundary with: Imperative Control Layer — Path 1 (Docker API) and Path 3 (Agent inbound exec) are distinct and mutually exclusive in scope.

1. Why an Agent at all

The original imperative-control-layer chose no host agent for runtime commands — Cloudflare Workers talked directly to the Docker Remote API (mTLS, port 2376). ADR-003 (2026-04-24) updates that boundary: the Docker Remote API stays for container lifecycle, but a hardened Agent now also receives a closed enum of CLI commands (bench / mariadb / redis-cli), replacing what would otherwise be Ansible playbook runs.

The Agent therefore has two purposes:

Outbound telemetry push (ADR-002) — heartbeat, metrics, logs, backup reports, so the Control Plane never has to poll N hosts.
Inbound scope-limited exec (ADR-003) — POST /agent/v1/exec { kind, args } where kind ∈ { bench, mariadb, redis-cli }. No arbitrary shell. No docker run. No docker exec of arbitrary commands.

The four execution paths from Control Plane Direct Provisioning §1 coexist as follows:

Concern	Owner	Direction	Protocol
Provision host (server, volume, firewall)	Hetzner Cloud API (Path A)	Worker → Hetzner	HTTPS + API token
Bootstrap host (Docker, certs, Agent install)	cloud-init (Path B)	Worker → host (one-shot)	Hetzner `user_data`
Run / pull / inspect / log a container	Docker Remote API (Path C)	Worker → Host	HTTPS + mTLS, port 2376
Run `bench` / `mariadb` / `redis-cli` (closed enum)	Agent inbound (Path D)	Worker → Host:443	HTTPS + mTLS + scope-limited JWT
Push metrics / logs / heartbeat / backup-report	Agent outbound (Path E, was ADR-002)	Host → Worker	HTTPS + mTLS + JWT

The Worker never uses SSH. The Worker never uses Pulumi or Ansible.

2. Hard boundary (forbidden / required)

	Forbidden	Required
Inbound endpoints	`POST /agent/v1/shell`, generic `/exec` without `kind`, `POST /agent/v1/docker/run`	Only `POST /agent/v1/exec` and `POST /agent/v1/exec/stream`, both with `kind ∈ { bench, mariadb, redis-cli }`
Inbound auth	Long-lived API key	mTLS + 5-min JWT, scope=`exec.<kind>`
Outbound auth	Long-lived API key in agent config	Short-lived JWT obtained from CP each rotation (TTL/2)
Container lifecycle (create / start / stop / pull / inspect / logs)	Routed through Agent	Routed through Docker Remote API (Path C)
Configuration management	SSH + Ansible playbook	Cloud-init at boot + Path D Agent exec for runtime CLI
Discovery	DNS-broadcast Agent endpoints; Agent scanning CP	CP knows Agent identity from `servers` table; both directions explicitly addressed

Closed enum invariant: any new Path D kind value requires an ADR amendment. The Agent process itself enforces the enum at request parse time and rejects unknown kind with HTTP 400 before any process spawn.

If a future requirement seems to require arbitrary shell on the host, it must be rejected — instead either:

Wrap the operation in a bench sub-command (extends kind=bench cleanly), or
Route the operation through Path C (Docker exec into a tooling container), or
Open a new ADR proposing a new kind value with a defined arg surface.

3. Deployment

3.1 Bootstrap

Per ADR-003, the cloud-init user_data is dynamically generated by the Worker at server-create time (no static file in prego-docker). The Worker calls the Hetzner Cloud API directly with the rendered template, the per-server bootstrap token, and the Agent’s mTLS material.

Cloud-init writes (all files 0600, owned by prego-agent):

/etc/prego-agent/config.toml — Agent config (server_id, region, CP base URL, agent listen port)
/etc/prego-agent/bootstrap-token — single-use token (≤15 min TTL) used only for the handshake
/etc/systemd/system/prego-agent.service — systemd unit (see §8.3)
/etc/prego-agent/{ca,cert,key}.pem — mTLS material (CA pinned, client cert + key)

Cloud-init then:

docker pull <agent_image_pinned_by_worker>
systemctl daemon-reload
systemctl enable --now prego-agent

The Agent’s first action on boot is to call POST /internal/agent/handshake with the bootstrap token over mTLS. CP marks the token consumed_at, returns the long-lived outbound JWT, and the bootstrap-token file is then unlinked.

Bootstrap-token security details and D1 schema: see Control Plane Direct Provisioning §4 and §6.

3.2 Runtime

flowchart TB
    subgraph host [Docker Host]
        agent[prego-agent systemd]
        docker[Docker daemon stats API]
        proc[proc sys metrics]
        logs[bench logs]
        bench[prego-frappe-bench container]
        mariadb[mariadb local socket]
        redis[redis local socket]
    end
    subgraph cp [Control Plane Worker]
        execApi[POST agent v1 exec]
        handshake[POST internal agent handshake]
        rotateAgent[POST internal agent rotate-token]
        heartbeat[POST internal agent heartbeat]
        metrics[POST internal agent metrics]
        logsink[POST internal agent logs]
    end
    agent -->|"mTLS plus outbound JWT"| heartbeat
    agent -->|"mTLS plus outbound JWT"| metrics
    agent -->|"mTLS plus outbound JWT"| logsink
    agent -->|"mTLS plus bootstrap token"| handshake
    agent -->|"mTLS plus old JWT before TTL"| rotateAgent
    execApi -->|"mTLS plus per-request JWT"| agent
    docker -.-> agent
    proc -.-> agent
    logs -.-> agent
    agent -.->|"docker exec bench"| bench
    agent -.->|"unix socket"| mariadb
    agent -.->|"unix socket"| redis

Agent runs bidirectional (ADR-003): outbound for telemetry, inbound only for the closed kind enum on Path D. Cloud-init firewall opens port 2376 (Docker Remote API, mTLS) and port 443 (Agent inbound, mTLS) — both restricted to CP egress IPs. Inbound port 22 (SSH) is closed for automated operations and is only re-enabled out-of-band for emergency manual intervention.

4. Authentication

4.1 mTLS

Mandatory at the TLS layer:

Server (CP) presents Cloudflare-issued cert for cp.pregoi.com
Client (Agent) presents client.pem issued by Prego internal CA
CP verifies client cert against infra_providers.ca_cert_ref for the server’s region
Agent verifies CP cert against ca.pem (pinned)

If mTLS fails, the request is dropped at the edge before the Worker runs (Cloudflare mTLS verification).

4.2 Short-lived JWT (5 min TTL)

There are two distinct JWT families, both signed by CP with EdDSA (Ed25519) using a key in Worker Secrets (AGENT_TOKEN_SIGNING_KEY):

4.2.1 Outbound JWT (Path E — Agent → CP)

Issued to the Agent for pushing telemetry. Claims:

{
  "sub": "agent:<server_id>",
  "iss": "cp.pregoi.com",
  "aud": "agent.pregoi.com",
  "iat": 1755600000,
  "exp": 1755600300,
  "scope": ["heartbeat","metrics","logs","backup_report"],
  "region": "sg",
  "infra_provider": "hetzner"
}

TTL: 5 minutes.
Rotation: Agent calls POST /internal/agent/rotate-token at TTL/2 (every 2.5 min) presenting the previous valid JWT. CP issues a new one and revokes the previous (sliding window).

4.2.2 Inbound JWT (Path D — Worker → Agent, ADR-003)

Minted per request by the Worker for a specific exec call. Claims:

{
  "sub": "worker:cp",
  "iss": "cp.pregoi.com",
  "aud": "agent:<server_id>",
  "iat": 1755600000,
  "exp": 1755600300,
  "scope": ["exec.bench"],
  "kind": "bench",
  "workflow_id": "wf_01HX...",
  "audit_id": "aud_01HX..."
}

TTL: 5 minutes (single-request use).
Scope is exactly one of exec.bench / exec.mariadb / exec.redis-cli. The Agent rejects requests where kind does not match the JWT scope.
The Agent does not persist the inbound JWT; it is verified and discarded per request.

4.3 First-time bootstrap

The bootstrap-token in /etc/prego-agent/bootstrap-token is a one-shot JWT with scope: ["bootstrap"] and TTL 15 min. The Agent’s first action on boot is to exchange it for a working JWT, after which the bootstrap file is deleted.

5. Endpoints

The Agent and the Worker each host endpoints for the other. Both directions require mTLS + scope-limited JWT.

5.1 Agent → Worker (outbound push, Path E)

All endpoints live under /internal/agent/* on the existing CP Worker.

Endpoint	Method	JWT scope	Purpose
`/internal/agent/handshake`	POST	bootstrap	One-shot at first boot; exchanges bootstrap token for long-lived JWT (ADR-003)
`/internal/agent/rotate-token`	POST	(current JWT)	Exchange current JWT for a new one (sliding window)
`/internal/agent/heartbeat`	POST	`heartbeat`	Liveness ping every 30 s with brief container summary
`/internal/agent/metrics`	POST	`metrics`	OTLP-formatted metrics payload, every 60 s
`/internal/agent/logs`	POST	`logs`	Selected log lines (errors, warnings, audit) — rate-limited
`/internal/agent/backup-report`	POST	`backup_report`	Result of `bench backup` jobs triggered locally

5.2 Worker → Agent (inbound exec, Path D — ADR-003)

All endpoints live under /agent/v1/* on each host (port 443 by default).

Endpoint	Method	JWT scope	Purpose
`/agent/v1/exec`	POST	`exec.<kind>` (`exec.bench`, `exec.mariadb`, `exec.redis-cli`)	Synchronous scope-limited CLI execution. Body: `{ workflowId, kind, args[], timeoutSeconds?, workingDir? }`. Response: `{ exitCode, stdoutTruncated, stderrTruncated, durationMs, auditId }`.
`/agent/v1/exec/stream`	POST (SSE)	same	Long-running command (e.g., `bench backup`, `bench migrate`) with progress stream. Each SSE event is `{ type: ‘stdout'

Validation:

kind MUST be in the closed enum; unknown values return HTTP 400 before any process spawn.
args[] is a string[] (no shell interpolation); the Agent invokes the binary directly with posix_spawn.
timeoutSeconds defaults to 60, max 1800.
workingDir defaults: /home/frappe/frappe-bench for bench; / for mariadb/redis-cli.

5.3 Common requirements (both directions)

Valid mTLS client cert (CA-issued; same CA as Docker Remote API)
Valid JWT with the correct scope for the endpoint
server_id in JWT MUST match an active row in servers table
Region match between JWT and servers.region
Optional Cloudflare egress IP allowlist on host firewall (open: ADR-003 OD-A)

Mismatches return HTTP 401/403 and are logged to Cloudflare Logpush for SOC review. Every Path D invocation also writes an agent_command_audit row in D1 (see Control Plane Direct Provisioning §6).

6. Payload contracts

6.1 `POST /internal/agent/heartbeat`

{
  "server_id": "app-sgp-001",
  "infra_provider": "hetzner",
  "agent_version": "0.4.2",
  "uptime_seconds": 84321,
  "containers": [
    { "name": "prego-frappe-bench", "image": "iamfork/prego-repo:v0.18.3", "state": "running", "restart_count": 0 }
  ],
  "site_count": 42,
  "ts": "2026-04-24T10:00:00Z"
}

6.2 `POST /internal/agent/metrics`

OTLP/HTTP envelope (protobuf or JSON depending on cost). Standard metric names:

host.cpu.utilization (gauge)
host.memory.utilization (gauge)
host.disk.usage_pct{mountpoint=...} (gauge)
frappe.bench.worker_count{queue=...} (gauge)
frappe.site.health{site=...} (gauge: 0/1)
mariadb.connections.active (gauge)
mariadb.disk.usage_pct{database=...} (gauge)
redis.memory.usage_pct{db=...} (gauge)
frappe.queue.depth{queue=...} (gauge)
prego.provisioning.duration_seconds{phase=...} (histogram)

CP forwards these to the configured central sink (Grafana Cloud or Datadog) per ADR-002 §12.

6.3 `POST /internal/agent/logs`

{
  "server_id": "app-sgp-001",
  "lines": [
    {
      "ts": "2026-04-24T10:00:01Z",
      "level": "ERROR",
      "source": "bench",
      "site": "acme.pregoi.com",
      "message": "Worker timeout: ..."
    }
  ]
}

Rate-limited: max 1000 lines/min/server. Above that, the Agent applies local sampling and emits a frappe.logs.dropped_total counter.

7. Token rotation in detail

sequenceDiagram
    participant Agent
    participant CP as Control Plane
    Note over Agent: t=0 boot. Bootstrap JWT in disk
    Agent->>CP: POST rotate-token bootstrap-token
    CP-->>Agent: JWT v1 exp=t+5min
    Note over Agent: Discard bootstrap-token
    loop every 2.5 min
        Agent->>CP: POST rotate-token JWT vN
        CP-->>Agent: JWT vN+1 exp=now+5min
    end
    Note over Agent: If rotation fails, retry with backoff. After 3 failures, fall back to bootstrap-token if file still exists otherwise alert

If rotation fails for 3 consecutive attempts without successful payload push, the Agent:

Writes a local breadcrumb file /var/log/prego-agent/rotation-failure.log
Stops sending metrics (silent failure preferred over leaking with stale auth)
Triggers a self-restart via systemd
After restart, attempts to re-bootstrap if a fresh bootstrap-token has been delivered out-of-band

CP detects missing heartbeats and alerts after 3 consecutive misses (90 s).

8. Capability model

The Agent process runs as a non-root user prego-agent with the minimum capabilities required to satisfy both the outbound push (Path E) and the inbound exec (Path D) responsibilities. ADR-003 widens the capability set to include the ability to invoke bench, mariadb, and redis-cli against the local host — but only those binaries.

8.1 Filesystem and socket access

Resource	Mode	Purpose
`/proc`, `/sys`	read	Host metrics
Docker socket (`/var/run/docker.sock`)	read-only via `docker` group	Container `list`, `stats`, `logs`, `inspect` (Path E telemetry only)
`/var/log/frappe/`, `/var/log/nginx/`, `/var/log/mariadb/`	read	Log shipping
`/var/log/prego-agent/`, `/var/lib/prego-agent/state.db`	read+write	Agent local state
`/etc/prego-agent/{ca,cert,key}.pem`	read (0600)	mTLS material
Bind to port 443 (or configured Agent port)	listen	Path D inbound

8.2 Process spawning (ADR-003)

The Agent invokes only these binaries via posix_spawn, with arguments validated against the kind enum:

`kind`	Binary	How invoked	Notes
`bench`	`docker exec -u frappe prego-frappe-bench /home/frappe/frappe-bench/env/bin/bench`	Path D	Runs inside the Frappe container; never as root on host
`mariadb`	`docker exec prego-mariadb mariadb` (DB host) OR `mariadb` (rare direct CLI)	Path D	Local Unix socket only
`redis-cli`	`redis-cli`	Path D	Local Unix socket only

The Agent never invokes sh, bash, python, ssh, docker run, docker create, or any binary outside this list.

8.3 systemd hardening

[Service]
User=prego-agent
SupplementaryGroups=docker          # required for docker exec into the bench container
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/log/prego-agent /var/lib/prego-agent
ReadOnlyPaths=/etc/prego-agent /var/run/docker.sock
# Allow listen on 443; otherwise no privileged ops:
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE
SystemCallFilter=@system-service @network-io
SystemCallErrorNumber=EPERM
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true

8.4 Invariants enforced even if the Agent is compromised

Even if an attacker gains code execution inside the Agent process, the Agent cannot:

Execute docker run, docker create, or docker exec outside the prego-frappe-bench and prego-mariadb containers (Docker socket access is read-only at the operations level for telemetry, write only for the closed kind enum)
Spawn arbitrary binaries (only docker, mariadb, redis-cli are reachable; SystemCallFilter blocks generic exec families beyond @system-service)
Write to Frappe site directories on the host (covered by ProtectSystem=strict)
Modify systemd units, the Docker daemon configuration, or /etc/prego-agent/ materials
Open additional inbound ports beyond the configured Agent port (no CAP_NET_RAW, no CAP_NET_ADMIN)
Persist beyond the next systemctl restart prego-agent (no write to /etc/, no cron)

A compromise still allows Path D commands to be issued against the local Frappe / MariaDB / Redis. That blast radius is bounded to the host’s tenant set and is the intentional cost of replacing Ansible. mTLS + scope-limited JWT minimise the chance of compromise in the first place.

9. Version rollout

Agent versions follow the existing CI pipeline (prego-docker/.github/workflows/build.yml pattern).

Rollout strategy:

New Agent version published to internal artifact store (R2 or GHCR)
CP exposes GET /internal/agent/version-target?region=sg returning the desired pinned version
Agent self-update mechanism: on heartbeat, CP optionally responds with { "upgrade_to": "0.5.1" }
Agent downloads new binary, verifies signature, swaps via systemctl restart prego-agent
Failed upgrades roll back automatically (systemd ExecReload previous binary)

Canary: CP can target a small fraction of servers for upgrade first by setting agent_version_target per row.

10. Audit

Two audit tables:

10.1 `agent_audit` (Path E events: bootstrap, rotation, upgrade, revoke)

CREATE TABLE agent_audit (
  audit_id     TEXT PRIMARY KEY,
  server_id    TEXT NOT NULL,
  agent_version TEXT,
  action       TEXT NOT NULL,         -- 'bootstrap','rotate','upgrade','revoke'
  result       TEXT NOT NULL,         -- 'ok','failed'
  detail       TEXT,
  at           TEXT NOT NULL DEFAULT (datetime('now'))
);

10.2 `agent_command_audit` (Path D exec invocations — ADR-003)

Schema in Control Plane Direct Provisioning §6. Every POST /agent/v1/exec and POST /agent/v1/exec/stream writes a row with kind, redacted args, exit code, duration, and truncated stdout/stderr. Secrets in args (passwords, tokens) are masked before persistence by the Worker before it forwards to the Agent.

Both audit streams are pushed to Cloudflare Logpush (R2 long-term archive) and surfaced in the Admin SPA /cp/console.

11. Failure modes & mitigations

Failure	Detection	Mitigation
Agent process crash	Missing heartbeat ≥ 90 s	Systemd auto-restart; alert if still missing after 5 min
JWT compromise	Anomaly detector (multiple `server_id`s from same IP, etc.)	CP revokes JWT family, forces bootstrap
mTLS cert expiry	CP refuses request with HTTP 495	Agent triggers cert renewal via cloud-init re-run; out-of-band replacement
Clock skew	JWT `iat`/`exp` validation fails	Agent uses NTP; CP allows ±60 s clock skew
Network partition	All pushes time out	Agent buffers up to 1 MB locally, flushes on reconnect
Compromised host	Audit log anomalies	CP marks `servers.status='isolated'`, ops manual review

12. References

Frappe SaaS Multitenant Docker Standard (ADR-002) (parent)
Control Plane Direct Provisioning (ADR-003) — Path D (Agent inbound exec) added here
Imperative Control Layer — Path 1 (Docker Remote API) preserved; Path 3 added by ADR-003
prego-control-plane/src/clients/hetzner/cloud-init.ts — bootstrap injection point (now generated dynamically by the Worker per ADR-003)
prego-control-plane/wrangler.toml — Worker Secrets registry
Tenant Lifecycle — backup-report endpoint integration
ADR-003 Open Decisions (OD-A inbound auth, OD-B cloud-init size, OD-C agent distribution)