Agent Security Standard
Companion to: Frappe SaaS Multitenant Docker Standard (ADR-002) and Control Plane Direct Provisioning (ADR-003) Boundary with: Imperative Control Layer — Path 1 (Docker API) and Path 3 (Agent inbound exec) are distinct and mutually exclusive in scope.
1. Why an Agent at all
The original imperative-control-layer chose no host agent for runtime commands — Cloudflare Workers talked directly to the Docker Remote API (mTLS, port 2376). ADR-003 (2026-04-24) updates that boundary: the Docker Remote API stays for container lifecycle, but a hardened Agent now also receives a closed enum of CLI commands (bench / mariadb / redis-cli), replacing what would otherwise be Ansible playbook runs.
The Agent therefore has two purposes:
- Outbound telemetry push (ADR-002) — heartbeat, metrics, logs, backup reports, so the Control Plane never has to poll N hosts.
- Inbound scope-limited exec (ADR-003) —
POST /agent/v1/exec { kind, args }wherekind ∈ { bench, mariadb, redis-cli }. No arbitrary shell. Nodocker run. Nodocker execof arbitrary commands.
The four execution paths from Control Plane Direct Provisioning §1 coexist as follows:
| Concern | Owner | Direction | Protocol |
|---|---|---|---|
| Provision host (server, volume, firewall) | Hetzner Cloud API (Path A) | Worker → Hetzner | HTTPS + API token |
| Bootstrap host (Docker, certs, Agent install) | cloud-init (Path B) | Worker → host (one-shot) | Hetzner user_data |
| Run / pull / inspect / log a container | Docker Remote API (Path C) | Worker → Host | HTTPS + mTLS, port 2376 |
Run bench / mariadb / redis-cli (closed enum) | Agent inbound (Path D) | Worker → Host:443 | HTTPS + mTLS + scope-limited JWT |
| Push metrics / logs / heartbeat / backup-report | Agent outbound (Path E, was ADR-002) | Host → Worker | HTTPS + mTLS + JWT |
The Worker never uses SSH. The Worker never uses Pulumi or Ansible.
2. Hard boundary (forbidden / required)
| Forbidden | Required | |
|---|---|---|
| Inbound endpoints | POST /agent/v1/shell, generic /exec without kind, POST /agent/v1/docker/run | Only POST /agent/v1/exec and POST /agent/v1/exec/stream, both with kind ∈ { bench, mariadb, redis-cli } |
| Inbound auth | Long-lived API key | mTLS + 5-min JWT, scope=exec.<kind> |
| Outbound auth | Long-lived API key in agent config | Short-lived JWT obtained from CP each rotation (TTL/2) |
| Container lifecycle (create / start / stop / pull / inspect / logs) | Routed through Agent | Routed through Docker Remote API (Path C) |
| Configuration management | SSH + Ansible playbook | Cloud-init at boot + Path D Agent exec for runtime CLI |
| Discovery | DNS-broadcast Agent endpoints; Agent scanning CP | CP knows Agent identity from servers table; both directions explicitly addressed |
Closed enum invariant: any new Path D kind value requires an ADR amendment. The Agent process itself enforces the enum at request parse time and rejects unknown kind with HTTP 400 before any process spawn.
If a future requirement seems to require arbitrary shell on the host, it must be rejected — instead either:
- Wrap the operation in a
benchsub-command (extendskind=benchcleanly), or - Route the operation through Path C (Docker exec into a tooling container), or
- Open a new ADR proposing a new
kindvalue with a defined arg surface.
3. Deployment
3.1 Bootstrap
Per ADR-003, the cloud-init user_data is dynamically generated by the Worker at server-create time (no static file in prego-docker). The Worker calls the Hetzner Cloud API directly with the rendered template, the per-server bootstrap token, and the Agent’s mTLS material.
Cloud-init writes (all files 0600, owned by prego-agent):
/etc/prego-agent/config.toml— Agent config (server_id, region, CP base URL, agent listen port)/etc/prego-agent/bootstrap-token— single-use token (≤15 min TTL) used only for the handshake/etc/systemd/system/prego-agent.service— systemd unit (see §8.3)/etc/prego-agent/{ca,cert,key}.pem— mTLS material (CA pinned, client cert + key)
Cloud-init then:
docker pull <agent_image_pinned_by_worker>systemctl daemon-reloadsystemctl enable --now prego-agentThe Agent’s first action on boot is to call POST /internal/agent/handshake with the bootstrap token over mTLS. CP marks the token consumed_at, returns the long-lived outbound JWT, and the bootstrap-token file is then unlinked.
Bootstrap-token security details and D1 schema: see Control Plane Direct Provisioning §4 and §6.
3.2 Runtime
flowchart TB
subgraph host [Docker Host]
agent[prego-agent systemd]
docker[Docker daemon stats API]
proc[proc sys metrics]
logs[bench logs]
bench[prego-frappe-bench container]
mariadb[mariadb local socket]
redis[redis local socket]
end
subgraph cp [Control Plane Worker]
execApi[POST agent v1 exec]
handshake[POST internal agent handshake]
rotateAgent[POST internal agent rotate-token]
heartbeat[POST internal agent heartbeat]
metrics[POST internal agent metrics]
logsink[POST internal agent logs]
end
agent -->|"mTLS plus outbound JWT"| heartbeat
agent -->|"mTLS plus outbound JWT"| metrics
agent -->|"mTLS plus outbound JWT"| logsink
agent -->|"mTLS plus bootstrap token"| handshake
agent -->|"mTLS plus old JWT before TTL"| rotateAgent
execApi -->|"mTLS plus per-request JWT"| agent
docker -.-> agent
proc -.-> agent
logs -.-> agent
agent -.->|"docker exec bench"| bench
agent -.->|"unix socket"| mariadb
agent -.->|"unix socket"| redis
Agent runs bidirectional (ADR-003): outbound for telemetry, inbound only for the closed kind enum on Path D. Cloud-init firewall opens port 2376 (Docker Remote API, mTLS) and port 443 (Agent inbound, mTLS) — both restricted to CP egress IPs. Inbound port 22 (SSH) is closed for automated operations and is only re-enabled out-of-band for emergency manual intervention.
4. Authentication
4.1 mTLS
Mandatory at the TLS layer:
- Server (CP) presents Cloudflare-issued cert for
cp.pregoi.com - Client (Agent) presents
client.pemissued by Prego internal CA - CP verifies client cert against
infra_providers.ca_cert_reffor the server’s region - Agent verifies CP cert against
ca.pem(pinned)
If mTLS fails, the request is dropped at the edge before the Worker runs (Cloudflare mTLS verification).
4.2 Short-lived JWT (5 min TTL)
There are two distinct JWT families, both signed by CP with EdDSA (Ed25519) using a key in Worker Secrets (AGENT_TOKEN_SIGNING_KEY):
4.2.1 Outbound JWT (Path E — Agent → CP)
Issued to the Agent for pushing telemetry. Claims:
{ "sub": "agent:<server_id>", "iss": "cp.pregoi.com", "aud": "agent.pregoi.com", "iat": 1755600000, "exp": 1755600300, "scope": ["heartbeat","metrics","logs","backup_report"], "region": "sg", "infra_provider": "hetzner"}- TTL: 5 minutes.
- Rotation: Agent calls
POST /internal/agent/rotate-tokenat TTL/2 (every 2.5 min) presenting the previous valid JWT. CP issues a new one and revokes the previous (sliding window).
4.2.2 Inbound JWT (Path D — Worker → Agent, ADR-003)
Minted per request by the Worker for a specific exec call. Claims:
{ "sub": "worker:cp", "iss": "cp.pregoi.com", "aud": "agent:<server_id>", "iat": 1755600000, "exp": 1755600300, "scope": ["exec.bench"], "kind": "bench", "workflow_id": "wf_01HX...", "audit_id": "aud_01HX..."}- TTL: 5 minutes (single-request use).
- Scope is exactly one of
exec.bench/exec.mariadb/exec.redis-cli. The Agent rejects requests wherekinddoes not match the JWT scope. - The Agent does not persist the inbound JWT; it is verified and discarded per request.
4.3 First-time bootstrap
The bootstrap-token in /etc/prego-agent/bootstrap-token is a one-shot JWT with scope: ["bootstrap"] and TTL 15 min. The Agent’s first action on boot is to exchange it for a working JWT, after which the bootstrap file is deleted.
5. Endpoints
The Agent and the Worker each host endpoints for the other. Both directions require mTLS + scope-limited JWT.
5.1 Agent → Worker (outbound push, Path E)
All endpoints live under /internal/agent/* on the existing CP Worker.
| Endpoint | Method | JWT scope | Purpose |
|---|---|---|---|
/internal/agent/handshake | POST | bootstrap | One-shot at first boot; exchanges bootstrap token for long-lived JWT (ADR-003) |
/internal/agent/rotate-token | POST | (current JWT) | Exchange current JWT for a new one (sliding window) |
/internal/agent/heartbeat | POST | heartbeat | Liveness ping every 30 s with brief container summary |
/internal/agent/metrics | POST | metrics | OTLP-formatted metrics payload, every 60 s |
/internal/agent/logs | POST | logs | Selected log lines (errors, warnings, audit) — rate-limited |
/internal/agent/backup-report | POST | backup_report | Result of bench backup jobs triggered locally |
5.2 Worker → Agent (inbound exec, Path D — ADR-003)
All endpoints live under /agent/v1/* on each host (port 443 by default).
| Endpoint | Method | JWT scope | Purpose |
|---|---|---|---|
/agent/v1/exec | POST | exec.<kind> (exec.bench, exec.mariadb, exec.redis-cli) | Synchronous scope-limited CLI execution. Body: { workflowId, kind, args[], timeoutSeconds?, workingDir? }. Response: { exitCode, stdoutTruncated, stderrTruncated, durationMs, auditId }. |
/agent/v1/exec/stream | POST (SSE) | same | Long-running command (e.g., bench backup, bench migrate) with progress stream. Each SSE event is `{ type: ‘stdout' |
Validation:
kindMUST be in the closed enum; unknown values return HTTP 400 before any process spawn.args[]is astring[](no shell interpolation); the Agent invokes the binary directly withposix_spawn.timeoutSecondsdefaults to 60, max 1800.workingDirdefaults:/home/frappe/frappe-benchforbench;/formariadb/redis-cli.
5.3 Common requirements (both directions)
- Valid mTLS client cert (CA-issued; same CA as Docker Remote API)
- Valid JWT with the correct scope for the endpoint
server_idin JWT MUST match anactiverow inserverstable- Region match between JWT and
servers.region - Optional Cloudflare egress IP allowlist on host firewall (open: ADR-003 OD-A)
Mismatches return HTTP 401/403 and are logged to Cloudflare Logpush for SOC review. Every Path D invocation also writes an agent_command_audit row in D1 (see Control Plane Direct Provisioning §6).
6. Payload contracts
6.1 POST /internal/agent/heartbeat
{ "server_id": "app-sgp-001", "infra_provider": "hetzner", "agent_version": "0.4.2", "uptime_seconds": 84321, "containers": [ { "name": "prego-frappe-bench", "image": "iamfork/prego-repo:v0.18.3", "state": "running", "restart_count": 0 } ], "site_count": 42, "ts": "2026-04-24T10:00:00Z"}6.2 POST /internal/agent/metrics
OTLP/HTTP envelope (protobuf or JSON depending on cost). Standard metric names:
host.cpu.utilization(gauge)host.memory.utilization(gauge)host.disk.usage_pct{mountpoint=...}(gauge)frappe.bench.worker_count{queue=...}(gauge)frappe.site.health{site=...}(gauge: 0/1)mariadb.connections.active(gauge)mariadb.disk.usage_pct{database=...}(gauge)redis.memory.usage_pct{db=...}(gauge)frappe.queue.depth{queue=...}(gauge)prego.provisioning.duration_seconds{phase=...}(histogram)
CP forwards these to the configured central sink (Grafana Cloud or Datadog) per ADR-002 §12.
6.3 POST /internal/agent/logs
{ "server_id": "app-sgp-001", "lines": [ { "ts": "2026-04-24T10:00:01Z", "level": "ERROR", "source": "bench", "site": "acme.pregoi.com", "message": "Worker timeout: ..." } ]}Rate-limited: max 1000 lines/min/server. Above that, the Agent applies local sampling and emits a frappe.logs.dropped_total counter.
7. Token rotation in detail
sequenceDiagram
participant Agent
participant CP as Control Plane
Note over Agent: t=0 boot. Bootstrap JWT in disk
Agent->>CP: POST rotate-token bootstrap-token
CP-->>Agent: JWT v1 exp=t+5min
Note over Agent: Discard bootstrap-token
loop every 2.5 min
Agent->>CP: POST rotate-token JWT vN
CP-->>Agent: JWT vN+1 exp=now+5min
end
Note over Agent: If rotation fails, retry with backoff. After 3 failures, fall back to bootstrap-token if file still exists otherwise alert
If rotation fails for 3 consecutive attempts without successful payload push, the Agent:
- Writes a local breadcrumb file
/var/log/prego-agent/rotation-failure.log - Stops sending metrics (silent failure preferred over leaking with stale auth)
- Triggers a self-restart via systemd
- After restart, attempts to re-bootstrap if a fresh bootstrap-token has been delivered out-of-band
CP detects missing heartbeats and alerts after 3 consecutive misses (90 s).
8. Capability model
The Agent process runs as a non-root user prego-agent with the minimum capabilities required to satisfy both the outbound push (Path E) and the inbound exec (Path D) responsibilities. ADR-003 widens the capability set to include the ability to invoke bench, mariadb, and redis-cli against the local host — but only those binaries.
8.1 Filesystem and socket access
| Resource | Mode | Purpose |
|---|---|---|
/proc, /sys | read | Host metrics |
Docker socket (/var/run/docker.sock) | read-only via docker group | Container list, stats, logs, inspect (Path E telemetry only) |
/var/log/frappe/, /var/log/nginx/, /var/log/mariadb/ | read | Log shipping |
/var/log/prego-agent/, /var/lib/prego-agent/state.db | read+write | Agent local state |
/etc/prego-agent/{ca,cert,key}.pem | read (0600) | mTLS material |
| Bind to port 443 (or configured Agent port) | listen | Path D inbound |
8.2 Process spawning (ADR-003)
The Agent invokes only these binaries via posix_spawn, with arguments validated against the kind enum:
kind | Binary | How invoked | Notes |
|---|---|---|---|
bench | docker exec -u frappe prego-frappe-bench /home/frappe/frappe-bench/env/bin/bench | Path D | Runs inside the Frappe container; never as root on host |
mariadb | docker exec prego-mariadb mariadb (DB host) OR mariadb (rare direct CLI) | Path D | Local Unix socket only |
redis-cli | redis-cli | Path D | Local Unix socket only |
The Agent never invokes sh, bash, python, ssh, docker run, docker create, or any binary outside this list.
8.3 systemd hardening
[Service]User=prego-agentSupplementaryGroups=docker # required for docker exec into the bench containerNoNewPrivileges=trueProtectSystem=strictProtectHome=truePrivateTmp=trueReadWritePaths=/var/log/prego-agent /var/lib/prego-agentReadOnlyPaths=/etc/prego-agent /var/run/docker.sock# Allow listen on 443; otherwise no privileged ops:CapabilityBoundingSet=CAP_NET_BIND_SERVICEAmbientCapabilities=CAP_NET_BIND_SERVICESystemCallFilter=@system-service @network-ioSystemCallErrorNumber=EPERMRestrictAddressFamilies=AF_INET AF_INET6 AF_UNIXRestrictNamespaces=trueLockPersonality=trueMemoryDenyWriteExecute=true8.4 Invariants enforced even if the Agent is compromised
Even if an attacker gains code execution inside the Agent process, the Agent cannot:
- Execute
docker run,docker create, ordocker execoutside theprego-frappe-benchandprego-mariadbcontainers (Docker socket access is read-only at the operations level for telemetry, write only for the closedkindenum) - Spawn arbitrary binaries (only
docker,mariadb,redis-cliare reachable;SystemCallFilterblocks generic exec families beyond@system-service) - Write to Frappe site directories on the host (covered by
ProtectSystem=strict) - Modify systemd units, the Docker daemon configuration, or
/etc/prego-agent/materials - Open additional inbound ports beyond the configured Agent port (no
CAP_NET_RAW, noCAP_NET_ADMIN) - Persist beyond the next
systemctl restart prego-agent(no write to/etc/, no cron)
A compromise still allows Path D commands to be issued against the local Frappe / MariaDB / Redis. That blast radius is bounded to the host’s tenant set and is the intentional cost of replacing Ansible. mTLS + scope-limited JWT minimise the chance of compromise in the first place.
9. Version rollout
Agent versions follow the existing CI pipeline (prego-docker/.github/workflows/build.yml pattern).
Rollout strategy:
- New Agent version published to internal artifact store (R2 or GHCR)
- CP exposes
GET /internal/agent/version-target?region=sgreturning the desired pinned version - Agent self-update mechanism: on heartbeat, CP optionally responds with
{ "upgrade_to": "0.5.1" } - Agent downloads new binary, verifies signature, swaps via
systemctl restart prego-agent - Failed upgrades roll back automatically (systemd
ExecReloadprevious binary)
Canary: CP can target a small fraction of servers for upgrade first by setting agent_version_target per row.
10. Audit
Two audit tables:
10.1 agent_audit (Path E events: bootstrap, rotation, upgrade, revoke)
CREATE TABLE agent_audit ( audit_id TEXT PRIMARY KEY, server_id TEXT NOT NULL, agent_version TEXT, action TEXT NOT NULL, -- 'bootstrap','rotate','upgrade','revoke' result TEXT NOT NULL, -- 'ok','failed' detail TEXT, at TEXT NOT NULL DEFAULT (datetime('now')));10.2 agent_command_audit (Path D exec invocations — ADR-003)
Schema in Control Plane Direct Provisioning §6. Every POST /agent/v1/exec and POST /agent/v1/exec/stream writes a row with kind, redacted args, exit code, duration, and truncated stdout/stderr. Secrets in args (passwords, tokens) are masked before persistence by the Worker before it forwards to the Agent.
Both audit streams are pushed to Cloudflare Logpush (R2 long-term archive) and surfaced in the Admin SPA /cp/console.
11. Failure modes & mitigations
| Failure | Detection | Mitigation |
|---|---|---|
| Agent process crash | Missing heartbeat ≥ 90 s | Systemd auto-restart; alert if still missing after 5 min |
| JWT compromise | Anomaly detector (multiple server_ids from same IP, etc.) | CP revokes JWT family, forces bootstrap |
| mTLS cert expiry | CP refuses request with HTTP 495 | Agent triggers cert renewal via cloud-init re-run; out-of-band replacement |
| Clock skew | JWT iat/exp validation fails | Agent uses NTP; CP allows ±60 s clock skew |
| Network partition | All pushes time out | Agent buffers up to 1 MB locally, flushes on reconnect |
| Compromised host | Audit log anomalies | CP marks servers.status='isolated', ops manual review |
12. References
- Frappe SaaS Multitenant Docker Standard (ADR-002) (parent)
- Control Plane Direct Provisioning (ADR-003) — Path D (Agent inbound exec) added here
- Imperative Control Layer — Path 1 (Docker Remote API) preserved; Path 3 added by ADR-003
prego-control-plane/src/clients/hetzner/cloud-init.ts— bootstrap injection point (now generated dynamically by the Worker per ADR-003)prego-control-plane/wrangler.toml— Worker Secrets registry- Tenant Lifecycle — backup-report endpoint integration
- ADR-003 Open Decisions (OD-A inbound auth, OD-B cloud-init size, OD-C agent distribution)