Skip to content

Agent Security Standard

Companion to: Frappe SaaS Multitenant Docker Standard (ADR-002) and Control Plane Direct Provisioning (ADR-003) Boundary with: Imperative Control Layer — Path 1 (Docker API) and Path 3 (Agent inbound exec) are distinct and mutually exclusive in scope.

1. Why an Agent at all

The original imperative-control-layer chose no host agent for runtime commands — Cloudflare Workers talked directly to the Docker Remote API (mTLS, port 2376). ADR-003 (2026-04-24) updates that boundary: the Docker Remote API stays for container lifecycle, but a hardened Agent now also receives a closed enum of CLI commands (bench / mariadb / redis-cli), replacing what would otherwise be Ansible playbook runs.

The Agent therefore has two purposes:

  1. Outbound telemetry push (ADR-002) — heartbeat, metrics, logs, backup reports, so the Control Plane never has to poll N hosts.
  2. Inbound scope-limited exec (ADR-003) — POST /agent/v1/exec { kind, args } where kind ∈ { bench, mariadb, redis-cli }. No arbitrary shell. No docker run. No docker exec of arbitrary commands.

The four execution paths from Control Plane Direct Provisioning §1 coexist as follows:

ConcernOwnerDirectionProtocol
Provision host (server, volume, firewall)Hetzner Cloud API (Path A)Worker → HetznerHTTPS + API token
Bootstrap host (Docker, certs, Agent install)cloud-init (Path B)Worker → host (one-shot)Hetzner user_data
Run / pull / inspect / log a containerDocker Remote API (Path C)Worker → HostHTTPS + mTLS, port 2376
Run bench / mariadb / redis-cli (closed enum)Agent inbound (Path D)Worker → Host:443HTTPS + mTLS + scope-limited JWT
Push metrics / logs / heartbeat / backup-reportAgent outbound (Path E, was ADR-002)Host → WorkerHTTPS + mTLS + JWT

The Worker never uses SSH. The Worker never uses Pulumi or Ansible.


2. Hard boundary (forbidden / required)

ForbiddenRequired
Inbound endpointsPOST /agent/v1/shell, generic /exec without kind, POST /agent/v1/docker/runOnly POST /agent/v1/exec and POST /agent/v1/exec/stream, both with kind ∈ { bench, mariadb, redis-cli }
Inbound authLong-lived API keymTLS + 5-min JWT, scope=exec.<kind>
Outbound authLong-lived API key in agent configShort-lived JWT obtained from CP each rotation (TTL/2)
Container lifecycle (create / start / stop / pull / inspect / logs)Routed through AgentRouted through Docker Remote API (Path C)
Configuration managementSSH + Ansible playbookCloud-init at boot + Path D Agent exec for runtime CLI
DiscoveryDNS-broadcast Agent endpoints; Agent scanning CPCP knows Agent identity from servers table; both directions explicitly addressed

Closed enum invariant: any new Path D kind value requires an ADR amendment. The Agent process itself enforces the enum at request parse time and rejects unknown kind with HTTP 400 before any process spawn.

If a future requirement seems to require arbitrary shell on the host, it must be rejected — instead either:

  • Wrap the operation in a bench sub-command (extends kind=bench cleanly), or
  • Route the operation through Path C (Docker exec into a tooling container), or
  • Open a new ADR proposing a new kind value with a defined arg surface.


3. Deployment

3.1 Bootstrap

Per ADR-003, the cloud-init user_data is dynamically generated by the Worker at server-create time (no static file in prego-docker). The Worker calls the Hetzner Cloud API directly with the rendered template, the per-server bootstrap token, and the Agent’s mTLS material.

Cloud-init writes (all files 0600, owned by prego-agent):

  • /etc/prego-agent/config.toml — Agent config (server_id, region, CP base URL, agent listen port)
  • /etc/prego-agent/bootstrap-tokensingle-use token (≤15 min TTL) used only for the handshake
  • /etc/systemd/system/prego-agent.service — systemd unit (see §8.3)
  • /etc/prego-agent/{ca,cert,key}.pem — mTLS material (CA pinned, client cert + key)

Cloud-init then:

Terminal window
docker pull <agent_image_pinned_by_worker>
systemctl daemon-reload
systemctl enable --now prego-agent

The Agent’s first action on boot is to call POST /internal/agent/handshake with the bootstrap token over mTLS. CP marks the token consumed_at, returns the long-lived outbound JWT, and the bootstrap-token file is then unlinked.

Bootstrap-token security details and D1 schema: see Control Plane Direct Provisioning §4 and §6.

3.2 Runtime

flowchart TB
    subgraph host [Docker Host]
        agent[prego-agent systemd]
        docker[Docker daemon stats API]
        proc[proc sys metrics]
        logs[bench logs]
        bench[prego-frappe-bench container]
        mariadb[mariadb local socket]
        redis[redis local socket]
    end
    subgraph cp [Control Plane Worker]
        execApi[POST agent v1 exec]
        handshake[POST internal agent handshake]
        rotateAgent[POST internal agent rotate-token]
        heartbeat[POST internal agent heartbeat]
        metrics[POST internal agent metrics]
        logsink[POST internal agent logs]
    end
    agent -->|"mTLS plus outbound JWT"| heartbeat
    agent -->|"mTLS plus outbound JWT"| metrics
    agent -->|"mTLS plus outbound JWT"| logsink
    agent -->|"mTLS plus bootstrap token"| handshake
    agent -->|"mTLS plus old JWT before TTL"| rotateAgent
    execApi -->|"mTLS plus per-request JWT"| agent
    docker -.-> agent
    proc -.-> agent
    logs -.-> agent
    agent -.->|"docker exec bench"| bench
    agent -.->|"unix socket"| mariadb
    agent -.->|"unix socket"| redis

Agent runs bidirectional (ADR-003): outbound for telemetry, inbound only for the closed kind enum on Path D. Cloud-init firewall opens port 2376 (Docker Remote API, mTLS) and port 443 (Agent inbound, mTLS) — both restricted to CP egress IPs. Inbound port 22 (SSH) is closed for automated operations and is only re-enabled out-of-band for emergency manual intervention.


4. Authentication

4.1 mTLS

Mandatory at the TLS layer:

  • Server (CP) presents Cloudflare-issued cert for cp.pregoi.com
  • Client (Agent) presents client.pem issued by Prego internal CA
  • CP verifies client cert against infra_providers.ca_cert_ref for the server’s region
  • Agent verifies CP cert against ca.pem (pinned)

If mTLS fails, the request is dropped at the edge before the Worker runs (Cloudflare mTLS verification).

4.2 Short-lived JWT (5 min TTL)

There are two distinct JWT families, both signed by CP with EdDSA (Ed25519) using a key in Worker Secrets (AGENT_TOKEN_SIGNING_KEY):

4.2.1 Outbound JWT (Path E — Agent → CP)

Issued to the Agent for pushing telemetry. Claims:

{
"sub": "agent:<server_id>",
"iss": "cp.pregoi.com",
"aud": "agent.pregoi.com",
"iat": 1755600000,
"exp": 1755600300,
"scope": ["heartbeat","metrics","logs","backup_report"],
"region": "sg",
"infra_provider": "hetzner"
}
  • TTL: 5 minutes.
  • Rotation: Agent calls POST /internal/agent/rotate-token at TTL/2 (every 2.5 min) presenting the previous valid JWT. CP issues a new one and revokes the previous (sliding window).

4.2.2 Inbound JWT (Path D — Worker → Agent, ADR-003)

Minted per request by the Worker for a specific exec call. Claims:

{
"sub": "worker:cp",
"iss": "cp.pregoi.com",
"aud": "agent:<server_id>",
"iat": 1755600000,
"exp": 1755600300,
"scope": ["exec.bench"],
"kind": "bench",
"workflow_id": "wf_01HX...",
"audit_id": "aud_01HX..."
}
  • TTL: 5 minutes (single-request use).
  • Scope is exactly one of exec.bench / exec.mariadb / exec.redis-cli. The Agent rejects requests where kind does not match the JWT scope.
  • The Agent does not persist the inbound JWT; it is verified and discarded per request.

4.3 First-time bootstrap

The bootstrap-token in /etc/prego-agent/bootstrap-token is a one-shot JWT with scope: ["bootstrap"] and TTL 15 min. The Agent’s first action on boot is to exchange it for a working JWT, after which the bootstrap file is deleted.


5. Endpoints

The Agent and the Worker each host endpoints for the other. Both directions require mTLS + scope-limited JWT.

5.1 Agent → Worker (outbound push, Path E)

All endpoints live under /internal/agent/* on the existing CP Worker.

EndpointMethodJWT scopePurpose
/internal/agent/handshakePOSTbootstrapOne-shot at first boot; exchanges bootstrap token for long-lived JWT (ADR-003)
/internal/agent/rotate-tokenPOST(current JWT)Exchange current JWT for a new one (sliding window)
/internal/agent/heartbeatPOSTheartbeatLiveness ping every 30 s with brief container summary
/internal/agent/metricsPOSTmetricsOTLP-formatted metrics payload, every 60 s
/internal/agent/logsPOSTlogsSelected log lines (errors, warnings, audit) — rate-limited
/internal/agent/backup-reportPOSTbackup_reportResult of bench backup jobs triggered locally

5.2 Worker → Agent (inbound exec, Path D — ADR-003)

All endpoints live under /agent/v1/* on each host (port 443 by default).

EndpointMethodJWT scopePurpose
/agent/v1/execPOSTexec.<kind> (exec.bench, exec.mariadb, exec.redis-cli)Synchronous scope-limited CLI execution. Body: { workflowId, kind, args[], timeoutSeconds?, workingDir? }. Response: { exitCode, stdoutTruncated, stderrTruncated, durationMs, auditId }.
/agent/v1/exec/streamPOST (SSE)sameLong-running command (e.g., bench backup, bench migrate) with progress stream. Each SSE event is `{ type: ‘stdout'

Validation:

  • kind MUST be in the closed enum; unknown values return HTTP 400 before any process spawn.
  • args[] is a string[] (no shell interpolation); the Agent invokes the binary directly with posix_spawn.
  • timeoutSeconds defaults to 60, max 1800.
  • workingDir defaults: /home/frappe/frappe-bench for bench; / for mariadb/redis-cli.

5.3 Common requirements (both directions)

  • Valid mTLS client cert (CA-issued; same CA as Docker Remote API)
  • Valid JWT with the correct scope for the endpoint
  • server_id in JWT MUST match an active row in servers table
  • Region match between JWT and servers.region
  • Optional Cloudflare egress IP allowlist on host firewall (open: ADR-003 OD-A)

Mismatches return HTTP 401/403 and are logged to Cloudflare Logpush for SOC review. Every Path D invocation also writes an agent_command_audit row in D1 (see Control Plane Direct Provisioning §6).


6. Payload contracts

6.1 POST /internal/agent/heartbeat

{
"server_id": "app-sgp-001",
"infra_provider": "hetzner",
"agent_version": "0.4.2",
"uptime_seconds": 84321,
"containers": [
{ "name": "prego-frappe-bench", "image": "iamfork/prego-repo:v0.18.3", "state": "running", "restart_count": 0 }
],
"site_count": 42,
"ts": "2026-04-24T10:00:00Z"
}

6.2 POST /internal/agent/metrics

OTLP/HTTP envelope (protobuf or JSON depending on cost). Standard metric names:

  • host.cpu.utilization (gauge)
  • host.memory.utilization (gauge)
  • host.disk.usage_pct{mountpoint=...} (gauge)
  • frappe.bench.worker_count{queue=...} (gauge)
  • frappe.site.health{site=...} (gauge: 0/1)
  • mariadb.connections.active (gauge)
  • mariadb.disk.usage_pct{database=...} (gauge)
  • redis.memory.usage_pct{db=...} (gauge)
  • frappe.queue.depth{queue=...} (gauge)
  • prego.provisioning.duration_seconds{phase=...} (histogram)

CP forwards these to the configured central sink (Grafana Cloud or Datadog) per ADR-002 §12.

6.3 POST /internal/agent/logs

{
"server_id": "app-sgp-001",
"lines": [
{
"ts": "2026-04-24T10:00:01Z",
"level": "ERROR",
"source": "bench",
"site": "acme.pregoi.com",
"message": "Worker timeout: ..."
}
]
}

Rate-limited: max 1000 lines/min/server. Above that, the Agent applies local sampling and emits a frappe.logs.dropped_total counter.


7. Token rotation in detail

sequenceDiagram
    participant Agent
    participant CP as Control Plane
    Note over Agent: t=0 boot. Bootstrap JWT in disk
    Agent->>CP: POST rotate-token bootstrap-token
    CP-->>Agent: JWT v1 exp=t+5min
    Note over Agent: Discard bootstrap-token
    loop every 2.5 min
        Agent->>CP: POST rotate-token JWT vN
        CP-->>Agent: JWT vN+1 exp=now+5min
    end
    Note over Agent: If rotation fails, retry with backoff. After 3 failures, fall back to bootstrap-token if file still exists otherwise alert

If rotation fails for 3 consecutive attempts without successful payload push, the Agent:

  1. Writes a local breadcrumb file /var/log/prego-agent/rotation-failure.log
  2. Stops sending metrics (silent failure preferred over leaking with stale auth)
  3. Triggers a self-restart via systemd
  4. After restart, attempts to re-bootstrap if a fresh bootstrap-token has been delivered out-of-band

CP detects missing heartbeats and alerts after 3 consecutive misses (90 s).


8. Capability model

The Agent process runs as a non-root user prego-agent with the minimum capabilities required to satisfy both the outbound push (Path E) and the inbound exec (Path D) responsibilities. ADR-003 widens the capability set to include the ability to invoke bench, mariadb, and redis-cli against the local host — but only those binaries.

8.1 Filesystem and socket access

ResourceModePurpose
/proc, /sysreadHost metrics
Docker socket (/var/run/docker.sock)read-only via docker groupContainer list, stats, logs, inspect (Path E telemetry only)
/var/log/frappe/, /var/log/nginx/, /var/log/mariadb/readLog shipping
/var/log/prego-agent/, /var/lib/prego-agent/state.dbread+writeAgent local state
/etc/prego-agent/{ca,cert,key}.pemread (0600)mTLS material
Bind to port 443 (or configured Agent port)listenPath D inbound

8.2 Process spawning (ADR-003)

The Agent invokes only these binaries via posix_spawn, with arguments validated against the kind enum:

kindBinaryHow invokedNotes
benchdocker exec -u frappe prego-frappe-bench /home/frappe/frappe-bench/env/bin/benchPath DRuns inside the Frappe container; never as root on host
mariadbdocker exec prego-mariadb mariadb (DB host) OR mariadb (rare direct CLI)Path DLocal Unix socket only
redis-cliredis-cliPath DLocal Unix socket only

The Agent never invokes sh, bash, python, ssh, docker run, docker create, or any binary outside this list.

8.3 systemd hardening

[Service]
User=prego-agent
SupplementaryGroups=docker # required for docker exec into the bench container
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/log/prego-agent /var/lib/prego-agent
ReadOnlyPaths=/etc/prego-agent /var/run/docker.sock
# Allow listen on 443; otherwise no privileged ops:
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE
SystemCallFilter=@system-service @network-io
SystemCallErrorNumber=EPERM
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true

8.4 Invariants enforced even if the Agent is compromised

Even if an attacker gains code execution inside the Agent process, the Agent cannot:

  • Execute docker run, docker create, or docker exec outside the prego-frappe-bench and prego-mariadb containers (Docker socket access is read-only at the operations level for telemetry, write only for the closed kind enum)
  • Spawn arbitrary binaries (only docker, mariadb, redis-cli are reachable; SystemCallFilter blocks generic exec families beyond @system-service)
  • Write to Frappe site directories on the host (covered by ProtectSystem=strict)
  • Modify systemd units, the Docker daemon configuration, or /etc/prego-agent/ materials
  • Open additional inbound ports beyond the configured Agent port (no CAP_NET_RAW, no CAP_NET_ADMIN)
  • Persist beyond the next systemctl restart prego-agent (no write to /etc/, no cron)

A compromise still allows Path D commands to be issued against the local Frappe / MariaDB / Redis. That blast radius is bounded to the host’s tenant set and is the intentional cost of replacing Ansible. mTLS + scope-limited JWT minimise the chance of compromise in the first place.


9. Version rollout

Agent versions follow the existing CI pipeline (prego-docker/.github/workflows/build.yml pattern).

Rollout strategy:

  1. New Agent version published to internal artifact store (R2 or GHCR)
  2. CP exposes GET /internal/agent/version-target?region=sg returning the desired pinned version
  3. Agent self-update mechanism: on heartbeat, CP optionally responds with { "upgrade_to": "0.5.1" }
  4. Agent downloads new binary, verifies signature, swaps via systemctl restart prego-agent
  5. Failed upgrades roll back automatically (systemd ExecReload previous binary)

Canary: CP can target a small fraction of servers for upgrade first by setting agent_version_target per row.


10. Audit

Two audit tables:

10.1 agent_audit (Path E events: bootstrap, rotation, upgrade, revoke)

CREATE TABLE agent_audit (
audit_id TEXT PRIMARY KEY,
server_id TEXT NOT NULL,
agent_version TEXT,
action TEXT NOT NULL, -- 'bootstrap','rotate','upgrade','revoke'
result TEXT NOT NULL, -- 'ok','failed'
detail TEXT,
at TEXT NOT NULL DEFAULT (datetime('now'))
);

10.2 agent_command_audit (Path D exec invocations — ADR-003)

Schema in Control Plane Direct Provisioning §6. Every POST /agent/v1/exec and POST /agent/v1/exec/stream writes a row with kind, redacted args, exit code, duration, and truncated stdout/stderr. Secrets in args (passwords, tokens) are masked before persistence by the Worker before it forwards to the Agent.

Both audit streams are pushed to Cloudflare Logpush (R2 long-term archive) and surfaced in the Admin SPA /cp/console.


11. Failure modes & mitigations

FailureDetectionMitigation
Agent process crashMissing heartbeat ≥ 90 sSystemd auto-restart; alert if still missing after 5 min
JWT compromiseAnomaly detector (multiple server_ids from same IP, etc.)CP revokes JWT family, forces bootstrap
mTLS cert expiryCP refuses request with HTTP 495Agent triggers cert renewal via cloud-init re-run; out-of-band replacement
Clock skewJWT iat/exp validation failsAgent uses NTP; CP allows ±60 s clock skew
Network partitionAll pushes time outAgent buffers up to 1 MB locally, flushes on reconnect
Compromised hostAudit log anomaliesCP marks servers.status='isolated', ops manual review

12. References

Help