SockguardSockguard

Observability

Prometheus metrics, the active upstream watchdog, the Docker API readiness probe, and W3C trace correlation. Wire Sockguard into Prometheus, Grafana, and your existing tracing pipeline without an OTLP exporter.

Sockguard exposes three observability surfaces, all opt-in or always-on with zero external dependencies:

  1. Prometheus /metrics — request counters, deny counters, latency histograms, active-request gauge, build/start gauges, and watchdog state. Opt-in via metrics.enabled.
  2. Active upstream watchdog — periodically dials the Docker socket, logs reachable/unreachable transitions, exports state via /health and metrics. Opt-in via health.watchdog.enabled.
  3. Readiness probe — periodically issues a real GET /containers/json against the Docker API and serves the result at /ready, catching a daemon that accepts connections but no longer answers. Opt-in via health.readiness.enabled.
  4. Trace/log correlation — preserves valid W3C traceparent context or generates a fresh local trace, forwards a proxy-local span ID to Docker, and tags access, audit, and upstream-error logs with trace_id, trace_span_id, trace_parent_id, and trace_sampled. Always on, no knob, no OTLP dependency.

Enable the metrics endpoint

The metrics endpoint is local to Sockguard, is never forwarded to Docker, bypasses Docker API allow rules like /health, and remains behind listener security plus clients.allowed_cidrs.

metrics:
  enabled: true
  path: /metrics      # default; must start with /, must differ from health.path

health:
  enabled: true
  path: /health
  watchdog:
    enabled: true
    interval: 5s      # positive duration; 1s–30s is typical
  readiness:
    enabled: true     # opt-in /ready probe against the Docker API
    path: /ready      # must differ from health.path, metrics.path, admin.path
    interval: 10s     # positive duration
    timeout: 5s       # positive duration; per-probe deadline

Equivalent environment variables:

SOCKGUARD_METRICS_ENABLED=true
SOCKGUARD_METRICS_PATH=/metrics
SOCKGUARD_HEALTH_WATCHDOG_ENABLED=true
SOCKGUARD_HEALTH_WATCHDOG_INTERVAL=5s
SOCKGUARD_HEALTH_READINESS_ENABLED=true
SOCKGUARD_HEALTH_READINESS_PATH=/ready
SOCKGUARD_HEALTH_READINESS_INTERVAL=10s
SOCKGUARD_HEALTH_READINESS_TIMEOUT=5s

metrics.enabled, health.watchdog.enabled, and health.readiness.enabled all default to false, so existing deployments stay quiet until you opt in.

Metric reference

MetricTypeLabelsNotes
sockguard_build_infogaugeversion, commit, build_date, go_versionConstant 1. Use for version panels and Grafana annotations.
sockguard_start_time_secondsgaugeUnix epoch when the metrics registry was created. Use for time() - sockguard_start_time_seconds.
sockguard_http_requests_totalcounterdecision, method, profile, route, statusOne per completed request. decision{allow, deny, would_deny, error}.
sockguard_http_denied_requests_totalcountermode, profile, reason_code, routeOnly deny-decision rows. mode{enforce, warn, audit}. Use for policy-violation alerts. Rule index is in the structured audit log (matched_rule field).
sockguard_http_request_duration_secondshistogramdecision, method, profile, routeBuckets: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 seconds.
sockguard_http_requests_activegaugeCurrently in-flight requests. Spikes during streaming endpoints (logs/attach/events).
sockguard_upstream_socket_upgauge1 reachable, 0 unreachable. Only emitted when health.watchdog.enabled is true.
sockguard_upstream_watchdog_checks_totalcounterresultresult{connected, unreachable}. Increments every interval; ratio is your error budget.
sockguard_upstream_api_upgauge1 the Docker API answered the readiness probe, 0 it did not. Only emitted when health.readiness.enabled is true.
sockguard_upstream_readiness_checks_totalcounterresultresult{ready, unreachable}. One per readiness probe interval; unreachable covers transport errors and non-2xx API responses.
sockguard_throttle_requests_totalcounterprofile, reason_code, modeThrottle denials. reason_code{rate_limit_exceeded, concurrency_cap, priority_floor}. mode{enforce, warn, audit}.
sockguard_inflight_requestsgaugeprofileIn-flight count for profiles with max_inflight configured. Reflects real concurrency, including warn/audit-mode requests that would have been denied.
sockguard_policy_versiongaugeMonotonic policy generation counter. Ticks once at startup and once per successful hot reload. Matches the version field in GET /admin/policy/version.
sockguard_config_reload_totalcounterresultHot-reload outcomes. result{ok, reject_load, reject_validation, reject_immutable, reject_signature}.
sockguard_config_reload_last_success_timestamp_secondsgaugeUnix timestamp of the last successful reload. Omitted from scrape output until the first successful reload.

Label cardinality

route is templated to bound cardinality: container/image IDs collapse to {id}, and namespaced image references like /images/linuxserver/qbittorrent:latest/json collapse to /images/{id}/json. Profile names come from clients.client_certificate_profiles or clients.source_ip_profiles; if neither matches, requests are tagged profile="default".

reason_code is a bounded enum: matched_deny_rule, no_matching_allow_rule, request_body_policy_denied, request_body_too_large, upstream_socket_unreachable, upstream_response_rejected_by_policy. New codes are added rarely and follow the same naming scheme.

Example scrape config

Sockguard's metrics endpoint serves the Prometheus 0.0.4 text format:

scrape_configs:
  - job_name: sockguard
    metrics_path: /metrics
    static_configs:
      - targets: ['sockguard:2375']

If Sockguard is behind mTLS, point Prometheus at the same TLS material:

scrape_configs:
  - job_name: sockguard
    metrics_path: /metrics
    scheme: https
    tls_config:
      ca_file:   /etc/prometheus/sockguard-ca.pem
      cert_file: /etc/prometheus/scraper.pem
      key_file:  /etc/prometheus/scraper-key.pem
    static_configs:
      - targets: ['sockguard:2375']

The scraper certificate must be presented to satisfy listen.tls's mTLS requirement; client selectors apply just like any other Sockguard caller.

Useful PromQL

Upstream socket down for more than two scrapes:

sockguard_upstream_socket_up == 0

Wire this to Alertmanager with for: 30s so a momentary blip during a restart doesn't page.

Deny rate by reason in the last 5 minutes:

sum by (reason_code) (rate(sockguard_http_denied_requests_total[5m]))

A sustained increase in matched_deny_rule usually means a client started making API calls that the policy was never expected to allow — investigate before relaxing the rule.

95th-percentile latency by route:

histogram_quantile(
  0.95,
  sum by (route, le) (rate(sockguard_http_request_duration_seconds_bucket[5m]))
)

Watchdog reachability ratio (last 1h):

sum(rate(sockguard_upstream_watchdog_checks_total{result="connected"}[1h]))
/
sum(rate(sockguard_upstream_watchdog_checks_total[1h]))

A value below 1.0 means the upstream socket flapped during the window even if it currently reads up=1.

Daemon answering the API but unready (wedged daemon):

sockguard_upstream_api_up == 0 and sockguard_upstream_socket_up == 1

The socket is dialable but the Docker API is not answering the readiness probe — the wedged-daemon case /health alone cannot see. Wire it to Alertmanager with for: 30s.

Throttle rate by profile and reason:

sum by (profile, reason_code) (rate(sockguard_throttle_requests_total[5m]))

priority_floor denials under a sustained load usually indicate that a low-priority profile is consuming more than its fair share of capacity. Raise its priority tier, lower its max_inflight, or reduce its token budget.

Would-deny rate for profiles in warn/audit mode:

sum by (profile) (
  rate(sockguard_http_denied_requests_total{mode=~"warn|audit"}[5m])
)

A would-deny rate that approaches zero over time means the policy is safe to promote from warn to enforce.

Policy version drift (detect config change without confirmation):

changes(sockguard_policy_version[10m])

A value of 0 over the expected reload window means a SIGHUP or fsnotify event did not produce a successful reload — check sockguard_config_reload_total{result!="ok"}.

Reload rejection breakdown:

sum by (result) (rate(sockguard_config_reload_total[1h]))

reject_signature means the cosign bundle verification failed; reject_immutable means an operator tried to change a listener or TLS field without restarting.

Active upstream watchdog

By default /health answers from a cached upstream probe taken when a real request comes through. That's enough for liveness, but it can lag by minutes when traffic is sparse. Enabling the watchdog flips Sockguard to active monitoring:

  • A goroutine dials the upstream socket every health.watchdog.interval and records the result.
  • State transitions are logged at WARN (unreachable) and INFO (recovered) with upstream_socket, upstream_status, up, and error fields. Steady state produces no log noise.
  • /health returns the latest watchdog snapshot. Reachable: HTTP 200, {"status":"healthy","upstream":"connected",...}. Unreachable: HTTP 503, {"status":"unhealthy","upstream":"unreachable","error":"...",...}.
  • When metrics.enabled is also true, Sockguard exports sockguard_upstream_socket_up (gauge) and sockguard_upstream_watchdog_checks_total{result=...} (counter).

Example watchdog log lines

{"time":"...","level":"WARN","msg":"upstream socket watchdog state changed","upstream_socket":"/var/run/docker.sock","upstream_status":"unreachable","up":false,"error":"dial unix /var/run/docker.sock: connect: no such file or directory"}
{"time":"...","level":"INFO","msg":"upstream socket watchdog state changed","upstream_socket":"/var/run/docker.sock","upstream_status":"connected","up":true}

State-change semantics are once-per-edge: the watchdog logs only when the reachability flips, not on every interval, so log volume scales with outages rather than with the configured interval.

Readiness probe

The watchdog dials the upstream socket — a liveness signal that only proves the socket accepts connections. That misses the failure mode where the daemon keeps accepting connections (so the dial and /_ping stay green) while request handling has wedged and GET /containers/json hangs. The readiness probe closes that gap:

  • A goroutine issues a real GET /containers/json?limit=1 against the upstream Docker API every health.readiness.interval, each call bounded by health.readiness.timeout.
  • /ready (default path; health.readiness.path) returns the latest result. Answering: HTTP 200, {"status":"healthy","upstream":"ready",...}. Wedged or unreachable: HTTP 503, {"status":"unhealthy","upstream":"unreachable",...} — any transport error or non-2xx API response counts as unready.
  • When metrics.enabled is also true, Sockguard exports sockguard_upstream_api_up (gauge) and sockguard_upstream_readiness_checks_total{result=...} (counter).

/health (and the watchdog) remain your liveness signal — "the process is up and the socket is dialable." /ready is your readiness signal — "the daemon is actually answering the API." In Kubernetes, point a liveness probe at /health and a readiness probe at /ready so a wedged daemon drains the pod from the Service endpoints (stop sending it traffic) without triggering a restart loop. Behind a load balancer, route the backend health check to /ready for the same reason.

The health.* block — readiness included — is immutable across hot reload, so changing the probe path, interval, or timeout requires a restart.

Trace and log correlation

Sockguard implements just enough of the W3C Trace Context spec to correlate logs with whatever upstream tracing system you already run. There is no config knob, no OTLP exporter, and no required dependency — Sockguard reads the traceparent header on the way in and writes one on the way out.

Field reference

Every access log line, audit event, and upstream-error log carries:

FieldWhen emittedMeaning
request_idAlways16-byte hex Sockguard-generated request identifier. Independent of W3C trace context.
trace_idAlways32-hex W3C trace ID. Inherited from a valid incoming traceparent, otherwise generated.
trace_span_idAlways16-hex span ID Sockguard generates for the proxied request. Always proxy-local.
trace_parent_idOnly when incoming traceparent validThe span ID Sockguard's span is parented to.
trace_sampledAlways (boolean)Inherited sampled flag from the incoming traceparent. Defaults to false.

Example: caller sends valid traceparent

curl -H 'traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01' \
     http://sockguard:2375/version

Resulting access log line:

{
  "msg": "request",
  "method": "GET",
  "path": "/version",
  "request_id": "9b90579488e330dc064063912617ac8f",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "trace_parent_id": "00f067aa0ba902b7",
  "trace_span_id": "f169285fbea26125",
  "trace_sampled": true,
  "status": 200
}

The forwarded request to Docker carries traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-f169285fbea26125-01, so Docker access logs that include trace context (or a sidecar that captures them) join cleanly to the caller's trace.

Example: no inbound trace context

When the caller doesn't send traceparent (or sends a malformed one), Sockguard generates a fresh local trace:

{
  "msg": "request",
  "method": "GET",
  "path": "/containers/json",
  "request_id": "ae3e66f51bef4de76bc87a472d3a01c9",
  "trace_id": "51f8b14be5fc5e2b3c83e9a28c098ac6",
  "trace_span_id": "3f09e482729e5949",
  "trace_sampled": false,
  "status": 200
}

trace_parent_id is omitted in this case because there is no parent span.

Production checklist

  • Set metrics.enabled: true on every deployment that has a Prometheus scraper. The endpoint costs almost nothing when idle and gives you a per-route deny breakdown the moment something starts misbehaving.
  • Set health.watchdog.enabled: true whenever the proxy can outlive a Docker daemon restart (i.e. always). The 5-second default is fine; lower it only if you expect sub-5s detection requirements.
  • Wire sockguard_upstream_socket_up == 0 for 30s into Alertmanager.
  • Set health.readiness.enabled: true and point your orchestrator's readiness check (Kubernetes readiness probe, load-balancer backend health) at /ready, leaving liveness on /health. This drains traffic from a proxy whose daemon has wedged without restart-looping it. Alert on sockguard_upstream_api_up == 0 and sockguard_upstream_socket_up == 1 for 30s.
  • Send Sockguard's structured logs to your existing log pipeline; the trace_id field joins them to whatever tracing system the caller uses.
  • If the same listener serves clients and Prometheus, gate the metrics path with clients.allowed_cidrs so a misbehaving caller can't poll your cardinality budget into oblivion.
  • Enable admin.enabled: true and configure a dedicated admin.listen.socket to get GET /admin/policy/version without exposing admin endpoints to containers that have socket access. The sockguard_policy_version gauge gives you the same counter for alerting, but the endpoint's bundle_signer and config_sha256 fields are only available via the HTTP response.
  • Alert on sockguard_config_reload_total{result!="ok"} so failed hot reloads surface immediately rather than silently running stale policy.

On this page