Observability at Scale: Traces, Cardinality, and Cost

Monthly research note. Theme: DevSecOps & Resilience Engineering.

TL;DR

A focused memo on Observability at Scale: Traces, Cardinality, and Cost: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

Provenance is a cryptographic statement; ship evidence with artifacts.
Treat CI/CD as attacker-controlled until proven otherwise; minimize secrets and privileges.
Short-lived credentials (OIDC) beat long-lived tokens in pipelines.
Define safety properties before performance goals.
Write assumptions down; treat them as interfaces.

Why this matters

Infrastructure-as-code without policy is just scripting the attack surface.
Policy drift is the default; guardrails must be automated and enforced.
Reproducibility is how you know what you shipped is what you built.
Runtime security needs evidence pipelines, not just dashboards.

Key questions

What is the minimum set of humans who can ship to production?
Where do you enforce policy (pre-merge, build, deploy, runtime)?
How do you rehearse incident response as code (runbooks, chaos, drills)?
How do you do safe rollouts (canary, blast-radius, rapid rollback)?
Which signals prove correctness (not just availability) in production?
What is your supply-chain threat model (dependency poisoning, CI compromise)?

Assumptions

Policy enforcement must be consistent across environments.
Dependencies can be compromised upstream (typosquatting, maintainer takeover).
CI runners are exposed to untrusted code (PRs, dependencies).
Observability pipelines can be attacked (log injection, PII leaks).

Non-goals

Manual policy enforcement or manual security review as the only control.
Assuming deploy equals success without runtime evidence.

Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

Build provenance is a cryptographic statement:

\mathrm{attest} \leftarrow \mathrm{Sign}_{k_\text{build}}(\mathrm{hash}(\text{artifact})\ \Vert\ \text{metadata}).

Make provenance verifiable: “what built this” must be cryptographically bound.

Policy should be code with diffs and reviews—guardrails, not guidelines.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

Downgrade resistance: negotiation can’t silently weaken security posture.
Authenticity: actions are bound to identity and purpose.
Least authority: privileges are scoped by purpose and time.
Integrity: invalid transitions are rejected (and detectable).

Failure modes

Recovery paths that only work when nothing is broken.
Mixed-version behavior that violates assumptions silently.
Timeout ambiguity causing double-apply or partial state transitions.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.

Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

flowchart LR
  src["Source"] --> build["Build (reproducible)"]
  build --> attest["Attestation"]
  attest --> scan["SAST/DAST/SCA"]
  scan --> deploy["Deploy (policy gates)"]
  deploy --> runtime["Runtime Policy + Observability"]

Implementation notes

Build systems that can prove what happened after an incident.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

// Treat CI as untrusted: keep tokens short-lived and scoped.
type Token struct {
  Value string
  ExpiresAtUnix int64
  Scope string
}

Verification strategy

Dependency tampering drills: lockfile changes, integrity failures.
Runtime conformance: detect drift between desired and actual state.
Pipeline attack simulations: compromise a runner and measure blast radius.
Policy tests: unit tests for policy-as-code rules.
Rollback tests as part of release (not “if needed”).

Operational notes

Continuously scan and inventory dependencies; prioritize by exposure.
Keep a provenance trail for every artifact deployed to production.
Rehearse incident response for the pipeline itself.
Treat policy changes as security-sensitive deploys (review + rollout).
Audit who can ship and how; remove implicit paths.

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Retry/timeout rates by endpoint and client cohort.
Admission-control / rate-limit rejections (by reason).
Invariant violation rate (should be ~0).
Error budget burn + tail latency under load.
Rollback events and the conditions that triggered them.

Rollback plan

Define an explicit rollback trigger (metrics + thresholds).
Prefer backward-compatible changes; avoid “flag day” upgrades.
Use canaries and staged rollout; stop early when signals degrade.
Keep dual-write / dual-verify windows where appropriate.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.

Evidence

Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
Jepsen (2) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

What is the smallest CI compromise that becomes a prod compromise today?
Which deploy actions are irreversible and how do you mitigate that?
Can you answer “what code is running” with cryptographic evidence?
How quickly can you revoke all pipeline credentials in an incident?

Checklist

Telemetry captures correctness signals.
Assumptions listed and reviewed.
Safety properties stated as invariants.
Rollback plan rehearsed and automated.
Failure modes enumerated with mitigations.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading