Monthly research note. Theme: IIoT Platforms & Edge Security.

TL;DR

Anomaly Detection: What 'Baseline' Means in Industrial Systems as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

  • Gateways are security boundaries; isolate blast radius and enforce policy early.
  • Design for power loss and intermittent links; recovery is the primary feature.
  • Secure updates need rollback protection and staged rollout with safety rails.
  • Bind security decisions to evidence (audit, invariants, telemetry).
  • Write assumptions down; treat them as interfaces.

Why this matters

  • Identity and freshness are the foundation of telemetry integrity.
  • Operational constraints (bandwidth, CPU) drive protocol choices.
  • Edge systems fail differently: power loss, intermittent links, and physical access.
  • Adversaries can replay and spoof data to mislead control planes.

Key questions

  • How do you prevent replay and reordering from becoming false control signals?
  • What is your offline behavior (safe mode vs degraded mode)?
  • What does incident response look like at fleet scale?
  • How do devices enroll securely (no shared secrets, minimal manual steps)?
  • Where do you terminate trust (device, gateway, cloud) and why?
  • How do you handle intermittent connectivity without corrupting state?

Assumptions

  • Connectivity is intermittent and high-latency; retries amplify costs.
  • Devices experience power loss and abrupt restarts.
  • Gateways can be compromised; isolate blast radius.
  • Firmware updates can fail mid-flight; partial installation is possible.

Non-goals

  • Relying on the cloud to enforce edge-local safety properties.
  • Treating identity as a static certificate file.
Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

Fleet rollout safety is a monotone constraint:

rollout(vk+1)can_rollback(vk)  telemetry_healthy.\text{rollout}(v_{k+1}) \Rightarrow \text{can\_rollback}(v_k)\ \wedge\ \text{telemetry\_healthy}.

Use monotonic counters when time is untrusted; combine with nonces and bounded windows.

Treat device identity as a lifecycle: provision → attest → rotate → revoke → forensics.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Authenticity: actions are bound to identity and purpose.
  • Integrity: invalid transitions are rejected (and detectable).
  • Replay resistance: duplicated inputs do not change outcomes.

Failure modes

  • Config drift that weakens security posture over time.
  • Mixed-version behavior that violates assumptions silently.
  • Timeout ambiguity causing double-apply or partial state transitions.
  • Observability gaps during incidents (missing evidence).
Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart TD
  dev["Device (identity + attestation)"] --> gw["Gateway"]
  gw --> bus["Message Bus"]
  bus --> ingest["Ingestion"]
  ingest --> tsdb["Time-Series Store"]
  tsdb --> apps["Analytics / Control Plane"]

Implementation notes

Treat the gateway as a security boundary, not a dumb proxy.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

Firmware update safety checklist:
- Signed manifest with version + hash
- Rollback protection (anti-downgrade)
- A/B partitions or staged apply
- Health check + watchdog
- Telemetry proves rollout state

Verification strategy

  • Power-loss fault injection during flash writes and installs.
  • Scale tests: provisioning bursts, reconnect storms, gateway failures.
  • Replay/reorder simulations for telemetry and control messages.
  • Key rotation drills across device + gateway + cloud.
  • Hardware-in-the-loop tests for update and recovery paths.

Operational notes

  • Treat time sync alerts as security signals (NTP manipulation).
  • Design rollouts to be interruptible and reversible.
  • Make revocation fast: emergency disable, quarantine, and re-enrollment.
  • Monitor fleet health by cohort (version, region, gateway).
  • Maintain an identity inventory: device → cert/keys → firmware version.
Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

  • Invariant violation rate (should be ~0).
  • Admission-control / rate-limit rejections (by reason).
  • Rollback events and the conditions that triggered them.
  • Authz failures and policy denials (unexpected spikes).
  • Error budget burn + tail latency under load.

Rollback plan

  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Keep dual-write / dual-verify windows where appropriate.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Use canaries and staged rollout; stop early when signals degrade.

Evidence

  • Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
    • Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
  • Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
    • Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.

Open questions

  • What is the blast radius of a compromised gateway?
  • How quickly can you revoke a compromised device identity globally?
  • Which messages are allowed to cause physical effects and under what conditions?
  • What does “safe behavior” mean when the cloud is unreachable?

Checklist

  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Failure modes enumerated with mitigations.
  • Telemetry captures correctness signals.
  • Safety properties stated as invariants.
  • Assumptions listed and reviewed.
  • Rollback plan rehearsed and automated.

Further reading

1.
Beyer B, Jones C, Petoff J, Murphy NR. Site Reliability Engineering: How Google Runs Production Systems [Internet]. O’Reilly Media; 2016. Available from: https://sre.google/sre-book/table-of-contents/
2.
Kleppmann M. Designing Data-Intensive Applications [Internet]. O’Reilly Media; 2017. Available from: https://dataintensive.net/