Monthly research note. Theme: IIoT Platforms & Edge Security.

TL;DR

Safety-Critical vs Security-Critical: Integrating Two Worlds as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • Device identity is a lifecycle: provision → attest → rotate → revoke → forensics.
  • Replay protection must not rely on wall-clock time alone (counters + windows).
  • Gateways are security boundaries; isolate blast radius and enforce policy early.
  • Define safety properties before performance goals.
  • Measure correctness signals, not only latency/throughput.

Why this matters

  • Identity and freshness are the foundation of telemetry integrity.
  • Gateways become choke points; design them as security boundaries.
  • Fleet-scale updates turn bugs into global incidents; rollback must be engineered.
  • Edge systems fail differently: power loss, intermittent links, and physical access.

Key questions

  • How do you prevent replay and reordering from becoming false control signals?
  • Where do you terminate trust (device, gateway, cloud) and why?
  • How do you handle intermittent connectivity without corrupting state?
  • How do you provision identity and rotate it over years?
  • How do you do secure updates (rollback protection, staged rollout, recovery)?
  • What does incident response look like at fleet scale?

Assumptions

  • Devices experience power loss and abrupt restarts.
  • Time sync is weak; clocks drift and may be manipulated.
  • Some devices are physically accessible to attackers.
  • Gateways can be compromised; isolate blast radius.

Non-goals

  • Assuming perfect time synchronization at the edge.
  • Assuming firmware updates always complete successfully.
Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

At the edge, identity and freshness are everything. A typical anti-replay constraint:

accept(m)nonce(m)Seen  ts(m)[tΔ,t+Δ].\text{accept}(m)\Rightarrow \mathrm{nonce}(m)\notin \mathrm{Seen}\ \wedge\ \mathrm{ts}(m)\in [t-\Delta,t+\Delta].

Treat device identity as a lifecycle: provision → attest → rotate → revoke → forensics.

Define safe modes explicitly: what do devices do when policy can’t be fetched?

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

  • Evidence: critical actions emit verifiable audit events.
  • Least authority: privileges are scoped by purpose and time.
  • Integrity: invalid transitions are rejected (and detectable).
  • Downgrade resistance: negotiation can’t silently weaken security posture.

Failure modes

  • Timeout ambiguity causing double-apply or partial state transitions.
  • Config drift that weakens security posture over time.
  • Mixed-version behavior that violates assumptions silently.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

sequenceDiagram
  participant D as Device
  participant G as Gateway
  participant C as Cloud
  D->>G: telemetry(nonce, ctr, sig)
  G->>C: forward + policy tags
  C-->>G: update policy
  G-->>D: commands (bounded)

Implementation notes

Edge security is about recovery: safe defaults, staged updates, and fast revocation.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Firmware update safety checklist:
- Signed manifest with version + hash
- Rollback protection (anti-downgrade)
- A/B partitions or staged apply
- Health check + watchdog
- Telemetry proves rollout state

Verification strategy

  • Replay/reorder simulations for telemetry and control messages.
  • Power-loss fault injection during flash writes and installs.
  • Hardware-in-the-loop tests for update and recovery paths.
  • Key rotation drills across device + gateway + cloud.
  • Scale tests: provisioning bursts, reconnect storms, gateway failures.

Operational notes

  • Treat time sync alerts as security signals (NTP manipulation).
  • Monitor fleet health by cohort (version, region, gateway).
  • Maintain an identity inventory: device → cert/keys → firmware version.
  • Design rollouts to be interruptible and reversible.
  • Make revocation fast: emergency disable, quarantine, and re-enrollment.
Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

  • Admission-control / rate-limit rejections (by reason).
  • Error budget burn + tail latency under load.
  • Rollback events and the conditions that triggered them.
  • Retry/timeout rates by endpoint and client cohort.
  • Invariant violation rate (should be ~0).

Rollback plan

  • Define an explicit rollback trigger (metrics + thresholds).
  • Use canaries and staged rollout; stop early when signals degrade.
  • Keep dual-write / dual-verify windows where appropriate.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.

Evidence

  • Learn TLA+ (1) — Practical entry point for specification and model checking.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
  • Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
    • Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.

Open questions

  • How quickly can you revoke a compromised device identity globally?
  • What does “safe behavior” mean when the cloud is unreachable?
  • What is the blast radius of a compromised gateway?
  • Which messages are allowed to cause physical effects and under what conditions?

Checklist

  • Rollback plan rehearsed and automated.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Failure modes enumerated with mitigations.
  • Assumptions listed and reviewed.
  • Telemetry captures correctness signals.
  • Safety properties stated as invariants.

Further reading

1.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/
2.
Beyer B, Jones C, Petoff J, Murphy NR. Site Reliability Engineering: How Google Runs Production Systems [Internet]. O’Reilly Media; 2016. Available from: https://sre.google/sre-book/table-of-contents/