Monthly research note. Theme: IIoT Platforms & Edge Security.

TL;DR

A focused memo on Offline-First Edge: Consistency During Intermittent Connectivity: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • Design for power loss and intermittent links; recovery is the primary feature.
  • Secure updates need rollback protection and staged rollout with safety rails.
  • Replay protection must not rely on wall-clock time alone (counters + windows).
  • Make failure modes explicit and observable.
  • Bind security decisions to evidence (audit, invariants, telemetry).

Why this matters

  • Gateways become choke points; design them as security boundaries.
  • Adversaries can replay and spoof data to mislead control planes.
  • Operational constraints (bandwidth, CPU) drive protocol choices.
  • Identity and freshness are the foundation of telemetry integrity.

Key questions

  • How do you prevent replay and reordering from becoming false control signals?
  • How do you handle intermittent connectivity without corrupting state?
  • How do you do secure updates (rollback protection, staged rollout, recovery)?
  • How do you provision identity and rotate it over years?
  • What does incident response look like at fleet scale?
  • Where do you terminate trust (device, gateway, cloud) and why?

Assumptions

  • Gateways can be compromised; isolate blast radius.
  • Time sync is weak; clocks drift and may be manipulated.
  • Firmware updates can fail mid-flight; partial installation is possible.
  • Devices experience power loss and abrupt restarts.

Non-goals

  • Relying on the cloud to enforce edge-local safety properties.
  • Assuming firmware updates always complete successfully.
Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

Fleet rollout safety is a monotone constraint:

rollout(vk+1)can_rollback(vk)  telemetry_healthy.\text{rollout}(v_{k+1}) \Rightarrow \text{can\_rollback}(v_k)\ \wedge\ \text{telemetry\_healthy}.

Define safe modes explicitly: what do devices do when policy can’t be fetched?

Use monotonic counters when time is untrusted; combine with nonces and bounded windows.

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

  • Evidence: critical actions emit verifiable audit events.
  • Replay resistance: duplicated inputs do not change outcomes.
  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Least authority: privileges are scoped by purpose and time.

Failure modes

  • Timeout ambiguity causing double-apply or partial state transitions.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Recovery paths that only work when nothing is broken.
  • Observability gaps during incidents (missing evidence).
Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart TD
  dev["Device (identity + attestation)"] --> gw["Gateway"]
  gw --> bus["Message Bus"]
  bus --> ingest["Ingestion"]
  ingest --> tsdb["Time-Series Store"]
  tsdb --> apps["Analytics / Control Plane"]

Implementation notes

Prefer protocols that degrade safely under packet loss and skew.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

Firmware update safety checklist:
- Signed manifest with version + hash
- Rollback protection (anti-downgrade)
- A/B partitions or staged apply
- Health check + watchdog
- Telemetry proves rollout state

Verification strategy

  • Key rotation drills across device + gateway + cloud.
  • Hardware-in-the-loop tests for update and recovery paths.
  • Power-loss fault injection during flash writes and installs.
  • Scale tests: provisioning bursts, reconnect storms, gateway failures.
  • Replay/reorder simulations for telemetry and control messages.

Operational notes

  • Design rollouts to be interruptible and reversible.
  • Maintain an identity inventory: device → cert/keys → firmware version.
  • Make revocation fast: emergency disable, quarantine, and re-enrollment.
  • Treat time sync alerts as security signals (NTP manipulation).
  • Monitor fleet health by cohort (version, region, gateway).
Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

  • Authz failures and policy denials (unexpected spikes).
  • Error budget burn + tail latency under load.
  • Invariant violation rate (should be ~0).
  • Admission-control / rate-limit rejections (by reason).
  • Rollback events and the conditions that triggered them.

Rollback plan

  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Use canaries and staged rollout; stop early when signals degrade.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Keep dual-write / dual-verify windows where appropriate.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.

Evidence

  • Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
    • Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
  • Learn TLA+ (2) — Practical entry point for specification and model checking.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

  • How quickly can you revoke a compromised device identity globally?
  • What is the blast radius of a compromised gateway?
  • What does “safe behavior” mean when the cloud is unreachable?
  • Which messages are allowed to cause physical effects and under what conditions?

Checklist

  • Rollback plan rehearsed and automated.
  • Failure modes enumerated with mitigations.
  • Assumptions listed and reviewed.
  • Telemetry captures correctness signals.
  • Safety properties stated as invariants.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

Further reading

1.
Kleppmann M. Designing Data-Intensive Applications [Internet]. O’Reilly Media; 2017. Available from: https://dataintensive.net/
2.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/