Monthly research note. Theme: IIoT Platforms & Edge Security.
TL;DR
A focused memo on Offline-First Edge: Consistency During Intermittent Connectivity: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
Correctness is cheaper to enforce at interfaces than to repair in production data.
Key takeaways
- Design for power loss and intermittent links; recovery is the primary feature.
- Secure updates need rollback protection and staged rollout with safety rails.
- Replay protection must not rely on wall-clock time alone (counters + windows).
- Make failure modes explicit and observable.
- Bind security decisions to evidence (audit, invariants, telemetry).
Why this matters
- Gateways become choke points; design them as security boundaries.
- Adversaries can replay and spoof data to mislead control planes.
- Operational constraints (bandwidth, CPU) drive protocol choices.
- Identity and freshness are the foundation of telemetry integrity.
Key questions
- How do you prevent replay and reordering from becoming false control signals?
- How do you handle intermittent connectivity without corrupting state?
- How do you do secure updates (rollback protection, staged rollout, recovery)?
- How do you provision identity and rotate it over years?
- What does incident response look like at fleet scale?
- Where do you terminate trust (device, gateway, cloud) and why?
Assumptions
- Gateways can be compromised; isolate blast radius.
- Time sync is weak; clocks drift and may be manipulated.
- Firmware updates can fail mid-flight; partial installation is possible.
- Devices experience power loss and abrupt restarts.
Non-goals
- Relying on the cloud to enforce edge-local safety properties.
- Assuming firmware updates always complete successfully.
Parsing is an attacker-controlled interface—validate early and fail fast.
Model & invariants
Fleet rollout safety is a monotone constraint:
Define safe modes explicitly: what do devices do when policy can’t be fetched?
Use monotonic counters when time is untrusted; combine with nonces and bounded windows.
Invariants must be checkable from evidence you actually have (state + logs + counters).
Security properties
- Evidence: critical actions emit verifiable audit events.
- Replay resistance: duplicated inputs do not change outcomes.
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Least authority: privileges are scoped by purpose and time.
Failure modes
- Timeout ambiguity causing double-apply or partial state transitions.
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Recovery paths that only work when nothing is broken.
- Observability gaps during incidents (missing evidence).
Mixed-version deployments create states you never tested—plan for them explicitly.
Design sketch
flowchart TD
dev["Device (identity + attestation)"] --> gw["Gateway"]
gw --> bus["Message Bus"]
bus --> ingest["Ingestion"]
ingest --> tsdb["Time-Series Store"]
tsdb --> apps["Analytics / Control Plane"]Implementation notes
Prefer protocols that degrade safely under packet loss and skew.
If you can’t explain a timeout outcome, you can’t make retries safe.
Firmware update safety checklist:
- Signed manifest with version + hash
- Rollback protection (anti-downgrade)
- A/B partitions or staged apply
- Health check + watchdog
- Telemetry proves rollout stateVerification strategy
- Key rotation drills across device + gateway + cloud.
- Hardware-in-the-loop tests for update and recovery paths.
- Power-loss fault injection during flash writes and installs.
- Scale tests: provisioning bursts, reconnect storms, gateway failures.
- Replay/reorder simulations for telemetry and control messages.
Operational notes
- Design rollouts to be interruptible and reversible.
- Maintain an identity inventory: device → cert/keys → firmware version.
- Make revocation fast: emergency disable, quarantine, and re-enrollment.
- Treat time sync alerts as security signals (NTP manipulation).
- Monitor fleet health by cohort (version, region, gateway).
Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.
What to monitor
- Authz failures and policy denials (unexpected spikes).
- Error budget burn + tail latency under load.
- Invariant violation rate (should be ~0).
- Admission-control / rate-limit rejections (by reason).
- Rollback events and the conditions that triggered them.
Rollback plan
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Use canaries and staged rollout; stop early when signals degrade.
- Define an explicit rollback trigger (metrics + thresholds).
- Keep dual-write / dual-verify windows where appropriate.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Evidence
- Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
- Learn TLA+ (2) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
Open questions
- How quickly can you revoke a compromised device identity globally?
- What is the blast radius of a compromised gateway?
- What does “safe behavior” mean when the cloud is unreachable?
- Which messages are allowed to cause physical effects and under what conditions?
Checklist
- Rollback plan rehearsed and automated.
- Failure modes enumerated with mitigations.
- Assumptions listed and reviewed.
- Telemetry captures correctness signals.
- Safety properties stated as invariants.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Further reading
- MQTT Version 5.0 (OASIS) — Messaging semantics, session behavior, and constraints at the edge.
- Uptane — Secure software updates for fleets with realistic threat models.
- NISTIR 8259A: IoT Device Cybersecurity Capability Core Baseline — Baseline capabilities and lifecycle expectations for devices.
- The Update Framework (TUF) Specification — Secure update metadata, compromise recovery, and key rotation.
- Learn TLA+ — Practical entry point for specification and model checking.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.