Monthly research note. Theme: IIoT Platforms & Edge Security.
TL;DR
Zero Trust for IIoT: Network Segmentation and Policy Enforcement as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.
If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.
Key takeaways
- Design for power loss and intermittent links; recovery is the primary feature.
- Replay protection must not rely on wall-clock time alone (counters + windows).
- Secure updates need rollback protection and staged rollout with safety rails.
- Make boundaries boring: validate inputs, cap costs, and be deterministic where needed.
- Bind security decisions to evidence (audit, invariants, telemetry).
Why this matters
- Fleet-scale updates turn bugs into global incidents; rollback must be engineered.
- Adversaries can replay and spoof data to mislead control planes.
- Gateways become choke points; design them as security boundaries.
- Edge systems fail differently: power loss, intermittent links, and physical access.
Key questions
- How do devices enroll securely (no shared secrets, minimal manual steps)?
- What is your offline behavior (safe mode vs degraded mode)?
- How do you provision identity and rotate it over years?
- How do you prevent replay and reordering from becoming false control signals?
- How do you handle intermittent connectivity without corrupting state?
- What does incident response look like at fleet scale?
Assumptions
- Gateways can be compromised; isolate blast radius.
- Devices experience power loss and abrupt restarts.
- Some devices are physically accessible to attackers.
- Firmware updates can fail mid-flight; partial installation is possible.
Non-goals
- Treating identity as a static certificate file.
- Assuming firmware updates always complete successfully.
Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.
Model & invariants
At the edge, identity and freshness are everything. A typical anti-replay constraint:
Use monotonic counters when time is untrusted; combine with nonces and bounded windows.
Treat device identity as a lifecycle: provision → attest → rotate → revoke → forensics.
Make the “impossible state” observable: a metric or alert that fires when invariants drift.
Security properties
- Authenticity: actions are bound to identity and purpose.
- Replay resistance: duplicated inputs do not change outcomes.
- Least authority: privileges are scoped by purpose and time.
- Evidence: critical actions emit verifiable audit events.
Failure modes
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Timeout ambiguity causing double-apply or partial state transitions.
- Mixed-version behavior that violates assumptions silently.
- Recovery paths that only work when nothing is broken.
Caches tend to become sources of truth unless you can recompute and validate them.
Design sketch
sequenceDiagram
participant D as Device
participant G as Gateway
participant C as Cloud
D->>G: telemetry(nonce, ctr, sig)
G->>C: forward + policy tags
C-->>G: update policy
G-->>D: commands (bounded)Implementation notes
Prefer protocols that degrade safely under packet loss and skew.
Bound work per request: parse, validate, and cap cost before you allocate heavy resources.
// Anti-replay sketch: monotonic counter + bounded window.
type Counter uint64
type SeenStore interface {
MaxCounter(deviceID string) (Counter, error)
UpdateMax(deviceID string, c Counter) error
}Verification strategy
- Power-loss fault injection during flash writes and installs.
- Hardware-in-the-loop tests for update and recovery paths.
- Replay/reorder simulations for telemetry and control messages.
- Scale tests: provisioning bursts, reconnect storms, gateway failures.
- Key rotation drills across device + gateway + cloud.
Operational notes
- Design rollouts to be interruptible and reversible.
- Treat time sync alerts as security signals (NTP manipulation).
- Maintain an identity inventory: device → cert/keys → firmware version.
- Monitor fleet health by cohort (version, region, gateway).
- Make revocation fast: emergency disable, quarantine, and re-enrollment.
Keep audit and config history queryable during incidents—evidence beats intuition.
What to monitor
- Error budget burn + tail latency under load.
- Rollback events and the conditions that triggered them.
- Retry/timeout rates by endpoint and client cohort.
- Invariant violation rate (should be ~0).
- Authz failures and policy denials (unexpected spikes).
Rollback plan
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Keep dual-write / dual-verify windows where appropriate.
- Use canaries and staged rollout; stop early when signals degrade.
- Define an explicit rollback trigger (metrics + thresholds).
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Evidence
- Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
- Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
Open questions
- What does “safe behavior” mean when the cloud is unreachable?
- Which messages are allowed to cause physical effects and under what conditions?
- How quickly can you revoke a compromised device identity globally?
- What is the blast radius of a compromised gateway?
Checklist
- Rollback plan rehearsed and automated.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Failure modes enumerated with mitigations.
- Assumptions listed and reviewed.
- Safety properties stated as invariants.
- Telemetry captures correctness signals.
Further reading
- NISTIR 8259A: IoT Device Cybersecurity Capability Core Baseline — Baseline capabilities and lifecycle expectations for devices.
- Uptane — Secure software updates for fleets with realistic threat models.
- MQTT Version 5.0 (OASIS) — Messaging semantics, session behavior, and constraints at the edge.
- The Update Framework (TUF) Specification — Secure update metadata, compromise recovery, and key rotation.
- Jepsen — Fault injection and correctness testing for distributed systems.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.