Zero Trust for IIoT: Network Segmentation and Policy Enforcement

Monthly research note. Theme: IIoT Platforms & Edge Security.

TL;DR

Zero Trust for IIoT: Network Segmentation and Policy Enforcement as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.

Key takeaways

Design for power loss and intermittent links; recovery is the primary feature.
Replay protection must not rely on wall-clock time alone (counters + windows).
Secure updates need rollback protection and staged rollout with safety rails.
Make boundaries boring: validate inputs, cap costs, and be deterministic where needed.
Bind security decisions to evidence (audit, invariants, telemetry).

Why this matters

Fleet-scale updates turn bugs into global incidents; rollback must be engineered.
Adversaries can replay and spoof data to mislead control planes.
Gateways become choke points; design them as security boundaries.
Edge systems fail differently: power loss, intermittent links, and physical access.

Key questions

How do devices enroll securely (no shared secrets, minimal manual steps)?
What is your offline behavior (safe mode vs degraded mode)?
How do you provision identity and rotate it over years?
How do you prevent replay and reordering from becoming false control signals?
How do you handle intermittent connectivity without corrupting state?
What does incident response look like at fleet scale?

Assumptions

Gateways can be compromised; isolate blast radius.
Devices experience power loss and abrupt restarts.
Some devices are physically accessible to attackers.
Firmware updates can fail mid-flight; partial installation is possible.

Non-goals

Treating identity as a static certificate file.
Assuming firmware updates always complete successfully.

Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

At the edge, identity and freshness are everything. A typical anti-replay constraint:

\text{accept}(m)\Rightarrow \mathrm{nonce}(m)\notin \mathrm{Seen}\ \wedge\ \mathrm{ts}(m)\in [t-\Delta,t+\Delta].

Use monotonic counters when time is untrusted; combine with nonces and bounded windows.

Treat device identity as a lifecycle: provision → attest → rotate → revoke → forensics.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

Authenticity: actions are bound to identity and purpose.
Replay resistance: duplicated inputs do not change outcomes.
Least authority: privileges are scoped by purpose and time.
Evidence: critical actions emit verifiable audit events.

Failure modes

Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Timeout ambiguity causing double-apply or partial state transitions.
Mixed-version behavior that violates assumptions silently.
Recovery paths that only work when nothing is broken.

Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

sequenceDiagram
  participant D as Device
  participant G as Gateway
  participant C as Cloud
  D->>G: telemetry(nonce, ctr, sig)
  G->>C: forward + policy tags
  C-->>G: update policy
  G-->>D: commands (bounded)

Implementation notes

Prefer protocols that degrade safely under packet loss and skew.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

// Anti-replay sketch: monotonic counter + bounded window.
type Counter uint64
type SeenStore interface {
  MaxCounter(deviceID string) (Counter, error)
  UpdateMax(deviceID string, c Counter) error
}

Verification strategy

Power-loss fault injection during flash writes and installs.
Hardware-in-the-loop tests for update and recovery paths.
Replay/reorder simulations for telemetry and control messages.
Scale tests: provisioning bursts, reconnect storms, gateway failures.
Key rotation drills across device + gateway + cloud.

Operational notes

Design rollouts to be interruptible and reversible.
Treat time sync alerts as security signals (NTP manipulation).
Maintain an identity inventory: device → cert/keys → firmware version.
Monitor fleet health by cohort (version, region, gateway).
Make revocation fast: emergency disable, quarantine, and re-enrollment.

Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

Error budget burn + tail latency under load.
Rollback events and the conditions that triggered them.
Retry/timeout rates by endpoint and client cohort.
Invariant violation rate (should be ~0).
Authz failures and policy denials (unexpected spikes).

Rollback plan

Prefer backward-compatible changes; avoid “flag day” upgrades.
Keep dual-write / dual-verify windows where appropriate.
Use canaries and staged rollout; stop early when signals degrade.
Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.

Evidence

Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.

Open questions

What does “safe behavior” mean when the cloud is unreachable?
Which messages are allowed to cause physical effects and under what conditions?
How quickly can you revoke a compromised device identity globally?
What is the blast radius of a compromised gateway?

Checklist

Rollback plan rehearsed and automated.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Failure modes enumerated with mitigations.
Assumptions listed and reviewed.
Safety properties stated as invariants.
Telemetry captures correctness signals.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading