Secure Remote Access: Bastions, Just-in-Time, and Audit

Monthly research note. Theme: IIoT Platforms & Edge Security.

TL;DR

A focused memo on Secure Remote Access: Bastions, Just-in-Time, and Audit: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Replay protection must not rely on wall-clock time alone (counters + windows).
Gateways are security boundaries; isolate blast radius and enforce policy early.
Design for power loss and intermittent links; recovery is the primary feature.
Bind security decisions to evidence (audit, invariants, telemetry).
Write assumptions down; treat them as interfaces.

Why this matters

Fleet-scale updates turn bugs into global incidents; rollback must be engineered.
Edge systems fail differently: power loss, intermittent links, and physical access.
Identity and freshness are the foundation of telemetry integrity.
Operational constraints (bandwidth, CPU) drive protocol choices.

Key questions

What is your offline behavior (safe mode vs degraded mode)?
How do you do secure updates (rollback protection, staged rollout, recovery)?
What does incident response look like at fleet scale?
How do you prevent replay and reordering from becoming false control signals?
Where do you terminate trust (device, gateway, cloud) and why?
How do you handle intermittent connectivity without corrupting state?

Assumptions

Devices experience power loss and abrupt restarts.
Gateways can be compromised; isolate blast radius.
Some devices are physically accessible to attackers.
Time sync is weak; clocks drift and may be manipulated.

Non-goals

Assuming firmware updates always complete successfully.
Treating identity as a static certificate file.

Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

Fleet rollout safety is a monotone constraint:

\text{rollout}(v_{k+1}) \Rightarrow \text{can\_rollback}(v_k)\ \wedge\ \text{telemetry\_healthy}.

Use monotonic counters when time is untrusted; combine with nonces and bounded windows.

Define safe modes explicitly: what do devices do when policy can’t be fetched?

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

Evidence: critical actions emit verifiable audit events.
Integrity: invalid transitions are rejected (and detectable).
Least authority: privileges are scoped by purpose and time.
Downgrade resistance: negotiation can’t silently weaken security posture.

Failure modes

Timeout ambiguity causing double-apply or partial state transitions.
Config drift that weakens security posture over time.
Recovery paths that only work when nothing is broken.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.

Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart TD
  dev["Device (identity + attestation)"] --> gw["Gateway"]
  gw --> bus["Message Bus"]
  bus --> ingest["Ingestion"]
  ingest --> tsdb["Time-Series Store"]
  tsdb --> apps["Analytics / Control Plane"]

Implementation notes

Treat the gateway as a security boundary, not a dumb proxy.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Firmware update safety checklist:
- Signed manifest with version + hash
- Rollback protection (anti-downgrade)
- A/B partitions or staged apply
- Health check + watchdog
- Telemetry proves rollout state

Verification strategy

Scale tests: provisioning bursts, reconnect storms, gateway failures.
Replay/reorder simulations for telemetry and control messages.
Power-loss fault injection during flash writes and installs.
Hardware-in-the-loop tests for update and recovery paths.
Key rotation drills across device + gateway + cloud.

Operational notes

Treat time sync alerts as security signals (NTP manipulation).
Monitor fleet health by cohort (version, region, gateway).
Maintain an identity inventory: device → cert/keys → firmware version.
Make revocation fast: emergency disable, quarantine, and re-enrollment.
Design rollouts to be interruptible and reversible.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Error budget burn + tail latency under load.
Authz failures and policy denials (unexpected spikes).
Admission-control / rate-limit rejections (by reason).
Invariant violation rate (should be ~0).
Retry/timeout rates by endpoint and client cohort.

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Define an explicit rollback trigger (metrics + thresholds).
Keep dual-write / dual-verify windows where appropriate.

Evidence

Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
Jepsen (2) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

Which messages are allowed to cause physical effects and under what conditions?
How quickly can you revoke a compromised device identity globally?
What does “safe behavior” mean when the cloud is unreachable?
What is the blast radius of a compromised gateway?

Checklist

Assumptions listed and reviewed.
Rollback plan rehearsed and automated.
Safety properties stated as invariants.
Failure modes enumerated with mitigations.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Telemetry captures correctness signals.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading