Monthly research note. Theme: IIoT Platforms & Edge Security.
TL;DR
A focused memo on Time-Series at Scale: Ingestion, Downsampling, and Query Isolation: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
Correctness is cheaper to enforce at interfaces than to repair in production data.
Key takeaways
- Design for power loss and intermittent links; recovery is the primary feature.
- Device identity is a lifecycle: provision → attest → rotate → revoke → forensics.
- Replay protection must not rely on wall-clock time alone (counters + windows).
- Define safety properties before performance goals.
- Make boundaries boring: validate inputs, cap costs, and be deterministic where needed.
Why this matters
- Operational constraints (bandwidth, CPU) drive protocol choices.
- Adversaries can replay and spoof data to mislead control planes.
- Gateways become choke points; design them as security boundaries.
- Identity and freshness are the foundation of telemetry integrity.
Key questions
- How do you handle intermittent connectivity without corrupting state?
- How do you prevent replay and reordering from becoming false control signals?
- How do you do secure updates (rollback protection, staged rollout, recovery)?
- What does incident response look like at fleet scale?
- How do you provision identity and rotate it over years?
- What is your offline behavior (safe mode vs degraded mode)?
Assumptions
- Firmware updates can fail mid-flight; partial installation is possible.
- Gateways can be compromised; isolate blast radius.
- Connectivity is intermittent and high-latency; retries amplify costs.
- Some devices are physically accessible to attackers.
Non-goals
- Assuming firmware updates always complete successfully.
- Assuming perfect time synchronization at the edge.
Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.
Model & invariants
At the edge, identity and freshness are everything. A typical anti-replay constraint:
Define safe modes explicitly: what do devices do when policy can’t be fetched?
Treat device identity as a lifecycle: provision → attest → rotate → revoke → forensics.
If the system can enter an invalid state, it eventually will—usually during an incident.
Security properties
- Evidence: critical actions emit verifiable audit events.
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Replay resistance: duplicated inputs do not change outcomes.
- Authenticity: actions are bound to identity and purpose.
Failure modes
- Mixed-version behavior that violates assumptions silently.
- Timeout ambiguity causing double-apply or partial state transitions.
- Config drift that weakens security posture over time.
- Recovery paths that only work when nothing is broken.
A recovery plan that isn’t exercised will fail when you need it.
Design sketch
flowchart TD
dev["Device (identity + attestation)"] --> gw["Gateway"]
gw --> bus["Message Bus"]
bus --> ingest["Ingestion"]
ingest --> tsdb["Time-Series Store"]
tsdb --> apps["Analytics / Control Plane"]Implementation notes
Prefer protocols that degrade safely under packet loss and skew.
Acknowledge only after durability (or make “ack” explicitly best-effort).
Firmware update safety checklist:
- Signed manifest with version + hash
- Rollback protection (anti-downgrade)
- A/B partitions or staged apply
- Health check + watchdog
- Telemetry proves rollout stateVerification strategy
- Replay/reorder simulations for telemetry and control messages.
- Hardware-in-the-loop tests for update and recovery paths.
- Key rotation drills across device + gateway + cloud.
- Scale tests: provisioning bursts, reconnect storms, gateway failures.
- Power-loss fault injection during flash writes and installs.
Operational notes
- Design rollouts to be interruptible and reversible.
- Maintain an identity inventory: device → cert/keys → firmware version.
- Monitor fleet health by cohort (version, region, gateway).
- Treat time sync alerts as security signals (NTP manipulation).
- Make revocation fast: emergency disable, quarantine, and re-enrollment.
Keep audit and config history queryable during incidents—evidence beats intuition.
What to monitor
- Rollback events and the conditions that triggered them.
- Retry/timeout rates by endpoint and client cohort.
- Error budget burn + tail latency under load.
- Authz failures and policy denials (unexpected spikes).
- Admission-control / rate-limit rejections (by reason).
Rollback plan
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Define an explicit rollback trigger (metrics + thresholds).
- Use canaries and staged rollout; stop early when signals degrade.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Keep dual-write / dual-verify windows where appropriate.
Evidence
- Learn TLA+ (1) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
- Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
Open questions
- What is the blast radius of a compromised gateway?
- How quickly can you revoke a compromised device identity globally?
- What does “safe behavior” mean when the cloud is unreachable?
- Which messages are allowed to cause physical effects and under what conditions?
Checklist
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Rollback plan rehearsed and automated.
- Failure modes enumerated with mitigations.
- Telemetry captures correctness signals.
- Assumptions listed and reviewed.
- Safety properties stated as invariants.
Further reading
- Uptane — Secure software updates for fleets with realistic threat models.
- NISTIR 8259A: IoT Device Cybersecurity Capability Core Baseline — Baseline capabilities and lifecycle expectations for devices.
- The Update Framework (TUF) Specification — Secure update metadata, compromise recovery, and key rotation.
- MQTT Version 5.0 (OASIS) — Messaging semantics, session behavior, and constraints at the edge.
- Learn TLA+ — Practical entry point for specification and model checking.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.