Anomaly Detection: What 'Baseline' Means in Industrial Systems

Monthly research note. Theme: IIoT Platforms & Edge Security.

TL;DR

Anomaly Detection: What 'Baseline' Means in Industrial Systems as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Gateways are security boundaries; isolate blast radius and enforce policy early.
Design for power loss and intermittent links; recovery is the primary feature.
Secure updates need rollback protection and staged rollout with safety rails.
Bind security decisions to evidence (audit, invariants, telemetry).
Write assumptions down; treat them as interfaces.

Why this matters

Identity and freshness are the foundation of telemetry integrity.
Operational constraints (bandwidth, CPU) drive protocol choices.
Edge systems fail differently: power loss, intermittent links, and physical access.
Adversaries can replay and spoof data to mislead control planes.

Key questions

How do you prevent replay and reordering from becoming false control signals?
What is your offline behavior (safe mode vs degraded mode)?
What does incident response look like at fleet scale?
How do devices enroll securely (no shared secrets, minimal manual steps)?
Where do you terminate trust (device, gateway, cloud) and why?
How do you handle intermittent connectivity without corrupting state?

Assumptions

Connectivity is intermittent and high-latency; retries amplify costs.
Devices experience power loss and abrupt restarts.
Gateways can be compromised; isolate blast radius.
Firmware updates can fail mid-flight; partial installation is possible.

Non-goals

Relying on the cloud to enforce edge-local safety properties.
Treating identity as a static certificate file.

Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

Fleet rollout safety is a monotone constraint:

\text{rollout}(v_{k+1}) \Rightarrow \text{can\_rollback}(v_k)\ \wedge\ \text{telemetry\_healthy}.

Use monotonic counters when time is untrusted; combine with nonces and bounded windows.

Treat device identity as a lifecycle: provision → attest → rotate → revoke → forensics.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

Downgrade resistance: negotiation can’t silently weaken security posture.
Authenticity: actions are bound to identity and purpose.
Integrity: invalid transitions are rejected (and detectable).
Replay resistance: duplicated inputs do not change outcomes.

Failure modes

Config drift that weakens security posture over time.
Mixed-version behavior that violates assumptions silently.
Timeout ambiguity causing double-apply or partial state transitions.
Observability gaps during incidents (missing evidence).

Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart TD
  dev["Device (identity + attestation)"] --> gw["Gateway"]
  gw --> bus["Message Bus"]
  bus --> ingest["Ingestion"]
  ingest --> tsdb["Time-Series Store"]
  tsdb --> apps["Analytics / Control Plane"]

Implementation notes

Treat the gateway as a security boundary, not a dumb proxy.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

Firmware update safety checklist:
- Signed manifest with version + hash
- Rollback protection (anti-downgrade)
- A/B partitions or staged apply
- Health check + watchdog
- Telemetry proves rollout state

Verification strategy

Power-loss fault injection during flash writes and installs.
Scale tests: provisioning bursts, reconnect storms, gateway failures.
Replay/reorder simulations for telemetry and control messages.
Key rotation drills across device + gateway + cloud.
Hardware-in-the-loop tests for update and recovery paths.

Operational notes

Treat time sync alerts as security signals (NTP manipulation).
Design rollouts to be interruptible and reversible.
Make revocation fast: emergency disable, quarantine, and re-enrollment.
Monitor fleet health by cohort (version, region, gateway).
Maintain an identity inventory: device → cert/keys → firmware version.

Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

Invariant violation rate (should be ~0).
Admission-control / rate-limit rejections (by reason).
Rollback events and the conditions that triggered them.
Authz failures and policy denials (unexpected spikes).
Error budget burn + tail latency under load.

Rollback plan

Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Keep dual-write / dual-verify windows where appropriate.
Define an explicit rollback trigger (metrics + thresholds).
Use canaries and staged rollout; stop early when signals degrade.

Evidence

Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.

Open questions

What is the blast radius of a compromised gateway?
How quickly can you revoke a compromised device identity globally?
Which messages are allowed to cause physical effects and under what conditions?
What does “safe behavior” mean when the cloud is unreachable?

Checklist

Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Failure modes enumerated with mitigations.
Telemetry captures correctness signals.
Safety properties stated as invariants.
Assumptions listed and reviewed.
Rollback plan rehearsed and automated.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading