Failure Modes and Recovery

Draft

Capsule treats loss of data — and loss of the keys that decrypt it — as a first-class concern. This doc catalogues what can go wrong, how each failure is detected or contained, and the redundant, independent paths that restore a user’s entire asset collection — including after catastrophic software bugs, not just key loss.

It is a cross-cutting doc by nature: the failure-mode logic lives in many modules (key handling in capsule-core::crypto::keys, restore in capsule-core::backup, blob durability in capsule-api, etc.). The contract this doc owns is the set of failures the system is required to survive and the independent paths that must each remain workable. The closing transport security section is the one piece of crypto config that lives outside the application layer.

Failure Mode Catalog

Failure mode	Detected / contained by	Recovery path
Master key loss	—	Master-key escrow (path 1) or cross-device recovery (path 2)
Device key loss	Device keys are disposable by design	Re-bootstrap from the master key (path 1/2); device keys are never recovered
AMK loss (album key)	—	OGK escrow (path 3) and the master-key-anchored backup escrow (path 4)
Write-tier key loss	—	Re-minted and redistributed over MLS at the next epoch; no asset is lost
Master key compromise	—	Rotate the master key + re-key affected albums — see Master-Key Compromise
Device compromise	—	Device revocation certificate + MLS `Remove`; surviving devices rotate group keys
AMK / write-tier compromise	—	MLS epoch bump mints a fresh AMK and write-tier key; the compromised epoch cannot read or sign future epochs
Server compromise	Server is never trusted for authorization or plaintext	Authorization is verified against MLS history; data is E2E-encrypted at rest
Classical primitive broken (Ed25519, X25519)	Hybrid construction	The PQ half (ML-DSA-65 / ML-KEM-768) still holds — confidentiality and authentication survive
PQ primitive broken (ML-DSA, ML-KEM)	Hybrid construction	The classical half still holds
Ciphertext corruption; chunk truncation, reorder, or deletion	AES-256-GCM-STREAM per-chunk tags + the manifest’s `ciphertext_hash` (algorithm fixed by `crypto_suite_id`)	Re-fetch the blob from a content-addressed copy (path 6)
Reader-signed / removed-writer / wrong-epoch / forged-chain / replayed manifest	The single `verify_asset` chokepoint	Asset is quarantined and surfaced in the audit trail
AMK distribution lag (manifest cites an in-flight epoch whose key has not yet arrived)	`verify_asset` pending outcome — `amk_version` is within the MLS-attested range but the AlbumKeyDistribution message is still in transit	Asset is held and retried as MLS state catches up; escalated to quarantine only if the key never arrives within the timeout — never misread as a forgery
MLS ratchet corruption or loss	—	The recovery path is independent of ratchet state (paths 1, 3, 4). State-divergence repair owned by MLS Resilience
Backup incompleteness (a referenced `amk_version` missing from the escrow)	Backup verification’s AMK-completeness check	Caught before the backup is relied on; re-export
Nonce reuse	Structurally prevented	STREAM derives per-chunk nonces; metadata blobs draw fresh random nonces; a fresh per-file key lets the STREAM counter start at zero
CBOR non-determinism breaking cross-peer signature verification	RFC 8949 §4.2 deterministic encoding	Byte-identical re-encoding; the signature verifies
Catastrophic software bug corrupting the library DB / index	The DB is a rebuildable cache, not a source of truth	Filesystem rebuild from CBOR sidecars (path 5)
Erroneous delete (bug or user)	Soft-delete is the default	Restore from trash within the retention window (path 7)
Stale-revival attempt (peer or restore sends an old-but-validly-signed manifest)	`prior_provenance_hash` chain (see Provenance) and matching server-side envelope check (see Threat Model)	Manifest is quarantined; chain advance is refused on both client and server
Suite-downgrade attempt (re-sign a manifest under a weaker `crypto_suite_id`)	Signature covers `crypto_suite_id` and `protocol_version`	Verification fails at `verify_asset`; manifest is quarantined
Derivative poisoning (buggy or hostile client overwrites a good thumbnail/embedding)	Every derivative carries a `DerivativeManifest` on its own chain	Overwrite without a valid manifest is rejected; provenance chain detects an unauthorized replacement
Cross-schema sidecar overwrite (old client writes back a sidecar after stripping unknown fields)	Sidecar signature covers every byte including unknown fields; old client `refuses to write` when `sidecar_schema` exceeds its max known	Old client cannot strip-and-resign; new client detects schema regression and quarantines

Redundant Recovery Paths

Restoring a complete asset collection does not depend on any single mechanism. The following paths are independent — each annotated with the failures it survives:

Master-key escrow. A recovery passphrase or BIP39-style seed unwraps the server-side escrow blob → account master key → AMK escrow → every asset. Survives: total device loss. See Master-Key Escrow.
Cross-device recovery. Any signed-in device re-bootstraps a new device over a verified channel. Survives: partial device loss, and loss of the master-key backup — as long as one device survives. The first-device flow is owned by Device Enrollment.
Owner Group Key (OGK). Any current member of the owner set recovers every album’s AMK versions, independent of album membership. Survives: lost album membership, gaps in AMK distribution over MLS.
Portable backup artifact. A self-describing, versioned, encrypted archive, stored offline. Survives: server data loss, account compromise, escrow-blob corruption. See Backup Artifact for the container format.
Recovery-first filesystem rebuild. CBOR sidecars are the canonical metadata store; the database is a rebuildable query cache. The idempotent rebuild_index() (capsule-core::library::rebuild) walks .cbor sidecars and reconstructs the index. Survives: DB corruption and catastrophic bugs in the index/query layer.
Content-addressed durability redundancy. Ciphertext is addressed by the SHA-256 of its bytes, so any byte-identical copy — on another device or a federated peer — is independently verifiable. This is a durability path: it restores ciphertext, not keys. Survives: single-server data loss.
Trash soft-delete window. Deletes are soft first — soft_delete() / purge_expired_trash() (capsule-core::library::trash) give a reversal window before a hard purge. Survives: erroneous deletes by a bug or user.

Account-type coverage. Registered accounts have all seven paths. Delegated/sponsored accounts are recovered via the sponsoring account’s master key, since their keys derive from it — effectively one path class, routed entirely through the sponsor; the sponsoree recovery matrix states the consequence (a sponsor who irrecoverably loses their master key loses every sponsoree’s data with it). Non-registered (share-link) accounts hold no collection of their own — recovery is not applicable.

Bug-Resistance Invariants

These cross-cutting properties make recovery robust specifically against catastrophic bugs, not just key loss:

The backup path is independent of the MLS ratchet. Restore never reconstructs ratchet state, so a ratchet bug cannot strand data. The master key — not any ratchet state — is the single backed-up root.
Hardware-bound, disposable device keys. Device keys live inside hardware, are non-exportable, and are never backed up — a lost device is re-bootstrapped, not recovered.
Cross-signing (Matrix-style). The master identity signs every device key; adding a device means an existing device signs it, so losing one device never compromises the account.
Every construction is versioned. KDF info strings, in-blob Argon2id parameters, the crypto_suite_id on every manifest and metadata blob, and the sidecar_schema on every sidecar mean a buggy v2 never strands v1 data — v2 keys and structures coexist with v1 without a flag day. Signature coverage of crypto_suite_id defeats downgrade-attempts.
verify_asset quarantines, never drops. A bug-produced invalid asset is neither silently dropped nor silently accepted; it is quarantined and surfaced in the audit trail so an operator can tell a bug from an attack.
Provenance is append-only. Each ProvenanceRecord carries the hash of its predecessor (prior_provenance_hash), and every record is hybrid-signed by the producing device. An attacker holding every current key still cannot rewrite a past record without forging an earlier (possibly retired) device’s signature — history is read-only. See Provenance.
Stale-revival is rejected. An incoming manifest whose prior_provenance_hash is behind the receiver’s stored latest_provenance_hash is treated as stale and quarantined — a deleted asset cannot be silently resurrected by a peer or a backup restore. The check is enforced both client-side and server-side (no key needed); see Threat Model.
Backup verification runs before reliance. Preview, dry-run, signature-chain, and AMK-completeness checks (see Backup Verification) detect an incomplete or broken backup before it is needed.

Transport Security

All client-server communication is over HTTPS. While Capsule’s stack aims to stay PQ-safe (within due course), the transport layer (TLS) must be configured by the server administrator to be PQ-resistant as well. As of 2026, the standard is TLS 1.3 with hybrid X25519+ML-KEM key exchange enabled. Since application servers do not terminate TLS, ensure the ingress/reverse proxy is properly configured.

The version policy everywhere: TLS 1.3 is required wherever both endpoints support it; TLS 1.2 is the absolute floor, admitted only for compatibility with legacy ingress/reverse-proxy deployments — nothing below 1.2 is ever negotiated, and the PQ-hybrid key-exchange recommendation applies to 1.3. Every Capsule client attempts 1.3 first; deployment documentation shows administrators how to verify their edge actually negotiates 1.3.

The ingress hop is the one piece of cryptographic configuration that lives outside the application layer — the application code cannot enforce it, only document the requirement. Where Capsule code itself terminates or originates TLS — the SDK’s HTTP client, LAN peering’s mutual TLS, server-to-server egress — the implementation is the rustls pin in Dependencies, and the same version policy applies in code.

Validation

The failure-mode catalog itself is the verification spec: each row must have an executable test that exercises both the detection (the catalog’s middle column) and the recovery (the right column).

Per-recovery-path scenarios — seven smoke tests, one per path. Each takes a library to a “lost” state corresponding to the path’s Survives annotation, runs the recovery, and asserts every asset is recoverable. The tests share fixtures from the Keys and Provenance test surfaces.
Bug-resistance invariant checks — unit-test surface that asserts each invariant holds structurally:
- The backup-artifact format does not embed MLS ratchet state (assert by inspecting an exported artifact).
- Device private keys cannot be exported (asserted per-platform in the Keys hardware smoke).
- A v2 client can read a v1 sidecar and write a v2 sidecar that a v1 client still validates as v1 (cross-version round-trip).
Catalog completeness — a CI check that every row in the catalog has at least one referenced test. Adding a row without a test is a structural error, not a TODO.

The full bounded recovery surface — including which paths must be exercised end-to-end across the full system — is in Module Map — E2E Test Surface.