Skip to content

Failure Modes and Recovery

Capsule treats loss of data — and loss of the keys that decrypt it — as a first-class concern. This doc catalogues what can go wrong, how each failure is detected or contained, and the redundant, independent paths that restore a user’s entire asset collection — including after catastrophic software bugs, not just key loss.

It is a cross-cutting doc by nature: the failure-mode logic lives in many modules (key handling in capsule-core::crypto::keys, restore in capsule-core::backup, blob durability in capsule-api, etc.). The contract this doc owns is the set of failures the system is required to survive and the independent paths that must each remain workable. The closing transport security section is the one piece of crypto config that lives outside the application layer.

Failure modeDetected / contained byRecovery path
Master key lossMaster-key escrow (path 1) or cross-device recovery (path 2)
Device key lossDevice keys are disposable by designRe-bootstrap from the master key (path 1/2); device keys are never recovered
AMK loss (album key)OGK escrow (path 3) and the master-key-anchored backup escrow (path 4)
Write-tier key lossRe-minted and redistributed over MLS at the next epoch; no asset is lost
Master key compromiseRotate the master key + re-key affected albums — see Master-Key Compromise
Device compromiseDevice revocation certificate + MLS Remove; surviving devices rotate group keys
AMK / write-tier compromiseMLS epoch bump mints a fresh AMK and write-tier key; the compromised epoch cannot read or sign future epochs
Server compromiseServer is never trusted for authorization or plaintextAuthorization is verified against MLS history; data is E2E-encrypted at rest
Classical primitive broken (Ed25519, X25519)Hybrid constructionThe PQ half (ML-DSA-65 / ML-KEM-768) still holds — confidentiality and authentication survive
PQ primitive broken (ML-DSA, ML-KEM)Hybrid constructionThe classical half still holds
Ciphertext corruption; chunk truncation, reorder, or deletionAES-256-GCM-STREAM per-chunk tags + ciphertext_sha256Re-fetch the blob from a content-addressed copy (path 6)
Reader-signed / removed-writer / wrong-epoch / forged-chain / replayed manifestThe single verify_asset chokepointAsset is quarantined and surfaced in the audit trail
AMK distribution lag (manifest cites an in-flight epoch whose key has not yet arrived)verify_asset pending outcome — amk_version is within the MLS-attested range but the AlbumKeyDistribution message is still in transitAsset is held and retried as MLS state catches up; escalated to quarantine only if the key never arrives within the timeout — never misread as a forgery
MLS ratchet corruption or lossThe recovery path is independent of ratchet state (paths 1, 3, 4). State-divergence repair owned by MLS Resilience
Backup incompleteness (a referenced amk_version missing from the escrow)Backup verification’s AMK-completeness checkCaught before the backup is relied on; re-export
Nonce reuseStructurally preventedSTREAM derives per-chunk nonces; metadata blobs draw fresh random nonces; a fresh per-file key lets the STREAM counter start at zero
CBOR non-determinism breaking cross-peer signature verificationRFC 8949 §4.2 deterministic encodingByte-identical re-encoding; the signature verifies
Catastrophic software bug corrupting the library DB / indexThe DB is a rebuildable cache, not a source of truthFilesystem rebuild from CBOR sidecars (path 5)
Erroneous delete (bug or user)Soft-delete is the defaultRestore from trash within the retention window (path 7)
Stale-revival attempt (peer or restore sends an old-but-validly-signed manifest)prior_provenance_hash chain (see Provenance) and matching server-side envelope check (see Threat Model)Manifest is quarantined; chain advance is refused on both client and server
Suite-downgrade attempt (re-sign a manifest under a weaker crypto_suite_id)Signature covers crypto_suite_id and protocol_versionVerification fails at verify_asset; manifest is quarantined
Derivative poisoning (buggy or hostile client overwrites a good thumbnail/embedding)Every derivative carries a DerivativeManifest on its own chainOverwrite without a valid manifest is rejected; provenance chain detects an unauthorized replacement
Cross-schema sidecar overwrite (old client writes back a sidecar after stripping unknown fields)Sidecar signature covers every byte including unknown fields; old client refuses to write when sidecar_schema exceeds its max knownOld client cannot strip-and-resign; new client detects schema regression and quarantines

Restoring a complete asset collection does not depend on any single mechanism. The following paths are independent — each annotated with the failures it survives:

  1. Master-key escrow. A recovery passphrase or BIP39-style seed unwraps the server-side escrow blob → account master key → AMK escrow → every asset. Survives: total device loss. See Master-Key Escrow.
  2. Cross-device recovery. Any signed-in device re-bootstraps a new device over a verified channel. Survives: partial device loss, and loss of the master-key backup — as long as one device survives. The first-device flow is owned by Device Enrollment.
  3. Owner Group Key (OGK). Any current member of the owner set recovers every album’s AMK versions, independent of album membership. Survives: lost album membership, gaps in AMK distribution over MLS.
  4. Portable backup artifact. A self-describing, versioned, encrypted archive, stored offline. Survives: server data loss, account compromise, escrow-blob corruption. See Backup Artifact for the container format.
  5. Recovery-first filesystem rebuild. CBOR sidecars are the canonical metadata store; the database is a rebuildable query cache. The idempotent rebuild_index() (capsule-core::library::rebuild) walks .cbor sidecars and reconstructs the index. Survives: DB corruption and catastrophic bugs in the index/query layer.
  6. Content-addressed durability redundancy. Ciphertext is addressed by the SHA-256 of its bytes, so any byte-identical copy — on another device or a federated peer — is independently verifiable. This is a durability path: it restores ciphertext, not keys. Survives: single-server data loss.
  7. Trash soft-delete window. Deletes are soft first — soft_delete() / purge_expired_trash() (capsule-core::library::trash) give a reversal window before a hard purge. Survives: erroneous deletes by a bug or user.

Account-type coverage. Registered accounts have all seven paths. Delegated/sponsored accounts are recovered via the sponsoring account’s master key, since their keys derive from it. Non-registered (share-link) accounts hold no collection of their own — recovery is not applicable.

These cross-cutting properties make recovery robust specifically against catastrophic bugs, not just key loss:

  • The backup path is independent of the MLS ratchet. Restore never reconstructs ratchet state, so a ratchet bug cannot strand data. The master key — not any ratchet state — is the single backed-up root.
  • Hardware-bound, disposable device keys. Device keys live inside hardware, are non-exportable, and are never backed up — a lost device is re-bootstrapped, not recovered.
  • Cross-signing (Matrix-style). The master identity signs every device key; adding a device means an existing device signs it, so losing one device never compromises the account.
  • Every construction is versioned. KDF info strings, in-blob Argon2id parameters, the crypto_suite_id on every manifest and metadata blob, and the sidecar_schema on every sidecar mean a buggy v2 never strands v1 data — v2 keys and structures coexist with v1 without a flag day. Signature coverage of crypto_suite_id defeats downgrade-attempts.
  • verify_asset quarantines, never drops. A bug-produced invalid asset is neither silently dropped nor silently accepted; it is quarantined and surfaced in the audit trail so an operator can tell a bug from an attack.
  • Provenance is append-only. Each ProvenanceRecord carries the hash of its predecessor (prior_provenance_hash), and every record is hybrid-signed by the producing device. An attacker holding every current key still cannot rewrite a past record without forging an earlier (possibly retired) device’s signature — history is read-only. See Provenance.
  • Stale-revival is rejected. An incoming manifest whose prior_provenance_hash is behind the receiver’s stored latest_provenance_hash is treated as stale and quarantined — a deleted asset cannot be silently resurrected by a peer or a backup restore. The check is enforced both client-side and server-side (no key needed); see Threat Model.
  • Backup verification runs before reliance. Preview, dry-run, signature-chain, and AMK-completeness checks (see Backup Verification) detect an incomplete or broken backup before it is needed.

All client-server communication is over HTTPS. While Capsule’s stack aims to stay PQ-safe (within due course), the transport layer (TLS) must be configured by the server administrator to be PQ-resistant as well. As of 2026, the standard is TLS 1.3 with hybrid X25519+ML-KEM key exchange enabled. Since application servers do not terminate TLS, ensure the ingress/reverse proxy is properly configured.

This is the one piece of cryptographic configuration that lives outside the application layer — the application code cannot enforce it, only document the requirement.

The failure-mode catalog itself is the verification spec: each row must have an executable test that exercises both the detection (the catalog’s middle column) and the recovery (the right column).

  • Per-recovery-path scenarios — seven smoke tests, one per path. Each takes a library to a “lost” state corresponding to the path’s Survives annotation, runs the recovery, and asserts every asset is recoverable. The tests share fixtures from the Keys and Provenance test surfaces.
  • Bug-resistance invariant checks — unit-test surface that asserts each invariant holds structurally:
    • The backup-artifact format does not embed MLS ratchet state (assert by inspecting an exported artifact).
    • Device private keys cannot be exported (asserted per-platform in the Keys hardware smoke).
    • A v2 client can read a v1 sidecar and write a v2 sidecar that a v1 client still validates as v1 (cross-version round-trip).
  • Catalog completeness — a CI check that every row in the catalog has at least one referenced test. Adding a row without a test is a structural error, not a TODO.

The full bounded recovery surface — including which paths must be exercised end-to-end across the full system — is in Module Map — E2E Test Surface.