Working Fundamentals
chapter 10 of 13

Errors & Failure

Designing for broken assumptions

What this chapter is

This chapter is not about exceptions. It is not about error codes. It is not about retries, timeouts, or logs.

This chapter is about truth.

Errors exist because reality did not match your assumptions.

The core truth

Failure is not an accident.

Failure is a revealed assumption.

When a system fails, it is saying:

"I expected the world to behave differently."

Ignoring that message does not restore correctness.

🧠 Mental Model Every error is an assumption becoming visible.

Why failure must be designed

Most systems treat failure as an afterthought.

They design for:

  • the happy path
  • correct inputs
  • perfect timing
  • reliable dependencies

Reality does not agree.

Failure that is not designed becomes:

  • partial updates
  • silent corruption
  • cascading outages
  • data inconsistency

🧠 Architect's Note The most dangerous failures are the quiet ones.

Failure is a first-class path

Failure is not an interruption to control flow. It is an alternate path.

Every failure path must answer:

  • what stops?
  • what continues?
  • what must be undone?
  • what must never happen?

If these answers are unclear, the design is incomplete.

Partial failure is worse than total failure

A total failure is visible. A partial failure lies.

Examples of partial failure:

  • data written but not acknowledged
  • side effects triggered before validation
  • retries applied after irreversible changes

🧠 Perspective It is safer to fail loudly than to succeed incorrectly.

Errors and state are inseparable

Failures interacting with state are dangerous.

Questions you must answer:

  • did state change?
  • was it complete?
  • can this be retried safely?
  • can it be rolled back?

If state cannot be restored or recovered, the system accumulates damage.

🧠 Architect's Note Failure + state without recovery is corruption.

Retrying is a design decision

Retries feel safe. They are not.

Blind retries can:

  • duplicate work
  • amplify load
  • worsen outages

Retries must be:

  • bounded
  • idempotent
  • intentional

🧠 Mental Model A retry is a promise that repeating is safe.

Error handling is part of the contract

Errors are not implementation details.

They are part of:

  • interfaces
  • expectations
  • system behavior

Consumers will build logic around:

  • how you fail
  • how often you fail
  • how you signal failure

Inconsistent error behavior creates fragile integrations.

Failure domains matter

Failures should be contained.

A failure in one area should not:

  • bring down unrelated parts
  • corrupt shared state
  • force global recovery

This is a structural problem, not a logging problem.

🧠 Architect's Note A system without failure boundaries will eventually fail everywhere.

When failure is expected

Some failures are normal:

  • invalid input
  • missing data
  • unavailable dependencies
  • timeouts

Treating expected failures as exceptional:

  • clutters logs
  • hides real issues
  • increases noise

🧠 Perspective If failure is expected, design for it explicitly.

Observability without panic

Errors should be:

  • visible
  • attributable
  • actionable

They should not:

  • overwhelm operators
  • trigger unnecessary escalation
  • obscure root causes

Silence is dangerous. Noise is also dangerous.

Failure teaches structure

Failures reveal:

  • hidden coupling
  • missing boundaries
  • fragile assumptions

Well-designed systems learn from failure. Poorly designed ones repeat it.

Minimal practice (still no code)

Problem: "A process validates input, writes to storage, and notifies an external system. The notification fails."

Ask:

  • What state has already changed?
  • Is retry safe?
  • What must be undone?
  • What should the system report?

If the answer is "it depends," the design is incomplete.

What beginners gain here

  • Respect for failure
  • Fewer destructive shortcuts
  • Safer mental models

What experienced developers recognize

  • Why outages cascade
  • Why retries worsen incidents
  • Why some bugs never fully disappear

Failures were ignored.

What this chapter deliberately avoids

  • Error-handling libraries
  • Framework-specific patterns
  • Logging tools
  • Incident playbooks

Those come after design.

Closing

Failure is not the enemy.

Unexamined failure is.

Design systems that:

  • expect failure
  • expose assumptions
  • recover deliberately

A system that fails honestly is safer than one that pretends to succeed.