Working Fundamentals

chapter 10 of 13▾

Errors & Failure

Designing for broken assumptions

What this chapter is

This chapter is not about exceptions. It is not about error codes. It is not about retries, timeouts, or logs.

This chapter is about truth.

Errors exist because reality did not match your assumptions.

The core truth

Failure is not an accident.

Failure is a revealed assumption.

When a system fails, it is saying:

"I expected the world to behave differently."

Ignoring that message does not restore correctness.

🧠 Mental Model Every error is an assumption becoming visible.

Why failure must be designed

Most systems treat failure as an afterthought.

They design for:

the happy path
correct inputs
perfect timing
reliable dependencies

Reality does not agree.

Failure that is not designed becomes:

partial updates
silent corruption
cascading outages
data inconsistency

🧠 Architect's Note The most dangerous failures are the quiet ones.

Failure is a first-class path

Failure is not an interruption to control flow. It is an alternate path.

Every failure path must answer:

what stops?
what continues?
what must be undone?
what must never happen?

If these answers are unclear, the design is incomplete.

Partial failure is worse than total failure

A total failure is visible. A partial failure lies.

Examples of partial failure:

data written but not acknowledged
side effects triggered before validation
retries applied after irreversible changes

🧠 Perspective It is safer to fail loudly than to succeed incorrectly.

Errors and state are inseparable

Failures interacting with state are dangerous.

Questions you must answer:

did state change?
was it complete?
can this be retried safely?
can it be rolled back?

If state cannot be restored or recovered, the system accumulates damage.

🧠 Architect's Note Failure + state without recovery is corruption.

Retrying is a design decision

Retries feel safe. They are not.

Blind retries can:

duplicate work
amplify load
worsen outages

Retries must be:

bounded
idempotent
intentional

🧠 Mental Model A retry is a promise that repeating is safe.

Error handling is part of the contract

Errors are not implementation details.

They are part of:

interfaces
expectations
system behavior

Consumers will build logic around:

how you fail
how often you fail
how you signal failure

Inconsistent error behavior creates fragile integrations.

Failure domains matter

Failures should be contained.

A failure in one area should not:

bring down unrelated parts
corrupt shared state
force global recovery

This is a structural problem, not a logging problem.

🧠 Architect's Note A system without failure boundaries will eventually fail everywhere.

When failure is expected

Some failures are normal:

invalid input
missing data
unavailable dependencies
timeouts

Treating expected failures as exceptional:

clutters logs
hides real issues
increases noise

🧠 Perspective If failure is expected, design for it explicitly.

Observability without panic

Errors should be:

visible
attributable
actionable

They should not:

overwhelm operators
trigger unnecessary escalation
obscure root causes

Silence is dangerous. Noise is also dangerous.

Failure teaches structure

Failures reveal:

hidden coupling
missing boundaries
fragile assumptions

Well-designed systems learn from failure. Poorly designed ones repeat it.

Minimal practice (still no code)

Problem: "A process validates input, writes to storage, and notifies an external system. The notification fails."

Ask:

What state has already changed?
Is retry safe?
What must be undone?
What should the system report?

If the answer is "it depends," the design is incomplete.

What beginners gain here

Respect for failure
Fewer destructive shortcuts
Safer mental models

What experienced developers recognize

Why outages cascade
Why retries worsen incidents
Why some bugs never fully disappear

Failures were ignored.

What this chapter deliberately avoids

Error-handling libraries
Framework-specific patterns
Logging tools
Incident playbooks

Those come after design.

Closing

Failure is not the enemy.

Unexamined failure is.

Design systems that:

expect failure
expose assumptions
recover deliberately

A system that fails honestly is safer than one that pretends to succeed.