Errors & Failure
Designing for broken assumptions
What this chapter is
This chapter is not about exceptions. It is not about error codes. It is not about retries, timeouts, or logs.
This chapter is about truth.
Errors exist because reality did not match your assumptions.
The core truth
Failure is not an accident.
Failure is a revealed assumption.
When a system fails, it is saying:
"I expected the world to behave differently."
Ignoring that message does not restore correctness.
🧠 Mental Model Every error is an assumption becoming visible.
Why failure must be designed
Most systems treat failure as an afterthought.
They design for:
- the happy path
- correct inputs
- perfect timing
- reliable dependencies
Reality does not agree.
Failure that is not designed becomes:
- partial updates
- silent corruption
- cascading outages
- data inconsistency
🧠 Architect's Note The most dangerous failures are the quiet ones.
Failure is a first-class path
Failure is not an interruption to control flow. It is an alternate path.
Every failure path must answer:
- what stops?
- what continues?
- what must be undone?
- what must never happen?
If these answers are unclear, the design is incomplete.
Partial failure is worse than total failure
A total failure is visible. A partial failure lies.
Examples of partial failure:
- data written but not acknowledged
- side effects triggered before validation
- retries applied after irreversible changes
🧠 Perspective It is safer to fail loudly than to succeed incorrectly.
Errors and state are inseparable
Failures interacting with state are dangerous.
Questions you must answer:
- did state change?
- was it complete?
- can this be retried safely?
- can it be rolled back?
If state cannot be restored or recovered, the system accumulates damage.
🧠 Architect's Note Failure + state without recovery is corruption.
Retrying is a design decision
Retries feel safe. They are not.
Blind retries can:
- duplicate work
- amplify load
- worsen outages
Retries must be:
- bounded
- idempotent
- intentional
🧠 Mental Model A retry is a promise that repeating is safe.
Error handling is part of the contract
Errors are not implementation details.
They are part of:
- interfaces
- expectations
- system behavior
Consumers will build logic around:
- how you fail
- how often you fail
- how you signal failure
Inconsistent error behavior creates fragile integrations.
Failure domains matter
Failures should be contained.
A failure in one area should not:
- bring down unrelated parts
- corrupt shared state
- force global recovery
This is a structural problem, not a logging problem.
🧠 Architect's Note A system without failure boundaries will eventually fail everywhere.
When failure is expected
Some failures are normal:
- invalid input
- missing data
- unavailable dependencies
- timeouts
Treating expected failures as exceptional:
- clutters logs
- hides real issues
- increases noise
🧠 Perspective If failure is expected, design for it explicitly.
Observability without panic
Errors should be:
- visible
- attributable
- actionable
They should not:
- overwhelm operators
- trigger unnecessary escalation
- obscure root causes
Silence is dangerous. Noise is also dangerous.
Failure teaches structure
Failures reveal:
- hidden coupling
- missing boundaries
- fragile assumptions
Well-designed systems learn from failure. Poorly designed ones repeat it.
Minimal practice (still no code)
Problem: "A process validates input, writes to storage, and notifies an external system. The notification fails."
Ask:
- What state has already changed?
- Is retry safe?
- What must be undone?
- What should the system report?
If the answer is "it depends," the design is incomplete.
What beginners gain here
- Respect for failure
- Fewer destructive shortcuts
- Safer mental models
What experienced developers recognize
- Why outages cascade
- Why retries worsen incidents
- Why some bugs never fully disappear
Failures were ignored.
What this chapter deliberately avoids
- Error-handling libraries
- Framework-specific patterns
- Logging tools
- Incident playbooks
Those come after design.
Closing
Failure is not the enemy.
Unexamined failure is.
Design systems that:
- expect failure
- expose assumptions
- recover deliberately
A system that fails honestly is safer than one that pretends to succeed.