Chapter 4 — Problem Frames and Observable Failure¶
Systems are built to prevent failures. If you can’t name the failure, you will design a system that optimizes the wrong thing—usually something socially convenient, like “alignment,” “visibility,” or “consistency.”
This chapter teaches the entry move of System Design Lens:
- Locate the failure
- Make it observable
- Choose the correct problem frame
- Refuse to proceed without evidence
The Failure This Chapter Prevents¶
Observable failure: systems are designed from abstract motivations rather than concrete breakdowns.
Symptoms:
- A new process is introduced with no baseline (“we need to improve”)
- Different stakeholders mean different things by the same goal word
- Teams debate methods instead of diagnosing failures
- “Success” is defined after the fact
- The system becomes permanent even after the original issue disappears
Root cause:
- Problem statements are written to be agreeable, not accurate.
What a Problem Frame Is¶
A problem frame is the boundary you draw around “what kind of failure this is” so that:
- you don’t solve the wrong problem well
- you don’t apply the wrong causality model
- you don’t pick artifacts that can’t represent reality
Problem frames are not labels. They determine:
- what evidence counts
- what decisions matter
- what object of control is realistic
In this book, failures tend to cluster into five frames.
The Five Failure Locations¶
Strategy failures¶
What breaks:
- priorities drift
- investment choices don’t match intent
- “important” work can’t defeat urgent work
Observable symptoms:
- frequent re-prioritization with no learning
- roadmaps that change without triggering a decision review
- teams building things that leadership later calls “not the goal”
Decisions commonly failing:
- priority, investment, scope
Discovery failures¶
What breaks:
- learning is slow or untrusted
- teams build based on assumptions that aren’t tested
- users are understood through proxy opinions
Observable symptoms:
- months of build work with surprise outcomes
- repeated “we thought users wanted…” postmortems
- research artifacts no one uses in decisions
Decisions commonly failing:
- diagnosis, investment, scope
Delivery failures¶
What breaks:
- flow, predictability, quality, throughput
Observable symptoms:
- chronic missed dates
- work items stuck in-progress
- quality debt accumulating faster than it can be paid down
- firefighting as the default mode
Decisions commonly failing:
- sequencing, repair, scope
Cooperation failures¶
What breaks:
- interfaces, ownership, coordination across boundaries
Observable symptoms:
- cross-team friction dominates cycle time
- unclear ownership of systems, APIs, or outcomes
- escalation replaces collaboration
- “we’re blocked” becomes a permanent status
Decisions commonly failing:
- ownership, sequencing, repair
Evolution / scaling failures¶
What breaks:
- the system stops working when context changes
- growth increases coupling and coordination cost
Observable symptoms:
- practices that worked for one team fail at 5–10 teams
- architectural boundaries erode
- governance expands to compensate for lack of clarity
- the organization becomes slow to adapt
Decisions commonly failing:
- ownership, investment, repair
Observable Failure vs Abstract Dissatisfaction¶
Abstract dissatisfaction is language like:
- “We need alignment”
- “We need better execution”
- “We need to move faster”
- “We need clarity”
- “We need accountability”
These phrases are not failures. They are requests for safety.
An observable failure is something that:
- is repeatedly happening
- could be witnessed by an outsider
- has a measurable cost (time, money, risk, customer impact)
- can be stated without moral judgment
The Observable Failure Statement format¶
Write it in 3–5 sentences:
- Situation: where/when it occurs
- Symptom: what repeatedly happens
- Consequence: what it costs
- Who is impacted: team, org, users
- Frequency: how often / how long
Example (delivery frame):
- “Over the last 6 weeks, items labeled ‘small’ routinely take 2–3 weeks to ship. Work sits in review and QA with unclear handoffs. This causes planned releases to slip and forces weekend stabilization. Engineers are increasingly reluctant to pick up work outside their area because it amplifies cycle time.”
The point is not perfection. The point is inspectability.
Why “Alignment” Is a Smell¶
“Alignment” is usually a symptom of one of these real failures:
- priority is not explicit
- scope boundaries are porous
- ownership is unclear
- sequencing dependencies are hidden
- investment choices are not committed
“Alignment” becomes a goal when people don’t want to name the real decision, because naming it creates conflict.
In this book, “alignment” is acceptable only when you can complete:
“Alignment about __ decision, using _ artifact, under ___ constraint.”
If you can’t, drop the word.
Problem Frames Determine Causality Assumptions¶
This matters because the wrong causality model creates the wrong system.
Examples:
- Delivery bottlenecks often require constraints & flow thinking (queues, WIP, bottlenecks).
- Strategy ambiguity often requires feedback loops (hypotheses, metrics, learning).
- Cooperation failures often require socio-technical thinking (authority, incentives, boundary clarity).
- Scaling failures often require evolutionary thinking (selection pressures, drift, local adaptation).
If you choose a linear plan for a feedback-dominant problem, you will get false confidence and real surprises.
Evidence: What Counts, What Doesn’t¶
Evidence that counts¶
- cycle times, queue sizes, defect rates
- incident timelines
- decision logs (or absence of them)
- recurring escalation paths
- handoff points and wait states
- repeated reversals (“we decided X, then we decided not-X”)
Evidence that does not count (by itself)¶
- “people feel misaligned”
- “leadership wants visibility”
- “teams are frustrated”
- “communication is bad”
These can be useful signals, but they are not failure definitions.
A Simple Frame Selection Tool¶
When you’re unsure which frame you’re in, ask:
- Are we failing to choose what matters? → Strategy
- Are we failing to learn what’s true? → Discovery
- Are we failing to deliver reliably? → Delivery
- Are we failing at cross-boundary coordination? → Cooperation
- Are we failing to adapt as we scale? → Evolution
Most real situations involve multiple frames, but you must choose the dominant one to avoid designing a system that “optimizes everything” and enforces nothing.
Misuse Model: How This Chapter Gets Misapplied¶
Misuse 1: Treating “observable” as “quantitative only”¶
Some failures are observable through consistent narratives and incidents even before metrics exist.
Correction: use incident examples and timelines as evidence until metrics stabilize.
Misuse 2: Over-framing and analysis paralysis¶
People keep refining the failure statement instead of acting.
Correction: timebox diagnosis. Your goal is a usable frame, not a perfect one.
Misuse 3: Choosing the frame that avoids conflict¶
Teams pick “delivery” because it feels technical, when the real problem is strategy or ownership.
Correction: ask “Which decision are we refusing to make?” That usually reveals the real frame.
The Non-Negotiable Rule Introduced Here¶
You may not select, adopt, or design a system until you can produce:
- one Observable Failure Statement
- one dominant problem frame
- one decision type that is failing repeatedly
If you can’t do that, you don’t need a system. You need observation.
Exit Condition for This Chapter¶
Before moving on, write:
- Your Observable Failure Statement (3–5 sentences)
- The dominant frame (strategy / discovery / delivery / cooperation / evolution)
- The primary decision type currently failing (priority / scope / ownership / sequencing / investment / diagnosis / repair)
You now have the minimum input required to do real system work.