Skip to content

Chapter 4 — Problem Frames and Observable Failure

Systems are built to prevent failures. If you can’t name the failure, you will design a system that optimizes the wrong thing—usually something socially convenient, like “alignment,” “visibility,” or “consistency.”

This chapter teaches the entry move of System Design Lens:

  1. Locate the failure
  2. Make it observable
  3. Choose the correct problem frame
  4. Refuse to proceed without evidence

The Failure This Chapter Prevents

Observable failure: systems are designed from abstract motivations rather than concrete breakdowns.

Symptoms:

  • A new process is introduced with no baseline (“we need to improve”)
  • Different stakeholders mean different things by the same goal word
  • Teams debate methods instead of diagnosing failures
  • “Success” is defined after the fact
  • The system becomes permanent even after the original issue disappears

Root cause:

  • Problem statements are written to be agreeable, not accurate.

What a Problem Frame Is

A problem frame is the boundary you draw around “what kind of failure this is” so that:

  • you don’t solve the wrong problem well
  • you don’t apply the wrong causality model
  • you don’t pick artifacts that can’t represent reality

Problem frames are not labels. They determine:

  • what evidence counts
  • what decisions matter
  • what object of control is realistic

In this book, failures tend to cluster into five frames.

The Five Failure Locations

Strategy failures

What breaks:

  • priorities drift
  • investment choices don’t match intent
  • “important” work can’t defeat urgent work

Observable symptoms:

  • frequent re-prioritization with no learning
  • roadmaps that change without triggering a decision review
  • teams building things that leadership later calls “not the goal”

Decisions commonly failing:

  • priority, investment, scope

Discovery failures

What breaks:

  • learning is slow or untrusted
  • teams build based on assumptions that aren’t tested
  • users are understood through proxy opinions

Observable symptoms:

  • months of build work with surprise outcomes
  • repeated “we thought users wanted…” postmortems
  • research artifacts no one uses in decisions

Decisions commonly failing:

  • diagnosis, investment, scope

Delivery failures

What breaks:

  • flow, predictability, quality, throughput

Observable symptoms:

  • chronic missed dates
  • work items stuck in-progress
  • quality debt accumulating faster than it can be paid down
  • firefighting as the default mode

Decisions commonly failing:

  • sequencing, repair, scope

Cooperation failures

What breaks:

  • interfaces, ownership, coordination across boundaries

Observable symptoms:

  • cross-team friction dominates cycle time
  • unclear ownership of systems, APIs, or outcomes
  • escalation replaces collaboration
  • “we’re blocked” becomes a permanent status

Decisions commonly failing:

  • ownership, sequencing, repair

Evolution / scaling failures

What breaks:

  • the system stops working when context changes
  • growth increases coupling and coordination cost

Observable symptoms:

  • practices that worked for one team fail at 5–10 teams
  • architectural boundaries erode
  • governance expands to compensate for lack of clarity
  • the organization becomes slow to adapt

Decisions commonly failing:

  • ownership, investment, repair

Observable Failure vs Abstract Dissatisfaction

Abstract dissatisfaction is language like:

  • “We need alignment”
  • “We need better execution”
  • “We need to move faster”
  • “We need clarity”
  • “We need accountability”

These phrases are not failures. They are requests for safety.

An observable failure is something that:

  • is repeatedly happening
  • could be witnessed by an outsider
  • has a measurable cost (time, money, risk, customer impact)
  • can be stated without moral judgment

The Observable Failure Statement format

Write it in 3–5 sentences:

  1. Situation: where/when it occurs
  2. Symptom: what repeatedly happens
  3. Consequence: what it costs
  4. Who is impacted: team, org, users
  5. Frequency: how often / how long

Example (delivery frame):

  • “Over the last 6 weeks, items labeled ‘small’ routinely take 2–3 weeks to ship. Work sits in review and QA with unclear handoffs. This causes planned releases to slip and forces weekend stabilization. Engineers are increasingly reluctant to pick up work outside their area because it amplifies cycle time.”

The point is not perfection. The point is inspectability.

Why “Alignment” Is a Smell

“Alignment” is usually a symptom of one of these real failures:

  • priority is not explicit
  • scope boundaries are porous
  • ownership is unclear
  • sequencing dependencies are hidden
  • investment choices are not committed

“Alignment” becomes a goal when people don’t want to name the real decision, because naming it creates conflict.

In this book, “alignment” is acceptable only when you can complete:

“Alignment about __ decision, using _ artifact, under ___ constraint.”

If you can’t, drop the word.

Problem Frames Determine Causality Assumptions

This matters because the wrong causality model creates the wrong system.

Examples:

  • Delivery bottlenecks often require constraints & flow thinking (queues, WIP, bottlenecks).
  • Strategy ambiguity often requires feedback loops (hypotheses, metrics, learning).
  • Cooperation failures often require socio-technical thinking (authority, incentives, boundary clarity).
  • Scaling failures often require evolutionary thinking (selection pressures, drift, local adaptation).

If you choose a linear plan for a feedback-dominant problem, you will get false confidence and real surprises.

Evidence: What Counts, What Doesn’t

Evidence that counts

  • cycle times, queue sizes, defect rates
  • incident timelines
  • decision logs (or absence of them)
  • recurring escalation paths
  • handoff points and wait states
  • repeated reversals (“we decided X, then we decided not-X”)

Evidence that does not count (by itself)

  • “people feel misaligned”
  • “leadership wants visibility”
  • “teams are frustrated”
  • “communication is bad”

These can be useful signals, but they are not failure definitions.

A Simple Frame Selection Tool

When you’re unsure which frame you’re in, ask:

  1. Are we failing to choose what matters? → Strategy
  2. Are we failing to learn what’s true? → Discovery
  3. Are we failing to deliver reliably? → Delivery
  4. Are we failing at cross-boundary coordination? → Cooperation
  5. Are we failing to adapt as we scale? → Evolution

Most real situations involve multiple frames, but you must choose the dominant one to avoid designing a system that “optimizes everything” and enforces nothing.

Misuse Model: How This Chapter Gets Misapplied

Misuse 1: Treating “observable” as “quantitative only”

Some failures are observable through consistent narratives and incidents even before metrics exist.

Correction: use incident examples and timelines as evidence until metrics stabilize.

Misuse 2: Over-framing and analysis paralysis

People keep refining the failure statement instead of acting.

Correction: timebox diagnosis. Your goal is a usable frame, not a perfect one.

Misuse 3: Choosing the frame that avoids conflict

Teams pick “delivery” because it feels technical, when the real problem is strategy or ownership.

Correction: ask “Which decision are we refusing to make?” That usually reveals the real frame.

The Non-Negotiable Rule Introduced Here

You may not select, adopt, or design a system until you can produce:

  • one Observable Failure Statement
  • one dominant problem frame
  • one decision type that is failing repeatedly

If you can’t do that, you don’t need a system. You need observation.

Exit Condition for This Chapter

Before moving on, write:

  1. Your Observable Failure Statement (3–5 sentences)
  2. The dominant frame (strategy / discovery / delivery / cooperation / evolution)
  3. The primary decision type currently failing (priority / scope / ownership / sequencing / investment / diagnosis / repair)

You now have the minimum input required to do real system work.