Agentic Orchestration RTM

The Problem

When you accelerate the building, you also accelerate the drift.

Most product teams I work with are already using AI agents to move faster — generating specs, writing user stories, spinning up prototypes in an afternoon. That speed is real. The risk hiding inside it is also real.

When you accelerate the building, you also accelerate the drift. Requirements multiply without evidence. Features ship without traceability. The Requirements Traceability Matrix — if it exists at all — becomes a document nobody updates because the sprint is already two cycles ahead of it.

The question I kept coming back to: what if the answer isn't to slow down AI? What if it's to build an orchestration layer that keeps the AI-assisted SDLC grounded?

"The next three years of product design won't be won by the designer who uses the most AI tools. They'll be won by the designer who installs accountability into the AI loop."

My Role

Solo methodology design — from first principles to working system.

This is my own IP, developed across live project work and iterated in production. No team, no agency brief. I started from a frustration I kept hitting on client work: the RTM existed as a compliance artefact, not a living design tool. The question was whether it could be both.

I built it the same way I'd build a product: defined the problem, decomposed the requirements, ran Step 0 decomposition on my own methodology before generating outputs for it. The RTM document you're reading about was itself run through the RTM framework — which is either recursive or validating, depending on how you look at it.

I think it's validating.

Process

Four roles, one loop — running in parallel, not in sequence.

The central design decision was structural: the RTM Orchestrator isn't a phase-gated process. It's four roles that an AI agent holds simultaneously, switching between them based on where the product is in its lifecycle.

Discovery

Analyst

Decomposes stated requirements to irreducible business needs. Audits for evidence. Flags assumption-driven requirements before they become assumption-driven features.

Build

Connector

Maps confirmed requirements to features in the current sprint. Flags orphaned features — those with no traceable business requirement. Every pivot generates a Sync Delta Report.

Pre-Ship

Validator

Runs a test coverage audit. Surfaces untested requirements as HIGH risk. Produces a pre-launch snapshot with a recommendation: BLOCK, ACCEPT-RISK, or CLEAR.

Post-Launch

Sentinel

Monitors the RTM against incoming behavioral data. Detects Research Drift — evidence contradicted by post-launch signals. The role most teams don't have.

The Sentinel role deserves its own note. Launch is almost universally treated as the end of the design engagement. The Sentinel role treats launch as the beginning of the measurement loop. That shift is not a minor detail — it's the whole point.

Key Design Decisions

Three choices that made this system actually usable.

Evidence Strength as a first-class field. Every requirement carries an evidence rating — not buried in a column you have to scroll to, but surfaced prominently in every output. The scale has four levels:

Evidence Strength Scale

★★★ Validated Multiple independent evidence sources; consistent, recent, and directly relevant to this requirement.

★★ Supported Single source or indirect evidence. Plausible, but not conclusive.

★ Weak Anecdotal, outdated, or from a non-representative sample. Treat with caution.

∅ Assumed No evidence. Team belief only. Flagged immediately — building on this is a conscious decision, not a default.

When the ∅ Assumed rate exceeds 30% of requirements, that's surfaced as a launch risk — not a minor gap. This threshold came from watching how often assumptions quietly cluster in the parts of a product that end up causing post-launch fires.

Human decision gates are explicit and non-negotiable. The framework is very clear about what the AI can do and what it can't. The agent flags, surfaces, and documents. It never approves.

Approving ∅ Assumed requirements for build

Building on unvalidated assumptions is a business risk, not a technical one. Humans own this call.

Shipping with UNTESTED requirements

Production risk is a human accountability, not an agent output. The agent recommends; the human decides.

Accepting Research Drift for a release

The agent surfaces evidence that's been contradicted by post-launch data. Deciding that drift "doesn't matter this cycle" is always a human call — one the system makes visible and documented, not invisible.

Step 0 decomposition as a mandatory gate. Before any RTM output is generated, the agent runs an internal decomposition pass: strips solution-language from requirements, identifies assumptions, flags collapsed or contradictory requirements. This is the most important thing in the whole framework and the easiest thing to skip. So it's a hard gate, not a guideline.

Tools & Methods

The frameworks that shaped it.

First Principles Thinking for requirement decomposition — the practice of stripping a stated need back to its irreducible business logic, refusing to accept solution-framing as a substitute for problem clarity. Without this, RTMs fill up with features masquerading as requirements.

Game Theory as a lens for the build phase — specifically for making stakeholder incentive misalignment legible. Engineering wants to close tickets. Product wants roadmap milestones. Design wants validated solutions. The RTM Orchestrator's job isn't to arbitrate. It's to make the cost of misalignment impossible to ignore.

Teresa Torres's Continuous Product Discovery framework informs the evidence structure — specifically the cadence at which evidence ages and the idea that discovery is a loop, not a pre-build phase.

The implementation vehicle is Claude (Anthropic), operating via a CLAUDE.md operating protocol that defines the four roles, the RTM schema, and the decision gates in machine-readable form. The methodology lives in a file. The agent reads the file. That's the system.

The Schema

One universal structure, across all four phases.

Every output — from discovery through post-launch drift detection — maps back to the same RTM row structure. This is what makes the four-role model actually coherent: there's one source of truth, and all four roles read from and write to it.

// RTM Row Schema — Universal

REQ-ID         // REQ-[domain]-[number] e.g. REQ-UX-003
Requirement    // Irreducible business need. Not the feature.
Type           // Functional / Non-Functional / Compliance / UX / ...
Evidence       // Source tag + one-line description
Strength       // ★★★ Validated / ★★ Supported / ★ Weak / ∅ Assumed
Feature(s)     // Product features satisfying this requirement
Token(s)       // Design tokens implementing the requirement
Test Case(s)   // [Not Defined] if none — flagged HIGH risk pre-ship
Status         // Full / Partial / Untested / Ungrounded
Risk           // HIGH / MED / LOW — coverage gaps × business criticality

The design tokens column is worth calling out. Most RTMs stop at test cases. This one connects the requirement all the way to the token layer — specifically for accessibility and compliance requirements, where the gap between "we said we'd meet WCAG AA" and "the token system actually enforces WCAG AA" is where a lot of silent failures live.

Outcomes & Impact

What actually happened when the framework ran on itself.

4 Full PRDs generated and traced back to a parent BRD

40% Evidence rated ★★★ Validated at initial RTM generation

~10% Requirements flagged ∅ Assumed — well below the 30% risk threshold

The methodology ran on itself — the RTM for the Agentic Orchestration RTM framework was generated using the framework. That sounds like a trick, but it's not. It's how you find out whether a methodology is actually rigorous or just sounds rigorous. Running it on your own work is the hardest test.

The pre-ship validation for the methodology returned ACCEPT-RISK. The highest-risk unvalidated assumption: whether the framework meaningfully reduces post-launch defects in teams that adopt it versus teams that don't. There's no comparative data yet. The RTM says so, clearly, in the ∅ Assumed row. That honesty is the point.

Applied to the Gamuda Land Vietnam PropTech transformation — four platforms across the property discovery, customisation, transaction, and referral lifecycle — the framework surfaced six critical unvalidated assumptions before any build work started. That's the value. Not the artifact. The surfacing.

Reflection

What I'd do differently — and what I'd defend.

The thing I'd do differently: I'd build an adoption pathway earlier. The framework as a methodology is solid. The framework as a team practice is still underspecified. How do you introduce it mid-sprint without it feeling like overhead? How do you get engineering to care about evidence ratings? Those are real questions I haven't fully answered.

The thing I'd defend: the human decision gates. There's a temptation — especially in teams that are excited about AI automation — to want the system to make more decisions, not fewer. But the value of this framework isn't the AI doing the work. It's the AI making the human decisions impossible to make accidentally. That's a different design goal, and I think it's the right one.

Speed is not the enemy of traceability. Unexamined speed is.