Methodology

The Pipeline Is a Partnership

I tried to run the agentic pipeline end-to-end on a four-SDK-surface PRD. Tests stayed green. The tenant didn't. The fix wasn't more autonomy — it was phasing the work, with checkpoints between.

Methodology6 min read

I — The misconception

For about a week I thought the agentic pipeline could run a feature end-to-end on its own.

The shape that produced that thought is real: a PRD becomes an architecture, an architecture becomes thirty tasks, the tasks become RED → GREEN → REFACTOR commits, code review catches the strict-TS errors I missed, the test pass goes 304/304, the docs write themselves, /ship closes the loop. Three days of elapsed time, thirty-five commits, and a feature that builds and tests clean against everything the framework knows how to check.

That's the part the pipeline is good at.

The part it can't do alone is the part that comes after: the panel loaded inside an actual SitecoreAI tenant, and the picture was wrong.

II — What the tenant said

PRD-001 of Last-Edit Trail extended the read-only trail panel into a Compare view: pick two page versions, see the field-level diff, drill into a per-component accordion, and flip to a side-by-side visual canvas. All of that built. All of that tested. All of that shipped on the branch with status: shipped_with_caveats.

The first real-tenant smoke surfaced nine findings in one session:

  • The Compare CTA was illegible — dark text on a purple background.
  • Clicking Compare opened a placeholder body. The shell was still mounting a scaffold helper from an earlier task instead of the real Compare view.
  • The page-fields section rendered nothing on success when the field list came back empty — no diff, no "no changes" copy.
  • Component rows lost their names because real layout data ships with IDs and no human-readable labels — the parser had no fallback.
  • The Visual tab loaded "Page not found" because the URL builder defaulted to the site root and nothing forwarded the page's actual route.

…and four more in that vein.

Each was a small, surgical fix. None were architectural. The framework caught architectural drift through structural tests — those held the whole way through. What the framework didn't catch was the gap between the shape of the data the mocks were returning and the shape of the data the real tenant returns.

III — Mocks are faithful to types, not to data

The fixtures in this codebase are all sourced from the SDK's own type definitions, and a project rule rejects fixtures that aren't. So the test surface is faithful to the type contract.

What the mock fixtures aren't faithful to is the runtime data. Real layout payloads ship with item IDs and no human-readable names. Real authoring queries can return zero fields for some pages depending on how you filter. Some context values turn out to be strings the SDK type doesn't make obvious you need. None of those are type errors. All of them are integration bugs.

PRD-001 introduced four new SDK integration surfaces in one shot — a page-state read, an authoring query, an undocumented preview-URL convention, and an embedded JSON-string parse on a layout field. Each had its own runtime-data gotcha. The framework had no checkpoint that asked "does the data shape match what your parser expected?" until the operator clicked Compare in the live tenant — by which point those four surfaces had stacked their own divergences on each other.

IV — Autonomous within, reality-check between

The cadence that actually works isn't "AI runs the whole pipeline" or "human approves every step." It's autonomous within a phase, hard checkpoint between phases.

Within a phase
fast-track autonomous

/create-prd → /architect → /task-breakdown → /implement → /code-review → /test → /document → /ship runs end-to-end. The framework does what it does well: produces well-architected, test-covered code against verified-shape SDK contracts. No micro-confirmations needed inside this loop.

Between phases
hard checkpoint

Real-tenant install. F12 console open. Five-minute walkthrough as an actual editor. Screenshots compared against the POC clickdummy. The kinds of bugs that mocks cannot reproduce surface here — and they surface in isolation, not stacked on three other phases worth of unverified divergence.

The first time you skip the checkpoint, the next phase builds on unverified ground. Two phases in, the divergences compound and the debug surface multiplies. By the fourth SDK surface, you're chasing nine smoke findings instead of one.

V — Phase by SDK surface, not by feature

The temptation when scoping a PRD is to phase by user-facing feature. "Phase 1 is Components used. Phase 2 is the Visual tab." That's a reasonable product split, and it has nothing to do with where the integration risk lives.

The phasing that actually de-risks the pipeline is one new SDK surface per phase.

BEFORE
1
massive PRD
shipped end-to-end
4
new SDK surfaces
in one shot
0
real-tenant
checkpoints
9
findings
at first smoke
AFTER
3
phases
shipped sequentially
1
new SDK surface
per phase
3
real-tenant
checkpoints
~3
findings each
in isolation

PRD-001 of Last-Edit Trail should have been:

  • Phase 1 — refactor the trail-row primitive, add multi-select, the Compare CTA, the Compare view shell, and a page-fields-only diff. One new SDK surface: the authoring query that returns a page's full version history. Ship. Verify in the tenant. Read the F12 console.
  • Phase 2 — add the Components-used union and the recursive per-component accordion. One new SDK surface: the page-state read that returns layout. Ship on top of a verified Phase 1.
  • Phase 3 — add the Visual tab, run the spike on the Pages preview URL convention, build the seam slider. One new SDK surface: the URL convention itself. Ship on top of two verified phases.

Each phase exits the pipeline through a real-tenant checkpoint before the next one enters it. Findings surface in isolation against known-good prior phases. Roll-back is cheap because rollback is "the previous phase's commit hash."

VI — Hooks for the next round

Hook 01

The pipeline is a partnership, not an autopilot.

It runs each phase end-to-end on its own. It does not run the whole product on its own. The human eye on a real tenant is the gate that catches what mocks cannot.

Hook 02

Phase by SDK surface, not by feature.

Three new SDK integration surfaces in one PRD is a forcing function for splitting. Each new surface ships against a verified prior phase, or it doesn’t ship.

Hook 03

Mocks are faithful to types. The tenant is faithful to data.

Type-driven fixtures prove the shape contract. They do not prove the runtime data shape, the empty-state behavior, or the iframe context. A captured-shape sample from a real tenant is the missing artifact.

Hook 04

Pending smoke gates aren’t soft passes.

If a “smoke pending” status still lets a ship command commit a PR, it’s a documentation field, not a gate. For Marketplace work the real-tenant smoke needs to be a hard block before ship finishes — not a caveat label on a PR title.

Hook 05

Scaffold-drift placeholders need their own structural test.

A placeholder with a comment that says “a later task replaces this” will outlive the later task unless something fails CI while it’s still there. Multi-batch implementation handoffs are exactly the kind of context where this rots.

The framework didn't fall apart. The pipeline didn't fail. What clarified itself is that the pipeline isn't the whole job — it's the inside half of a partnership. The outside half is one human, one real tenant, every time the integration surface area grows.

Related case studies