The Pipeline Is a Partnership
I tried to run the agentic pipeline end-to-end on a four-SDK-surface PRD. Tests stayed green. The tenant didn't. The fix wasn't more autonomy — it was phasing the work, with checkpoints between.
I — The misconception
For about a week I thought the agentic pipeline could run a feature end-to-end on its own.
The shape that produced that thought is real: a PRD becomes an
architecture, an architecture becomes thirty tasks, the tasks become
RED → GREEN → REFACTOR commits, code review catches the strict-TS
errors I missed, the test pass goes 304/304, the docs write themselves,
/ship closes the loop. Three days of elapsed time, thirty-five
commits, and a feature that builds and tests clean against everything
the framework knows how to check.
That's the part the pipeline is good at.
The part it can't do alone is the part that comes after: the panel loaded inside an actual SitecoreAI tenant, and the picture was wrong.
II — What the tenant said
PRD-001 of Last-Edit Trail extended the read-only trail panel into a
Compare view: pick two page versions, see the field-level diff, drill
into a per-component accordion, and flip to a side-by-side visual
canvas. All of that built. All of that tested. All of that shipped on
the branch with status: shipped_with_caveats.
The first real-tenant smoke surfaced nine findings in one session:
- The Compare CTA was illegible — dark text on a purple background.
- Clicking Compare opened a placeholder body. The shell was still mounting a scaffold helper from an earlier task instead of the real Compare view.
- The page-fields section rendered nothing on success when the field list came back empty — no diff, no "no changes" copy.
- Component rows lost their names because real layout data ships with IDs and no human-readable labels — the parser had no fallback.
- The Visual tab loaded "Page not found" because the URL builder defaulted to the site root and nothing forwarded the page's actual route.
…and four more in that vein.
Each was a small, surgical fix. None were architectural. The framework caught architectural drift through structural tests — those held the whole way through. What the framework didn't catch was the gap between the shape of the data the mocks were returning and the shape of the data the real tenant returns.
III — Mocks are faithful to types, not to data
The fixtures in this codebase are all sourced from the SDK's own type definitions, and a project rule rejects fixtures that aren't. So the test surface is faithful to the type contract.
What the mock fixtures aren't faithful to is the runtime data. Real layout payloads ship with item IDs and no human-readable names. Real authoring queries can return zero fields for some pages depending on how you filter. Some context values turn out to be strings the SDK type doesn't make obvious you need. None of those are type errors. All of them are integration bugs.
PRD-001 introduced four new SDK integration surfaces in one shot — a page-state read, an authoring query, an undocumented preview-URL convention, and an embedded JSON-string parse on a layout field. Each had its own runtime-data gotcha. The framework had no checkpoint that asked "does the data shape match what your parser expected?" until the operator clicked Compare in the live tenant — by which point those four surfaces had stacked their own divergences on each other.
IV — Autonomous within, reality-check between
The cadence that actually works isn't "AI runs the whole pipeline" or "human approves every step." It's autonomous within a phase, hard checkpoint between phases.
The first time you skip the checkpoint, the next phase builds on unverified ground. Two phases in, the divergences compound and the debug surface multiplies. By the fourth SDK surface, you're chasing nine smoke findings instead of one.
V — Phase by SDK surface, not by feature
The temptation when scoping a PRD is to phase by user-facing feature. "Phase 1 is Components used. Phase 2 is the Visual tab." That's a reasonable product split, and it has nothing to do with where the integration risk lives.
The phasing that actually de-risks the pipeline is one new SDK surface per phase.
shipped end-to-end
in one shot
checkpoints
at first smoke
shipped sequentially
per phase
checkpoints
in isolation
PRD-001 of Last-Edit Trail should have been:
- Phase 1 — refactor the trail-row primitive, add multi-select, the Compare CTA, the Compare view shell, and a page-fields-only diff. One new SDK surface: the authoring query that returns a page's full version history. Ship. Verify in the tenant. Read the F12 console.
- Phase 2 — add the Components-used union and the recursive per-component accordion. One new SDK surface: the page-state read that returns layout. Ship on top of a verified Phase 1.
- Phase 3 — add the Visual tab, run the spike on the Pages preview URL convention, build the seam slider. One new SDK surface: the URL convention itself. Ship on top of two verified phases.
Each phase exits the pipeline through a real-tenant checkpoint before the next one enters it. Findings surface in isolation against known-good prior phases. Roll-back is cheap because rollback is "the previous phase's commit hash."
VI — Hooks for the next round
The pipeline is a partnership, not an autopilot.
It runs each phase end-to-end on its own. It does not run the whole product on its own. The human eye on a real tenant is the gate that catches what mocks cannot.
Phase by SDK surface, not by feature.
Three new SDK integration surfaces in one PRD is a forcing function for splitting. Each new surface ships against a verified prior phase, or it doesn’t ship.
Mocks are faithful to types. The tenant is faithful to data.
Type-driven fixtures prove the shape contract. They do not prove the runtime data shape, the empty-state behavior, or the iframe context. A captured-shape sample from a real tenant is the missing artifact.
Pending smoke gates aren’t soft passes.
If a “smoke pending” status still lets a ship command commit a PR, it’s a documentation field, not a gate. For Marketplace work the real-tenant smoke needs to be a hard block before ship finishes — not a caveat label on a PR title.
Scaffold-drift placeholders need their own structural test.
A placeholder with a comment that says “a later task replaces this” will outlive the later task unless something fails CI while it’s still there. Multi-batch implementation handoffs are exactly the kind of context where this rots.
The framework didn't fall apart. The pipeline didn't fail. What clarified itself is that the pipeline isn't the whole job — it's the inside half of a partnership. The outside half is one human, one real tenant, every time the integration surface area grows.
Related case studies