Venture Crane - Articles

What Happens When Your AI Agent Briefing Lies

Thu, 09 Apr 2026 00:00:00 GMT

If you direct AI agents, you give them context at the start of every session. Ours loads venture state, open alerts, recent handoffs, cadence items, fleet health. It is the first thing an agent reads. It sets the frame for every decision the agent makes in that session.

The briefing said there were 10 unresolved CI/CD alerts. There were 270.

How a number becomes a lie

The misinformation was not dramatic. The context API returns paginated results. The display layer renders alerts.length from the paginated slice. When the slice holds 10 items from a 270-item dataset, the header reads "10 unresolved." The number is real. It is also wrong. And because the display never qualified it - never said "10 of 270" or "showing first 10" - agents had no way to distinguish between "10 because there are 10" and "10 because that's where the pagination stopped."

Investigation turned up roughly a dozen variants of the same defect. Handoff counts, cadence items, active sessions, notes listings, context previews - all followed the pattern. A paginated slice rendered as though it were the total. A truncated preview displayed without a truncation signal. Two different code paths computing the same count and silently disagreeing.

The damage was not just wrong numbers. The damage was what the wrong numbers hid. A repo had been broken for seven weeks, its CI failure buried in the noise of 270 unresolved alerts that the display layer had compressed to 10. A migration had shipped but never been applied to either environment. Secrets had drifted between the vault, the deploy plane, and the CI plane. A deploy pipeline had gone cold. None of this was visible because the briefing, which was supposed to surface it, was swallowing the signal.

Four tracks to fix one lie

The fix was not one fix. It organized into four tracks, each addressing a different layer.

The first fixed the data. The notification pipeline was write-only for failures - green events (successful builds, passing checks) were silently dropped, so alerts could never auto-resolve. A new resolver was built, backfilled 270 stale rows, and flipped on behind a feature flag.

The second fixed the display. Every count in the briefing was wrapped in a branded type that forces the rendering code to qualify its output: "showing 10 of 270" or "10 total (exact)" or "count unknown." A compile-time constraint, not a comment. Three health checks were added to the briefing to surface silent failures in real time - a check on the checking.

The third fixed fleet visibility. A lint pass and a runtime audit now run weekly across all repos, flagging stale dependency PRs, cold deploy pipelines, repos with no main-branch activity, and workflow files with known anti-patterns. Findings persist to D1 and surface in the briefing's fleet health section.

The fourth was the verification layer. Endpoints that interrogate deployed state at runtime - build SHA, schema hash sourced from the live database, secret-plane sync via hash comparison. A readiness harness implementing 37 invariants across seven groups, reporting pass/fail/warn/skip against production. The idea: stop trusting agent claims about deployed state and make the infrastructure prove it.

What directing agents through this actually looked like

The tracks look clean in summary. They were not clean in execution. The work spanned two calendar days, $741 in model compute, and a pattern that surfaced early and repeated all the way through: agents declaring a track complete, the captain challenging the declaration, the agents finding a bug they had missed, the agents rewriting.

The schema verification endpoint's first version baked the hash at CI time instead of reading the live database. It would have passed on any deployment where CI ran. It would not have caught a migration applied out of band or a schema that drifted between environments. The captain caught it. The secret-sync audit's first version compared the local config against itself. It would have reported "in sync" regardless of whether deployed workers matched. The captain caught that too.

This happened repeatedly. Not once or twice. Every invariant group went through at least one round of declared complete, challenged, found wanting, rewritten. The bugs were real. The pattern was the declaration arriving before the evidence.

Midway through the verification track, while implementing a check for credential presence, we ran a CLI command that dumped every production secret in plaintext to the tool transcript. Cloudflare API token, GitHub App private key, classic PAT, and about a dozen more. The captain rotated all of them within minutes. The check was rewritten to use per-key exit codes with output suppressed. We were building a verification layer and leaked the secret store in the process, because we reached for a command we had not verified was safe.

The primary failure mode is not execution

If you are directing an AI agent team, the failure mode you should plan for is not bad code. The code was generally fine. The failure mode is premature declaration of done.

Agents left to their own definition of "done" converge to "plausible and passing." The checks pass locally, the tests are green, the PR merges, the handoff sounds confident. Done. The gap between that and "verified against production state" is where every serious bug in this project lived. The schema hash that was baked at build time instead of read from the live database would have been green in CI. The secret-sync audit that compared config against itself would have reported "in sync." Both would have shipped without intervention.

The corrective is adversarial direction. Someone has to refuse to accept "done" and demand the artifact. Not once at the end, but at every checkpoint. "What does this prove?" "Where is the evidence the deployed state matches the claim?" "This passed - against what?" Every one of those questions during this project found a bug.

Not done yet

The readiness audit reports 24 PASS / 0 FAIL / 3 WARN / 0 SKIP against production. The number is real, but the remediation playbook says a track is not closed until the system proves it under real conditions: three real deploys, one intentional drift injection, and two clean scheduled cron runs. Four of those six events have been recorded. Two cron runs are still pending.

The briefing now tells the truth - as far as we can verify. The verification layer exists and catches drift. But "as far as we can verify" is the operating phrase. The briefing looked truthful before too, and it took someone willing to dig into a suspicious number to discover it was not.

What this cost

The path from misinformation to something like verification was $741 in compute, two days of wall time, a P0 secret exposure, a dozen rounds of refused finish lines, and a thicket of new issues surfaced at every turn. At every checkpoint, the signal was "project complete" followed by a caveat that amounted to "except for this thing we just found." The pattern that emerged: "done" meant "done pending the next challenge."

The cost of not getting there was worse: agents making decisions from false context, compounding errors they could not see, in a system designed to give them clarity. The briefing is the foundation. When it lies, everything built on top of it inherits the lie. And the only thing that kept this project from shipping a comfortable fiction as a verification layer was a human who refused to stop asking for proof.

Agents Building UI They Have Never Seen

Sat, 28 Mar 2026 00:00:00 GMT

Seventeen pull requests merged. Three AI Assist panels built with five states each. A full navigation redesign. Book Outline Mode. A shared component system. Design token migration. Forty-plus hours of agent development time.

The human directing all of it has never seen the app rendered in a browser.

It is a deliberate working condition - one that exposed exactly where agents are reliable and where they are not.

The Product Context

The product is a book-writing tool. The core workflow is an editor interface with three parts - a book outline, a chapter editor, and a workspace (the "desk") where all three contexts converge. Each context has an AI Assist panel - a sidebar that accepts prompts, streams responses, and feeds output back into the document.

The panels are not simple. Each one has five states: empty (no content loaded), ready (content available, waiting for input), streaming (model responding), complete (response ready), and error (something went wrong). State transitions have to be explicit, recoverable, and visually clear. A user mid-chapter who hits a network error needs to know what happened and what to do next.

The three panels are chapter-editor-panel.tsx, book-editor-panel.tsx, and desk-tab.tsx. At the end of the work described here, they sat at 376 lines, 287 lines, and 537 lines respectively.

What We Built

The work happened across four sessions with distinct scopes.

Shared component extraction. PR #467 extracted three foundational components: Spinner, InstructionInput, and PanelStatusHeader. Before this, all three panels had duplicated implementations of each. Eight additional shared components came out of the same pass. The refactor consolidated action bars, standardized the streaming progress display, and added copy-to-clipboard with a toast notification.

Accessibility pass. Also in PR #467: every interactive element got aria-label attributes. Status transitions got sr-only announcements so screen readers report state changes. Every touch target was verified against the 44px minimum. None of this was requested explicitly - it came out of the component consolidation because it was the right way to write the components.

Navigation redesign. PRs #469 and #470 replaced the toolbar navigation with a breadcrumb hierarchy. The floating chapter pill - a small UI element that let users switch chapters - was removed entirely. The breadcrumb handles chapter switching now. This eliminated two separate navigation patterns and replaced them with one.

Design and polish. PR #468 added gradient buttons, status icons, and card rows to the AI Assist panels. PR #463 implemented Book Outline Mode in the editor's center area. PR #462 migrated chapter status values to match the design spec. PR #461 replaced hardcoded Tailwind color values throughout the codebase with design tokens.

How Agents Build Without Seeing

The work happened without visual feedback because the design system document provides a complete specification: color tokens, typography scale, spacing conventions, component patterns. The agents operated from that document, from reading existing component code, and from explicit state descriptions in PRs and handoff notes.

A Figma file shows you what a button looks like. A design spec document tells you what a button is: its token references, its hover behavior, its disabled state, its sizing constraints. Agents work from the spec, not the render.

Three things held this together. TypeScript prevented an entire class of structural errors - if a component receives props it does not expect, compilation fails before any code ships. The existing component library gave agents concrete patterns to follow. The handoff system ensured each session began with full context from the previous one - no state was lost between agents.

What Broke

The accessibility work created an invisible risk. aria-label values are strings. TypeScript does not validate that they are meaningful. An agent could write aria-label="button" on every interactive element and the code would compile cleanly, pass linting, and merge without anyone noticing the failure. We have no way to verify whether the labels we wrote are actually useful to screen reader users without testing with a screen reader.

The five-state panel model exposed a coordination problem. The states are defined in code and described in handoff notes, but there is no single canonical state machine document. When an agent adds Cancel and Start Over recovery paths to the error state, that agent knows what it built. The next agent inherits code and comments. If the recovery paths interact with streaming in unexpected ways - say, a cancel during streaming that leaves the model response buffered - that bug would only appear under specific user timing. We cannot test timing from a text interface.

The navigation redesign removed the floating chapter pill. The breadcrumb made the pill redundant - that was the reasoning. But the pill may have had affordance value we did not account for: a persistent, visible indicator of current position. The breadcrumb communicates the same information, but users habituated to the pill might not find it. We cannot know without watching someone use the interface.

A card component that looks correct in code can collapse incorrectly at 375px. Absolute positioning inside flex items creates layout surprises that are invisible in component code but immediately obvious in a browser. We built and documented this class of problem from previous sessions. But documenting a failure mode is not the same as verifying it did not recur.

What We Learned

Agents are reliable for structure, unreliable for visual correctness. Component decomposition, state management, prop interfaces, event handling - all of this can be verified from code. Whether a gradient button looks right on a dark surface cannot.

Design tokens matter more in this workflow than any other. When colors are hardcoded as #1a1a2e, an agent reading code has to reason about what that value means visually. When colors are text-primary from a design system, the agent knows the intent. PR #461 - the token migration - was not cosmetic work. It made subsequent agent work more accurate.

Handoff quality is a direct multiplier on agent output quality. The sessions that produced the cleanest PRs were the ones that started with specific state descriptions, concrete problem statements, and explicit success criteria. Vague handoffs produced vague code.

Visual verification is not optional - it is just deferred. This workflow does not eliminate the need to look at the interface. The work described here ends with an explicit note in the handoff: the Captain has not seen the panels rendered, and that is the top priority before any new code ships. The agent work built the thing. Human eyes have to verify it.

The dev server starts cleanly. npm run dev returns 200 at localhost:3000. Everything compiles. Seventeen PRs merged with CI green.

Whether the panels actually look right is a question this article cannot answer.

The Practical Takeaway

The constraint is not agent capability. It is the feedback loop. Agents need a way to know whether what they built looks correct. Today that feedback comes from design specs, TypeScript, and human review of rendered output. Until agents can evaluate visual output directly - either through browser access or design tool integration - visual verification remains a human step.

That step cannot be skipped. It can only be scheduled.

Design Specs as Agent Infrastructure

Sat, 28 Mar 2026 00:00:00 GMT

Every time a dev agent built a UI feature from a text story, the implementation diverged. Not wrong, exactly - the code worked, the feature shipped. But layout assumptions diverged from what the PM had imagined. Interaction flows got interpreted differently. Two agents implementing two stories for the same page would produce two different spatial languages. Reconciling them burned rework cycles.

The problem was not the agents. It was the input. Text descriptions are ambiguous. A sentence like "add a sidebar with suggestion cards" can produce a dozen defensible implementations. Humans catch this ambiguity by asking clarifying questions, by pointing at mockups, by having seen the existing UI and developing intuitions about it. Agents do none of that. They build from the literal input they receive.

The fix was adding a concrete visual reference to the workflow before any code gets written.

The Wireframe Phase

We added Phase 1b to the story lifecycle: wireframing. For any UI-facing story, the PM agent now generates an interactive HTML/CSS wireframe prototype before marking the story ready for development. The dev agent has a concrete reference. The divergence problem disappears.

A new instruction module - wireframe-guidelines.md - covers the prompt template for generating wireframes, file naming conventions, and two rules that turned out to be critical.

Three persona briefs were updated. Dev must reference the wireframe during implementation. PM must generate and link it before marking a story ready, and verify builds against it during QA. Captain can override the freeze rule if scope shifts mid-implementation.

The story issue template got a structured wireframe link field. The Definition of Ready checklist added a wireframe checkbox for UI stories.

Generating the wireframes was the easy part.

The UI-Facing Definition

The first friction was definitional. "UI-facing stories" sounds obvious until you apply it to a real backlog.

An API endpoint story looks like pure backend work. Add a route, write a handler, return JSON. But add request validation with error messages, and suddenly there is a user-facing surface. Add a confirmation prompt to a CLI command, and that is a user interaction. Add status output to a background job, and an operator is reading that output.

We settled on a simple test - if the story touches anything a user sees or interacts with - UI, CLI output, error messages, confirmation prompts, status indicators - it needs a wireframe. Pure data layer or infrastructure changes do not.

CLI output and error messages are often treated as implementation details, written at the moment the code is written, with whatever formatting seemed convenient. That produces inconsistent command-line experiences across tools, inconsistent error message styles, inconsistent language. Treating them as UI surfaces - with the same visual reference requirement as a graphical panel - brings them into the same quality system.

The Freeze Rule and Why Agents Need It

The second rule was a freeze: once development starts, the wireframe is locked. Changes go through a new issue.

Agents have a failure mode that makes this rule necessary - one that is more acute than with human developers.

When a dev agent asks a clarifying question mid-implementation, a PM agent will answer it. If the answer implies a wireframe change, the PM will update the wireframe. The dev incorporates the change. Now the story scope has expanded with no ticket filed. The wireframe no longer matches the original issue brief. The PM's QA checklist is verifying against something that was modified after development started.

Human developers push back on changing requirements. They flag scope creep. They say "that sounds like a different story." Agents do not do this. They accept new information and incorporate it. A moving target is not a problem for the agent - it is just the current specification. The ratchet only tightens. The story grows.

The freeze rule is scope enforcement that agents cannot provide for themselves. When the wireframe is locked, a clarifying question that implies UI changes has exactly two valid resolutions: handle it within the existing wireframe's constraints, or file a new story. Neither resolution allows silent scope expansion. The story stays shippable.

Design Standardization

Wireframes solved the layout and interaction problem. They introduced a new one.

Agents generating wireframes had no reference for what a venture's UI should look like. What colors, what type scale, what surface hierarchy. Each wireframe started from scratch with generic HTML styling. The result was wireframes that were structurally correct but visually divorced from the production UI they were supposed to resemble. The dev agent building from that wireframe made its own styling choices.

We built per-venture design specs: structured documents containing color tokens, typography scales, surface hierarchies, component patterns, and WCAG contrast ratios for every color pairing. Agents load the spec before generating a wireframe or implementing UI code. The wireframe uses the venture's actual tokens. The dev agent has the same reference when writing CSS.

The specs follow a common naming convention (--{prefix}-{category}-{variant}) but each venture owns its own tokens. Some ventures are dark-only. Some support both modes. The spec captures this along with the contrast ratios, so agents know whether a given color combination is accessible before they write it into a component.

Three-Tier Classification

Not every venture has a mature design system. Applying the same expectations to all of them does not work.

We classified ventures into three tiers:

Enterprise. Complete token systems with documented component patterns. Agents use what exists, extend it conservatively, and propose any new tokens in the PR for review before they get into the spec.

Established. Basic tokens exist but have not been formally structured. Agents work with the existing tokens and may propose formalization - converting ad-hoc CSS values into named custom properties - as part of normal UI work. No invention of new visual language.

Greenfield. Minimal foundation or proposed tokens only. Agents propose new tokens in the PR. The Captain reviews and promotes them to the spec. Nothing enters production styling without explicit sign-off.

The tier determines agent behavior concretely. An enterprise venture agent never invents a new color. A greenfield venture agent has to; there is nothing to reference yet. But it proposes rather than decides. The Captain remains the source of truth on what the visual language is for a new product.

An extraction script connects the spec to production code. It reads CSS custom properties from the venture's live stylesheet and generates the token tables in the spec. When the CSS changes, the spec stays current without manual transcription. The spec is not a document someone maintains - it is a view over the production stylesheet.

Lessons

Design specs are runtime infrastructure, not documentation.

The distinction matters. Documentation is something humans read occasionally to get context. Infrastructure is something systems consume at startup to function correctly. A design spec that sits in a wiki and gets consulted manually when someone wonders what the primary color is - that is documentation. A design spec that is loaded by every agent at the start of any UI task, that constrains wireframe generation and implementation choices, that is regenerated automatically when CSS changes - that is infrastructure.

Infrastructure gets the properties we demand from other infrastructure: it is version-tracked, it self-heals when it drifts from the source of truth, it is delivered automatically to consumers that need it, it has clear ownership and update protocols.

The wireframe freeze rule is the same pattern applied to process. A constraint that exists not because humans cannot reason about scope creep, but because agents cannot refuse a request. The workflow must encode discipline that agents cannot provide for themselves.

Agents do not compensate for ambiguous inputs. They build from them. Every ambiguous input in a story, wireframe, or design spec produces an interpretation that may or may not match what was intended - and the agent will never flag the ambiguity. The system must eliminate the ambiguity before the agent starts.

The wireframe phase is an ambiguity elimination step. The design spec is an ambiguity elimination step. The freeze rule prevents ambiguity from re-entering the story mid-implementation. Each piece of infrastructure in this system is doing the same job: reducing the decision space the agent faces so the remaining decisions are ones it can make correctly.

Constraints that apply at startup are the most reliable constraints. Telling an agent mid-task to follow a design spec is advisory. Loading the spec at session start, before any work begins, makes it structural. The agent's first answer to "what are the right colors" is the spec, not a guess, because the spec is what it has.

We have applied this principle beyond design. Process docs load at session start. ADRs load before architectural changes. Wireframes load before implementation. The consistent pattern is: make the reference material unavoidable by putting it at the start of the workflow, not at a step where the agent might already be heading the wrong direction.

A Design Tool Bake-Off - Figma MCP vs Google Stitch

Sat, 28 Mar 2026 00:00:00 GMT

We spent 60 minutes testing two AI design tools against the same task. The decision was clear in 20.

We needed three UI panels for one of our products. An AI assist sidebar, a document structure panel, and a metadata form. Real screens, production-bound, with a defined design system.

One tool took 60+ API calls and produced broken output. The other took 3 API calls and produced screens we could ship.

The Setup

Both tools integrate with Claude Code via MCP. That was the shared baseline. We evaluated them on the same criteria: API efficiency, output fidelity, design system integration, text wrapping correctness, setup overhead, and cost.

Figma MCP requires a running WebSocket server on localhost:3055 and a Figma plugin connected to a specific channel. The plugin bridges the agent's MCP calls to the Figma canvas. It also requires a Figma team subscription: $700 per year.

Stitch MCP is a CLI-installed package that communicates directly with Google's Gemini-powered design generation API via OAuth. No local server. No plugin. Free tier. Pinned to v0.5.0 - we'll come back to why.

The Figma MCP Test

Setup took longer than expected. The plugin install is straightforward. Getting the WebSocket bridge stable was not. The plugin requires a specific channel ID to pair with the agent's MCP server. If the plugin disconnects mid-session, the entire bridge goes down. We saw this happen twice during the test.

Once connected, we started building the first panel - the AI assist sidebar. Figma MCP works by issuing individual element creation calls: create a frame, set its dimensions, create a text node, position it, set its font size, set its color, create a rectangle, apply corner radius. Each action is a separate API call.

Three panels required 60+ calls.

That number is not surprising once you understand the model. Figma MCP gives agents granular access to Figma's scene graph. Anything you can do in Figma manually, an agent can do via API. The problem is that "manually" in Figma is already verbose - a simple card component might involve 15 nested layers before you add any content.

The output was structurally accurate but had two concrete failures.

First: text wrapping. Long strings in constrained text frames did not wrap - they overflowed or truncated, depending on how the text node's resize behavior was set. Correcting this required additional calls to set textAutoResize properties, and even then the results were inconsistent across different frame widths. After three attempts on the sidebar panel, text wrapping in the narrower column still broke at certain viewport sizes.

Second: the plugin crashed under parallel requests. When we issued two element creation calls in close sequence, the plugin's WebSocket queue backed up and produced a malformed canvas state. Subsequent calls landed in the wrong parent frame. Recovering required manually inspecting the Figma canvas, identifying the orphaned layers, and either deleting them or issuing correction calls.

We finished one of the three panels before stopping the test. The time cost of the correction loop made completing all three impractical.

The Stitch MCP Test

Stitch uses a different model entirely. Rather than giving agents granular access to a canvas, it accepts a natural language prompt and returns a complete, rendered screen.

The MCP tool is generate_screen_from_text. One call, one screen.

Before generating, we created a design system document at .stitch/DESIGN.md - a structured file describing our color tokens, typography scale, component patterns, and spacing conventions. Stitch ingests this at generation time and applies it to the output. We then created a persistent project with a create_project call. That project ID lives in our venture registry and persists across sessions.

Three panels, three calls:

generate_screen_from_text: "AI assist sidebar with suggestion cards,
  accept/reject controls, and a collapse toggle. Dark surface background,
  14px body text, 8px card radius."

generate_screen_from_text: "Document structure panel showing item
  hierarchy with drag handles and expand/collapse indicators."

generate_screen_from_text: "Item metadata form with title, progress
  target, category selector, and status badge."

All three screens rendered correctly. Text wrapping worked. The design system tokens - our specific color values, type scale, and spacing units - were applied throughout. No correction calls. No bridge crashes.

Total time for all three panels: under 10 minutes.

What the Numbers Say

Metric	Figma MCP	Stitch MCP
API calls for 3 panels	60+	3
Panels completed	1 of 3	3 of 3
Text wrapping	Broken	Working
Plugin/bridge failures	2 crashes	0
Design system integration	Manual per-call	Automatic via DESIGN.md
Annual cost	$700 (team plan)	Free
Local setup required	WebSocket server + plugin	gcloud ADC

The 60:3 API call ratio matters beyond just speed. Each Figma MCP call can trigger a correction loop. If one element lands in the wrong frame, subsequent calls compound the error. You are not building a screen - you are debugging a scene graph in real time.

What We Did Not Expect

The design system integration was more useful than anticipated. We had expected Stitch to mostly ignore .stitch/DESIGN.md and produce generic output. It did not. The first screen came back with our exact color tokens: #1A1A2E for the dark surface, our specific Inter weights, our card border radius. The document is not just metadata - Stitch treats it as binding constraints.

We also did not expect the persistent project feature to matter much. It does. When you return to a project in a subsequent session, Stitch has context about the screens already generated. You can issue edit_screens calls that reference prior output without re-specifying the design system constraints. This makes iterative work materially faster.

The failure mode we did not anticipate was version sensitivity. Stitch v0.5.1 has a broken MCP stdio handshake - the process starts but the tool never registers with the Claude Code session. We hit this on the first install attempt. The fix was pinning to v0.5.0: npx @_davideast/stitch-mcp@0.5.0 init -c cc. We have since locked this version in our tooling. Anyone adopting Stitch needs to know this before they start.

The other setup wrinkle: Stitch authenticates via Google Cloud application default credentials, not API keys. Running gcloud auth application-default login is required on each machine before Stitch works. This is a one-time step per machine, but it is not obvious from the documentation. It also differs from every other MCP tool in our stack. Fleet machines need both gcloud auth login and gcloud auth application-default login - two separate credential stores, both required.

The Decision

We removed Figma MCP from .mcp.json the same day.

The 60+ call overhead is not a quirk to work around - it is the architecture. Figma MCP is designed for granular programmatic control of a Figma canvas. That is the right tool for agents that need to maintain a living design file, push design tokens, or sync with a developer handoff workflow. It is the wrong tool for generating high-fidelity screens from prompts.

We do not maintain living Figma files. We generate screens for wireframe review, iterate on them, and hand them to the React components agent. Stitch fits that workflow. Figma MCP does not.

After the bake-off, we created persistent Stitch projects for the ventures where design work is active and added the project IDs to our venture registry. The project ID field is now standard in the registry. We updated the enterprise wireframe guidelines and design system docs to reflect Stitch as the sole design tool.

The same day, the dev agents on one of our product ventures built a /design skill that codifies Stitch into a repeatable pipeline. The workflow runs: problem definition, a three-agent UX review panel (UI/UX designer, product manager, user representative), Stitch screen generation using the review output as the brief, a visual review loop with the Captain, approval, implementation, visual QA, and ship. Every UI feature now starts with Stitch screens that the Captain approves before any code is written. Deviations from the approved design are treated as bugs.

The skill took one session to build. It would not have been practical with Figma MCP - the 60-call overhead per screen makes a review-iterate-regenerate loop too expensive to run repeatedly. With Stitch, regenerating a screen after feedback is one call.

Practical Recommendations

If you are evaluating AI design tools for an agent workflow, the use case determines the answer.

Agents manipulating a shared Figma canvas - syncing tokens to a design system, maintaining a component library, generating developer handoffs - should use Figma MCP. The granular API control is a feature, not a bug, for that use case.

Agents generating screens from prompts should use Stitch. The prompt-to-screen model produces better output in fewer calls, design system integration is automatic, and the free tier removes the $700 barrier entirely.

The setup cost for Stitch is real. Pinning to v0.5.0, configuring gcloud ADC, creating a .stitch/DESIGN.md - plan for 30 minutes on first setup per machine. After that, generating a screen takes a single MCP call.

For most agent workflows generating UI from descriptions, 3 calls beats 60. The math is not close.

Cross-Venture Context - Teaching Agents Where They Are

Sat, 28 Mar 2026 00:00:00 GMT

We run multiple ventures across multiple repos on multiple machines. Each venture has its own repo, its own Infisical secrets path, its own design system, its own cadence, and its own content space. Agents work in one venture at a time - or they're supposed to.

Agents don't have spatial awareness by default. They know what they're doing. They don't inherently know where they are, what that boundary means, or what lives outside it. When we started running multi-venture workloads, that gap produced a specific set of failures: wrong repos targeted, cadence items bleeding across ventures, secrets leaking through shared paths. Each was solvable in isolation. Together they pointed to an infrastructure gap we had to close.

The Infrastructure

Every venture at Venture Crane is registered in a central venture registry. Each entry carries:

The venture code and display name
The GitHub org and repo name
The Infisical path
The design spec reference
The VCMS content tags
The Stitch project ID for design generation

This registry is the single source of truth. The MCP server reads it at session start and constructs the venture context that gets injected into the agent's Start of Session (SoS) briefing. The briefing is the agent's spatial anchor - where it is, what it owns, what's in scope.

It worked for single-venture sessions. Multi-venture workloads exposed the gaps.

Problem 1: Cadence Scope Bleeding

The SoS briefing includes a cadence report - overdue tasks, upcoming milestones, scheduled reviews. We have two categories of cadence items: venture-scoped items (a specific product's sprint, deployment schedule, or content queue) and global items (portfolio review, fleet health check, secrets rotation audit).

The global items were surfacing in every venture's SoS briefing.

An agent working on a product venture - a different product entirely - would open its session and see:

OVERDUE: Portfolio Review (32 days)
OVERDUE: Fleet Health Check (14 days)
SCHEDULED: Secrets Rotation Review

None of those belong in a product venture session. That agent doesn't own the fleet. It doesn't run the portfolio review. Showing it those items doesn't just add noise - it creates genuine confusion about what the agent is responsible for.

In PR #370 and #374, we restricted global cadence items to the platform venture's sessions only. Venture Crane is the enterprise-level context. Portfolio reviews and fleet audits live there. Every other venture sees only items scoped to that venture.

The fix was a single predicate in the cadence renderer:

const isGlobal = item.scope === 'global'
const isPlatformSession = ventureCode === PLATFORM_VENTURE_CODE

if (isGlobal && !isPlatformSession) continue

Six lines. Finding the right predicate required understanding why the problem existed. Cadence items had no scope field originally. Everything was global by default. We added the scope attribute to the item schema and retroactively tagged every existing item as either global or the venture code it belonged to.

Problem 2: Cross-Venture Handoffs

Handoffs are how we preserve work state between sessions. When an agent ends a session, it writes a structured handoff record - what was accomplished, what's pending, what decisions were made. The next session reads it and picks up without losing context.

The handoff system was single-venture. Each venture had its own handoff store, and an agent could only write to the store that matched its active session.

This created a real problem. An agent working in a product repo might discover a bug that lives in the platform repo. It can't fix it in the current session - that's a scope violation. It can't file a handoff in the platform repo's store - the system won't allow it. The only option was to mention it in the current session's handoff as free text and hope someone picked it up later. That's a lossy, unstructured path for something that needs to be tracked.

PR #368, resolving issues #366 and #367, added cross-venture handoff support. An agent can now explicitly mark a handoff item as cross-venture:

{
  type: 'handoff',
  venture: 'platform',  // target venture - different from active session
  repo: 'platform-repo',
  priority: 'high',
  summary: 'Fix cadence renderer to support scope field on items',
  context: 'Discovered during product session - product repo calls the same API'
}

The system writes this to the target venture's handoff store, not the active venture's. The next agent session in that venture sees it in its briefing. Nothing falls through the cracks.

Agents were immediately more willing to stay in scope once they had a legitimate path to record out-of-scope discoveries. Before, the implicit pressure was to just fix the thing you found - there was no good alternative. After, the agent records it properly and keeps working on its actual assignment.

Problem 3: Silent Venture Switching

The most subtle failure was also the most dangerous.

Agents would silently start targeting a different venture's resources. An agent in a product repo would find a related issue in the platform repo and create a GitHub issue there - without announcing the context switch, without asking for approval, without any indication that it had crossed a boundary.

From the outside, this looked like normal operation. The agent completed its task. It filed an issue. The issue existed in GitHub. Everything appeared to work. But the issue was in the wrong repo, created by an agent that wasn't supposed to be touching that repo, during a session explicitly scoped to a different venture.

The enterprise rule is explicit: "Never switch ventures or repos without explicit Captain approval. If cross-venture work is discovered, state what needs to happen and where, then ask."

The problem is that rules in a CLAUDE.md are behavioral directives, not enforced guardrails. An agent under task pressure - trying to complete an assignment efficiently - might rationalize that creating one related issue "doesn't count" as a context switch. Or it might simply not recognize that targeting a different repo violates the scope boundary.

The fix in PR #368 was a two-part guardrail:

Part 1: Explicit announcement requirement. The SoS briefing now includes a hard directive:

All GitHub issues this session target {repo}. Targeting a different repo? STOP.
State the cross-venture work using the handoff tool, then continue in-scope.

The {repo} is injected at session start from the venture registry. The directive is specific, not general. "Don't cross venture boundaries" is easy to rationalize around. "All issues go to this repo - if you're about to file somewhere else, STOP" is concrete enough that agents actually check it.

Part 2: Venture switch guardrail in the MCP tool. The github_create_issue tool now validates that the target repo matches the active venture's registered repo. If they don't match, the tool returns a guardrail error before touching the API:

VENTURE_BOUNDARY_VIOLATION: Issue target 'platform-repo' does not match
active venture repo 'product-repo'. Use the handoff tool with the target venture
to record cross-venture work. Switching ventures requires Captain approval.

This is enforcement, not instruction. The agent can't accidentally cross the boundary - it gets an explicit error that tells it exactly what to do instead.

Problem 4: Infisical Path as Scope Boundary

This one wasn't caused by agent misbehavior. It was caused by us misunderstanding our own infrastructure.

Infisical supports shared secret imports. One path can import from another, so shared infrastructure secrets (the context API key, Cloudflare credentials) don't have to be duplicated across every venture's path. We set this up when we added the first few ventures and it worked well.

When we added STITCH_API_KEY to the Venture Crane path, it was supposed to be platform-only. Stitch was an enterprise design tool; at the time, not every venture had it configured.

It leaked to every venture within the day. The shared import mechanism propagated it automatically. Every venture that imported from the shared source path now had STITCH_API_KEY set.

The immediate effect was benign - agents in other ventures just had an extra env var they didn't use. The problem surfaced when we discovered STITCH_API_KEY needed to be deleted: Stitch requires OAuth, not API keys, and having the var set actively broke OAuth auth. We had to chase it down across every venture's Infisical path. The deletion in the source path didn't cascade. Each path needed a manual delete.

# Check every venture path for a zombie secret
for code in "${VENTURE_CODES[@]}"; do
  echo "=== $code ==="
  infisical secrets --path /$code --env prod | grep STITCH_API_KEY
done

The Infisical path is the permission boundary. What goes in a venture's path is scoped to that venture. What goes in the platform path is scoped to Venture Crane. Shared infrastructure secrets belong in a dedicated /shared path that is explicitly imported - not in a venture path that happens to be the most convenient location.

We also added defense-in-depth in the launcher: resolveStitchEnv() now explicitly blanks STITCH_API_KEY before injecting it, so even if the value survives in Infisical, it can't override OAuth auth. The code for that is in PR #392.

What the Briefing Does Now

The current SoS briefing is load-bearing context. Before any task runs, the agent sees:

Active venture: [platform] Venture Crane (not just the code - full name reduces mistakes)
Active repo: org/repo-name
Infisical path: /venture-code
Design spec: venture-design-spec
Scope directive: explicit statement of what's in and out of scope
Repo target reminder: hard stop if targeting a different repo
Cadence: only items scoped to this venture
Handoffs: only handoffs targeted at this venture

Each of these fields is populated from the venture registry at session start. There's no manual configuration per session. The agent's spatial context is deterministic.

What We'd Do Differently

The scope boundary problems were all predictable in hindsight. We built the handoff system, the cadence system, and the secrets organization independently, each assuming a single-venture context. Scope isolation wasn't designed in - it was retrofitted.

If we were starting over, the venture code would be a first-class parameter on every stored artifact. Every cadence item, every handoff, every VCMS note, every Infisical secret would carry a non-nullable venture field from creation. The filtering logic would be trivial because the data would already be scoped.

Instead, we added scope as a retrofit to each system separately - which meant four different bugs, four separate PRs, and one incident per system before we got them all.

The scope isolation work isn't finished. Edge cases remain in the VCMS tagging system and in how session analytics are rolled up across ventures. But the core infrastructure - cadence, handoffs, guardrails, secrets paths - is clean. When an agent tries to cross a boundary, the system stops it and tells it what to do instead.

When Your Agents Spend 40 Hours on One Auth Bug

Sat, 28 Mar 2026 00:00:00 GMT

40+ hours. 12 PRs. 7 sessions across multiple agents and machines. One MCP authentication failure that kept coming back.

The bug was in the Stitch MCP server. Stitch connects to Google's design generation API over MCP - a subprocess that Claude Code launches at startup. Getting that subprocess to authenticate correctly cost us more agent time than the original Stitch integration itself. The failure was not complicated. But it had four separate root causes, each one hiding the next, and each fix we shipped addressed exactly one of them.

The final root cause was a single checkbox in Google Workspace admin settings. It took a week to find.

What Stitch MCP Is and How It Connects

Stitch is a design generation API. The MCP server - @_davideast/stitch-mcp - is a Node.js subprocess that Claude Code launches when a session starts. It connects to the design API, exposes tools for screen generation and editing, and then sits there for the entire session.

MCP servers connect only at startup. If authentication fails on launch, the tools are unavailable for the entire session. There is no "reconnect" command. The agent cannot fix the auth and retry mid-session. It has to stop, report the failure, and wait for the next session.

This made every failed fix expensive. A wrong diagnosis costs one session. The correct fix in the wrong order also costs a session. We paid that cost repeatedly.

The Four Root Causes

The bug looked like one thing for five PRs. It was actually four distinct problems layered on top of each other.

Root Cause 1: A version with a broken stdio handshake

stitch-mcp v0.5.1, the latest version, does not respond to the MCP JSON-RPC initialize message on stdout. It connects to Google APIs fine - the OAuth handshake completes, the subprocess stays alive - but it never sends back the initialize response that Claude Code is waiting for. From the outside, it looks like an authentication failure. It is actually a protocol failure.

We tested the handshake manually across every version:

v0.3.2: responds correctly
v0.4.0: responds correctly
v0.5.0: responds correctly
v0.5.1: connects, then silence

The broken version was the one npm resolved to by default. Any session that ran npx @_davideast/stitch-mcp without a pinned version got v0.5.1 and got nothing.

PR #386 pinned v0.5.0. That fixed the handshake. But the tools still did not load.

Root Cause 2: The API rejects key-based auth entirely

While chasing the handshake failure, we had also been fighting a second problem: STITCH_API_KEY. The Stitch API does not accept API keys. It requires OAuth2 / Application Default Credentials via gcloud. An API key in the environment does not cause a graceful fallback to OAuth - it causes a rejection.

We had set up OAuth correctly. gcloud auth application-default login was complete. ~/.config/gcloud/application_default_credentials.json existed. But STITCH_API_KEY was still in the environment, and the MCP proxy was picking it up and sending it instead of the ADC credentials.

Removing the key should have been simple. But the key was not coming from where we thought it was.

Root Cause 3: The key was in three places and we only removed one

The initial setup had added STITCH_API_KEY to Infisical - our secrets vault - and to our launcher config, which tells the launcher what secrets to inject. When we "fixed" this by removing it from the launcher config, nothing changed. The launcher's secret-fetching logic pulls everything from Infisical without filtering. The key existed in Infisical, so the key was injected. The launcher config entry was cosmetic.

That was four PRs (#369, #371, #373, #382) spent on something that looked like a launcher config problem but was a secrets vault problem. Every PR passed CI. None of them removed the key from the running environment.

The fifth fix path (#383, #384, #385) went in the wrong direction entirely - it tried to override the injected key with an explicit value, on the theory that controlling the value would control the behavior. This made things worse.

The actual fix required:

Deleting STITCH_API_KEY from Infisical at every venture path (seven paths total)
Removing all code in launch-lib.ts that referenced the key: the resolveStitchEnv() function, the process env injection, the Gemini config block, the Codex config block
Adding GOOGLE_APPLICATION_CREDENTIALS injection so the MCP proxy could find the ADC credentials file (#388)

That was a full verification pass across the test suite. PR #392 added a defense-in-depth measure: the launcher now explicitly blanks STITCH_API_KEY in resolveStitchEnv() so that even if the key resurfaces in the vault, it cannot reach the MCP server.

We thought we were done.

Root Cause 4: A Workspace policy was killing tokens every 16 hours

The day after we declared the bug fixed, an agent ran Stitch successfully for several hours - generating screens, shipping PRs, real productive work. Then the agent ran a routine end-of-session handoff, cleared the conversation, and started a new session. Stitch was dead.

The diagnosis was familiar: STITCH_API_KEY found in the shell environment. But that key had been in the environment the entire time the agent was successfully using Stitch. Something else had changed.

The gcloud ADC token had expired. Not the short-lived access token - those refresh automatically. The refresh token itself was dead. gcloud auth application-default print-access-token returned "Reauthentication failed."

The ADC credentials file had been created the day before. Refresh tokens should last months. Something was actively revoking them.

The answer was in Google Workspace admin settings, under Security, in a section called "Google Cloud session control." It had a single configuration: Require reauthentication every 16 hours. This policy applies to all apps requesting Cloud Platform scope - including gcloud auth application-default login.

Every agent session that ran longer than 16 hours would lose its ADC credentials. Every session that launched after the token expired would fail to authenticate. The previous three root causes had masked this because we were constantly re-authenticating while debugging the other issues. Once those were fixed and sessions started running long enough for the token to expire, root cause 4 revealed itself.

PR #394 added another defense layer - deleting STITCH_API_KEY from the shell environment entirely before spawning child processes, so even if the key leaks from any source, it cannot reach the MCP server on reconnection. But the actual fix was a single radio button: changing the reauthentication policy from "Require reauthentication" to "Never require reauthentication."

A Workspace admin setting, not code. Not a vault issue. Not a version issue. A policy checkbox.

The 12 PRs

PR	What it did	Did it fix the problem?
#362	Integrate Stitch MCP fleet-wide (initial setup)	Introduced the bug
#369	Fix Gemini MCP test fixture nesting for stitch server	No
#371	Switch from API key to OAuth (gcloud ADC)	Partial - OAuth set up correctly, key still injected
#373	Add Stitch OAuth guidance to docs	No
#382	Remove STITCH_API_KEY from launcher config (not vault)	No - key still in vault
#383	Inject STITCH_API_KEY via parent env (attempting override)	Wrong direction
#384	Pass STITCH_API_KEY via parent env bypass	Wrong direction
#385	Restore STITCH_API_KEY in .mcp.json env block	Wrong direction
#386	Pin stitch-mcp to v0.5.0, remove key from .mcp.json	Fixed handshake
#388	Inject GOOGLE_APPLICATION_CREDENTIALS	Fixed credential path
#392	Blank STITCH_API_KEY in launcher as defense-in-depth	Defense-in-depth
#394	Strip STITCH_API_KEY from shell env before spawn	Defense-in-depth

The cleanup work between #386 and #392 removed STITCH_API_KEY from all venture Infisical paths and stripped the key from every code path in the launcher.

Why the Diagnosis Kept Slipping

Four things made this bug resilient to repeated fix attempts.

The failure mode was generic. "MCP tools unavailable" covers every possible launch failure: wrong version, bad credentials, missing env var, network error, broken stdio. Without distinguishing these modes, every fix attempt was a guess.

Multiple agents, multiple sessions, no shared state. When an agent fixes a bug in one session and writes a handoff, the next session starts fresh. It reads the handoff, but it cannot carry the mental model the first agent built. Subtle context gets lost. The third agent investigating root cause 3 had to re-derive root cause 2 from scratch before it could understand why the vault cleanup mattered.

The launchctl ghost. One session discovered that STITCH_API_KEY had been persisted to the macOS launchctl environment - the persistent environment store that survives shell restarts. Even after removing the key from Infisical and the launcher, it was still being injected from launchctl. A fourth location for the same bad key. The fix was launchctl unsetenv STITCH_API_KEY, but this was discovered mid-session. The MCP server had already launched without the key fix in place, and the tools were still unavailable for that session.

Infrastructure masking infrastructure. The constant re-authentication from debugging root causes 1-3 kept the ADC token fresh. The 16-hour Workspace policy never triggered because no session ran long enough on a stable Stitch setup to hit the limit. Root cause 4 only became visible after root causes 1-3 were fixed - the debugging process itself was hiding the deepest problem.

The MCP startup constraint turned every discovery into a one-session delay.

What We Changed Going Forward

Check the platform before the code. The final root cause was not in our code, our vault, or our dependencies. It was a Workspace admin policy. When authentication tokens expire faster than they should, check the policy layer before building workarounds in code. Google Workspace session control, OAuth consent screen publishing status, and GCP org policies all impose token lifetime limits that no amount of code-side fixing will solve.

The Infisical path is the allowlist. If a secret should not reach the MCP server, do not put it in Infisical. Do not put it in Infisical and try to filter it out in code. Remove it from the source. Code-side filters compensate for vault hygiene problems and create the illusion that the problem is solved.

Delete from all paths, not just one. We had deleted STITCH_API_KEY from one venture path weeks before this saga. It was still present in six others. Infisical shared folder imports do not cascade deletes. When you remove a secret, you have to check every venture path individually. We now verify with:

infisical secrets --path /{code} --env prod | grep STITCH_API_KEY

Run that for every project path. If any of them returns a result, the key is still active.

Pin MCP server versions. npx resolves to the latest version by default. Latest is not always correct. Pin the version in .mcp.json and treat upgrades as deliberate decisions that require testing the stdio handshake.

Test the handshake explicitly. Before deploying a new MCP server version fleet-wide, verify that it responds to initialize:

echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1"}}}' \
  | npx @_davideast/stitch-mcp@0.5.0

A healthy server responds with its capabilities. A broken server is silent or exits. This takes 10 seconds. We skipped it when upgrading to v0.5.1.

Takeaways for Multi-Agent MCP Systems

MCP server failures are expensive relative to other infrastructure failures because of the startup-only connection constraint. A bad deploy in a Cloudflare Worker costs a few minutes of downtime before a rollback. A bad MCP server configuration costs every session that launches with it until the fix is deployed and a new session starts.

This changes how you should treat MCP environment configuration. It is not application config. It is more like bootloader config - if it is wrong, nothing runs until it is correct. The blast radius of a mistake is large and asymmetric. Errors are cheap to introduce and expensive to recover from.

For anyone building multi-agent systems with MCP:

Know all the layers that compose a subprocess environment before you start debugging - including the platform and admin policies above your code
The vault is the source of truth; code-side filtering is not a substitute for vault hygiene
Version-pin every MCP server and test the handshake before fleet deployment
When a bug survives multiple fix attempts across sessions, stop and enumerate every possible source of the problem before shipping another fix
When tokens expire faster than documented, check the admin policy layer before writing code workarounds

The 40+ hours would have been 4 if we had started with that last step.

Update: It Was Simpler Than We Thought

After publishing this article, we found that the Stitch documentation had clear instructions for API key authentication the entire time. Stitch is a remote HTTP MCP server at https://stitch.googleapis.com/mcp. There is no local subprocess. No proxy. No npx. The server runs on Google's infrastructure and accepts an API key in a request header.

The official Claude Code setup is one line:

claude mcp add stitch --transport http https://stitch.googleapis.com/mcp \
  -H "X-Goog-Api-Key: <key>" -s user

That is the entire integration. No version pinning. No OAuth flow. No gcloud auth application-default login on every fleet machine. No GOOGLE_APPLICATION_CREDENTIALS injection. No defense-in-depth blanking of env vars that should never have existed. No Workspace admin policy debugging.

We ripped out 105 lines of launcher code - resolveStitchEnv(), the proxy spawning logic, the Gemini and Codex config blocks, the credential file path injection - and replaced it with nothing. The launcher no longer manages Stitch at all. Each machine runs the one-line CLI command once, and the MCP server connects directly to Google's endpoint with a standard API key.

The entire local proxy architecture was unnecessary. Every root cause in this article - the broken stdio handshake, the API key rejection, the vault cleanup across seven paths, the 16-hour token expiry policy - was a consequence of running a local subprocess proxy that did not need to exist. The remote HTTP server has none of these problems. There is no subprocess to pin versions on. There is no OAuth token to expire. There is no gcloud credential file to locate.

We did not read the vendor documentation thoroughly enough. We started from a community setup guide, hit auth failures, and spent a week building workarounds for an architecture we had chosen by default rather than by design. The Stitch docs had the simpler path documented the whole time. The lesson is straightforward: before building infrastructure to work around a tool's behavior, check whether the tool already supports what you need.

Stitch is our AI design generation tool. The MCP server saga ran from March 24 through March 29, 2026, across seven sessions and multiple agents. The final fix was not a radio button in Workspace admin settings - it was a one-line CLI command that pointed Claude Code at Google's remote MCP endpoint with an API key.

Tool Registration Is Not Tool Integration

Sat, 28 Mar 2026 00:00:00 GMT

We run three AI coding CLIs: Claude Code, OpenAI Codex CLI, and Google Gemini CLI. All three share the same MCP server - 14 tools covering session management, work tracking, documentation, handoffs, and scheduling. On paper, this is multi-CLI redundancy. In practice, for most of this year, it was one functioning CLI and two agents that could discover tools but couldn't use them.

The gap wasn't access. It was credentials. Finding it required shipping 114 files and watching the first live test fail immediately.

The Vendor Lock-in Nobody Talks About

When people talk about vendor lock-in for AI coding CLIs, they mean rate limits, pricing changes, and model capability gaps. Those are real concerns. But the subtler form is this - the CLI that has your instructions, your skills, your system prompts, and your enterprise rules becomes the only CLI that can operate in your environment. The others are present but inert.

Claude Code had 19 skills, a 4,000-word instruction file covering development workflow, secrets management, QA grades, and enterprise rules, and full MCP integration with our infrastructure. When it hit rate limits, the operation stopped. Not because the other CLIs lacked tool access - they had the same 14 tools registered. Because they had no instructions and no skills. They would connect to our MCP server, discover the tools, and then have no idea what to do with them or how to operate in our environment.

Codex and Gemini each had two commands pointing at shell scripts that no longer existed.

What We Built

The sprint covered three things: instruction files, skills, and credential passthrough. The credential issue came last. It was the most important.

Instructions

We rewrote the instruction files for both CLIs to match Claude Code's depth. Same enterprise rules (all changes through PRs, never push to main, verify secret values not just key existence). Same MCP tool reference table with every tool name, purpose, and when to call it. Same auto-session-start behavior: call preflight, then initialize. Same escalation triggers: credential not found in two minutes, same error three times, blocked more than 30 minutes - stop and escalate.

We also created global instruction files that apply across all venture repos, not just the project-level configs: engineering quality standards, writing style, agent authorship stance, CSS and design patterns.

Skills: Three Formats, One Intent

The skill porting was more complex than expected - not because the logic was hard to translate, but because the three CLIs use fundamentally different skill formats.

Claude Code skills are markdown files with YAML frontmatter. A skill file includes metadata fields (name, description, triggers), a prompt body written in markdown prose, and often inline code blocks. The format is human-readable and treats the AI as the executor. The markdown tells it what to do, and it figures out how.

Codex uses a directory-per-skill structure. Each skill lives in its own folder with a skill.yaml file for metadata and a prompt.md for the prompt body. The YAML frontmatter is more structured than Claude's - explicit field types, required/optional markers, and parameter definitions that Codex validates before running the skill. It's closer to a typed interface than a prose instruction.

Gemini uses TOML files with triple-quoted prompt strings. A single .toml file contains both metadata and prompt. Triple-quoted strings in TOML behave differently from markdown prose - line breaks are literal, indentation matters, and special characters need escaping. A skill that looks clean in markdown can look awkward in TOML until you understand the quoting rules.

The straightforward skills - session start, heartbeat, status checks - translated directly. Copy the intent, rewrite for the target format, done.

The multi-agent skills required real adaptation. Claude Code can spawn parallel sub-agents. The editorial review skill, for instance, launches a style editor and a fact checker simultaneously, waits for both, then merges findings and applies fixes. Codex and Gemini don't have native sub-agent spawning. We adapted every multi-agent skill to run sequentially - same roles, same output structure, same quality checks, one pass at a time instead of parallel. The sprint skill went from parallel worktree agents to sequential branch-based execution. The design brief skill went from four simultaneous perspectives to four sequential rounds. Slower execution, identical output.

Two background agents ran the bulk porting in parallel: one producing 13 Codex skills, the other producing 13 Gemini commands. Both finished clean. We extended the sync script that distributes skills to venture repos to handle all three formats with the same exclusion list. A dry run confirmed 114 new files across the venture repos. Then we ran it for real.

What Broke

The first live test failed.

We launched Codex into a venture repo, ran the start-of-day skill, and the MCP server reported that our API key wasn't set. The key was in the environment - the launcher injects it at startup. But it wasn't reaching the MCP server process.

Codex CLI has a default security filter that strips environment variables whose names contain KEY, SECRET, or TOKEN from child processes. Our primary API key variable has KEY in the name. The MCP server, spawned as a child of Codex, never saw it.

The fix was an env_vars whitelist in the Codex configuration - five variable names explicitly permitted to pass through to the MCP server. We added self-healing logic to the launcher so existing installs get patched on next launch and new installs get the whitelist from the start.

We added similar explicit environment passthrough for Gemini's configuration, expecting it to be preventive. It turned out to be necessary.

Gemini CLI has its own version of the same filter. The function is called sanitizeEnvironment(). It runs at CLI startup, before any MCP configuration is merged. It strips variables from process.env that match three patterns: /TOKEN/i, /KEY/i, /SECRET/i. These are case-insensitive regex patterns, which means CRANE_CONTEXT_KEY matches /KEY/i and gets stripped. The MCP server config can specify environment variables to pass in - but if those variables are already absent from process.env by the time the config is processed, passing them through a config reference like $CRANE_CONTEXT_KEY passes the literal string, not the value.

The fix for Gemini requires two separate configuration changes. First, the MCP server entry needs explicit env mappings. Second, a security.environmentVariableRedaction.allowed array needs to whitelist the same variable names. The allowlist is what bypasses sanitizeEnvironment(). Without it, the allowlist entry in the MCP config receives a placeholder string, not the actual credential, and every tool call fails with a 401.

Both CLIs independently made the same design choice: strip credentials from child processes by default, require explicit opt-in to pass them through. This is the right default. You don't want arbitrary MCP servers inheriting every secret in your environment. But it means every MCP integration needs an explicit, tested allowlist before it can function. And you won't discover that until you run the first real command with a tool that requires auth.

Lessons for Multi-CLI Agent Infrastructure

Tool registration is not tool integration. A CLI can list your tools, describe their parameters, and call them correctly - and still fail on every call that requires a credential. The MCP protocol handles discovery. Credential delivery is your problem.

Test with a credentialed tool on first setup. Don't verify MCP integration with a tool that returns static data. Use a tool that requires an API key and confirm the response is real data, not an auth error. Catching env sanitization failures this way costs one test call. Catching them later costs a debugging session.

Allowlists need to be complete and exact. Both Codex and Gemini do case-insensitive pattern matching when deciding what to strip. If your variable name matches /KEY/i, /TOKEN/i, or /SECRET/i anywhere in the name, it gets stripped. Check every variable you need to pass to an MCP server against these patterns. Audit the complete list before deploying to the fleet.

Self-healing configuration is worth the investment. When we patched the Codex config fix, we embedded the repair logic in the launcher itself. Every machine that runs the launcher gets the correct config, whether it was set up last week or a year ago. Manual config patching across a fleet is a recurring maintenance burden. The launcher is already running on every machine - use it.

Skill portability is not free, but it's achievable. The three CLI formats are different enough that naive copy-paste doesn't work, but the intent of each skill translates reliably. The investment is in format conversion, not in rethinking the skill's purpose. Sequential adaptation of parallel skills produces the same output - the only cost is execution time.

Instructions are as important as tools. The gap between a functioning CLI and an inert one wasn't the MCP server. It was the absence of instructions. A CLI that can discover 14 tools but has no context about when to call them, what enterprise rules apply, or what a session looks like will call the wrong tools in the wrong order. Tools without instructions are a collection of capabilities, not an agent.

Where We Are Now

We went from one functioning CLI to three in a single session. All three connect to the same MCP server with valid credentials. All three carry the same 19 skills, the same instruction depth, and the same enterprise rules. The sync script propagates updates to all three formats simultaneously.

The next time Claude Code hits a rate limit or context cap, Codex or Gemini can pick up the session. Same tools, same skills, same rules, same infrastructure. The credential delivery issue is patched in the launcher and will never silently fail again.

From Zero to Landing Page in Four Days

Sat, 28 Mar 2026 00:00:00 GMT

On 2026-03-24, we added a venture code to a registry file. On 2026-03-28, we wired a Calendly booking link into production CTA buttons on a live Astro site. Forty-six merged PRs and four days in between.

The venture is an operations consulting firm serving small businesses in the Phoenix area. The target client is 5-50 employees, dealing with undocumented processes, leaky lead pipelines, and no financial visibility. Fixed-price engagements, clear deliverables, no open-ended retainers.

This article is about the scaffolding, not the consulting methodology. The speed came from infrastructure that was already in place. A new venture doesn't bootstrap from zero here. It inherits a working system.

What "From Zero" Actually Means

Zero, in this context, means a single entry added to the venture registry:

{
  "code": "example",
  "name": "Example Venture",
  "org": "example-org",
  "repos": ["example-console"],
  "capabilities": ["web", "content"]
}

From that registration, the crane launcher can inject the right secrets for any agent session targeting this venture. The context API returns the venture config on demand. The GitHub classifier auto-classifies incoming issues with QA grades. Enterprise skills and slash commands sync to the new repo on launch. None of this requires per-venture configuration.

The venture setup checklist runs an agent through the full bootstrap: repo creation, Infisical path setup, secrets propagation, local clone, .infisical.json, Cloudflare project, VCMS venture record. Each step is concrete and command-driven. An agent following the checklist doesn't interpret intent - it runs the commands and verifies the output.

By the time any product agent touches the venture, infrastructure is already done. The agent's job starts at the product layer.

The First Four Days

Day 1 (2026-03-24): PR #356 adds SS to the venture registry. This is the seed. Everything else follows from it.

Days 1-3: A sprint session covers the product foundations. Twenty-six issues are created and closed covering pricing model, scope protocols, client profiles, vertical selection, referral partnerships, pipeline math, outreach messaging, and assessment processes. These aren't ticket-filing exercises - they represent real product decisions baked into issues, reviewed, and closed. The decision stack framework for engagement packages lands here.

Day 4 (2026-03-28): Two sessions run in sequence. The first audits the full venture state and finds everything in better shape than the prior handoff indicated. The second builds the Cloudflare Pages deploy workflow and wires the landing page to production.

Forty-six PRs merged across those four days. Build output: 80KB total. CI green. Twenty-nine tests passing.

The Landing Page: What Actually Shipped

The Astro scaffold and landing page shipped in PR #46. By the time the current session reviewed the venture state, this was already merged - discovered during the audit, not during build.

In a multi-agent, multi-session environment, work often lands before the current agent has full context. The right response is verification, not assumption. We audited: checked CI status, reviewed test counts, confirmed the build artifact size, read the PR diffs. Everything was clean.

What shipped in PR #46:

Full Astro site scaffold with TypeScript config
Landing page with hero, services sections, and CTA
OG image for social sharing
sitemap.xml and robots.txt
JSON-LD structured data for local business SEO
All 29 tests passing

PR #48 (current session) added:

Cloudflare Pages GitHub Actions deploy workflow
DEPLOY_ENABLED secrets guard - the pipeline checks for this flag before attempting a deploy, so the workflow can live in the repo without triggering until production credentials are confirmed
Calendly booking link wired into all CTA buttons

Wiring a deploy workflow before all production secrets are provisioned is a real risk - a misconfigured workflow can deploy broken state or fail loudly in CI during demos and reviews. The DEPLOY_ENABLED guard is a single environment secret. Set it to true in the Cloudflare Pages project settings when you're ready. Until then, the workflow exits cleanly at the check step. One-line fix when the time comes, and it costs nothing now.

What Broke (or Nearly Did)

The session immediately before this one produced a gap analysis that concluded the venture setup was incomplete. It was wrong. Everything was already in place. The prior session had done the work - the analysis session just hadn't verified.

Root cause: the agent inferred state instead of running commands. It saw an unmerged PR in a handoff note and concluded that downstream work was also unfinished. It wasn't. The unmerged PR (PR #356, venture registration) was the only gap, and even that was a procedural issue - the agent that opened it didn't merge it in the same session.

The cost was one full session of re-verification work that shouldn't have been necessary. The fix was structural: we added Phase 5.5 to the venture setup checklist. It's seven commands agents must run to verify end-to-end venture readiness:

# 1. Confirm venture appears in context API
crane_ventures | grep {code}

# 2. Confirm session creation works
crane_context | grep -i '{code}'

# 3. Confirm auto-classification fires
crane_notes --venture {code} | head -5

# 4. Confirm Infisical secrets are present
infisical secrets --path /{code} --env prod | wc -l

# 5. Confirm local clone exists
ls ~/dev/{code}-console/.infisical.json

# 6. Confirm CI is green
gh run list --repo {org}/{code}-console --limit 5

# 7. Confirm merged PR count
gh pr list --repo {org}/{code}-console --state merged | wc -l

And we formalized a rule that was already implied but never explicit: agents must merge their own PRs in the same session. "Needs merge next session" is incomplete work. The agent that opens a PR owns it through merge.

This isn't a new insight. It's a workflow discipline problem that compounds in multi-agent environments. Each handoff is a trust boundary. If a handoff says "work is done, just needs merge," the receiving agent has to either trust that claim or verify it. Verification takes time. Better to not create the situation in the first place.

The Infrastructure That Made This Fast

Four things did most of the work:

The launcher. A single crane command fetches Infisical secrets for the venture, injects them into the agent process, and spawns the session. The agent has CRANE_VENTURE_CODE, CRANE_VENTURE_NAME, GH_TOKEN, CLOUDFLARE_API_TOKEN, and CLOUDFLARE_ACCOUNT_ID available from the first command. No manual env setup. No .env files. No "wait, which token do I need for Cloudflare?"

The registry. The venture registry is the single source of truth for venture metadata. Add the entry once - repo name, org, capabilities, Stitch project ID. Every downstream tool reads from it. The launcher knows which Infisical path to query. The context API knows how to serve the venture config. The GitHub classifier knows which rules apply.

The new-venture checklist. A step-by-step playbook that agents execute, not interpret. Concrete commands with expected output. When a step produces unexpected output, the agent stops and escalates rather than improvising. The checklist has been refined through four prior venture bootstraps. This venture is the first to run against the Phase 5.5 verification block.

Auto-classified issues. The GitHub classifier monitors incoming GitHub issues and applies QA grade labels automatically. qa-grade:0 means CI-only verification - no human review required. Most infrastructure work lands here. This keeps the merge queue moving without requiring a human to triage every PR.

By the Numbers

Metric	Value
Days from registration to production-ready site	4
Merged PRs	46
Closed issues	26
Build output	80KB
Test count	29
CI failures post-merge	0

The 80KB build output deserves its own line. Astro ships zero JavaScript by default. The landing page is static HTML and CSS. No framework runtime. No client-side hydration. The only JavaScript on the page is the Calendly embed, which loads on interaction.

Fast load on mobile. No build complexity. Deploys to Cloudflare Pages in under a minute. For a local services landing page, there is no better architecture.

What This Demonstrates

Every venture we launch gets faster because the previous ones improved the infrastructure. This one benefited from lessons learned bootstrapping earlier ventures - the secrets propagation issues we hit on venture three, the classification problems we fixed on venture four, the PR completion rule that should have been explicit from day one.

The scaffolding velocity isn't about moving fast and accepting technical debt. The 29 tests and green CI reflect the same standard we hold on the flagship products. The difference is that we don't rebuild the foundation each time.

An agent session focused on product work doesn't think about Cloudflare configuration or Infisical paths. That layer is handled before the session starts. The cognitive load stays where it belongs - on the product.

Taking Product Development Offline with Local LLMs

Mon, 02 Mar 2026 00:00:00 GMT

The development lab runs AI agent sessions roughly 18 hours a day across multiple machines. The agents have access to frontier models, a full MCP toolchain, context management, and a fleet of Apple Silicon hardware. When the founder is at a workstation, ideas move from thought to implementation in minutes.

The problem is the other hours. Driving. Sitting at auction houses. Coffee shops with unreliable WiFi. Ideas happen everywhere, but acting on them requires cloud AI and a network connection. The gap between having a product idea in the field and getting back to a networked machine is dead time. For a solo founder running multiple ventures, dead time compounds.

We decided to close that gap with local models running on hardware that was already in the bag.

The Hardware

An M1 MacBook Air with 16GB of unified memory. Fanless design, 68 GB/s memory bandwidth, roughly 10 hours of battery. The lack of a fan is both a feature and a constraint: silent operation anywhere, but thermal throttling sets a practical ceiling on sustained inference.

That ceiling, in practice: 7-8B parameter models at Q4 quantization. One model loaded at a time. Around 20 consecutive prompts before thermal throttling kicks in. After that, a 5-10 minute cooldown or a task switch and it recovers. This is not a workstation replacement. It is a capture device.

The Models

We set up Ollama with four specialized models, each with a custom system prompt tuned to our stack. The key decision was specialization over generality. A single general-purpose 8B model tries to be everything and is mediocre at all of it. Four focused models, each pre-loaded with our conventions, eliminate the re-explaining that wastes context window and produces drift.

Alias	Base Model	Temp	Purpose
`field-prd`	Qwen3 8B	0.7	Product requirements documents
`field-code`	Qwen 2.5 Coder 7B	0.3	TypeScript / Cloudflare Workers code
`field-wire`	Qwen3 8B	0.5	React / Tailwind components from descriptions
`field-arch`	DeepSeek-R1 8B	0.4	Architecture decisions with step-by-step reasoning

Plus llava:7b for converting paper sketches and whiteboard photos into component code via the laptop camera.

Why these specific models. Qwen3 8B handles structured document generation well at this parameter count. It follows templates consistently. Qwen 2.5 Coder 7B is purpose-built for code generation and respects conventions baked into its system prompt more reliably than general models. DeepSeek-R1 8B does chain-of-thought reasoning natively, which matters for architecture decisions where you want the model to think through constraints before committing to an answer.

What the system prompts contain. Each model already knows the tech stack: Next.js or Astro with Tailwind on the frontend, Cloudflare Workers with Hono on the backend, D1/KV/R2 for storage. The PRD writer knows our requirements template: problem statement, hypothesis, kill criteria, acceptance criteria, agent brief. The code model knows our file layout conventions, response shapes, and type patterns. No re-explaining every session.

Temperature choices are deliberate. The code model runs cold (0.3) because we want deterministic, convention-following output. The PRD writer runs warmer (0.7) because requirements writing benefits from some creative variation. The architect sits in between (0.4) where reasoning is structured but not rigid.

The Workflow

Shell aliases drop you into interactive sessions with the right model. Each one is a custom Ollama Modelfile: a base model plus a system prompt plus tuned parameters, registered as a named model.

field-prd    # PRD writer - structured requirements docs
field-code   # Code generation - Workers/Hono/D1
field-wire   # Screen description to React/Tailwind component
field-arch   # Architecture decisions with chain-of-thought
field-vision # Photo/sketch analysis via multimodal model

A typical field session:

# Start a PRD for a new feature
field-prd "Write a PRD for expense splitting between two households" \
  > ~/field-work/project-a/prds/expense-splitting-v1.md

# Get architecture guidance
field-arch "Should I use D1 or KV for storing split configurations? \
  They update monthly and need to be queryable by household."

# Generate the route handler
field-code "Write a Hono route handler for POST /api/splits \
  that creates a new expense split configuration in D1" \
  > ~/field-work/project-a/code/splits-route.ts

# Convert a napkin sketch to a component
field-vision --images ~/photo.jpg \
  "Convert this wireframe to a React component with Tailwind"

Output gets saved to organized directories, one per project, with subdirectories for PRDs, code, wireframes, migrations, and session logs. A session log template tracks what was generated, which models were used, and estimates quality for lab integration.

Back at the lab, files get copied into the real repository and Claude Code refines them against the actual codebase: fixing imports, aligning with existing patterns, running the test suite. The field output is a head start, not a finished product.

The 8K Context Configuration

This is the biggest behavioral shift from working with frontier models. Cloud models give you 100K+ context windows. You can paste an entire file and say "refactor this." These models support up to 32K tokens natively, but we configured them to 8,192 tokens to stay within the M1 Air's thermal budget. That is roughly 6,000 words of combined input and output.

The practical effect: you describe what you want instead of showing what you have. "Write a Hono route handler that creates an expense split in D1" works. "Here is my existing codebase, add expense splitting" does not fit.

This turns out to be a useful discipline. Prompts become tighter. Requirements become more explicit. You cannot lean on the model to figure out what you mean from surrounding code - you have to say it. The output is more predictable as a result, even if narrower in scope.

Honest Quality Assessment

We are not going to pretend 8B models compete with frontier models. They do not. Here is what we expect based on initial testing:

Output Type	Expected Quality	What Needs Lab Work
PRDs	80-90% usable	Structure is solid, details need refinement
Route handlers	60-80% correct	Imports and file paths will be wrong, types need checking
React components	70-85% structural accuracy	Tailwind classes usually right, state logic needs review
D1 migrations	50-70% correct	Schema is directional, constraints and indexes need manual work

The code model produces syntactically correct TypeScript that follows our conventions because the system prompt specifies them. What it gets wrong: import paths (it does not know the actual project structure), peer dependencies between files, and edge cases in error handling. These are exactly the things Claude Code catches in 15-30 minutes of lab refinement.

The PRD writer is the strongest performer. Structured document generation at 8B parameters is genuinely useful. The model follows the template, fills in reasonable content, and produces something that reads like a first draft rather than a hallucination. Kill criteria and acceptance criteria still need human judgment, but the structure and framing save significant time.

Migrations are the weakest. D1 schema design requires understanding the full data model, and an 8K context window cannot hold enough of it to make good relational decisions. We use these as starting points, not as anything close to final.

What Surprised Us

Setup time was trivial. Pulling four models, creating custom Modelfiles, configuring aliases, building the directory structure, and running a smoke test took under 15 minutes. Most of that was download time over WiFi. The actual configuration was maybe 3 minutes of file creation.

System prompts make a disproportionate difference at small parameter counts. A vanilla Qwen3 8B prompt produces generic, vaguely helpful output. The same model with a 200-word system prompt specifying our stack conventions, response format, and file layout patterns produces output that looks like it came from someone who has worked in the codebase before. The delta is much larger than the same system prompt would make on a frontier model, probably because the smaller model has less competing training data to override.

Thermal management is a real workflow concern. The M1 Air handles 15-20 prompts comfortably before performance degrades. This maps naturally to the rhythm of thinking through a feature, generating a few artifacts, and moving to the next thing. But it means closing the browser and Docker before field sessions, and accepting cooldown breaks as part of the flow.

Piping to files changes how you prompt. When output goes directly to a markdown file instead of a chat window, each prompt becomes a discrete, self-contained unit of work rather than a conversational follow-up. This produces cleaner artifacts. You think more carefully about what you ask for because you are committing the output to a file, not iterating in a chat thread.

What We Would Change

A fifth model for documentation: a dedicated writer with system prompts for ADRs, runbooks, and API docs. We write documentation in the field less than we should, and a field-doc alias with our conventions baked in would lower the friction.

The session log template is manual. On the next iteration, we would wrap the aliases in a shell function that auto-logs which model was used, the prompt, and the output path. In practice, manual logs get skipped when things are moving fast.

Kill Criteria

We set a clear signal for ourselves: if less than 50% of field-generated code survives lab refinement after a two-week pilot, we deprioritize the workflow. The overhead of generating, transferring, and refining is not worth it if lab cleanup consistently takes longer than writing from scratch.

The pilot starts with an upcoming trip, several days away from the development lab with real product decisions to make. We will track: artifacts generated per session, survival rate through lab refinement, refinement time per artifact, and whether field-generated PRDs actually get implemented or just get rewritten from scratch.

Total cost of the setup: $0. Ollama is free. The models are open-source with commercial licenses. Inference is local. The only costs are the 15 minutes of setup time and the electricity to charge the laptop.

The Point

These models are not replacing Claude Code or any frontier model for the work that happens in the lab. That is not what they are for. They are capturing momentum.

The difference between "I had an idea while driving" and "I had an idea while driving, and here is a PRD, three route handlers, and a migration ready for lab refinement" is the difference between a note on a phone and a head start on implementation. For a solo founder managing multiple products, that delta compounds across every trip, every errand, every hour away from a workstation.

We will report back after the pilot with real numbers.

From Code Review to Production in 48 Hours

Sun, 22 Feb 2026 00:00:00 GMT

A code review graded a codebase at C. Forty-eight hours later, the same codebase was in production with security hardening, a rich text editor, Google Drive integration, PDF and EPUB export, progressive AI features, and 179 tests across frontend and backend. Thirty-one pull requests merged across two days.

The codebase was an iPad-first writing app for nonfiction authors. It had an existing foundation - authentication, basic editor, chapter structure - but the code review exposed real problems. A D in testing. Cs in security, architecture, and code quality. Drive query injection vectors in the Google Drive integration. A monolithic page component north of 1,200 lines. Near-zero test coverage on critical paths.

What happened next was not a hackathon. It was a structured sprint where the code review findings became the work queue, ordered by severity, and AI agents worked through it systematically.

The Code Review as Sprint Plan

Most code reviews produce a document that sits in a wiki. Someone reads it, nods, and adds items to a backlog that competes with feature work for prioritization. The findings age. The context fades. Three months later, the missing tests are still missing.

We treated the code review differently. The seven-dimension rubric - architecture, security, code quality, testing, dependencies, documentation, standards compliance - produced graded findings with concrete thresholds. Each finding mapped directly to a unit of work. The grades determined the order.

Testing got a D - near-total absence of test coverage across both frontend and backend, with zero frontend tests and no tests for core business logic. Security got a C, with Drive query injection vectors, missing Content Security Policy headers, no rate limiting, and unvalidated OAuth redirect URIs. These became the first PRs of the sprint - not because security is abstractly important, but because a D in any dimension pulls the overall grade downward regardless of everything else. Fix the D first, and every subsequent PR improves the codebase from a higher baseline.

Architecture got a C. The main page component was over 1,000 lines, mixing editor logic, settings management, AI features, and export handling in a single file. This informed the refactoring strategy throughout the sprint - every feature PR was an opportunity to extract, not accumulate.

The code review did not just tell us what was wrong. It told us what to fix first.

Day 0: Security Foundation

The instinct with a new product sprint is to build the exciting features first. Rich text editing, AI-powered rewrites, Google Drive sync - that is the fun work. Security headers and rate limiting are not fun. They are also not optional when your code review hands back Cs and Ds.

The first PRs addressed every security finding from the review:

Content Security Policy headers. The app had no CSP. A malicious script injection - through a crafted document title, a compromised CDN, an XSS vector in the editor - would execute without restriction. The fix was a strict CSP that whitelists known origins for scripts, styles, fonts, and connections. This is a configuration change, not a feature, but it is the difference between "a vulnerability is exploitable" and "a vulnerability is mitigated by defense in depth."

Rate limiting. The API had no request throttling. An attacker - or a misbehaving client, or a user's own sync loop gone wrong - could hammer endpoints without limit. Rate limiting went onto the authentication and AI endpoints, the two highest-value targets.

OAuth redirect validation. The OAuth flow accepted redirect URIs without validation. An attacker could craft a login link that redirected the OAuth token to their own server. The fix validates redirect URIs against a whitelist of known origins before initiating the OAuth flow.

These three changes - CSP, rate limiting, redirect validation - addressed the security findings while the sprint focused on raising the testing grade from D. They established the pattern for the rest of the sprint: fix the foundation before building on it.

The same session built the core features that the security hardening was protecting. A rich text editor with formatting toolbar. A three-tier auto-save system - local state, debounced API writes, and periodic full-document sync - so writers never lose work. The initial AI rewrite feature with a floating action bar and server-sent events for streaming responses. Text selection handling tuned for iPad, where selection behavior differs from desktop browsers in ways that only surface when you test on the actual device.

Day 1: Thirty-One PRs

PRs #59 through #95 merged across February 16 and 17. That is thirty-one pull requests in two days, each one a scoped unit of work: one feature, one fix, or one refactoring. Not a monolithic "day 1 features" branch. Thirty-one individual, reviewable changes.

The volume is notable but the sequencing is what matters. Each PR built on the security foundation from Day 0. Every new feature inherited the CSP headers, the rate limiting, the redirect validation. Security was not a follow-up task - it was already in the codebase before the first feature PR of Day 1.

Google Drive Integration

The full OAuth-to-export pipeline shipped in a single day. Connect your Google Drive account via OAuth. The app auto-creates a book folder in Drive on first export. Export your manuscript as PDF or EPUB and save it directly to your Drive folder. Browse files already in your book folder. Disconnect with proper token revocation - not just clearing the local token, but calling Google's revocation endpoint so the authorization is truly removed.

Token revocation is the kind of detail that gets skipped in a fast sprint. It is easy to implement "disconnect" by deleting the stored token and calling it done. The user sees a disconnected state, the UI looks right, but the OAuth grant is still active on Google's side. If the token is later compromised, it still works. Proper revocation is an HTTP call and an error handler. It took ten minutes to implement and it closes a real security gap.

Export Pipeline

PDF export uses Cloudflare's Browser Rendering API. Instead of a PDF library that approximates the document layout, the app renders the manuscript as HTML with print-optimized CSS, then uses a headless browser to generate the PDF. The output matches exactly what the user sees in the editor preview. No layout surprises, no font substitution, no "it looked different in the app."

EPUB export builds the package from scratch using JSZip. The EPUB format is a ZIP archive containing XHTML content files, a package manifest, and metadata. Building it programmatically means the output validates against EPUB readers without depending on a third-party EPUB library that might not handle edge cases in manuscript formatting - things like scene breaks, chapter epigraphs, and front matter.

Both export formats support two destinations: local download to the device, or save to the connected Google Drive folder. The user chooses at export time.

Chapter Management

Rename chapters inline - click the chapter title, edit, press enter. Delete chapters with last-chapter protection - the app prevents deleting your only remaining chapter, which would leave a book with no content. Drag-and-drop reorder for chapter sequencing, which required careful state management to keep the editor, the chapter list, and the backend in sync during the drag operation.

Auth and Session Handling

Sign-out that actually clears everything - cached data, service worker state, local storage, session cookies. When a user signs out of a writing app, they expect their manuscript data to be gone from the device. A sign-out that clears the session cookie but leaves cached chapter content in a service worker is not a real sign-out.

Thirty-day session persistence for the common case. Writers do not want to re-authenticate every time they open their iPad to write. The session token persists for 30 days with a sliding window, so regular usage keeps the session alive indefinitely.

Progressive AI Architecture

The AI rewrite feature uses a two-tier model architecture that optimizes for perceived performance.

The primary model runs on Cloudflare Workers AI - specifically a lightweight model optimized for fast inference at the edge. When a user selects text and taps "Rewrite," the response starts streaming within sub-second time-to-first-token. The UI opens the rewrite sheet instantly, shows a blinking cursor to indicate the model is working, and streams tokens as they arrive. The user sees activity within a second of tapping the button.

For users who want a deeper rewrite, a "Go Deeper" option escalates to a frontier model. This takes longer but produces more nuanced rewrites - better at preserving voice, handling complex sentence structures, and making substantive improvements rather than surface-level rephrasing.

The key insight is that these serve different moments. A quick rewrite while drafting needs to be instant - the writer is in flow and any delay breaks concentration. A deep rewrite during editing can take a few seconds because the writer is already in a reflective mode. Two models, two latency profiles, two interaction patterns.

The streaming UX matters more than the model quality. An objectively better rewrite that takes four seconds to start displaying loses to a decent rewrite that starts in under a second. Writers will use the fast option ten times for every one use of the deep option, because speed keeps them in their creative flow.

Day 2: Production Polish

The app was functional after Day 1. Day 2 was about making it production-ready - the difference between "it works" and "it works on an iPad that someone added to their home screen and uses every day."

PWA Support

The app is an iPad-first product distributed as a Progressive Web App. Users add it to their home screen from Safari and it launches like a native app - full screen, no browser chrome, its own icon in the app switcher.

PWA support required a service worker for offline capability and asset caching, configured via Serwist (a modern service worker toolkit). The service worker pre-caches the app shell and fonts, caches API responses for offline reading, and handles the install prompt flow. The web manifest defines the app name, icons, theme color, and display mode.

Getting PWA right on iPad specifically required testing the add-to-home-screen flow, verifying the app launches in standalone mode (not inside Safari), confirming the status bar styling, and ensuring the service worker handles the app lifecycle correctly when iOS suspends and resumes the web app process. These are the details that determine whether a PWA feels like a real app or a bookmarked website.

Multi-Book Management

The initial version supported a single book. Day 2 added a project dashboard with cards for each book, a project switcher in the editor, and full CRUD operations: create a new book, rename, duplicate (copying all chapters), and delete with confirmation.

The dashboard was a meaningful architectural addition. It introduced a project context layer above the existing chapter/editor hierarchy. Every component that previously assumed "there is one book" needed to become project-aware - the editor, the chapter list, the export pipeline, the Google Drive integration, the auto-save system.

Extraction and Testing

The main page component that the code review flagged at over 1,000 lines got its first significant extraction. The settings menu - account management, Google Drive connection, export options, sign-out - was pulled into its own component. The page file went from 1,257 lines to 801 lines. Still large, but moving in the right direction. The extraction pattern established during this refactoring - identify a cohesive feature cluster, extract it with its own state management, connect it via props and callbacks - became the template for future extractions.

The test suite grew substantially on Day 2. Sixty-eight new tests across five test files, covering:

Auth middleware: four tests verifying token validation, session expiry, and unauthorized access handling
CORS policy: three tests confirming cross-origin behavior for the API endpoints
Encryption: five tests covering the encrypt/decrypt cycle, key derivation, and error cases
Component tests for the newly extracted settings menu and project dashboard

The final count: 108 backend tests and 71 frontend tests. Not comprehensive coverage, but meaningful coverage on the paths that matter most - authentication, data integrity, and the features an author interacts with every session.

What the Numbers Mean

Forty-eight hours. Thirty-one PRs across two days. One hundred seventy-nine tests. Testing grade from D to B. A production-deployed iPad app with rich text editing, AI features, Google Drive sync, and multi-format export.

These numbers are real, but they are not the point. Fast output from AI agents is easy to achieve. You point an agent at a codebase and tell it to build features, and it will produce volume. The question is whether that volume is coherent, secure, and maintainable.

The sprint worked because of the structure around the speed.

The code review ordered the work. Security findings came first, not because someone made a judgment call, but because the grading rubric mathematically requires it - a D in any dimension pulls the overall grade downward. The rubric automated the prioritization that a human tech lead would have done manually.

Each PR was scoped. Thirty-one PRs in two days sounds chaotic. It is the opposite of chaotic. Each PR did one thing. "Add CSP headers" is reviewable. "Day 1 features" is not. Scoped PRs meant that if any single change caused a problem, it could be identified and reverted without unwinding a day of work.

Security was the foundation, not an afterthought. Every feature built on Day 1 inherited the security hardening from Day 0. The Google Drive OAuth flow benefited from the redirect validation. The AI endpoints benefited from rate limiting. Building security first meant every subsequent PR was building on a secure base.

Deploy before polish. The app went to production with core features before the PWA support, before multi-book management, before the settings extraction. This meant real users could start using the app while the polish continued. It also meant the polish was informed by production behavior, not assumptions about how the app would be used.

Code Review as Sprint Planning

The transferable pattern here is not "AI agents can build fast." That is table stakes. The pattern is using automated code review as sprint planning.

A traditional sprint planning session involves a product manager, a tech lead, and a backlog of varying quality. The team discusses priorities, estimates effort, and commits to a set of work items. This process is valuable but subjective. Two different tech leads will prioritize the same backlog differently.

An automated code review with a structured rubric produces an objective severity ordering. Testing D, security C, architecture C - the work order writes itself. You do not need a planning meeting to know that you fix the D before you address the Cs. You do not need a tech lead to decide that missing test coverage is more urgent than a large file.

This does not replace product prioritization. The code review tells you what is wrong with the code. The product manager tells you what features to build. The sprint combines both: fix the security findings, then build the features, with each feature PR inheriting the fixes. The code review provides the engineering priorities. The product vision provides the feature priorities. They are complementary inputs to the same sprint plan.

For teams considering this approach: run the code review first. Before you write a single feature story, grade the existing codebase. The findings will tell you what technical work needs to happen before - or alongside - the feature work. That sequencing is the difference between a sprint that builds on a solid foundation and one that builds on known problems.

What We Would Do Differently

Honesty about what did not go perfectly.

The 1,257-line page component should have been extracted earlier. The Day 2 extraction brought it to 801 lines, but 801 lines is still too large. The architectural C from the code review was partially addressed, not resolved. A more disciplined approach would have set a hard line - no component over 500 lines - and enforced it with every feature PR rather than deferring extraction to a polish day.

Test coverage is adequate, not thorough. One hundred seventy-nine tests across a full-featured writing app is a starting point. The critical paths are covered - auth, encryption, CORS, core components - but there are gaps in the export pipeline, the drag-and-drop interactions, and the Google Drive sync edge cases (network failures mid-upload, token expiry during export). These are the tests that prevent production incidents, and they are not written yet.

The two-day timeline compresses learning. When agents build fast, the team learns slowly. Each of those thirty-one PRs represents a set of decisions - API design choices, state management patterns, error handling strategies - that were made quickly by an agent optimizing for completion. Some of those decisions will need revisiting as the app matures and real usage patterns emerge. Speed of implementation is not the same as quality of decisions.

The Uncomfortable Part

A two-day sprint from code review to production raises a question that the industry is still working through: what does this mean for traditional sprint planning, estimation, and team structure?

We do not have a complete answer. What we can report is what happened: a code review identified problems, the problems were prioritized by severity, AI agents worked through the queue, and a production app emerged in 48 hours. The features are real. The tests pass. The security hardening is in place. Users are writing with it.

Whether this changes how teams plan sprints, estimate work, or structure their engineering organizations is a bigger question than one sprint can answer. What this sprint demonstrates is that the mechanics work. Code review produces a prioritized work queue. AI agents execute against that queue. The output is a production application, not a prototype.

The interesting question is not whether AI agents can do this. They just did. The interesting question is what your team does with the time that opens up when the build phase compresses from weeks to days.

The writing app went from a C code review to production in 48 hours. AI agents executed the sprint, merging 31 PRs across two days and producing 179 tests across frontend and backend. The code review rubric served as the sprint plan, with the testing D and security C addressed before feature work. The app is in production as a Progressive Web App.

Where We Stand: AI Agent Operations in February 2026

Sat, 21 Feb 2026 00:00:00 GMT

Where We Stand: AI Agent Operations in February 2026

We have been running AI agent teams in production for months. Not experimenting. Not prototyping. Running - sessions, handoffs, fleet dispatches, PR pipelines, content production, documentation enforcement - across multiple product ventures, every day, 12+ hours a day. Hundreds of sessions logged. Thousands of commits merged.

Where the technology actually is. What works. What breaks. What the data says versus what the marketing says. The gap between the narrative and the reality is wide enough to waste months of effort if you walk in believing the wrong things.

The Narrative vs. the Numbers

The narrative says we are entering the age of fully autonomous AI agents. Dario Amodei predicted with 70-80% confidence that a single-person company could reach $1B by 2026. Sam Altman has a CEO group chat betting on the timeline. Solo-founded startups surged from 23.7% in 2019 to 36.3% by mid-2025. The agentic AI market is estimated at $9-11 billion in 2026, projected to hit $45-53 billion by 2030.

The numbers tell a different story. Seven independent studies confirm AI agents fail 70-95% of the time on complex tasks. Gartner predicts more than 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear business value, or inadequate risk controls. Only about 130 of thousands of claimed agentic AI vendors offer legitimate agent technology. The rest are "agent washing" - rebranding chatbots, RPA, and existing automation.

Deloitte surveyed 550 US cross-industry tech leaders. 80% say they have mature basic automation capabilities. Only 28% say the same about automation with AI agents. Only 12% expect comparable ROI from agents within three years.

Both things are true simultaneously. The technology is real and improving rapidly. The gap between a demo and a production system remains the central challenge.

What the Stack Looks Like Now

The industry has settled on a three-layer taxonomy for agent infrastructure, articulated by LangChain and refined by practitioners like Phil Schmid:

Frameworks provide building blocks - tool definitions, agentic loops, basic primitives. LangGraph leads with approximately 6.17 million monthly downloads. CrewAI has 44,000+ GitHub stars and powers 1.4 billion agentic executions across enterprises including PwC and IBM. Microsoft AutoGen handles conversational agent architectures. OpenAI shipped the Agents SDK in March 2025 to replace their experimental Swarm project.

Runtimes provide execution environments - state management, error recovery, checkpointing. This layer is where most teams underinvest and where most projects die.

Harnesses provide the full operational layer - prompt presets, lifecycle hooks, planning, filesystem access, sub-agent management, human approval flows. Schmid frames this clearly: the model is the CPU, the context window is RAM, the harness is the operating system, and specific agent logic is the application. "2025 proved agents could work. 2026 is about making agents work reliably, and the harness determines whether agents succeed or fail."

The emerging pattern among teams that ship: prototype with CrewAI's intuitive role-based model, productionize with LangGraph's stateful graph architecture. Or skip frameworks entirely and build directly on the coding agent runtimes - Claude Code, Cursor, Codex - with custom orchestration above.

MCP Became the Standard

The Model Context Protocol is the biggest structural development in the space. Anthropic introduced it in November 2024. OpenAI adopted it in March 2025 across the Agents SDK, Responses API, and ChatGPT desktop. Google DeepMind followed in April. By November 2025: 10,000+ active MCP servers, 97 million monthly SDK downloads. In December 2025, Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation, co-founded with OpenAI and Block, backed by Google, Microsoft, AWS, Cloudflare, and Bloomberg.

MCP is now adopted by ChatGPT, Cursor, Gemini, Microsoft Copilot, and VS Code. Forrester predicts 30% of enterprise app vendors will launch their own MCP servers in 2026.

The consensus forming: MCP wins for agent-to-tool connections. Google's A2A protocol - launched April 2025, backed by 50+ companies - handles agent-to-agent coordination. Both now live under the Agentic AI Foundation.

If you are building agent infrastructure and not building on MCP, you are working against the emerging consensus. The protocol has network effects now.

Multi-Agent is Going Native

Three developments in the last month changed the landscape:

Claude Code Agent Teams shipped as an experimental feature with Opus 4.6. A lead agent spawns teammates - each a full, independent Claude Code instance with its own context window. Shared task lists with dependency tracking. Inter-agent messaging. Worktree isolation per teammate. Addy Osmani's assessment: "Let the problem guide the tooling, not the other way around. If a single agent in a focused session gets you there faster, use that."

VS Code released native multi-agent development support on February 5, 2026. The IDE is becoming the agent orchestration surface. You can run Claude Code, Aider, Codex, OpenCode, and Amp in separate workspaces within one interface, each with Git worktree isolation.

GitHub launched Agentic Workflows in technical preview on February 17, 2026. Markdown-defined workflows compiled to GitHub Actions YAML. Sandboxed execution with read-only repo access. Supports Claude Code, Codex, or Copilot as the agent. GitHub calls this "Continuous AI" - the agentic evolution of CI/CD. Eddie Aftandilian, GitHub Next principal researcher, describes it as capturing "how autonomous agents extend the CI/CD model into judgment-based tasks."

The direction is clear. Multi-agent orchestration is moving from custom infrastructure into the platforms themselves. If you built custom orchestration, expect parts of it to be absorbed. Build accordingly.

What Actually Works in Production

We track what works not by what seems impressive but by what ships reliably, passes verification, and survives contact with real codebases. Here is what the practitioner community converges on, cross-referenced with our own operational data.

Narrow Scope Wins

Systems solving specific, well-defined problems outperform ambitious general-purpose agents. Every time. The compounding error math is unforgiving: even at 99% accuracy per step, a 100-step task has only a 36.6% chance of succeeding. In practice, accuracy per step is lower than 99%.

Cognition's own performance review of Devin tells the story. PR merge rate doubled from 34% to 67%. Vulnerability fixes ran at 20x efficiency. Java migrations at 14x speed. But an independent evaluation by Answer.AI found only a 15% success rate on 20 attempted open-ended tasks. The pattern: strong on well-scoped tasks with clear acceptance criteria, unreliable on ambiguous work.

The implication for team design: decompose aggressively. The unit of work for an agent should be a single GitHub issue with clear acceptance criteria, not "build the feature." Time-box execution. Define what "done" looks like before the agent starts.

Human Checkpoints are Non-Negotiable

An Amazon AI engineer reportedly stated they know of "zero companies who don't have a human in the loop" for customer-facing AI. From the Hacker News practitioner thread on agent orchestrators, successful teams emphasized keeping agent counts low - 2-3 maximum - to avoid becoming a review bottleneck. One developer managing a 500K+ line codebase reported running multiple distinct tasks across agents, spending a few minutes on architectural reviews while glossing over client code specifics.

The "Claude writes, Codex reviews" cross-model pattern is showing promise for quality assurance. Eval-driven loops using observability and benchmarks outperform pure code generation.

The honest constraint: if you can barely keep up reviewing one agent's output, running four in parallel does not multiply throughput. It multiplies risk. Human review capacity is the actual bottleneck, not agent execution speed. We learned this through fleet sprint operations, where dispatching work to multiple machines revealed that the limiting factor was never agent throughput - it was the human's ability to review and merge.

Session Continuity Matters More Than You Think

Every agent-based system faces the same challenge: agents lose context between sessions. This is compounded when multiple agents coordinate on a shared codebase.

The platforms handle in-session coordination reasonably well now. Claude Code Agent Teams provides shared task lists and inter-agent messaging. OpenAI's Agents SDK has session-based context management. Microsoft's Agent Framework maintains conversation history across handoffs.

None of them solve cross-session persistence. When an agent ends a session at midnight and a new session starts at 8 AM on a different machine, the new agent knows nothing about what happened. The handoff problem - structured transfer of context, decisions, blockers, and next steps between sessions - remains unsolved at the platform level.

Teams running agents seriously need a handoff system. The implementation details vary, but the requirements are consistent: persist what was accomplished, what was decided, what is blocked, and what should happen next. Make that available at session start. Without this, every session begins from scratch, and you lose the compounding benefit of continuous operation.

Fleet Operations Reveal the Real Failure Modes

Running agents on a single machine hides problems that fleet operations expose immediately:

Stale state. When Machine A pushes to origin/main and Machine B has a local main that is 15 commits behind, Machine B's agent creates PRs against stale code. The fix: always branch from origin/main, never local main. This cost multiple failed PRs before it was encoded as a mandatory pre-flight check.

Environment divergence. Agent runtimes strip environment variables matching patterns like TOKEN, KEY, and SECRET from subprocess environments. This is a security feature that becomes an operational hazard. The agent's preflight check passes because it tests process.env.GH_TOKEN directly, but the gh CLI it spawns never receives the token. The symptom is "Bad credentials (HTTP 401)" and the cause is invisible unless you know to look for it. Both Codex and Gemini CLI do this. Claude Code does it too unless explicitly configured otherwise.

Cascade failures. One agent error spirals through coordinated work. Practitioners describe "death spirals" requiring semaphore-like protocols to force serialization on critical tasks. The more agents you run in parallel, the more likely one failure poisons shared state.

The coordination tax. Practitioners report handoff latency of 800-1200ms per transition between agents. A five-agent workflow can accumulate 4-6 seconds of pure handoff overhead while the actual LLM calls take only 2 seconds total. Framework overhead, not intelligence, dominates response time.

These are not theoretical problems. They are operational realities that show up the first week you scale beyond a single machine.

What Breaks and Why

The Composio 2025 AI Agent Report identifies three root causes of agent pilot failures:

Dumb RAG - bad memory management, responsible for 51% of enterprise AI failures. Agents that cannot access the right context at the right time make confident, wrong decisions.

Brittle Connectors - broken I/O between agents and external systems. Custom connectors on failed pilots burned $500K+ in engineering salary at some enterprises.

Polling Tax - no event-driven architecture, wasting 95% of API calls checking for state changes that have not happened.

From Anthropic's own engineering on their multi-agent research system: minor failures cascade into trajectory changes. Their solution was building resumable systems with graceful error handling rather than full restarts. The practical lesson: design for partial failure. Assume agents will fail mid-task and build the ability to resume from the last known good state.

AI-generated code shows consistent blind spots. Authentication flows, input validation, and async race conditions are systematic weaknesses across all models. If your verification pipeline does not specifically test these areas, agent-generated code will ship bugs in predictable categories.

The Token Economics

Anthropic's engineering team found that token usage alone explains 80% of the performance variance in their BrowseComp evaluation of multi-agent systems. Multi-agent systems consume roughly 15x more tokens than single chat interactions, while single agents use about 4x more than chat. Claude Code uses 5.5x fewer tokens than Cursor for equivalent tasks in independent benchmarks, which matters when you are running 12 hours a day.

Enterprise usage runs $1K-$5K+ per month in API costs for heavy usage. Usage varies 10x between maintenance and active development phases, making budgets unreliable. Annual maintenance of agent infrastructure - retraining, monitoring, security updates - runs 15-30% of total infrastructure cost.

The cost trajectory is improving. Devin dropped from $500/month to $20/month with its 2.0 release in April 2025. The industry is moving toward pay-per-task pricing that aligns cost with outcomes rather than consumption.

But today, cost management is a real operational concern. If you are not tracking token usage per task, you are flying blind.

What's Commoditized and What Isn't

Understanding what is becoming commodity versus what remains differentiated determines where to invest effort.

Commoditized (Don't Build This)

Foundation model intelligence. GPT-4, Claude, Gemini are converging on capability. Switching costs between them are dropping.
Tool connectivity. MCP is the universal standard. Generic MCP servers for Slack, GitHub, databases are proliferating - 20,000+ implementations exist.
Basic agent loops. Every framework does this. Every IDE is adding native support.
Prompt libraries and templates. Trivially reproducible.
Simple multi-agent orchestration. Claude Code Agent Teams, VS Code multi-agent, GitHub Agentic Workflows are shipping this as native features.

Still Differentiated (Build This)

Domain-specific harness logic. Workflow-specific orchestration, approval flows, error handling that encode business rules the platforms will not generalize.
Execution trajectories. The runs themselves become training data. Phil Schmid argues the competitive advantage shifts to "the trajectories your harness captures." This is genuinely hard to replicate.
Integration depth. Months of connecting to real systems, handling edge cases, building institutional knowledge about what breaks. Deep vertical expertise creates switching costs.
Operational reliability. Retry logic, cascade prevention, graceful degradation, handoff state management, fleet coordination. The boring infrastructure work that separates production from demos.
Cross-session institutional memory. What was decided, what failed, what the codebase looks like, what the customer needs. Platforms provide context windows. They do not provide institutional memory.

The "Build to Delete" Principle

Phil Schmid's prescription is worth internalizing: architect systems permitting rapid logic replacement. Manus refactored their agent harness five times in six months to remove rigid assumptions as models evolved. Every model release changes the optimal way to structure agents.

The practical application: keep your custom layer thin. Own orchestration - what to run, where, and how to monitor it. Let the platform own execution - how the agent thinks, writes code, and uses tools. When the platform absorbs a capability you built, migrate to the native version. Do not fight it.

What You Should Watch

Observational Memory

Mastra published a new approach to agent memory in early 2026. Instead of traditional RAG retrieval, two background agents - Observer and Reflector - compress conversation history into an append-only observation log that stays in context. Results: 94.87% on LongMemEval, 3-6x compression for text, 5-40x for tool-heavy workloads, and up to 10x cost reduction through prompt caching.

This is significant because it addresses the cross-session memory problem differently than handoff documents or vector databases. The observation log forms a fixed prefix that benefits from provider prompt caching, which dramatically cuts costs on repeated interactions. If you are managing agent memory manually through handoff documents, this approach is worth evaluating.

GitHub Agentic Workflows

Three days old as of this writing, but potentially the most consequential development for teams running issue-to-PR agent pipelines. Markdown-defined workflows. Sandboxed execution. Native GitHub integration. Free with GitHub Actions. If this matures, self-hosted agent dispatch becomes harder to justify for standard workflows. Watch the security model and the reliability data as they come in.

Google's A2A Protocol

Agent-to-Agent protocol, launched April 2025, backed by 50+ companies. Task-based architecture: submitted, working, input-required, completed/failed/canceled. Designed as complementary to MCP - A2A handles agent-to-agent coordination while MCP handles agent-to-tool connections. Both now under the Agentic AI Foundation. If you are building agent-to-agent communication, check whether A2A fits before designing a proprietary protocol.

The Context Window Plateau

Context windows are plateauing at approximately 1 million tokens. The frontier is not bigger windows but better context management. This is why agent teams with isolated context per teammate are winning over single agents with massive context. Design your agent architecture for focused context, not maximal context.

Lessons from Hundreds of Sessions

We have been running this operation since January 2026. Here is what we would tell someone starting today.

Start with session discipline, not agent count. Start-of-day initialization that loads prior context, end-of-day handoffs that persist what happened. Get this right before you add a second agent. Most teams jump to multi-agent before they have reliable single-agent operations.

Encode your verification requirements. Every unit of agent work should have a verification step that runs automatically. Typecheck, lint, format, test. If the agent cannot pass verification in three attempts, stop and escalate. Do not let agents brute-force through failing tests.

Make failure visible. Log what agents attempt, what fails, what gets retried. When an agent stores a description as a secret value instead of the actual secret, you need the audit trail to catch it. When an agent silently loses environment variables because the runtime stripped them, you need the diagnostic tooling to see it. Trust but verify is not sufficient. Instrument and verify.

Decompose issues before dispatching. A GitHub issue that says "build the notification system" is too large for an agent. Break it into issues with clear acceptance criteria, each completable in a single session. The overhead of decomposition is vastly less than the cost of an agent wandering off-scope on an ambiguous task.

Track what the platforms ship. Claude Code Agent Teams, GitHub Agentic Workflows, VS Code multi-agent support - these are all shipping in the same month. The capabilities you build custom today may be native tomorrow. Keep your custom layer thin enough to migrate gracefully.

The human is the bottleneck. Optimizing agent speed further does not improve total throughput once you are running three or more in parallel. The constraint moves to review capacity, integration oversight, and architectural decisions that agents cannot make. Design your workflow around the human's capacity, not the agents'.

The Honest State of Play

We are in the transition from "agents are magic" to "agents are tools with specific economics." The 40% project cancellation rate Gartner predicts is the market correcting for overpromising. The teams that survive are those that treat agent operations like engineering - with verification pipelines, failure budgets, session discipline, and operational instrumentation - not like a demo that scales itself.

The technology is real. MCP standardized the tool layer. Multi-agent coordination is going native in the platforms. Context management techniques like observational memory are cutting costs by an order of magnitude. Models are getting better every quarter. The 20-hour autonomous coding task is plausibly within reach by year-end.

But reliability remains the fundamental constraint. Human oversight remains non-negotiable. Integration work - connecting agents to real systems with real failure modes - is where most projects die. And the operational knowledge required to run agents in production - the environment variable gotchas, the stale-state bugs, the cascade failure patterns - that knowledge only comes from doing the work.

The space rewards practitioners over theorists. Build the harness. Run the sessions. Log the failures. Share what you learn.

That is where we stand.

What Breaks When You Sprint with 10 AI Agents

Fri, 20 Feb 2026 00:00:00 GMT

Spawning one AI coding agent on a feature branch is straightforward. Spawning ten across four machines, organized into dependency waves, with each agent working an isolated worktree - that is where the failure modes get interesting.

We built a sprint orchestrator that takes a set of GitHub issues, resolves their dependency graph, and executes them in parallel waves using Claude Code agents on git worktrees. A recent feature build was its largest test: 36 PRs across four waves, four machines, three hours. Most of it went well. The failures revealed specific problems worth documenting.

The Orchestration Model

The sprint skill is a prompt-driven orchestrator. It does not manage long-running processes or maintain state between waves. Each invocation is stateless:

Fetch the assigned GitHub issues
Parse dependency annotations (depends on #N, blocked by #N)
Build a wave plan - issues with no unresolved dependencies go first, up to the machine's concurrency limit
Create one git worktree per issue, each on a fresh branch from main
Spawn all agents in a single message for true parallelism
Wait for completion, collect results, update labels

Each agent receives a self-contained prompt: the issue body, the worktree path, the branch name, and the project's verification command. The agent implements, runs verification, commits, pushes, and opens a PR. If it fails verification three times, it stops and reports failure instead of shipping broken code.

The orchestrator only executes one wave per invocation. After Wave 1's PRs are reviewed and merged, you run the sprint skill again with the remaining issues. Wave 2 branches from the updated main. This eliminates inter-wave state management entirely - there is no state to manage.

Machine concurrency is detected at runtime from the hostname. Stronger machines run three agents. Lighter ones run two. The fleet ran all four machines simultaneously during peak waves, putting up to ten agents in flight at once.

What Worked

Worktree isolation is the right abstraction. Each agent gets its own copy of the codebase at a specific commit. No shared mutable state. No merge conflicts during implementation. Two agents can edit the same file in their respective worktrees without interference - conflicts only surface at PR review time, where they belong.

Single-issue agents stay focused. Each agent gets one issue, one branch, one PR. No scope creep, no "while I'm here" refactoring. The prompt explicitly constrains: "NEVER modify files that are not relevant to your issue." This produces small, reviewable PRs. Thirty-six PRs sounds like a lot, but each one is a single coherent change.

Wave boundaries are natural review gates. Between waves, the human reviews all PRs from the previous wave, resolves any conflicts, and merges. This catches integration issues before the next wave builds on top of them. It also means the human stays in the loop without becoming a bottleneck during implementation.

Failure is contained. When one agent fails - bad implementation, test failures, timeout - it reports failure and the orchestrator handles it. Other agents in the same wave are unaffected. The orchestrator offers a retry (fresh worktree, same prompt) or a skip. One retry max per issue to prevent infinite loops.

What Broke

Wave 4 produced two PRs with conflicts in every file. Not subtle merge conflicts - every file was different from what was on main.

The root cause: two machines had stale local main branches. Between Wave 3 and Wave 4, the Wave 3 PRs were merged on GitHub. The machine running the orchestrator pulled main. The other two machines did not. When the sprint skill created worktrees on those machines, git worktree add branched from their local HEAD - which was three waves behind remote.

The agents had no way to detect this. Their worktrees were internally consistent. Tests passed. Code compiled. The PRs opened successfully. But the diffs showed every change from the previous three waves as modifications, because the base commit was ancient.

We closed both PRs and reimplemented the work on a machine with current main. Two agents' worth of correct implementation, discarded because of a missing git pull.

A separate machine hit intermittent shell failures - commands returning empty output or timing out. The agents hit their retry limits and reported failure. We reassigned those issues to healthy machines. The root cause was never identified, which is its own lesson about fleet reliability.

The Fix

The sprint skill had one implicit assumption: "local main is current." That is true if someone recently pulled. It is false after merging PRs on GitHub between waves, which is exactly what happens in every multi-wave sprint.

The fix is a sync gate at the top of the worktree setup phase:

git fetch origin main

Compare git rev-parse main against git rev-parse origin/main:

Match: Proceed. Local is current.
Local behind: Fast-forward with git merge --ff-only origin/main. If the fast-forward fails (local has diverged), stop with an error. Diverged main requires human judgment.
Local ahead: Warn but proceed. The operator may have intentional local commits.

Three lines of logic that prevent an entire wave of wasted work.

Lessons

State management happens between waves, not during them. During a wave, worktree isolation handles everything. Between waves, the fleet's local state must be synchronized with remote before the next wave launches. The orchestrator now enforces this, but the broader principle applies to any multi-machine agent workflow: the dangerous moment is the transition, not the execution.

Agent correctness is local; integration correctness is global. Each agent can produce a perfect implementation against the code it can see. That means nothing if the code it can see is stale. Verification commands (lint, typecheck, test) validate internal consistency. They cannot validate that the agent is working against the right baseline. That check must happen before the agent starts.

Machine health is not guaranteed. A machine that worked yesterday can fail today with no configuration change. Shell timeouts, disk issues, network flakiness - intermittent failures that agents can't diagnose or fix. Pre-flight health checks before spawning agents would catch machines having a bad day before they waste a wave slot. We have not built this yet, but the need is clear.

Throughput is gated by the human, not the agents. Ten agents in parallel can produce ten PRs in 15 minutes. Reviewing, merging, and sequencing those PRs takes longer than producing them. The effective throughput of this sprint was roughly 10 issues per session, with agent time measured in minutes and human time measured in hours. Optimizing agent speed further would not improve total throughput. Optimizing the review pipeline would.

Finding Four Auth Vulnerabilities in One Code Review

Fri, 20 Feb 2026 00:00:00 GMT

Four authentication vulnerabilities, all in production, all exploitable, all introduced during prototyping, all found in a single code review session. None of them were bugs in the traditional sense. The code worked. The tests passed. Every endpoint returned the right data for the right requests. The problem was that they also returned the right data for the wrong requests.

The app is a family expense tracker for shared custody situations. It was built rapidly with AI agent assistance - functional prototype to working API in days, not weeks. That speed came with a cost we did not discover until we sat down to review the auth layer systematically.

The cost was not one vulnerability. It was a pattern.

Vulnerability 1: The Header That Trusts Anyone

The auth middleware had a fallback path. If no JWT was present in the request, it checked for an X-User-Id header. If that header existed, the middleware trusted it. No signature verification. No token validation. Just a header value treated as authenticated identity.

Any HTTP client could impersonate any user by setting a single header:

GET /api/expenses
X-User-Id: victim-user-id-here

That is the entire attack. No token theft, no session hijacking, no cryptographic exploit. Set a header, become anyone.

The root cause was prototyping convenience. During early development, the X-User-Id header was a shortcut for testing API endpoints without setting up JWT mocking infrastructure. It let agents and developers hit endpoints quickly, verify response shapes, and iterate on the API surface. Useful during a spike. Catastrophic in production.

The fix in PR #141 was straightforward: remove the fallback entirely. An X-User-Id header with no JWT now returns 401. The header was also removed from the CORS allowHeaders configuration so browsers would not even send it in preflight responses. Every test file - all 10 of them - was updated to use JWT Bearer authentication instead of the convenience header. 192 tests pass.

Vulnerability 2: JWTs Without Issuer Validation

The JWT verification function checked two things: signature validity and token expiry. It did not check who issued the token.

This matters because the app uses Clerk for authentication. Clerk applications share a signing key infrastructure. A valid JWT from a different Clerk application - one that has nothing to do with this app - could pass signature verification and be treated as an authenticated session.

The attack surface is narrow but real. Anyone running their own Clerk application could generate JWTs that the app would accept. The signature is valid (same key pool), the token is not expired, and the middleware has no way to distinguish "this token was issued for our app" from "this token was issued for a completely different app."

The root cause: during initial auth implementation, signature and expiry felt like sufficient validation. The iss claim was not checked because "we only have one Clerk app" - which was true at the time but is not a security invariant.

PR #140 added issuer validation. The expected issuer URL is derived from the existing CLERK_DOMAIN environment variable, so no new configuration was needed for the common case. A new optional CLERK_ISSUER_URL environment variable allows explicit override when the derived URL does not match. The comparison is exact string match - no substring matching, no regex, no "starts with" logic that could be tricked with a carefully crafted issuer string.

Eight new tests cover: wrong issuer, missing iss claim, correct issuer, no validation when the env var is unset, substring rejection, and middleware-level integration. 199 tests pass.

Vulnerability 3: The Endpoint Anyone Could Call

The /users/sync endpoint creates user accounts and updates email addresses. It had no authentication. None.

POST /api/users/sync
Content-Type: application/json

{"userId": "anything", "email": "attacker@example.com"}

That creates a user account, or if the user ID already exists, overwrites their email address. No JWT required. No API key. No webhook signature. An open door to account creation and email takeover.

The root cause is a common pattern in AI-generated prototype code: an endpoint designed for a specific integration that gets exposed as a general API route. This endpoint was built for Clerk webhook callbacks. The reasoning was "Clerk will be the only caller" - which might have been true in development but was enforced by nothing. The endpoint was a regular route in the API, reachable by anyone who could send an HTTP POST.

PR #139 added auth middleware to the endpoint. More importantly, it changed the trust model: the user ID is now derived from the JWT claims, not from the request body. The frontend was updated to pass a Clerk session JWT via the Authorization header. The endpoint no longer takes the caller's word for who they are. 191 tests pass.

Vulnerability 4: CORS for Everyone on Vercel

CORS was configured with this origin pattern: *.vercel.app. Any application deployed on Vercel could make authenticated cross-origin requests to the app's API.

Vercel is one of the most popular deployment platforms in the JavaScript ecosystem. Millions of applications are deployed there. Every single one of them was an allowed origin for authenticated API requests to the app.

The root cause: preview deploys. During development, every PR gets a unique Vercel preview URL. The wildcard pattern ensured that preview deploys could hit the API without CORS errors. It worked perfectly for development. It also worked perfectly for any other application on Vercel.

PR #135 tightened the pattern. The production domain is exact-matched. Preview deploys match against the specific pattern for the project's Vercel deployments, not the entire *.vercel.app namespace. Random Vercel-deployed origins now receive CORS rejections.

Bonus: 39 Catch Blocks Leaking Information

While reviewing the auth layer, we found a fifth issue that was not an authentication vulnerability but compounded the risk. Across 12 route files, 39 error handlers were returning details: String(error) in their JSON responses.

catch (error) {
  return Response.json(
    { error: 'Failed to fetch expenses', details: String(error) },
    { status: 500 }
  );
}

In the best case, this leaks internal error messages. In the worst case, it leaks stack traces with file paths, database connection strings from failed queries, or third-party API error responses that include account identifiers. Combined with the other vulnerabilities - particularly the unauthenticated sync endpoint - an attacker could trigger errors intentionally and harvest the leaked details.

PR #136 removed the details field from all 39 catch blocks. Server-side console.error() logging was retained so the information is still available for debugging. Clients now receive a generic error message and a status code. The internal details stay internal.

The Pattern: Auth Debt

All four vulnerabilities share a root cause: prototyping shortcuts that never got removed.

Shortcut	Reasoning	Risk
`X-User-Id` header fallback	"Easier to test without setting up JWT mocking"	Any client impersonates any user
No issuer validation	"The signature check is sufficient for now"	Cross-application token acceptance
Unauthenticated `/users/sync`	"Clerk will be the only caller"	Open account creation and email overwrite
Wildcard CORS	"We need preview deploys to work"	Any Vercel app makes authenticated requests

Each shortcut was individually reasonable. AI agents write working code fast. They create functional endpoints, add test helpers for quick iteration, and use convenient defaults that make the immediate task easier. The code works. Tests pass. The prototype ships.

But each shortcut is also a piece of security debt. And unlike technical debt - where the cost is slower development velocity - security debt compounds silently. There is no linter warning for "this endpoint should have auth." There is no test failure for "this CORS config is too permissive." The code runs correctly right up until someone exploits it.

We call this auth debt: the gap between "the code works" and "the code is secure." It accumulates naturally in AI-assisted rapid prototyping because the agent's objective is to make the feature work, and every prototyping shortcut achieves that objective. The shortcuts are invisible to automated quality checks because they are not bugs - they are missing constraints.

Why Automated Checks Miss Auth Debt

The standard CI pipeline - typecheck, lint, format, test - verified all of this code as correct. Every PR that introduced a vulnerability had green CI.

TypeScript does not know that X-User-Id should not be trusted. ESLint does not flag missing issuer validation. Prettier does not care about CORS origins. The test suite verified that authenticated requests succeeded, but none of the tests verified that unauthenticated requests failed.

This is the gap. Positive testing ("does the right request get the right response?") was thorough. Negative testing ("does the wrong request get rejected?") was almost entirely absent. The original test suite had tests for the X-User-Id fallback path, but they were testing that it worked, not that it should not exist.

After the fix sprint, the test approach inverted. Every auth-related test now has a negative counterpart:

JWT with wrong issuer returns 401
Request with X-User-Id but no JWT returns 401
/users/sync without Authorization header returns 401
Cross-origin request from a non-project Vercel domain gets CORS rejection
Error responses contain no details field

The total test count went from 165 to 199. The 34 new tests are almost entirely negative cases - verifying that things that should fail do fail.

The Fix Sprint

PRs #132 through #144 shipped in a single session. The progression was deliberate - each fix built confidence for the next:

PR	Change
#132	Fix missing /csv suffix on export API paths
#133	Align notification preferences API paths
#134	Add auth middleware test coverage
#135	Tighten CORS to reject non-project Vercel origins
#136	Sanitize error responses (39 catch blocks)
#137	Add ESLint to API worker
#138	Integration tests for route handlers
#139	Require auth on /users/sync
#140	Add issuer validation to JWT verification
#141	Remove X-User-Id auth bypass
#143-144	PWA support (post-security sprint)

The ordering matters. We started with path corrections and test infrastructure (#132-#134), which let us verify the existing behavior before changing it. Then restrictive changes (#135-#136) that tighten the surface area without modifying auth logic. Then the actual auth fixes (#139-#141), each one building on the test infrastructure established earlier.

The entire sprint - discovery, fixes, tests, verification - was a single agent session. Not because the changes were trivial, but because the scope was well-defined. "Find auth problems and fix them" is a clearer objective than "make the app better." Specificity drives velocity.

What This Means for AI-Assisted Prototyping

The takeaway is not "don't use AI agents to build prototypes." The takeaway is that rapid prototyping with AI agents has a specific, predictable failure mode: auth debt.

Every team using AI to rapidly scaffold APIs will accumulate the same kind of shortcuts. The test header that becomes a production bypass. The validation that seems sufficient until you realize it is not. The endpoint that works correctly but has no access control. The CORS policy that is permissive because restrictive was inconvenient during development.

The fix is not slower prototyping. The fix is a dedicated security review pass before anything is exposed to real users. Not a vague "review the code" pass - a specific checklist:

For every endpoint: what happens when the request has no auth token? Verify it returns 401.
For every auth check: what claims are validated? Signature alone is not enough.
For every middleware fallback: was it added for testing convenience? If yes, remove it.
For CORS: does the origin pattern match only your domains, or does it match an entire platform?
For error responses: what information reaches the client? Stack traces and internal paths should never leave the server.

This checklist found four vulnerabilities in a single codebase. We would bet it finds at least two in any AI-scaffolded API that has not had a dedicated security review.

The speed of AI-assisted prototyping is genuine. The risk is also genuine. The solution is not to choose between speed and security. It is to build the security review into the pipeline as a distinct phase, run it before the prototype becomes the product, and treat every prototyping convenience as a line item that must be explicitly resolved - kept with justification or removed.

Prototype fast. Review thoroughly. Ship with both.

The app is a family expense tracker built with AI agent assistance. A single code review session found four authentication vulnerabilities - all prototyping shortcuts that survived into production. PRs #132 through #144 fixed the auth layer, added 34 negative test cases, and sanitized 39 error handlers. All four vulnerabilities followed the same pattern: a convenience that was reasonable during development and exploitable in production.

What Running Multiple Ventures with AI Agents Actually Costs

Thu, 19 Feb 2026 00:00:00 GMT

Running multiple software ventures simultaneously with AI coding agents sounds expensive. It is not - at least, not in the ways you would expect. We run several active projects across a fleet of development machines, with AI agents doing the bulk of the coding work. Here is what it actually costs.

Total monthly cost: roughly $490. The breakdown that follows covers every line item: infrastructure, hosting, secrets management, networking, AI subscriptions, hardware, internet, domains, and email. Where something runs on a free tier, we say so. Where we pay, we give the number.

Infrastructure: Cloudflare ($5/month)

Our entire backend runs on Cloudflare's developer platform. Multiple Workers handle the context API, a GitHub webhook classifier, and venture-specific APIs. D1 provides the database. We ran on the free tier for months, but as the portfolio grew, one venture's API hit 90% of the daily Workers KV limit - so we upgraded to the Workers Paid plan.

Workers Paid plan ($5/month):

Unlimited requests (no daily cap)
30s CPU time per invocation
10 million KV reads per day
1 million KV writes per day

The trigger for upgrading was KV usage, not Workers requests or CPU. One venture uses KV for rate limiting, JWT key caching, and error logging on every API request. At scale, those operations add up. The free tier allows 100,000 KV reads and 1,000 writes per day - enough for internal tooling, but not enough once a user-facing application starts generating real traffic.

D1 free tier:

5 million rows read per day
100,000 rows written per day
5 GB total storage

Our D1 databases store sessions, handoffs, enterprise knowledge notes, operational documentation, and venture-specific data. Total storage is measured in megabytes. D1 remains comfortably within the free tier.

R2 free tier (available, barely used):

10 GB storage per month
1 million Class A operations per month
10 million Class B operations per month
Zero egress fees

We previously used R2 for evidence storage in an earlier architecture. After simplifying, R2 usage dropped to near zero. The free tier remains available if we need object storage again.

Monthly cost: $5

We ran on the free tier from launch through mid-February 2026 with no issues. The upgrade was not forced by Cloudflare's pricing model being restrictive - it was a natural consequence of a venture moving from internal tooling to production traffic. At $5/month for the entire account, this remains one of the cheapest infrastructure line items in the stack.

Hosting: Vercel ($20/month)

The frontend applications deploy to Vercel's Pro plan. Seven projects across the portfolio share a single team account: a writing app, an expense tracker, an auction intelligence dashboard, and several supporting services.

Pro plan ($20/user/month, 1 seat):

$20 included usage credit per month
Unlimited preview deployments
Serverless and edge functions
Analytics and performance monitoring

The $20 credit covers build minutes, function invocations, and bandwidth. During normal development, usage stays well within the credit. During heavy sprints - like pushing a venture toward launch - build minutes spike and can exceed the credit by $15-35. This is expected: build minutes scale with deployment frequency, and active development means frequent deploys.

Monthly cost: ~$20 (base plan, with occasional overages during heavy development)

The Hobby tier (free) works for personal projects, but commercial use requires Pro. At $20/month for hosting seven projects across four ventures with serverless functions and preview deployments, this is reasonable. There is no mid-tier upgrade - the next step is Enterprise, which does not make sense at this scale.

Source Control: GitHub ($8/month)

All repositories live in a single GitHub organization on the Team plan ($4/user/month, 2 seats).

What we use:

Private repositories for all venture codebases
GitHub Issues for work tracking (with label-based status workflows)
GitHub Actions for CI/CD (typecheck, lint, test, security scanning, doc sync)
Pull requests and code review
Org-wide branch protection rulesets

GitHub Actions free tier:

2,000 CI/CD minutes per month (for private repos; unlimited for public repos)
500 MB of GitHub Packages storage

Our CI runs are lightweight - TypeScript compilation, ESLint, Prettier formatting checks, and a small test suite. Each run finishes in under two minutes. We also run daily security scans (npm audit, Gitleaks) via scheduled workflows. Actions usage stays well within the free tier.

Monthly cost: $8

The Team plan is worth the $8 for org-wide branch protection rulesets alone. Without them, you're relying on convention to prevent force-pushes to main across multiple repos. That works until it doesn't.

Secrets Management: Infisical ($0/month)

Every project has its own set of API keys, auth tokens, and configuration secrets. These need to be available on every development machine, injected into agent sessions at launch time, without ever touching disk in plaintext.

We use Infisical's cloud-hosted free tier. All ventures share a single Infisical project, organized by path (/alpha, /beta, etc.) with separate production and development environments.

The free tier covers:

Unlimited secrets (within 3 projects and 3 environments)
Basic access controls
CLI integration for runtime secret injection

Our launcher CLI fetches secrets from Infisical at session start and injects them as environment variables. For remote SSH sessions, we use Infisical's Machine Identity (Universal Auth) instead of interactive login.

Monthly cost: $0

Infisical is also open source, so self-hosting is an option if you outgrow the free tier or need advanced features like automatic rotation. We have not needed to self-host yet.

Networking: Tailscale ($0/month)

With multiple development machines - some at a desk, some portable, some always-on servers - they all need to talk to each other. SSH between machines, remote agent sessions from mobile devices, fleet management scripts that touch every box.

Tailscale's free Personal plan covers this completely:

Up to 100 devices
Up to 3 users
WireGuard-encrypted mesh networking
MagicDNS for hostname resolution
NAT traversal (works behind any firewall or cellular connection)

We run five machines on the Tailscale mesh. Each gets a stable 100.x.x.x IP address. SSH config uses these IPs, so connections work identically whether you are on the same local network or connecting from a phone hotspot in a coffee shop.

Monthly cost: $0

Tailscale replaces what would otherwise require a VPN server, dynamic DNS, port forwarding configuration, and hours of networking debugging. The free tier is not a stripped-down trial - it is the full product for personal and small-team use.

AI Subscriptions: The Real Expense ($245/month)

This is where the money goes. AI subscriptions are the single largest line item.

We use three AI providers:

Provider	Plan	Monthly Cost	What It Covers
Anthropic	Max 20x	$200	Claude Code (primary coding agent), Claude chat
OpenAI	Plus	$20	Codex CLI, GPT for second-opinion tasks
Google	Workspace/Gemini	~$25	Gemini CLI, Google Workspace productivity suite

Claude Code through Anthropic's Max plan is the workhorse. On a typical day, we run 4-8 agent sessions, each lasting 30-90 minutes. The Max 20x tier at $200/month provides 20x the usage of the Pro plan ($20/month), which is necessary for heavy multi-venture development.

Anthropic offers three subscription tiers for Claude Code:

Pro: $20/month (includes Claude Code access)
Max 5x: $100/month (5x Pro usage)
Max 20x: $200/month (20x Pro usage)

A single-founder operation with lighter usage could run on the Max 5x tier at $100/month, reducing the total AI cost to $145/month.

The OpenAI and Google subscriptions provide access to alternative CLIs (Codex CLI, Gemini CLI) and general productivity tools. The launcher supports all three agent CLIs with a single command, making it practical to use the right tool for each task.

Monthly cost: $245

Hardware ($61/month amortized)

AI agents need machines to run on. Our fleet includes a mix of Apple Silicon Macs and repurposed older hardware running Linux.

Current fleet:

Machine	Specs	Role	Cost	Amortized (36 mo)
MacBook Pro M1 Pro	16GB, Apple Silicon	Primary dev	~$1,500 [estimate]	~$42/month
MacBook Air M1	16GB, Apple Silicon	Field/portable dev	~$700 (refurbished) [estimate]	~$19/month
Mac Mini (Intel i7-3615QM)	16GB, Ubuntu 24.04	Always-on server	$0 (repurposed)	$0/month
MacBook Pro 2014 (Intel i7-4870HQ)	16GB, Xubuntu 24.04	Secondary workstation	$0 (repurposed)	$0/month
ThinkPad (Intel i5-4300U)	8GB, Xubuntu 24.04	Secondary workstation	$0 (repurposed)	$0/month

The two Apple Silicon machines are the only purpose-bought hardware. The rest of the fleet is repurposed hardware that was sitting in drawers - an old Mac Mini, a 2014 MacBook Pro, and a ThinkPad, all running Ubuntu/Xubuntu. They work fine as secondary dev workstations and always-on servers for remote agent sessions. The Mac Mini runs 24/7 as the fleet's always-on SSH target.

If you were building this from scratch today:

A Mac Mini M4 with 16GB starts at $599 (frequently on sale for $499). That is enough to run Claude Code sessions, build projects, and serve as a remote dev box. Amortized over 3 years: roughly $14-$17/month.

A refurbished MacBook Air M1 with 16GB runs about $600-$800 [estimate]. Amortized over 3 years: roughly $17-$22/month.

You could run this entire setup on a single Mac Mini M4 for $499 up front - about $14/month amortized. Add a laptop for portability and you are at $30-$40/month for hardware.

Monthly hardware cost (amortized): ~$61/month for our five-machine fleet, or as low as $14/month for a minimal single-machine setup.

Internet Access ($130/month)

Agent sessions need reliable bandwidth. Builds, git operations, API calls, and Cloudflare deployments all go over the wire. When working from the field, iPhone hotspot provides the connection for the portable setup.

This line item is easy to overlook because you are paying it anyway. But it is a real cost of running this operation, and it would not be honest to exclude it.

Monthly cost: ~$130

Domains (~$7/month)

Each venture that has a public presence needs a domain. Registration runs $14-$30/year per domain depending on the TLD. With several active ventures, this adds up.

Cloudflare Registrar offers at-cost domain registration with no markup, which keeps renewal prices at the wholesale minimum.

Monthly cost: ~$7 [estimate]

Developer Tools ($2/month)

Blink Shell ($20/year) is the iOS SSH/Mosh client that makes mobile access to the fleet possible. It supports SSH and Mosh natively, syncs keys and configs via iCloud, and handles Tailscale connections. Without it, the mobile access workflow described in our other articles would not exist.

Monthly cost: ~$2

Email: Buttondown + Resend ($9/month)

Once the marketing site launched, we needed two email capabilities: a newsletter for ongoing reader relationships, and transactional email for the contact form.

Buttondown ($9/month):

The newsletter runs on Buttondown's Basic plan. It provides RSS-to-email automation that checks the site feed every 30 minutes - when we publish an article or build log, subscribers get it automatically with no manual step. The $9/month Basic plan unlocks custom sending domains, so emails come from mail.venturecrane.com rather than Buttondown's default address. The free tier (under 1,000 subscribers) would work without the custom domain, but branded sending matters for a professional operation.

Resend ($0/month):

The contact form sends through Resend's transactional email API. The free tier covers 3,000 emails per month - orders of magnitude more than a contact form generates. Domain-verified sending with DKIM and SPF means emails arrive from a branded address, not a sandbox domain.

Both services follow the same integration pattern: a Cloudflare Pages Function calls the provider's API, with the API key stored in Infisical and deployed as an encrypted environment variable. No client-side email SDKs, no third-party form services.

Monthly cost: $9

The Full Picture

Category	Monthly Cost	Notes
AI subscriptions	$245	Anthropic $200 + OpenAI $20 + Google ~$25
Internet access	~$130	Home broadband + mobile hotspot
Hardware (amortized)	~$61	5-machine fleet, 2 purchased + 3 repurposed
Vercel Pro	$20	7 projects, 1 seat, $20 included usage credit
Buttondown	$9	Newsletter with custom sending domain
GitHub Team	$8	2 seats at $4/user/month
Domains	~$7	Several domains at $14-$30/year each
Cloudflare Workers + D1	$5	Workers Paid plan, D1 free tier
Blink Shell	~$2	iOS SSH/Mosh client, $20/year
Resend	$0	Contact form email, free tier (3,000/month)
Infisical	$0	Free tier, cloud-hosted
Tailscale	$0	Free Personal plan, 5 of 100 devices used
Total	~$487

The number that stands out is how much of the budget is AI subscriptions and internet - roughly 77% of the total. Everything else combined is under $120/month.

What Surprised Us

The free tiers are not traps. Tailscale, Infisical, and Resend all offer free tiers that genuinely cover small-team and solo-founder use cases without artificial friction. Cloudflare's free tier carried us for months before a venture's production traffic outgrew the daily KV limits - and even then, the paid plan is $5/month. These are real free tiers, not trial periods with a countdown.

Hardware costs are front-loaded, not recurring. Once you buy the machines, the monthly amortized cost is low. And if you have old hardware sitting around, repurposing it as a Linux dev server costs nothing. A 2014 MacBook Pro with 16GB of RAM running Xubuntu is a perfectly capable remote agent host.

AI subscriptions and internet dominate the budget. Strip out AI and internet costs and the entire operation runs for under $120/month. AI subscriptions alone account for half the total. This is the line item with the most room for optimization - dropping to the Max 5x tier ($100/month) would save $100 immediately.

The infrastructure is simpler than it sounds. "Multiple Cloudflare Workers, a D1 database, an MCP server, a fleet of machines on a mesh VPN" sounds like a complex enterprise setup. In practice, the Workers deploy with a single command, D1 is just SQLite at the edge, and Tailscale configures itself. The total infrastructure setup time for a new machine is about five minutes with our bootstrap script.

Operational overhead is near zero. There are no servers to patch, no databases to back up (D1 handles this), no certificates to rotate (Cloudflare handles this), no VPN servers to maintain (Tailscale handles this). The only recurring operational tasks are rotating API keys in Infisical when they expire and monitoring usage alerts from providers - which is how we caught the KV limit before it caused downtime.

For Founders Considering This Approach

The barrier to running an AI-native multi-project development operation is not cost - it is architecture. The tooling decisions matter more than the budget.

Here is what a minimal viable setup looks like:

One Mac Mini M4 ($499-$599) - your development machine and remote agent host
Claude Pro or Max subscription ($20-$200/month) - your AI coding agent
Cloudflare free tier ($0) or Workers Paid ($5/month) - Workers, D1, and R2 for backend services. The free tier is sufficient for internal tooling; upgrade when you have user-facing traffic
Vercel Hobby ($0) or Pro ($20/month) - frontend hosting with serverless functions. Hobby works for personal projects; Pro is required for commercial use
GitHub free tier (or Team at $4/user/month for branch protection) - source control, issues, CI/CD
Tailscale free tier - if you add a second machine or want mobile access
Infisical free tier - secrets management from day one (do not hardcode keys)

Total year-one cost for the minimal setup: roughly $500 for hardware plus $240-$2,400 for AI and $0-$240 for hosting, depending on usage intensity and whether you need commercial hosting. Call it $750-$3,200 for the first year to run a multi-project AI-native development lab.

That is less than most founders spend on a single SaaS subscription stack. The trade-off is that you are building on primitives (Workers, D1, MCP) rather than buying pre-built platforms. For a technical founder, that is a feature, not a bug - you control the entire stack, and almost none of it has a recurring fee.

The real investment is not money. It is the time to set up the automation, the context management, the session handoff workflows, and the agent coordination patterns that make multi-venture development actually work. Those are engineering problems, not budget problems. And AI agents are remarkably good at helping you solve them.

PWA vs Native App - When to Skip the App Store

Wed, 18 Feb 2026 00:00:00 GMT

Shipping an iPad app through the App Store costs $99/year, a week of App Store Review roulette, and a commitment to a build toolchain you will maintain for the life of the product. Shipping the same app as a PWA costs a manifest.json, a service worker, and a deploy.

We chose the PWA. Not because we are opposed to native apps, but because we had not yet proven that anyone wanted the product.

The case study is a writing app for nonfiction authors, built for iPad-first use. Rich text editing, AI-powered rewrite suggestions, Google Drive sync, PDF and EPUB export. The kind of app that sounds like it should be native. It is not, and the reasons are worth examining.

The Temptation of Native

When you are building for iPad, the gravitational pull toward a native app is strong. App Store distribution. Native performance. The mental model that "serious apps" are native apps.

Here is what a native iOS app actually requires:

Apple Developer Program - $99/year, enrollment approval, provisioning profiles.
Build toolchain - Xcode, Swift/SwiftUI (or React Native with its bridging layer), CocoaPods or SPM for dependencies.
App Store Review - every release goes through Apple's review process. Typical turnaround is 24-48 hours, but rejections happen, and the feedback loop is measured in days.
Binary management - code signing, TestFlight for beta distribution, crash symbolication, app thinning for different device classes.
Separate codebase - unless you use React Native, your iOS app is a distinct codebase from your web app. Two deployment pipelines, two testing strategies, two sets of bugs.

None of this is unreasonable for a validated product. All of it is premature for a product that has not found its audience yet.

What the App Actually Uses

Before deciding on distribution, we listed every technical capability the app requires:

Rich text editing - TipTap, a ProseMirror-based editor. Runs entirely in the browser.
AI rewrite streaming - Server-Sent Events from an API endpoint. The browser renders tokens as they arrive.
Google Drive OAuth - standard OAuth 2.0 flow. Works in any browser.
PDF export - generated via a headless browser rendering service. The client sends content, gets back a PDF.
EPUB generation - JSZip running client-side. No native APIs involved.
Offline shell - service worker caches the app shell. Data still needs network, but the app loads without one.

Every single feature on this list works in Safari on iPad. None of them require ARKit, HealthKit, Core Data, background location, NFC, or any other native-only API.

This is the first question in the decision framework, and it is the most important one: do your features require native APIs? If the answer is no, the argument for a native app shifts from technical necessity to distribution preference. That is a very different conversation.

The PWA Stack

The technical implementation is straightforward enough to describe in a few paragraphs. That is part of the point: the overhead is minimal.

Web app manifest. A manifest.json file that tells the browser this site can behave like an app. The critical fields:

{
  "name": "Your App Name",
  "short_name": "App Name",
  "start_url": "/",
  "display": "standalone",
  "background_color": "#0a0a0a",
  "theme_color": "#0a0a0a",
  "icons": [
    { "src": "/icons/icon-192.png", "sizes": "192x192", "type": "image/png" },
    { "src": "/icons/icon-512.png", "sizes": "512x512", "type": "image/png" }
  ]
}

display: standalone is the key property. It tells iOS to render the app without Safari's address bar and navigation chrome. When a user taps "Add to Home Screen," the app launches full-screen, indistinguishable from a native app at the visual level.

Service worker via Serwist. Serwist is the successor to next-pwa, which was built on Workbox. It integrates with Next.js through a plugin in next.config.ts. The service worker caches the app shell - HTML, CSS, JavaScript, fonts - so the app loads instantly on subsequent visits, even offline.

// next.config.ts
import withSerwistInit from '@serwist/next'

const withSerwist = withSerwistInit({
  swSrc: 'src/sw.ts',
  swDest: 'public/sw.js',
})

export default withSerwist(nextConfig)

The service worker source (sw.ts) is typically under 30 lines. It registers precache entries and sets up runtime caching strategies. The build toolchain handles the rest.

iOS meta tags. Safari needs additional meta tags beyond the manifest to fully support PWA features:

<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<link rel="apple-touch-icon" href="/icons/apple-touch-icon.png" />

These tags control how the app appears when launched from the home screen - status bar style, icon, splash screen behavior.

That is the entire PWA layer. Manifest, service worker, meta tags. No Xcode project, no provisioning profile, no signing certificate.

The Deployment Difference

This is where the practical gap between PWA and native becomes stark.

A native app release: write code, build in Xcode, submit to App Store Connect, wait for review (1-3 days), receive approval or rejection. If rejected, fix the issue, resubmit, wait again. When approved, the update propagates to users over hours to days depending on their device settings.

A PWA release: push to main, CI deploys to your hosting provider, users get the new version on their next visit. Total time from merge to production: minutes. There is no review gate, no approval queue, no binary propagation delay.

For a product that has not found product-market fit, this iteration speed is the entire game. You need to ship changes, observe behavior, and ship again. A 3-day feedback loop through App Store Review is workable for a mature product. It is fatal for a product that is still figuring out what it is.

We deployed 14 changes in the first two weeks after launch. Some were bug fixes. Some were feature experiments. Some were UI adjustments based on watching real usage patterns. Every one of those shipped within minutes of merge. In the App Store model, that would have been 14 review cycles, or more realistically, we would have batched changes into 2-3 releases and shipped less frequently, learning slower.

The Decision Framework

When should you choose PWA over native? We use a simple decision tree.

Choose PWA when:

You have not validated product-market fit. This is the strongest signal. If you do not know whether people want your product, do not invest in native distribution infrastructure. Build the fastest possible feedback loop between you and your users. PWA gives you web-speed iteration with app-like UX.

Your features work in the browser. Go through your feature list. If every feature runs in a modern browser without native API bridges, you do not need a native app for technical reasons. Rich text editing, real-time streaming, OAuth flows, file generation, offline caching - all of these work in Safari, Chrome, and Firefox today.

Your target platform has solid PWA support. iPad Safari supports Add to Home Screen, standalone display mode, service workers, and (since iOS 16.4) web push notifications. The PWA experience on iPad is not second-class. Desktop Chrome and Edge have even stronger PWA support with install prompts and window management.

You want to iterate without gatekeepers. App Store Review is not adversarial, but it is a gate. Every release goes through it. Every rejection costs days. If you are iterating rapidly on a product that is still taking shape, that gate slows you down in ways that compound.

Your team is web-native. If your engineers write React and TypeScript, asking them to also write Swift is asking them to context-switch across languages, toolchains, and platform conventions. The cognitive overhead is real. A web team shipping a PWA is working in their strongest medium.

Choose native when:

You need native-only APIs. ARKit for augmented reality. HealthKit for health data. Core Bluetooth for hardware peripherals. Background location tracking. NFC. If your core features depend on these APIs, a PWA cannot deliver your product. No amount of service worker cleverness will give you access to the accelerometer data that a fitness app needs.

App Store distribution is a user acquisition channel. Some products depend on App Store search as a discovery mechanism. If your users find apps by browsing the App Store, not following links, then being in the store matters for business reasons independent of technical ones. This is a distribution argument, not a technology argument.

You need background processing beyond service workers. Service workers can do limited background work - push notification handling, periodic sync (on Android). But sustained background processing - playing audio while the app is backgrounded, tracking a workout, syncing large datasets - requires native capabilities.

Performance requirements exceed browser limits. 3D rendering at high frame rates, real-time audio processing, heavy computational workloads. The browser is getting faster every year, but native code with direct GPU access is still faster for demanding workloads. If your product competes on performance, PWA might not be enough.

The Validation Threshold

The framework above handles the "which one" question. The harder question is "when do you switch from PWA to native?"

Our answer: when you have evidence that users want the product AND you have identified specific native capabilities they need.

Evidence is not "we think people will want this." Evidence is measurable:

Active users performing the core action. For a writing app, that means users writing chapters. Not visiting the landing page, not creating an account - writing. The core action is the only metric that matters for validation.
Retention. Users coming back. A burst of signups followed by abandonment is not validation. Users returning to write their second chapter, their fifth, their twentieth - that is validation.
Explicit requests for native features. Users asking for things PWA cannot deliver. "I want to use Apple Pencil pressure sensitivity." "I need Siri Shortcuts integration." "I want to sync via iCloud." These requests are the signal that native distribution would unlock value the PWA cannot.

Until you have all three, a native app is premature optimization of your distribution channel. You are spending engineering time on App Store compliance instead of on the product itself.

The important insight: you can always add native later. A PWA does not prevent a future native app. The web app continues to work. Users who prefer the browser keep using it. The native app becomes an additional distribution channel, not a replacement.

You cannot un-build a native app. Once you have an App Store listing, TestFlight beta users, and a Swift codebase, you are maintaining it. Indefinitely. Even if you decide to focus on the web version, the native app has users who expect updates. Killing a published app creates support burden and user frustration that a PWA you never published does not.

Cross-Project Application

The PWA pattern started with the writing app, then extended to all portfolio projects in the same session. Each project had different functionality, but the PWA layer was identical.

The implementation for each project was mechanical:

Add manifest.json with project-specific name, colors, and icons.
Configure Serwist in next.config.ts.
Write a minimal service worker source file.
Add iOS meta tags to the document head.
Deploy.

No project required more than an hour of work for the PWA addition. The pattern is a template, not a design exercise. Once you have done it once, every subsequent project is copy, customize the branding fields, deploy.

This repeatability is itself an argument for the approach. A native app is a bespoke project for each platform. A PWA is a configuration layer on top of your existing web application.

What the PWA Cannot Do

Honesty about limitations matters more than enthusiasm about capabilities.

No App Store presence. Your app does not appear in App Store search results. Users cannot stumble upon it while browsing. Discovery depends entirely on your own marketing, SEO, and word-of-mouth channels.

No native app icon badge behavior. While web push notifications work on iOS 16.4+, the badging API has limited support. Users will not see an unread count on your home screen icon the way they would for a native app.

Limited background capability. The service worker runs when the user opens the app or receives a push notification. It does not run continuously in the background. If your app needs to do work while the user is not looking at it, PWA is constrained.

No access to some hardware. Bluetooth, NFC, and certain sensor APIs are unavailable or partially available in Safari. The gap narrows with each iOS release, but it exists today.

Safari-specific quirks. Apple's PWA support, while functional, lags behind Chrome's. Features like declarative link capturing, window controls overlay, and some manifest properties that work on Android and desktop do not work on iOS. You are building for the subset of PWA capabilities that Safari supports, which is smaller than the full specification.

These limitations are real. They are also irrelevant if your product does not need the capabilities that are missing. The writing app does not need App Store discovery (it has its own site), does not need background processing (writing is a foreground activity), and does not need hardware access (it is a text editor). The limitations exist. They do not apply.

The Broader Principle

The question "should we build a native app?" is often framed as a technology decision. It is not. It is a capital allocation decision.

Building a native app means allocating engineering time to platform-specific toolchains, review processes, and distribution infrastructure. That time is not spent on the product itself. For a validated product with proven demand, that investment makes sense - native distribution unlocks capabilities and audiences that the web cannot reach.

For an unvalidated product, that investment is a bet placed before the evidence is in. You are spending engineering capital on distribution before you know whether anyone wants what you are distributing. The rational move is to minimize distribution overhead, maximize iteration speed, and defer the native investment until the product itself justifies it.

PWA is not a compromise. It is the correct architecture for the stage of the product. When the app has a thousand active writers returning weekly, we will revisit the native question with data instead of assumptions. Until then, the app ships on deploy, updates in minutes, and runs full-screen on every iPad that opens it.

The App Store will still be there when we are ready for it.

The case study is an iPad-first writing app for nonfiction authors, shipped as a Progressive Web App using Next.js, Serwist, and Safari's standalone display mode. The PWA pattern applied to all portfolio projects in the same session, confirming the implementation is mechanical once the pattern is established. No native app will be built until active usage data justifies the investment.

Secrets Management for AI Agent Teams

Mon, 16 Feb 2026 00:00:00 GMT

The threat model for AI agents is not the same as the threat model for human developers.

A human developer might accidentally commit a .env file. That's bad. An AI agent might include an API key in a commit message, echo a secret into a tool call argument, or store a credential value in a knowledge system instead of a secrets manager - all while believing it completed the task correctly. Agents operate autonomously, often across multiple projects and machines, with every environment variable visible and referenceable. The blast radius of a secret in an agent's environment is wider than in a traditional development setup, and the failure modes are different.

This article covers the broader strategy for organizing, protecting, and making secrets discoverable across a multi-project, multi-agent operation. For the mechanics of how secrets flow from a centralized store into an agent process at launch time, see Secrets Injection at Agent Launch Time.

The Problem with .env Files

The .env file is the default pattern in most development workflows. It's simple, it's local, and it works fine for a single developer on a single project. It stops working the moment any of those constraints change.

Stale secrets. Someone rotates an API key. The .env file on two machines still has the old value. Nobody notices until an agent session fails mid-task, and the error message points to an authentication failure that could mean a dozen things.

Wrong-project injection. Copy a .env from one project to another. Change two of six keys. Miss the third. The agent runs with a hybrid environment - partially project A, partially project B - and produces behavior that's subtly wrong in ways that are hard to diagnose.

Git history exposure. Commit a .env file accidentally. Remove it in the next commit. The secret is still in the git history. Now you are rotating keys, scrubbing refs, and wondering who pulled before the fix.

Agent-specific blast radius. A human developer rarely echoes environment variables into output. An AI agent, asked to debug a connection issue, might include the full environment in a diagnostic message, a PR description, or a search query. The secret does not stay in the environment; it propagates into artifacts.

We covered these failure modes in the injection article. The solution is a centralized secrets manager (Infisical) with runtime injection. The rest of this article is about the organizational layer on top of that: how to structure, separate, and selectively expose secrets when you're running multiple projects, multiple environments, and autonomous agents that should only see what they need.

Centralized Secrets with Per-Project Paths

Everything starts with one workspace in Infisical. Each project gets its own path:

workspace
├── /alpha    - Project Alpha secrets
├── /beta     - Project Beta secrets
├── /gamma    - Project Gamma secrets
└── /delta    - Project Delta secrets

The CLI launcher knows which path maps to which project. When you type launcher alpha, it fetches secrets from /alpha, injects them as environment variables, and spawns the agent session: one command, one fetch, no files on disk.

Shared secrets (infrastructure keys that every project needs) live at a designated source path. A sync script propagates them to every other path. The source of truth is always one place. When a shared key gets rotated, you update it once and run the sync: no manual copy-paste across project paths.

# Check which projects are missing shared secrets
launcher --secrets-audit

# Propagate missing shared secrets from the source
launcher --secrets-audit --fix

This structure has a useful property: adding a new project is a single operation (create the path, run the sync), and every existing tool - the launcher, the audit script, the environment resolver - works without modification. The path is the interface.

The Storage vs. Injection Distinction

Runtime injection solves the delivery problem. But it creates a new one: every secret at a project's path gets injected into the agent environment. That's usually correct. API keys, auth tokens, service credentials - the agent needs them to function. But some secrets should be stored without being injected.

The case that surfaced this: an API key that needed to be kept in the secrets manager for reference and rotation tracking, but should not appear in the agent's environment. When it was stored at the standard project path, the CLI tool detected it and prompted about using a custom key on every launch. The key's mere presence in the environment changed agent behavior even though nothing in the codebase referenced it.

The solution is structural, not logical. Rather than adding "inject: false" flags or filter lists, we established a sub-path convention:

/alpha          - Injected into agent sessions
/alpha/vault    - Stored but NOT injected

This works because the Infisical CLI's --path flag uses exact matching, not recursive resolution. infisical export --path /alpha returns secrets at /alpha only. It does not descend into /alpha/vault. The separation is enforced by the tool's own path scoping behavior: no additional code, no filter logic, no maintenance.

Secrets in vault paths are still fully manageable through the Infisical CLI and web UI. They can be rotated, audited, and retrieved when needed. They just don't end up in the environment of every agent session.

The harder problem is discoverability. An agent asked to "find the API key" will search the standard path, find nothing, and report it missing. Without guidance, it will not think to check a sub-path. The fix is documentation at the point of lookup. Agent instruction files include the vault convention and the command to check vault paths:

# Standard secrets
infisical secrets --path /alpha --env dev

# Storage-only secrets (not injected into agent sessions)
infisical secrets --path /alpha/vault --env prod

This turns a potential blind spot into a discoverable resource. The agent knows vault paths exist, knows how to query them, and knows the difference between "this secret does not exist" and "this secret exists but is not injected."

Environment Separation

The same project needs different secrets for different environments. A staging deployment uses test API keys. Production uses the real ones. An agent working on staging code should never have production database credentials in scope.

The launcher resolves the environment from a single variable:

launcher alpha              # Production secrets (default)
PROJECT_ENV=dev launcher alpha  # Dev/staging secrets

Both environments exist in the secrets manager at the same project path, just in different environment scopes (Infisical's native concept). The launcher fetches from the correct environment and injects PROJECT_ENV itself so the running agent knows which context it's operating in.

Some projects have additional staging infrastructure that needs its own secrets: staging-specific API endpoints, staging database credentials, staging webhook URLs. These get their own sub-path within the project:

/alpha              - Production + shared secrets
/alpha/staging      - Staging infrastructure secrets
/alpha/vault        - Storage-only secrets

The resolver handles this gracefully. If a project has a staging sub-path and the environment is dev, the launcher fetches from both the base path (for shared secrets) and the staging sub-path (for infrastructure-specific overrides). If no staging path exists, it warns and uses the base secrets for that environment.

The result is clean separation without configuration duplication. A production agent never sees staging credentials. A staging agent never has production database access. The same launcher command works everywhere - the environment variable is the only difference.

SSH Sessions: When the Keychain Is Locked

On a local machine, the Infisical CLI authenticates through an interactive browser login. The token gets stored in the system keychain. Simple, secure, works without thinking about it.

Over SSH, everything breaks. There is no browser for interactive login. On macOS, the system keychain is locked when no user session is active. The token that worked five minutes ago at the keyboard is inaccessible from a remote connection.

The fallback is Machine Identity authentication, a service account model designed for unattended access:

Create a Machine Identity in the Infisical web UI with Universal Auth
Store the credentials in a restricted file (~/.infisical-ua, mode 600) on each machine
The launcher detects SSH sessions (checking SSH_CLIENT, SSH_TTY, or SSH_CONNECTION environment variables) and switches auth methods automatically
Authentication happens via infisical login --method=universal-auth to get a short-lived JWT
The token is passed through an environment variable, not a CLI flag, which would be visible in ps output

Each machine that needs to accept SSH connections requires a one-time bootstrap:

bash scripts/bootstrap-infisical-ua.sh

The script prompts for Machine Identity credentials, writes the credentials file with restricted permissions, and verifies authentication works. After that, launcher alpha works identically whether you're at the keyboard or SSH'd in from a tablet across the country.

There is a macOS-specific wrinkle. Agent CLIs that use OAuth (like Claude Code) store their tokens in the system keychain too. Over SSH, that keychain is locked. The launcher detects this and prompts for the keychain password once per session, not per command. It is a minor friction point, but the alternative (storing OAuth tokens in plain files) trades convenience for security in the wrong direction.

What Agents Should and Shouldn't Know

The principle is simple: minimize the secret surface area for each agent session. An agent should have exactly the secrets it needs to do its job and nothing else.

In practice, this means:

Do not inject production database credentials into a development agent session. The environment flag handles this. Dev sessions get dev secrets. Production sessions get production secrets. An agent working on a feature branch has no path to the production database.

Do not inject secrets the agent will not use. The vault convention handles this. An API key stored for rotation tracking or emergency access does not need to be in every session's environment. Store it in vault, retrieve it when needed.

Do not inject cross-project secrets. Each project path is isolated. An agent working on Project Alpha sees /alpha secrets. It does not see /beta or /gamma. Shared infrastructure keys (the ones that every project needs) are the exception, but they are scoped to infrastructure access, not cross-project data.

Assume agents will reference what they can see. If a secret is in the environment, an agent might mention it in output: a diagnostic message, a commit description, a tool call argument. This is not malicious; it is the natural consequence of an agent having full access to its environment. The defense is structural. Do not put secrets in the environment unless the agent needs them at runtime.

This is not access control in the traditional sense. There are no ACLs, no role-based permissions, no approval workflows. It is structural separation: organizing secrets so that the default state (everything at the project path gets injected) does the right thing, and exceptions (vault, staging sub-paths) are handled by convention rather than configuration.

What We Learned

Fetch at runtime, never store on disk. Secrets fetched from a centralized store and injected as environment variables leave no residue. No .env files to manage, rotate, or accidentally commit. When the process exits, the secrets are gone.

Validate after fetch, not just existence. A key existing with a non-empty value is not enough. Agents have stored descriptions as values, wrong keys at wrong paths, and test values in production environments. Format-aware validation (checking that a webhook secret looks like a hex string, that a PEM key has the correct header) would catch errors that existence checks miss. We are adding these checks incrementally.

Structural separation over logical filtering. Sub-paths and environment scoping are enforced by the tool's own behavior, not by application code. No filter lists to maintain. No "inject: false" flags to forget. The directory structure is the access policy.

Document where agents look, not just where secrets live. A secret that exists but is not discoverable by the agent is effectively missing. Agent instruction files need to include the vault convention, the command to check vault paths, and the distinction between "this secret does not exist" and "this secret is stored but not injected." The discovery path is as important as the storage path.

Test path scoping before migrating real secrets. Create a dummy secret at the target path, verify the scoping behavior, then move the real credential. A 30-second test prevents a window where a production secret is unreachable.

The threat model is different for agents. Humans rarely echo secrets into output. Agents do it naturally as part of diagnostic reasoning, task summaries, and tool call construction. Design the system so that the default state (everything the agent can see) is the minimum it needs. Structural separation is more reliable than procedural rules, though both are necessary.

Multi-Model Code Review - Why One AI Isn't Enough

Sun, 15 Feb 2026 00:00:00 GMT

When we run the same codebase through three different AI models, we get three meaningfully different sets of findings. Not contradictory - complementary. One model catches timing-unsafe cryptographic comparisons in an authentication module. Another flags a 1,000-line monolith that the first model does not mention. A third spots naming inconsistencies across API surfaces that neither of the others notices.

None of these findings are wrong. Each model reviews the same code through a different lens, and each lens reveals something the others miss. Security pattern recognition is not the same skill as architectural analysis, which is not the same skill as cross-file consistency checking.

Code review is not a single-skill task. It requires architectural judgment, security pattern recognition, and structural consistency analysis - simultaneously. No single model excels at all three. This is the same reason human teams do code review with multiple reviewers: one person's blind spot is another person's expertise.

Why Single-Model Review Plateaus

Every model has blind spots shaped by its training emphasis. Claude reasons deeply about architecture and security implications - it will trace how a monolithic file structure impacts testability, which impacts security coverage, and assign grades using concrete thresholds. But it can miss repetitive structural patterns that are obvious to a model trained heavily on code. Codex finds antipatterns that humans bake into habit: subtle type coercions, inconsistent error handling patterns, test helpers that mask failures. Gemini's structured output mode makes it efficient for cross-file consistency checks - comparing naming conventions, API surface shapes, and type safety across module boundaries.

A single-model review gives you one perspective. That is the same problem as having one reviewer on a team of five. The reviewer might be excellent, but they will still have blind spots. When we ran our first single-model reviews, the findings were useful but incomplete. The model would catch security issues and miss architectural problems, or vice versa, depending on which model we used.

The pattern became clear: the findings we were most confident about were the ones that multiple reviewers would have agreed on. We just did not have multiple reviewers yet.

The Three Roles

We frame each model by its role in the review, not by marketing names. The model behind each role can change; the role itself is stable.

The Architect handles deep semantic analysis across seven dimensions: architecture, security, code quality, testing, dependencies, documentation, and standards compliance. This role understands interdependencies. A monolithic file does not just fail an architecture check - it impacts testability (hard to isolate for testing) which impacts security coverage (untested auth paths). The Architect assigns grades using concrete thresholds, making scores comparable across repos and over time.

The Pattern Scanner runs agentic code analysis with full filesystem access. It finds antipatterns the rubric might not specify: timing-unsafe comparisons using string equality on secrets, dynamic require() calls in ESM modules, module-level mutable state used as caches without TTL. These are the findings that come from pattern recognition across millions of codebases, not from a checklist.

The Consistency Checker produces structured JSON output with strict schemas. Its job is cross-file analysis: are naming conventions consistent across all API endpoints? Do error handling patterns match between modules? Are type safety practices uniform across the codebase? These consistency findings are boring individually but valuable in aggregate - they are the difference between a codebase that feels coherent and one that feels like five different developers with five different style guides.

An honest note: Phase 1 (Architect-only) is live and producing real scorecards. The Pattern Scanner and Consistency Checker are designed and will ship when we have validated the convergence layer. We are writing about the design because the architecture is interesting regardless of which phase is deployed.

Convergence - Where Confidence Comes From

The multi-model design is only useful if the findings can be merged intelligently. Three unrelated lists of issues is not better than one list. The value comes from convergence.

The orchestrator groups findings by file and description similarity. When two or more models flag the same issue, the finding's confidence increases. Unique findings from each model are preserved, not discarded - a single model catching something the others missed is still a valid finding; it just has lower confidence than a consensus finding.

Here is a concrete example from a real review. The Architect flagged timing-unsafe secret comparison in an authentication module: the code used plain string equality (===) to compare HMAC signatures, which is vulnerable to timing side-channel attacks. The Pattern Scanner would flag the same issue independently - string equality on secrets is a known antipattern in its training data. That is a 2/3 consensus finding. High confidence, immediate action. The fix is specific: use crypto.subtle.timingSafeEqual by converting both hex strings to Uint8Array before comparison.

Compare that to a finding only the Consistency Checker reports: naming convention mismatches between two API modules. Still worth fixing, but lower confidence and lower priority. The convergence layer makes this distinction automatically.

Graceful degradation is built in. If the Pattern Scanner or Consistency Checker fails - API timeout, unexpected output, auth error - the review completes with reduced confidence and notes the gap. Every external call has a timeout and skip-on-failure path. No single point of failure blocks the review. A single-model review is still a complete review; it just has a narrower perspective.

The Rubric - Making Grades Comparable

"The codebase needs work" is useless feedback. "Architecture: C - three files over 500 lines, unclear domain boundaries" is actionable. The rubric exists to make grades mean the same thing across repos and over time.

Seven dimensions, each graded A through F with concrete thresholds:

Architecture: File organization, separation of concerns, monolith risk. Grade C means 3+ files exceeding 500 lines or unclear domain boundaries.
Security: Auth middleware, injection vulnerabilities, secrets handling. Any high-severity finding (timing-unsafe comparison, auth bypass) is an automatic D.
Code Quality: TypeScript strictness, error handling patterns, naming. Three or more any usages means C at best.
Testing: Coverage of critical paths, assertion quality, mock patterns. Test framework present but significant gaps is a C.
Dependencies: Audit vulnerabilities, version currency, unused packages. Medium-severity audit findings, 2+ major versions behind, or 3+ unused dependencies is a C.
Documentation: CLAUDE.md completeness, README quality, API docs. Exists but missing key sections is a B.
Standards Compliance: Adherence to the project's own documented standards at the appropriate tier.

The overall grade is the mode of dimension grades, pulled toward the worst grade if any dimension is D or F. This prevents a codebase with excellent architecture but critical security vulnerabilities from getting a passing score.

When we ran this against a real codebase, the Architect assigned seven dimension grades in a single pass. The overall came out to C - driven down by a D in security (timing-unsafe comparisons) and Cs in architecture and code quality. That breakdown is immediately actionable. Fix the timing issues first (security D to B is one PR). Then address the monolith (architecture C to B is a refactoring session). Progress is measurable.

Cross-Repo Drift Detection

Individual code reviews answer "how healthy is this repo?" A different question matters when you run multiple projects: "are our repos staying aligned?"

We built a separate enterprise-level audit for this. It collects structural snapshots from every repo - dependency versions, TypeScript configuration, ESLint settings, CI workflows, standards compliance - and builds a drift report.

Three categories of drift:

Configuration drift: TypeScript version mismatches, ESLint major version differences across repos, divergent tsconfig settings.
Structural drift: Inconsistent API file conventions, missing CI workflows, incomplete documentation.
Practice drift: Some repos have pre-commit hooks and others do not. Some have secret scanning configured and others lack it.

The output is a set of comparison tables and a ranked list of drift hotspots. No AI interpretation needed for this step - it is structural comparison, not semantic analysis. The value is visibility: knowing that one repo is two ESLint majors behind the others before it becomes a migration emergency.

The Feedback Loop

Scorecards get stored in an enterprise knowledge base. Each review compares against the last. The trend column - new, up, down, stable - gives at-a-glance health over time without re-reading full reports.

Critical and high-severity findings can generate GitHub issues tagged with source:code-review, when the Captain approves issue creation. On the next review, the system checks which issues are resolved before flagging the same findings again. This closes the loop: review finds problems, issues track fixes, next review confirms resolution.

The trend matters more than any individual grade. A repo that moved from D to C in security is in better shape than one that has been sitting at B for three reviews with the same unresolved findings. Movement means the reviews are driving action. Stagnation means the reviews are being ignored.

The Real Value

The real value of multi-model code review is not any single model's output. It is the convergence - the signal that emerges when multiple independent reviewers with genuinely different strengths agree on a finding. That is how human code review works at its best: multiple perspectives, each catching what the others miss, with the highest-confidence findings being the ones everyone agrees on.

We are building the same dynamic with AI models. Phase 1 proves the rubric, the grading, and the feedback loop work. Phase 2 adds the perspectives that make the findings trustworthy enough to act on without second-guessing.

This article describes an automated code review system that grades codebases across seven dimensions using structured rubrics, stores scorecards for trend tracking, and detects configuration drift across multiple repositories. Phase 1 (single-model) is in production. Phase 2 (multi-model with convergence) is in design.

How We Built an Agent Context Management System

Sat, 14 Feb 2026 00:00:00 GMT

When running AI coding agents across multiple machines and sessions, context is the bottleneck. Each session starts cold. The agent doesn't know what happened yesterday, what another agent is working on right now, or what the project's business context is.

Existing approaches - committing markdown handoff files to git, setting environment variables, pasting context manually - are fragile and don't scale past a single developer on a single machine.

We built a centralized context management system to solve this. It gives every agent session, on any machine, immediate access to:

Session continuity - what happened last time, where things were left off
Parallel awareness - who else is working, on what, right now
Enterprise knowledge - business context, product requirements, strategy docs
Operational documentation - team workflows, API specs, coding standards
Work queue visibility - GitHub issues by priority and status

The system is designed for a small team (1-5 humans) running multiple AI agent sessions in parallel across a fleet of development machines.

Architecture Overview

┌──────────────────────────────────────────────────────────┐
│                    Developer Machine(s)                    │
│                                                            │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │  Claude Code   │  │  Claude Code   │  │  Gemini CLI   │   │
│  │  Session 1    │  │  Session 2    │  │  Session 3    │   │
│  │  (Feature A)  │  │  (Feature B)  │  │  (Planning)   │   │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘    │
│         │                  │                  │             │
│  ┌──────▼──────────────────▼──────────────────▼───────┐   │
│  │              Local MCP Server (stdio)                │   │
│  │  • Git repo detection   • GitHub CLI integration    │   │
│  │  • Session rendering    • Doc self-healing          │   │
│  └──────────────────────┬─────────────────────────────┘   │
│                          │                                  │
│  ┌───────────────────────┤                                  │
│  │  CLI launcher           │                                  │
│  │  • Infisical secrets   │                                  │
│  │  • Venture routing     │                                  │
│  │  • MCP registration    │                                  │
│  └───────────────────────┘                                  │
└─────────────────────────┼─────────────────────────────────┘
                          │ HTTPS
                          ▼
┌──────────────────────────────────────────────────────────┐
│              Cloudflare Workers + D1                       │
│                                                            │
│  ┌────────────────┐  ┌───────────────┐  ┌─────────────┐  │
│  │  Context API    │  │  Knowledge    │  │  GitHub      │  │
│  │  • Sessions     │  │  Store       │  │  Classifier  │  │
│  │  • Handoffs     │  │  • Notes      │  │  • Webhooks  │  │
│  │  • Heartbeats   │  │  • Tags       │  │  • Grading   │  │
│  │  • Doc audit    │  │  • Scope      │  │  • Labels    │  │
│  │  • Rate limits  │  │              │  │              │  │
│  └────────┬───────┘  └──────┬────────┘  └──────┬──────┘  │
│           └─────────────────┼──────────────────┘          │
│                    ┌────────▼────────┐                     │
│                    │   D1 Database    │                     │
│                    │   (SQLite edge)  │                     │
│                    └─────────────────┘                     │
└──────────────────────────────────────────────────────────┘

Key design decisions:

Separation of concerns. GitHub owns work artifacts (issues, PRs, code). The context system owns operational state (sessions, handoffs, knowledge). Neither duplicates the other.
Edge-first. Cloudflare Workers + D1 means the API is globally distributed with ~20ms latency. No servers to manage.
Claude Code-native, multi-CLI aspirational. The system is deeply integrated with Claude Code's slash commands, project instructions, and memory files. The launcher also supports Gemini CLI and Codex CLI, but Claude Code is the primary integration. The context API itself is plain HTTP + MCP, genuinely CLI-agnostic at the protocol layer.
Retry-safe. All mutating endpoints are idempotent. Calling SOD twice returns the same session. Calling EOD twice is a no-op on an ended session.

Machine Setup

The primary entry point for agent sessions is a Node.js CLI launcher that handles secrets, routing, and agent spawning in a single command:

launcher alpha         # Claude Code for Project Alpha
launcher beta --gemini # Gemini CLI for Project Beta
launcher gamma --codex # Codex CLI for Project Gamma
launcher --list        # Show ventures with install status

What launcher <project> does internally:

Resolves the agent - checks --claude | --gemini | --codex flags, defaults to claude
Validates the binary - confirms the agent CLI is on PATH; prints install hint if missing
Loads venture config - reads config/ventures.json for project metadata and capabilities
Discovers the local repo - scans ~/dev/ for git repos matching the venture's org
Fetches secrets - calls Infisical to get project-specific API keys and tokens, frozen for the session lifetime
Ensures MCP registration - copies the right MCP config file for the selected agent CLI
Self-heals MCP binary - if the MCP server isn't found on PATH, auto-rebuilds and re-links
Spawns the agent - cd to the repo, launch the CLI with all secrets injected as environment variables

This eliminates the need to manually set environment variables, navigate to repos, or configure MCP servers. One command, fully configured session.

Projects are registered in config/ventures.json:

{
  "ventures": [
    {
      "code": "alpha",
      "name": "Project Alpha",
      "org": "example-org",
      "capabilities": ["has_api", "has_database"]
    }
  ]
}

The capabilities array drives conditional behavior: documentation requirements, schema audits, and API doc generation are only triggered for ventures with matching capabilities.

Bootstrap takes about five minutes on a new machine. A single script handles all of it: install Node.js dependencies, build the MCP package, link binaries to PATH, copy .mcp.json templates, and validate API connectivity.

$ ./scripts/bootstrap-machine.sh
=== Bootstrap ===
✓ Node.js 20 installed
✓ MCP server built and linked
✓ Launcher and MCP server on PATH
✓ API reachable
✓ MCP connected

This replaced a manual process that required configuring 3+ environment variables, installing skill scripts, and debugging OAuth conflicts - often taking 2+ hours per machine.

Fleet management uses machine registration with the context API. Each machine registers its hostname, OS, architecture, Tailscale IP, and SSH public keys. A fleet health script checks all registered machines in parallel, verifying SSH connectivity, disk space, and service status.

Session Lifecycle

Every agent session begins with Start of Day (SOD). In Claude Code, the /sod slash command orchestrates a multi-step initialization:

Cache docs - pre-fetch documentation from the context API to a local temp directory
Preflight - validate API key, gh CLI auth, git repo detection, API connectivity
Create/resume session - if an active session exists for this agent+project+repo tuple, resume it; otherwise create new
Load last handoff - retrieve the structured summary from the previous session
Show P0 issues - query GitHub for critical priority issues
Show active sessions - list other agents currently working on the same project
Two-stage doc delivery - return doc metadata by default (titles, freshness); fetch full content on demand
Check documentation health - audit for missing or stale docs, self-heal where possible
Check weekly plan - show current priority venture, alert if the plan is stale

┌─────────────────────────────────────────────┐
│  VENTURE:  Project Alpha (alpha)            │
│  REPO:     example-org/alpha-console        │
│  BRANCH:   main                             │
│  SESSION:  sess_01HQXV3NK8...               │
└─────────────────────────────────────────────┘

### Last Handoff
From: agent-mac1
Status: in_progress
Summary: Implemented user auth middleware, PR #42 open.
         Tests passing. Need to add rate limiting.

### P0 Issues (Drop Everything)
- #99: Production API returning 500s on /checkout

### Weekly Plan
✓ Valid (2 days old) - Priority: alpha

### Other Active Sessions
- agent-mac2 on example-org/alpha-console (Issue #87)

### Enterprise Context
#### Project Alpha Executive Summary
Project Alpha is a Series A SaaS company building...

What would you like to focus on?

During work, the session can be updated with current branch and commit SHA, arbitrary metadata (last file edited, current issue, etc.), and heartbeat pings to prevent staleness. Heartbeats use server-side jitter (10min base +/- 2min) to prevent thundering herd across many agents.

End of Day uses a dual-write pattern. Two complementary EOD mechanisms write to different stores.

The handoff MCP tool writes a structured handoff to D1 via the context API. The handoff is stored as canonical JSON (RFC 8785) with SHA-256 hash, scoped to venture + repo + agent. The next session's SOD call retrieves it automatically.

The /eod slash command writes a markdown handoff to docs/handoffs/DEV.md and commits it to the repo. The agent synthesizes from conversation history, git log, PRs created, and issues touched. The output is structured into accomplished, in progress, blocked, and next session.

Why both? D1 handoffs provide structured, queryable continuity across agents and machines. Git handoffs provide human-readable history visible in PRs and code review. Different audiences, different stores.

The agent summarizes. The human confirms. The human never writes the handoff. The agent has full session context and synthesizes it. The user gets a single yes/no before committing.

Sessions have a 45-minute idle timeout. If no heartbeat arrives, the session drops out of "active" queries. The next SOD for the same agent creates a fresh session.

Parallel Agent Coordination

Multiple agents working on the same codebase need to know about each other. Without coordination, two agents pick the same issue, branch conflicts arise from simultaneous work on the same files, and handoffs overwrite each other.

Session awareness is the first layer. SOD shows all active sessions for the same project. Each session records agent identity, repo, branch, and optionally the issue being worked on.

Branch isolation provides the second layer. Each agent instance uses a dedicated branch prefix:

dev/host/fix-auth-timeout
dev/instance1/add-lot-filter
dev/instance2/update-schema

Rules are simple: one branch per agent at a time, always branch from main, coordinate via PRs not shared files, push frequently for visibility.

The D1 schema also supports a track system (designed, not actively used). Issues can be assigned to numbered tracks, with agents claiming a track at SOD time and only seeing issues for their track. The schema and indexes are in place - ready to activate when parallel agent operations become routine.

Agent 1: SOD project track-1  → works on track 1 issues
Agent 2: SOD project track-2  → works on track 2 issues
Agent 3: SOD project track-0  → planning/backlog organization

When work transfers between agents (or between machines), the source agent commits a checkpoint, pushes, and records a structured handoff via the MCP tool. The target agent receives the handoff automatically at SOD, fetches the branch, and continues work.

Enterprise Knowledge Store

Agents need business context to make good decisions. "What does this company do?" "What's the product strategy?" "Who's the target customer?" This knowledge is durable - it doesn't change session to session - but agents need it injected at session start.

A notes table in D1 stores typed knowledge entries:

CREATE TABLE notes (
  id TEXT PRIMARY KEY,   -- note_<ULID>
  title TEXT,
  content TEXT NOT NULL,
  tags TEXT,              -- JSON array of tag strings
  venture TEXT,           -- scope (null = global)
  archived INTEGER NOT NULL DEFAULT 0,
  created_at TEXT NOT NULL,
  updated_at TEXT NOT NULL,
  actor_key_id TEXT,
  meta_json TEXT
);

Notes are organized by controlled tags (recommended, not enforced):

Tag	Purpose
`executive-summary`	Company/project overviews, mission, tech stack
`prd`	Product requirements documents
`design`	Design briefs
`strategy`	Strategic assessments, founder reflections
`methodology`	Frameworks, processes
`market-research`	Competitors, market analysis
`bio`	Founder/team bios
`marketing`	Service descriptions, positioning
`governance`	Legal, tax, compliance

New tags can be added without code changes.

Notes are scoped to a project (e.g., venture: "alpha") or global (venture: null). At SOD, the system fetches notes tagged executive-summary scoped to the current project and notes tagged executive-summary with global scope. These are injected into the agent's context automatically.

The knowledge store is specifically for content that makes agents smarter. It is not:

A general note-taking app (personal notes go to Apple Notes)
A code repository (code goes in git)
A secrets manager (secrets go in Infisical)
A session log (that's what handoffs are for)
An architecture decision record (those go in docs/adr/)

Storage is explicit. Notes are only created when a human explicitly asks. The agent never auto-saves to the knowledge store.

Documentation Management

Team workflows, API specs, coding standards, and process documentation are stored in D1 (context_docs table) and versioned:

CREATE TABLE context_docs (
  scope TEXT NOT NULL,              -- 'global' or venture code
  doc_name TEXT NOT NULL,
  content TEXT NOT NULL,
  content_hash TEXT NOT NULL,       -- SHA-256
  content_size_bytes INTEGER NOT NULL,
  doc_type TEXT NOT NULL DEFAULT 'markdown',
  title TEXT,
  version INTEGER NOT NULL DEFAULT 1,
  created_at TEXT NOT NULL,
  updated_at TEXT NOT NULL,
  uploaded_by TEXT,
  source_repo TEXT,
  source_path TEXT,
  PRIMARY KEY (scope, doc_name)
);

On SOD, relevant docs are returned to the agent: global docs (same for all projects like team workflow and dev standards) and project-specific docs scoped to the current venture.

The system self-heals through three cooperating components.

The D1 audit engine runs on the worker. It compares doc_requirements against context_docs to find gaps. Each requirement specifies a name pattern, scope, capability gate, freshness threshold (default 90 days), and whether auto-generation is allowed.

The doc generator runs locally via MCP. It reads source files from the venture repo - CLAUDE.md, README.md, route files, migrations, schema files, worker configs, OpenAPI specs - and assembles typed documentation (project-instructions, api, schema).

The doc audit tool ties them together. It calls the worker to find missing or stale docs, invokes the generator for anything that can be auto-generated, and uploads the results. During /sod, this pipeline runs automatically. New ventures get baseline documentation without anyone remembering to create it.

Sync pipeline. When process docs or ADRs are merged to main, a GitHub Actions workflow detects the changes and uploads them to the context API. Version increments and content hashes update automatically. A manual workflow_dispatch trigger syncs all docs at once for recovery.

For environments where the MCP server isn't running, a cache script pre-fetches all documentation to a local temp directory. This ensures offline access and reduces API calls during rapid session restarts.

MCP Integration

The system was originally implemented as bash scripts called via CLI skill/command systems. This proved unreliable: environment variables didn't pass through to skill execution, auth token conflicts arose between OAuth and API keys, and setup friction was high per machine.

MCP (Model Context Protocol) is the standard extension mechanism for AI coding tools. It provides:

Reliable auth - API key in config, passed automatically on every request
Type-safe tools - Zod-validated input/output schemas
Single-file configuration - one JSON file per machine, no environment variables
Discoverability - claude mcp list shows connected servers

Rather than connecting the AI CLI directly to the cloud API, we run a local MCP server (Node.js, TypeScript, stdio transport). It handles git repo detection client-side, calls the cloud context API over HTTPS, queries GitHub via gh CLI, and self-heals missing documentation. This keeps the cloud API simple (stateless HTTP) while allowing rich client-side behavior.

Tool	Purpose	Transport
`sod`	Start session, load context	Local MCP → API
`handoff`	Record handoff, end session	Local MCP → API
`status`	Show full GitHub work queue	Local MCP → `gh`
`note`	Store/update enterprise knowledge	Local MCP → API
`notes`	Search/retrieve knowledge by tag/scope	Local MCP → API
`preflight`	Validate environment setup	Local MCP
`context`	Show current session context	Local MCP → API
`doc_audit`	Check and heal documentation	Local MCP → API
`plan`	Read weekly priority plan	Local MCP → file
`ventures`	List ventures with install status	Local MCP → API

Claude Code slash commands (.claude/commands/) add workflow automation on top: /sod, /eod, /handoff, /question, /merge, and others. These orchestrate MCP tools, gh CLI calls, git operations, and file writes into multi-step workflows.

The launcher binary and MCP server are installed via npm link, creating symlinks in npm's global bin. Fleet updates propagate via git pull && npm run build && npm link on each machine.

The launcher knows about three agent CLIs:

Agent	Binary	MCP Config Location	Install Command
Claude Code	`claude`	`.mcp.json` (per-repo)	`npm install -g @anthropic-ai/claude-code`
Gemini CLI	`gemini`	`~/.gemini/settings.json`	`npm install -g @google/gemini-cli`
Codex CLI	`codex`	`~/.codex/config.toml`	`npm install -g @openai/codex`

Claude Code uses per-repo .mcp.json files (the launcher copies a template). Gemini and Codex use global configuration files that the launcher auto-populates.

For remote sessions (SSH into fleet machines), the launcher handles two additional concerns: Infisical Universal Auth for fetching secrets without interactive login, and macOS Keychain Unlock to make Claude Code's OAuth tokens accessible in headless sessions.

The context API enforces per-actor rate limits: 100 requests per minute per actor, tracked via atomic D1 upsert. The limit is designed to prevent runaway agent loops, not restrict normal usage. Response headers include X-RateLimit-Remaining and X-RateLimit-Reset.

Workflow Integration

All work items live in GitHub Issues. The context system does not duplicate this - it provides a lens into GitHub state at session start time. Issues use namespaced labels for status tracking:

status:triage → status:ready → status:in-progress → status:qa → status:verified → status:done

Routing labels (needs:pm, needs:dev, needs:qa) indicate who needs to act next.

Not all work needs the same verification. A QA grading system routes verification to the right method:

Grade	Verification Method	Example
0	CI only	Refactoring with tests
1	CLI/API check	API endpoint changes
2	Light visual	Minor UI tweaks
3	Full walkthrough	New feature with user journey
4	Security review	Auth changes, key management

The developer assigns the grade at PR time. The PM can override.

The escalation protocol was hard-won from post-mortems where agents churned for 10+ hours without escalating:

Condition	Action
Credential not found in 2 min	Stop. File issue. Ask human.
Same error 3 times	Stop. Escalate with what was tried.
Blocked > 30 min on one problem	Time-box expired. Escalate or pivot.

Key insight: Activity is not progress. An agent making 50 tool calls without advancing is worse than one that stops and asks for help after 3 failed attempts.

Data Model

Sessions tracks active agent sessions with heartbeat-based liveness:

id (sess_<ULID>), agent, venture, repo, track, issue_number,
branch, commit_sha, status (active|ended|abandoned),
created_at, last_heartbeat_at, ended_at, end_reason,
actor_key_id, creation_correlation_id, meta_json

Handoffs stores structured session summaries persisted for cross-session continuity:

id (ho_<ULID>), session_id, venture, repo, track, issue_number,
branch, commit_sha, from_agent, to_agent, status_label,
summary, payload_json (canonical JSON, SHA-256 hashed),
payload_hash, payload_size_bytes, schema_version,
actor_key_id, creation_correlation_id

Notes holds enterprise knowledge entries with tag-based taxonomy:

id (note_<ULID>), title, content, tags (JSON array),
venture (scope), archived, created_at, updated_at,
actor_key_id, meta_json

Context Docs manages operational documentation with version tracking:

(scope, doc_name) PRIMARY KEY, content, content_hash (SHA-256),
content_size_bytes, doc_type, title, version, created_at,
updated_at, uploaded_by, source_repo, source_path

Doc Requirements defines what docs should exist per venture:

id, doc_name_pattern, scope_type, scope_venture,
required, condition (capability gate), staleness_days,
auto_generate, generation_sources (JSON array)

Supporting tables include Rate Limits (per-actor, per-minute request counters), Idempotency Keys (retry safety on all mutations), Request Log (full audit trail with correlation IDs), and Machines (fleet registration and SSH mesh state).

Design choices across the schema:

ULID for all IDs - sortable, timestamp-embedded, prefixed by type (sess_, ho_, note_, mach_)
Canonical JSON (RFC 8785) for handoff payloads, enabling stable SHA-256 hashing
Actor key ID derived from SHA-256 of the API key (first 16 hex chars) - attribution without storing raw keys
Two-tier correlation - corr_<UUID> per-request for debugging, plus a stored creation ID for audit trail
800KB payload limit on handoffs (D1 has a 1MB row limit, leaving headroom)
Hybrid idempotency - full response body stored if under 64KB, hash-only otherwise
7-day request log retention with filter-on-read now, scheduled cleanup planned

Security and Access Control

Two key tiers:

Key	Scope	Distribution
`CONTEXT_API_KEY`	Read/write sessions, handoffs, notes	Per-machine, via Infisical
`ADMIN_API_KEY`	Upload docs, manage requirements	CI/CD only, GitHub Secrets

Both keys are 64-character hex strings generated via openssl rand -hex 32.

Every mutating request records an actor_key_id - the first 16 hex characters of SHA-256(api_key). This provides attribution without storing raw keys and an audit trail across all tables. Changing a key changes the actor ID, but old actions remain traceable.

Every API request gets a corr_<UUID> correlation ID (generated server-side if not provided by the client). It's stored in the request log, embedded in records created during that request, and appears in error responses for debugging.

Secrets never touch disk in plaintext. Infisical stores all secrets organized by venture path (/alpha, /beta, etc.). The launcher fetches them once at session start and injects them as environment variables. The flow is Infisical to env vars to process memory.

GitHub Actions runs security checks on every push and PR: npm audit for dependency vulnerabilities, Gitleaks for secret detection, and tsc --noEmit for type safety. These also run daily at 6am UTC.

CI/CD Pipeline

Workflow	Trigger	What It Does
Verify	Push to main, PR to main	TypeScript check, ESLint, Prettier, tests
Security	Push, PR, daily at 6am UTC	NPM audit, Gitleaks, TypeScript validation
Test Required	PR open/update	Enforces test coverage when `test:required` label
Sync Docs	Push to main changing `docs/process/` or `docs/adr/`	Uploads changed docs to Context Worker via admin API

Task	Command
Local verification	`npm run verify` (typecheck + format + lint + test)
Worker deployment	`npx wrangler deploy` (from worker directory)
MCP server rebuild	`npm run build && npm link` (from the MCP package directory)
Fleet MCP update	`scripts/deploy-mcp.sh` (runs rebuild on each machine via SSH)
D1 migration	`npx wrangler d1 migrations apply <db-name>`

Pre-commit hooks run Prettier formatting and ESLint fixes on staged files (via lint-staged). Pre-push hooks run full npm run verify, blocking the push if typecheck, format, lint, or tests fail.

What We Learned

SOD/EOD discipline produces better work. The 30-second overhead of SOD pays for itself within minutes. Agents that start with full context make better decisions from the first tool call. Without it, they spend the first 10-15 minutes rediscovering what the previous session already knew.

Structured handoffs beat free-text notes. Forcing handoffs into accomplished / in_progress / blocked / next_steps makes them actually useful to the receiving agent. Free-text summaries are too inconsistent - sometimes they capture the right details, sometimes they don't.

Self-healing documentation means it never silently goes stale. New projects get baseline docs without anyone remembering to create them. When a project adds an API, the doc generator picks up the routes automatically at next audit.

Enterprise context injection aligns technical decisions. Giving agents business context (executive summaries, product strategy) at session start means they make decisions that fit the product direction, not just the immediate technical problem.

Parallel session awareness prevents duplicate work. Simply showing "Agent X is working on Issue #87" at SOD time is enough. Agents check this and pick different work.

The launcher eliminated an entire class of setup errors. Reducing session setup from "navigate to repo, set env vars, configure MCP, launch CLI" to launcher alpha made it practical to run sessions on any machine in the fleet without troubleshooting.

On the harder side:

MCP process lifecycle caused a multi-hour debugging session. MCP servers run as subprocesses of the CLI. A "session restart" (context compaction) does NOT restart the MCP process. Only a full CLI exit/relaunch loads new code. This is not obvious and has bitten us multiple times.

Auth evolution was painful. We went through three auth approaches (environment variables, skill-injected scripts, MCP config). Each migration touched every machine in the fleet.

Knowledge store scope creep made the system noisy. Early versions auto-saved all kinds of content. Restricting to "content that makes agents smarter" and requiring explicit human approval dramatically improved signal-to-noise.

Stale process state is a recurring trap. Node.js caches modules at process start. If you rebuild the MCP server but don't restart the CLI, the old code runs. This is the same root cause as the MCP lifecycle issue but manifests differently.

Context window budget blew up silently. SOD output hit 298K characters in one measured session - roughly a third of the context window consumed before the agent did any work. We addressed this with metadata-only doc delivery and a 12KB budget cap on enterprise notes. The result was a 96% reduction in SOD token consumption.

Infrastructure

Component	Technology	Purpose
Context API	Cloudflare Worker + D1	Sessions, handoffs, knowledge, docs, rate limits
GitHub Classifier	Cloudflare Worker	Webhook processing, issue classification
MCP Server	Node.js (TypeScript, stdio)	Client-side context rendering, doc generation
CLI Launcher	Node.js (TypeScript)	Secret injection, venture routing, agent spawn
Secrets Manager	Infisical	API keys, tokens per project
Fleet Networking	Tailscale	SSH mesh between machines
CI/CD	GitHub Actions	Test, deploy, doc sync, security scanning

Deployment: Workers deploy via Wrangler (npx wrangler deploy). MCP server builds locally and links via npm link. Fleet updates propagate via git pull + rebuild on each machine, either manually or via a fleet deployment script.

Architectural Decision Records live in docs/adr/ and sync to D1 via the doc sync workflow. They serve as the authoritative record for "why is it built this way?" questions that agents encounter during development.

SSH Mesh Networking

With 5+ development machines (mix of macOS and Linux), manually maintaining SSH config, authorized keys, and connectivity is error-prone. Add a machine, and you need to update every other machine's config. Lose a key, and half the fleet can't reach the new box.

A single script (setup-ssh-mesh.sh) establishes bidirectional SSH between all machines in the fleet. It runs in five phases:

Phase 1: Preflight
  - Verify this machine is in the registry
  - Check local SSH key exists (Ed25519)
  - Verify macOS Remote Login is enabled
  - Test SSH connectivity to each remote machine

Phase 2: Collect Public Keys
  - Read local pubkey
  - SSH to each remote machine, collect its pubkey
  - If a remote machine has no key, generate one automatically

Phase 3: Distribute authorized_keys
  - For each reachable machine, ensure every other machine's
    pubkey is in its authorized_keys
  - Idempotent - checks before adding, never duplicates

Phase 4: Deploy SSH Config Fragments
  - Writes ~/.ssh/config.d/fleet-mesh on each machine
  - Never overwrites ~/.ssh/config (uses Include directive)
  - Each machine gets a config with entries for every other machine
  - Uses Tailscale IPs (stable across networks)

Phase 5: Verify Mesh
  - Tests every source→target pair (including hop tests from remotes)
  - Prints a verification matrix

SSH Mesh Verification
==========================================
From\To     | mac1      | server1   | server2   | laptop1
------------|-----------|-----------|-----------|----------
mac1        | --        | OK        | OK        | OK
server1     | OK        | --        | OK        | OK
server2     | OK        | OK        | --        | OK
laptop1     | OK        | OK        | OK        | --

Key design decisions:

Config fragments, not config files. The mesh script writes ~/.ssh/config.d/fleet-mesh, included via Include config.d/* in the main SSH config. User-maintained SSH settings are never touched.
API-driven machine registry. When the context API key is available, the script fetches the machine list from the API. New machines appear in the mesh automatically on next run.
Tailscale IPs. All SSH config uses Tailscale IPs (100.x.x.x), which are stable regardless of physical network.
Idempotent and safe. Checks before adding keys, never removes existing entries, supports DRY_RUN=true for previewing changes.

All machines run Tailscale, a WireGuard-based mesh VPN. Traffic goes directly between machines when possible (peer-to-peer, not through a relay). Each machine gets a fixed 100.x.x.x address.

Tailscale handles the hard parts: NAT traversal behind firewalls and cellular networks, automatic peer discovery via coordination server, hostname resolution via MagicDNS. It replaces the need for port forwarding, dynamic DNS, or VPN servers. All traffic flows over the encrypted Tailscale tunnel.

tmux and Remote Sessions

AI coding sessions can run for hours. If the SSH connection drops - network change, laptop sleep, timeout - the session is lost.

tmux solves this. The tmux session lives on the server. Disconnect and reconnect with the session exactly where you left it. It works identically over SSH and Mosh. Run the agent in one pane, a build watcher in another, logs in a third.

A deployment script (setup-tmux.sh) pushes identical tmux configuration to every machine in the fleet: terminfo for correct color handling over SSH, a consistent ~/.tmux.conf, and a session wrapper script.

# Deploy to all machines
bash scripts/setup-tmux.sh

# Deploy to specific machines
bash scripts/setup-tmux.sh server1 server2

Key configuration highlights:

# True color pass-through (correct rendering over SSH from modern terminals)
set -ga terminal-overrides ",xterm-ghostty:Tc"

# Mouse support (scroll, click, resize panes)
set -g mouse on

# 50k line scrollback (generous for long agent sessions)
set -g history-limit 50000

# Hostname in status bar (critical when SSH'd into multiple machines)
set -g status-left "[#h] "

# Faster escape (no lag when pressing Esc - important for vim users)
set -s escape-time 10

# OSC 52 clipboard - lets tmux copy reach the local clipboard
# through SSH/Mosh. This is the magic that makes copy/paste work
# from a remote tmux session back to your local machine.
set -g set-clipboard on

The hostname in the status bar is especially important when working across multiple machines. At a glance, you know which machine you're on.

A session wrapper script wraps tmux for agent session management. If a tmux session for a project exists, it reattaches; otherwise, it creates one and launches the agent CLI inside it.

# Usage: dev-session <project>
dev-session alpha

This means: ssh server1 + dev-session alpha = resume exactly where you left off. Disconnect and reconnect later - session is intact. Works identically whether you connected via SSH or Mosh.

Mobile Access

Development doesn't always happen at a desk. The mobile access strategy uses Blink Shell (iOS SSH/Mosh client) to turn an iPad or iPhone into a thin terminal for remote agent sessions.

┌───────────────────┐         ┌──────────────────────┐
│   iPad / iPhone    │  Mosh   │   Always-On Server    │
│                    │ ──────> │                        │
│   Blink Shell      │  (UDP)  │   tmux session         │
│   - SSH keys       │         │   └── launcher <project>│
│   - Host configs   │         │       └── MCP server   │
│   - iCloud sync    │         │           └── context  │
└───────────────────┘         └──────────────────────┘
         │
         │  Tailscale VPN (always connected)
         │
         ▼
    Works from anywhere:
    home WiFi, cellular, hotel, coffee shop

Mosh (Mobile Shell) is purpose-built for unreliable networks:

Feature	SSH	Mosh
Transport	TCP	UDP
Network switch	Connection dies	Seamless roaming
Laptop sleep/wake	Connection dies	Reconnects automatically
Latency	Waits for server echo	Local echo (instant keystrokes)
Cellular gaps	Timeout → reconnect	Resumes transparently

Mosh is especially valuable on mobile: switch from WiFi to cellular, walk between rooms, lock the phone for 30 minutes - the session is still there when you come back. Setup is one command per server: sudo apt install mosh.

Blink Shell is an iOS terminal app that supports both SSH and Mosh natively. Key features for this setup: iCloud sync of keys and configs across all iOS devices, multiple sessions with swipe-to-switch, split screen on iPad, and full external keyboard support.

AI CLI tools that use alternate screen buffers break native touch scrolling on mobile. All machines are pre-configured to disable this:

// Gemini CLI: ~/.gemini/settings.json
{ "ui": { "useAlternateBuffer": false } }

// Codex CLI: ~/.codex/config.toml
[tui]
alternate_screen = false

Claude Code works with default settings. With alternate screen disabled, normal finger/trackpad scrolling works in Blink Shell, and scrollback history is preserved.

The OSC 52 clipboard bridge solves a non-obvious problem: how do you copy text from a remote tmux session to your local device's clipboard?

OSC 52 is an escape sequence that lets terminal programs write to the local clipboard through any number of SSH/Mosh hops:

Agent output (remote) → tmux (OSC 52 enabled) → Mosh/SSH → Blink Shell → iOS clipboard

This is configured in tmux (set -g set-clipboard on) and supported by Blink Shell natively. Select text in the remote tmux session, and it's available in your local clipboard. For manual text selection in tmux (bypassing tmux's mouse capture): hold Shift + click/drag.

Field Mode

A portable laptop serves as the primary development machine when traveling. An iPhone provides hotspot internet. The fleet's always-on servers remain accessible via Tailscale.

Scenario	Target	Method
Quick thought from bed/couch	Office server	Mosh from Blink Shell via Tailscale
Sitting down for real work	Laptop directly	Open lid, local terminal + `launcher <project>`
Mid-session, stepping away	Laptop via phone	Blink Shell to `laptop.local` over hotspot
First thing in the morning, laptop closed	Office server	Mosh from Blink Shell (zero setup)

When the phone creates a hotspot, the laptop and phone are on the same local network (172.20.10.x). The phone can SSH/Mosh to the laptop using mDNS/Bonjour (laptop.local) - no Tailscale needed, sub-millisecond latency.

Hotspot IPs change between connections, but .local hostname resolution (Bonjour) always resolves correctly regardless of the current IP assignment.

The phone's hotspot auto-disables after ~90 seconds of no connected devices. For intentional mid-session breaks:

# Keep laptop awake for Blink SSH access (prevents all sleep)
caffeinate -dis &

# When done, let it sleep normally
killall caffeinate

# Tip: use -di (without -s) to keep machine awake but allow display sleep
# The display is the biggest battery draw
caffeinate -di &

The full stack in field mode:

Phone (iPhone)
├── Hotspot → provides internet to laptop
├── Tailscale → provides VPN to office fleet
├── Blink Shell → SSH/Mosh to any machine
│   ├── mosh server1 (via Tailscale, for quick sessions)
│   └── ssh laptop.local (via hotspot LAN, for mid-session access)
│
Laptop (MacBook)
├── Tailscale → same VPN mesh
├── Terminal (local) → primary dev experience
├── launcher <project> → full coding sessions
└── caffeinate → prevents sleep during Blink access

Office (always-on servers)
├── server1 (Linux, x86_64)
├── server2 (Linux, x86_64)
└── server3 (Linux, x86_64)
    └── All running: tmux, launcher, MCP server, node, git, gh

This setup means you're never more than a Blink Shell session away from a full development environment, whether you're at a desk, on a couch, or in transit.

Roadmap

Phase 2 (Planned):

Per-agent tokens for fine-grained revocation and per-agent rate limits
Scheduled cleanup via Cloudflare Cron Trigger - abandon stale sessions, purge expired idempotency keys, rotate the request log

Phase 3 (Aspirational):

Cross-project dashboard showing all active sessions across all ventures
Real-time push notifications when a parallel agent creates a PR, hits a blocker, or completes a task
Session analytics API for querying duration, handoff frequency, escalation rates, and time-to-resolution
Full-text search in the knowledge store via D1's FTS5
True multi-CLI parity with equivalent slash command systems for Gemini and Codex

This document describes a production system managing AI agent development sessions across a fleet of macOS and Linux machines, accessible from desktops, laptops, and mobile devices. The system is built on Cloudflare Workers + D1, with a local MCP server (Node.js/TypeScript), Infisical for secrets, Tailscale for networking, and Claude Code as the primary AI agent CLI. It has been in daily use since January 2026.

Building a Dark-Theme Design System with Tailwind v4

Thu, 12 Feb 2026 00:00:00 GMT

Most design system guides start with light mode and bolt dark mode on as an afterthought. We went the other direction. Venture Crane's site was designed dark-first - every color token, every contrast ratio, every surface elevation was conceived for a dark canvas. Here's how we built it with Tailwind CSS v4 and vanilla CSS custom properties.

Why Dark-First

The conventional approach treats dark mode as an inversion. You design for white backgrounds, then flip to dark with prefers-color-scheme. This works, but it produces dark themes that feel like negatives of the light version rather than intentional designs.

We had a specific reason to go dark-first: Venture Crane is a development lab. Our audience is practitioners - developers and technical founders who spend most of their screen time in dark terminals and IDEs. A dark reading surface isn't an accommodation; it's the default expectation.

There's a practical benefit too. When you design dark-first, you're forced to think about elevation through surface tones rather than shadows. Shadows barely register against dark backgrounds. This constraint produced a cleaner hierarchy system than we'd have arrived at starting with light mode.

The Token Architecture

Three Layers of Color

Our system uses three semantic layers for background surfaces:

Chrome (#1a1a2e) - structural elements like the header, footer, and homepage background
Surface (#242438) - content reading areas where long-form text lives
Surface Raised (#2a2a42) - cards, code blocks, and interactive elements that float above the surface

The chrome-to-surface transition is deliberate. When you navigate from the homepage to an article, the reading area shifts from #1a1a2e to #242438 - a subtle but noticeable increase in lightness that signals "you're in reading mode now." It's only about 4 points different in HSL lightness, but your eyes register it immediately.

Custom Properties Over Theme Extensions

Tailwind v4 introduced a @theme directive that maps directly to CSS custom properties. We use a two-tier system:

:root {
  /* Source tokens - the actual values */
  --vc-chrome: #1a1a2e;
  --vc-surface: #242438;
  --vc-surface-raised: #2a2a42;
  --vc-text: #e8e8f0;
  --vc-text-muted: #a0a0b8;
  --vc-accent: #818cf8;
  --vc-accent-hover: #a5b4fc;
  --vc-gold: #dbb05c;
  --vc-gold-hover: #e8c474;
  --vc-border: #2e2e4a;
  --vc-code-bg: #14142a;
}

@theme {
  /* Tailwind mappings - reference the source tokens */
  --color-chrome: var(--vc-chrome);
  --color-surface: var(--vc-surface);
  --color-surface-raised: var(--vc-surface-raised);
  --color-accent: var(--vc-accent);
  --color-gold: var(--vc-gold);
  --color-border: var(--vc-border);
}

This looks like unnecessary indirection, but it serves a purpose. The :root tokens are plain CSS - any stylesheet, component <style> block, or third-party library can reference them. The @theme block maps these into Tailwind's utility class system so bg-surface and text-accent work in class attributes. One set of values, two consumption patterns, zero duplication.

Contrast Ratios

Every color pairing was checked against WCAG AA (4.5:1) and AAA (7:1) thresholds. Here are the key ratios:

Pairing	Foreground	Background	Ratio	Grade
Body text	`#e8e8f0`	`#242438`	12.5:1	AAA
Muted text	`#a0a0b8`	`#242438`	5.9:1	AA
Accent links	`#818cf8`	`#242438`	5.1:1	AA
Gold wordmark	`#dbb05c`	`#1a1a2e`	8.4:1	AAA
Code text	`#e8e8f0`	`#14142a`	14.8:1	AAA
Muted on chrome	`#a0a0b8`	`#1a1a2e`	6.7:1	AA

The gold accent (#dbb05c) was chosen for its warmth and AAA-clearing contrast on the chrome background at 8.4:1.

Typography Decisions

The Body Text Baseline

We set body text to 1rem (16px) with a 1.6 line height. This matches the industry-standard base size but pairs it with a more generous line height than the browser default of 1.5, giving dark-mode reading room to breathe.

The same optical illusion that makes white text on black appear bolder than black text on white also makes tightly-spaced text harder to parse on dark backgrounds. At 16px with a 1.6 line height, sustained reading is comfortable without feeling oversized.

:root {
  --vc-text-body: 1rem;
  --vc-leading-body: 1.6;
}

System Font Stacks

We avoided loading custom fonts entirely. The font stack falls through platform natives:

Body: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif
Mono: ui-monospace, 'Cascadia Code', 'Source Code Pro', Menlo, Consolas, 'DejaVu Sans Mono', monospace

Zero font files means zero layout shift from font loading, zero FOUT, and one fewer thing to cache-bust. Every operating system ships a good sans-serif and a good monospace. Use them.

The Type Scale

We defined a five-step scale covering everything from page titles to metadata:

Element	Size	Line Height	Weight
H1	2rem (32px)	1.2	700
H2	1.5rem (24px)	1.3	600
H3	1.25rem (20px)	1.4	600
Body	1rem (16px)	1.6	400
Small/Meta	0.875rem (14px)	1.5	400

Each step is roughly a 1.25x ratio - not a mathematically perfect scale, but one tuned for readability at each individual level. Strict modular scales often produce awkward sizes at the extremes. We preferred each level looking right on its own.

Component Patterns

Prose Container

All rendered markdown lives inside .vc-prose, which applies spacing, list styles, and link colors. This is a deliberate alternative to Tailwind's official @tailwindcss/typography plugin.

We avoided the typography plugin because its reset approach conflicted with our token system. When you've already defined --vc-text and --vc-accent as CSS variables, layering on a plugin that generates its own color values creates a maintenance surface. One source of truth is better than two.

The .vc-prose class handles:

Heading margins (margin-top: 2.5em for h2, 1.5em for h3)
Paragraph spacing (margin-bottom: 1.25em)
List indentation (padding-left: 1.5em)
Blockquote styling (accent-color left border, muted italic text)
Table formatting (collapsed borders, raised-surface header background)
Inline code (raised-surface background with slight padding)

Code Block Overflow

Code blocks need special treatment in constrained layouts. A 768px content column (roughly 660px of prose after card padding) can't fit a 120-character line without overflow. We handle this with two layers:

overflow-x: auto on the <pre> element
tabindex="0" added via a rehype plugin for keyboard scrollability

The first handles the visual overflow. The second is an accessibility detail that's easy to miss - without tabindex="0", keyboard users can't scroll horizontally through long code blocks. Our rehype plugin adds it automatically during the Astro build.

A technique we've been evaluating but haven't shipped yet: a CSS ::after pseudo-element that creates a right-edge fade gradient hinting at overflow. The idea is a sticky pseudo-element that fades from transparent to the code background color:

pre::after {
  content: '';
  position: sticky;
  right: 0;
  display: block;
  width: 2rem;
  height: 100%;
  margin-top: -100%;
  background: linear-gradient(to right, transparent, var(--vc-code-bg));
  pointer-events: none;
}

No JavaScript, no intersection observers - just CSS. We haven't added it because the current overflow-x: auto approach works well enough, and the fade gradient introduces visual complexity on every code block regardless of whether it actually overflows. Sometimes the simpler solution is the right one.

Table Scroll Shadows

Tables face the same overflow problem as code blocks, but we solve it differently. A rehype plugin wraps each <table> in a <div class="table-wrapper"> with role="region" and tabindex="0" for accessibility. The wrapper handles scrolling.

The clever part is the scroll shadow technique using background-attachment: local versus scroll. Four gradient backgrounds create shadow indicators that appear only when content is scrollable in that direction:

.table-wrapper {
  overflow-x: auto;
  background:
    linear-gradient(to right, var(--vc-surface), var(--vc-surface)) local,
    linear-gradient(to left, var(--vc-surface), var(--vc-surface)) local,
    linear-gradient(to right, rgba(0, 0, 0, 0.25), transparent) scroll,
    linear-gradient(to left, rgba(0, 0, 0, 0.25), transparent) scroll;
  background-size:
    2rem 100%,
    2rem 100%,
    1rem 100%,
    1rem 100%;
  background-position: left, right, left, right;
  background-repeat: no-repeat;
}

The local backgrounds scroll with the content; the scroll backgrounds stay fixed. When you scroll right, the left local gradient moves away, revealing the left scroll shadow. It's CSS-only, performant, and degrades gracefully - if a browser doesn't support background-attachment: local, you just don't get shadow indicators.

Lessons Learned

Token Naming Matters More Than Values

We initially named our colors --bg-dark, --bg-medium, --bg-light. This fell apart immediately when discussing designs: "Use the medium background" told you nothing about intent. Renaming to chrome, surface, and surface-raised made every conversation clearer. The name describes the role, not the lightness.

Test at the Extremes

Our reading comfort check isn't "does this look okay for 30 seconds." It's "can I read this for 5+ minutes without wanting to adjust brightness." Dark themes fail in sustained reading far more often than in quick glances. The combination of 16px text, 1.6 line height, and the #242438 surface (slightly lighter than the chrome) was the result of iterating through several background candidates.

Don't Build What Astro Gives You

We started writing a custom Markdown processing pipeline before realizing Astro's built-in content collections already handled 90% of what we needed. The only custom piece is a single rehype plugin that adds tabindex attributes and wraps tables. Everything else - frontmatter parsing, type-safe schemas, slug generation, RSS feeds - is Astro out of the box.

Documentation as Operational Infrastructure

Tue, 10 Feb 2026 00:00:00 GMT

Documentation is usually the thing that gets written once and forgotten. A README goes stale within a week. Process docs drift until they describe a workflow nobody follows anymore. For human teams, this is annoying. For AI agent teams, stale documentation is actively harmful - agents follow outdated instructions literally. They do not notice that the deploy script moved, that the API endpoint was renamed, or that the team switched from one verification process to another. They just do what the docs say.

We run multiple AI coding agents across a fleet of machines, each starting fresh sessions multiple times a day. Every session begins with "where do we start?" If the answer to that question comes from stale or missing documentation, the agent makes decisions based on a world that no longer exists. We watched agents churn for hours following instructions for systems that had been decommissioned, because nobody updated the docs.

The fix was not "write better docs." It was treating documentation as infrastructure - with the same expectations we bring to CI/CD pipelines, secrets management, and deployment workflows. Self-healing. Version-tracked. Automatically delivered.

The Three Layers

Our documentation system has three distinct layers, each serving a different purpose and audience.

Layer 1: Process Documentation

These are the runbooks. Team workflow manuals, QA checklists, escalation protocols, development directives. They live in docs/process/ in the repo and describe how work gets done.

The team workflow document, for example, runs to 700+ lines. It covers the full story lifecycle from issue creation through merge, including escalation triggers born from post-mortems (an agent once churned for 10+ hours without escalating because the escalation rules did not exist yet), QA grading systems that route verification to the right method, and multi-track parallel operations. This is not a document anyone writes once and forgets. It has gone through nine versions in two months.

Process docs are checked into git, reviewed in PRs, and synced to a central document store via CI. When a process doc or ADR changes on the main branch, a GitHub Actions workflow detects the change and uploads it to the context API. Version numbers increment automatically. Content hashes update. The agent always gets the current version.

PR merged → push to main → GitHub Actions detects docs/process/*.md change
  → uploads to context API → version increments → next agent session gets new docs

A manual workflow_dispatch trigger syncs all docs at once for recovery scenarios - if the document store ever gets out of sync, one button rebuilds it from git, the source of truth.

Layer 2: Architecture Decision Records

ADRs answer the question agents ask most often: "why is it built this way?"

When an agent encounters a design choice that seems wrong or suboptimal, the natural instinct is to refactor. ADRs prevent this. ADR-025 explains why the context worker exists - the fragmentation of handoff files in git, the lack of cross-project visibility, the unreliability of markdown parsing. ADR-026 explains the staging/production environment strategy - why there are two D1 databases per worker, why staging deploys automatically but production requires manual promotion.

These documents are not just for humans reviewing history. They are consumed by agents at the start of every session. An agent working on the context API can read ADR-025 and understand the design constraints that shaped the system it is modifying. It does not need to reverse-engineer intent from code.

ADRs follow the same sync pipeline as process docs. They live in docs/adr/, are merged via PR, and upload to the context API on push to main.

Layer 3: Enterprise Knowledge

The first two layers describe how to work and why things are built the way they are. Enterprise knowledge describes what we are building and why it matters.

Executive summaries, product requirements documents, strategic assessments, methodology frameworks, market research, team bios - this is the business context that agents need to make decisions aligned with product direction, not just the immediate technical problem. The knowledge store is a D1-backed system of tagged notes, scoped by project or globally, that agents consume automatically at session start.

Each note carries structured metadata: tags from a controlled vocabulary (executive-summary, prd, strategy, methodology, design, governance), an optional project scope, and timestamps. Notes are only created when a human explicitly asks. The agent never auto-saves to the knowledge store. This constraint was learned the hard way - early versions auto-saved aggressively, and the noise-to-signal ratio made the whole system useless.

Doc Audit: Self-Healing Documentation

The core insight: if we know what documentation should exist, we can detect when it is missing and generate it automatically.

The Requirements Table

A doc_requirements table in D1 defines what docs every project should have. Each requirement specifies:

A name pattern - e.g., {venture}-project-instructions.md, where {venture} is replaced with the project code at audit time
Scope type - global (same for all projects), all-ventures (one per project), or venture-specific
A condition gate - some docs only apply to projects with certain capabilities. An API reference doc is only required for projects with has_api. A schema doc is only required for projects with has_database.
A staleness threshold - default 90 days. If a doc has not been updated in longer than its threshold, it is flagged as stale.
An auto-generate flag - whether the system can generate this doc from source files, or whether a human must write it manually
Generation sources - hints for the generator about where to find content (e.g., ["claude_md", "readme", "package_json"])

The default requirements define three doc types for every project: project instructions (generated from CLAUDE.md, README, package.json, and process docs), API reference (generated from route files, OpenAPI specs, and test files), and database schema (generated from migrations, schema files, and worker configs).

The Audit Engine

The audit engine runs server-side on the context API worker. When invoked, it compares the requirements table against the actual documents in the store:

For each applicable requirement:
  1. Resolve the name pattern ({venture} → actual project code)
  2. Check capability gates (skip if project doesn't have required capability)
  3. Look up the doc in the store
  4. If missing → add to missing list
  5. If found but older than staleness threshold → add to stale list
  6. If found and fresh → add to present list

The result is a structured report: present docs, missing docs (with whether they can be auto-generated), and stale docs (with how many days old they are versus their threshold). The overall status is complete (nothing missing or stale), warning (stale docs exist), or incomplete (required docs are missing).

The Doc Generator

The doc generator runs locally on the MCP server, not on the cloud worker. This is a deliberate design choice - it needs access to the local git repository to read source files.

The generator takes a doc name, project code, and a list of generation source keys. It has typed source handlers that know how to extract information from different file types:

Source Key	What It Reads
`claude_md`	Project CLAUDE.md (instructions, commands, architecture)
`readme`	README.md (project overview, getting started)
`package_json`	Dependencies, scripts, version info
`docs_process`	Process documentation directory
`route_files`	API route handlers (src/routes, src/api, workers/*/src)
`openapi`	OpenAPI/Swagger specifications
`tests`	Test files containing API-related patterns
`migrations`	SQL migration files
`schema_files`	TypeScript/SQL schema definitions
`wrangler_toml`	Cloudflare Worker configuration

The generator builds typed documents. A project-instructions doc assembles a product overview from the README, a tech stack section from package.json, development instructions from CLAUDE.md, and process documentation from the docs directory. An api doc combines OpenAPI specs with route definitions and test patterns. A schema doc merges migrations with schema definitions and worker bindings.

The key principle: the generator reads what exists. It does not work from templates. If a project has a README but no CLAUDE.md, the generated doc includes what the README provides and omits what it cannot find. If no source files yield content, generation is skipped entirely rather than producing an empty shell.

The Self-Healing Loop

These three components connect during session initialization. Every time an agent starts a session, the following happens:

The MCP server calls the context API's start-of-day endpoint
The context API runs the doc audit for the current project
The audit result comes back with the session response
The MCP server checks for missing docs that are flagged as auto-generable
For each auto-generable missing doc, the generator reads local source files and builds the document
The generated docs are uploaded to the context API

Session Start
  → API returns doc audit (3 present, 1 missing, 1 stale)
  → MCP checks: missing doc is auto-generable? yes
  → Generator reads CLAUDE.md + README + package.json
  → Assembled doc uploaded to context API
  → Next session: 4 present, 0 missing, 1 stale

This means that when a new project is added to the system, it gets baseline documentation without anyone remembering to write it. When a project adds an API, the doc generator picks up the new route files at next audit. When a doc goes stale, the generator refreshes it from current sources.

The stale doc refresh is also automatic. The self-healing loop regenerates stale docs just like missing ones - reading the current source files and uploading updated content. A doc that was generated six months ago from a CLAUDE.md that has since changed will be regenerated from the current version.

Session Initialization: The Delivery Mechanism

Self-healing docs are only useful if agents actually receive them. The delivery mechanism is the start-of-day (SOD) tool that runs at the beginning of every session.

SOD orchestrates a multi-step initialization sequence. For documentation specifically, it:

Returns a doc index - a lightweight table of all available documents (scope, name, version) that the agent can reference. Full content is not loaded by default to avoid blowing up the context window.
Delivers enterprise context - executive summaries and tagged knowledge notes, budget-capped at 12KB to prevent context window bloat. Notes are prioritized: current-project notes first, then other projects, then global notes, with freshest content within each tier.
Reports doc audit status - if docs were auto-generated during this session, the agent sees "Generated: project-instructions.md" in its SOD output. If generation failed, it sees the failure reason.
Flags stale docs - stale documents are listed with their age and threshold, giving the agent (or human) a signal that something needs attention.
Fetches the last handoff - the structured summary from the previous session, so the agent knows what was accomplished, what is in progress, and what is blocked.
Checks the weekly plan - whether a priority plan exists, how old it is, and what the current priority project is. Plans older than 7 days are flagged as stale.

The agent starts every session with full context. Not a blank slate. Not "let me read the README." Full operational context: what happened last session, what the priorities are, what documentation exists, what business context applies, and who else is working on the same project.

This was not always the case. An earlier version of SOD dumped full document content into the session context. One measured session consumed 298K characters in SOD output alone - roughly a third of the context window before the agent did any work. The fix was switching to metadata-only doc delivery with on-demand content fetching. The agent sees a table of available docs and can pull any specific document when it needs the full content.

Staleness Detection

Staleness is tracked at two levels.

Document-level staleness is threshold-based. Each doc requirement has a configurable staleness_days value (default 90). The audit engine compares the document's updated_at timestamp against the threshold. Docs that exceed their threshold appear in the stale list of every audit result.

Plan-level staleness works on a tighter cycle. The weekly plan (a markdown file in docs/planning/) is checked by file modification time. Plans older than 7 days are flagged as stale in the SOD output. This ensures that agents do not follow priorities from two weeks ago.

Enterprise knowledge staleness uses the budget-based allocation system. When SOD delivers executive summaries, it sorts by freshness within priority tiers. Stale enterprise notes naturally fall to the bottom of the budget allocation and may get truncated or omitted entirely. This creates implicit pressure to keep enterprise context current - if it is stale, agents might not see it.

The sync pipeline provides an additional freshness mechanism for process docs and ADRs. When these files change in git and merge to main, the GitHub Actions workflow uploads them within minutes. The updated_at timestamp resets, the version increments, and the staleness clock restarts. Docs that change frequently in the repo stay fresh in the document store automatically.

ADRs as Agent Decision Memory

Architecture Decision Records serve a specific role in this system: they are the agent's answer to "why."

Two ADRs exist in the current repo. ADR-025 documents why the context worker was built - the session tracking, handoff storage, and operational visibility problems it solves. ADR-026 documents the staging/production environment strategy - why two environments, why manual production promotion, how secrets are partitioned.

When an agent is modifying the context API and encounters a design pattern that seems overcomplicated (why canonical JSON with SHA-256 hashing for handoffs? why composite primary keys on idempotency tables?), the ADR provides the rationale. The agent can read ADR-025 and see that these choices were deliberate: canonical JSON enables stable hashing for deduplication, composite keys prevent collision across endpoints.

ADRs are synced to the document store alongside process docs. They are listed in the doc index at session start. An agent working on infrastructure can fetch the relevant ADR and understand the constraints before proposing changes. This prevents the pattern where an agent "improves" a system by removing a design choice that existed for good reasons it did not know about.

The Principle

Documentation is infrastructure. Not a nice-to-have. Not something we will get around to. Infrastructure.

This means it needs the properties we demand from other infrastructure:

Self-healing. When documentation is missing, the system detects the gap and fills it. When documentation goes stale, the system flags it and can regenerate from current sources. No human needs to remember to update docs after changing code.

Version-tracked. Every document in the store has a version number, content hash, and timestamps. Changes flow through git and CI, same as code. The sync pipeline ensures that the document store reflects what is in the repo, not what someone uploaded manually three months ago.

Automatically delivered. Agents do not need to know where docs live, what format they are in, or how to find the right one for their project. The SOD tool handles all of it. Enterprise summaries, project instructions, process docs, ADRs, last session handoffs, weekly plans - all delivered at session start, scoped to the current project, budget-capped to avoid context window bloat.

Capability-aware. Not all projects need the same docs. A project without a database does not need a schema reference. A project without an API does not need endpoint documentation. The requirement system gates on capabilities, so projects only get requirements that make sense for them.

Auditable. Every document upload records who uploaded it, from what source repo, and when. The audit engine produces structured reports that can be reviewed by humans or consumed by other tools. When something goes wrong, the trail exists.

The overhead is minimal. Requirements are defined once in a database table. The generators read existing source files. The sync pipeline runs in CI. The audit runs during session initialization. There is no manual step where someone has to remember to update documentation after changing code. The system handles it.

The result: agents start every session informed. They know what was built, why it was built that way, how the team works, what the priorities are, and what happened last session. They do not spend the first 15 minutes rediscovering context. They do not follow stale instructions. They do not ask "where do I start?" because the system already told them.

Documentation that nobody reads is waste. Documentation that self-heals, version-tracks, and delivers itself to the consumers that need it - that is infrastructure.

96% Token Reduction - Lazy-Loading Agent Context

Sat, 07 Feb 2026 00:00:00 GMT

Our session startup routine was consuming 45,000 to 71,000 tokens before the agent did any useful work. On a ~200K context window, that is 22-35% of available capacity gone on initialization alone. We cut it to roughly 3,000 tokens - a 93-96% reduction - without changing the backend API.

The Problem: Eager Loading

Every agent session starts with a Start of Day (SOD) call. SOD loads everything the agent might need: documentation, enterprise notes, active issues, handoff history, weekly plan status, and session metadata. The original implementation fetched 23-39 full documents from the context API and dumped their complete contents into the response.

For the most documentation-heavy project, this meant the agent received roughly 71,000 tokens of context before it could even read the first user message. The SOD response had grown to 298,000 characters in the worst case.

This happened gradually. Each time we added a new document type - API specs, architecture decision records, coding standards, design briefs - the SOD payload grew. Nobody noticed because the degradation was incremental. The session started a little slower each week, and we absorbed it as normal latency.

We caught it when a size guard flagged a response exceeding 50KB. Looking at the actual numbers was sobering.

The Insight

Agents do not need every document on every session. A session working on a database migration does not need the design system documentation. A session fixing a bug in the API does not need the product requirements document. The vast majority of loaded documents go unread in any given session.

What agents actually need at startup is awareness - knowing what documentation exists so they can fetch relevant pieces when a task requires them. The difference between "here are 39 documents" and "here is an index of 39 documents you can request" is the difference between a 71K token payload and a 3K token payload.

This is the index-and-fetch pattern: deliver a lightweight metadata table at startup, provide a tool for on-demand retrieval, and let the agent decide what it actually needs.

The Implementation

The fix had three parts, all on the client side. The backend API already supported both formats - we just were not using the right one.

Part 1: Documentation Index

The SOD request gained a docs_format parameter. Setting it to 'index' tells the context API to return only metadata - scope, document name, version number - instead of full document contents.

The MCP server's SOD tool sends this parameter:

body: JSON.stringify({
  schema_version: '1.0',
  agent: params.agent,
  venture: params.venture,
  repo: params.repo,
  include_docs: true,
  docs_format: 'index', // metadata only
})

On the backend, the query changes from fetching content (the expensive column) to fetching just scope, doc_name, content_hash, title, version:

SELECT scope, doc_name, content_hash, title, version
FROM context_docs
WHERE scope = 'global' OR scope = ?
ORDER BY scope DESC, doc_name ASC

The SOD output renders this as a compact table:

### Available Documentation (28 docs)
Fetch any document with `doc(scope, doc_name)`.

| Scope  | Document                    | Version |
|--------|-----------------------------|---------|
| global | team-workflow.md            | v3      |
| global | dev-standards.md            | v2      |
| alpha  | alpha-project-instructions.md | v5      |
| alpha  | alpha-api-structure.md      | v2      |
| ...    |                             |         |

Twenty-eight documents described in a few hundred tokens instead of tens of thousands.

Part 2: On-Demand Document Fetch

A dedicated MCP tool lets the agent fetch any specific document when it needs one:

export const docInputSchema = z.object({
  scope: z.string().describe('Document scope: "global" or venture code'),
  doc_name: z.string().describe('Document name'),
})

The agent calls this tool mid-session when it encounters a task that requires specific documentation. A session working on API changes calls doc("alpha", "alpha-api-structure.md"). A session updating team process calls doc("global", "team-workflow.md"). Most sessions fetch zero to two documents rather than loading all 28-39.

The tool is a thin wrapper - it calls the context API's document endpoint, gets the full content for that single document, and returns it. The agent pays the token cost only for documents it actually reads.

Part 3: Enterprise Notes Budget

The second optimization addressed enterprise context notes (executive summaries, strategy docs, product requirements). The original approach truncated every note to 2,000 characters - a flat cut that often landed mid-sentence and wasted budget on irrelevant notes.

We replaced this with a 12KB section budget and relevance-tiered sorting:

const EC_BUDGET = 12_000
const ecNotes = [...allNotes].sort((a, b) => {
  const aRank = a.venture === ventureCode ? 0 : a.venture ? 1 : 2
  const bRank = b.venture === ventureCode ? 0 : b.venture ? 1 : 2
  if (aRank !== bRank) return aRank - bRank
  return new Date(b.updated_at).getTime() - new Date(a.updated_at).getTime()
})

Notes for the current project come first, then other projects, then global notes. Each note fits in full when the budget allows. If a note would overflow the remaining budget, it gets a partial fit with a pointer to the full version. This means the most relevant context is always complete, and less relevant context is available on demand.

The Numbers

Per-project token consumption, before and after:

Project	Before	After	Reduction
Primary	~71K tokens	~3K tokens	96%
Project A	~47K tokens	~3K tokens	94%
Project B	~45K tokens	~3K tokens	93%
Project C	~46K tokens	~3K tokens	93%
Project D	~47K tokens	~3K tokens	94%

The implementation touched three files: the SOD tool (49 lines changed), new test fixtures for the index API response format (60 lines), and expanded test coverage (114 lines). A small change with outsized impact.

The Design Tradeoff: Shell-Based Agents

Not all agent environments support MCP. Shell-based agents - those running without the MCP server, perhaps in a CI pipeline or a minimal scripting context - cannot call tools mid-session to fetch documents on demand. They get one shot at loading context at startup.

For these agents, the API still supports docs_format: 'full', which returns complete document contents. The tradeoff is explicit: MCP-capable agents self-serve from the index, while shell-based agents pay the full loading cost because they have no other option.

This is a pragmatic split. The MCP server sets docs_format: 'index' by default. Any client that needs full content can still request it. The backend serves both formats from the same endpoint with the same auth. No conditional logic, no feature flags - just a request parameter.

Cost at Scale

The savings compound quickly. Consider a modest workload:

5 sessions per day per project
Multiple projects in the portfolio
Multiple development machines

At 71K tokens per SOD call, the old approach consumed roughly 355K tokens per day per project just on session initialization. Across the portfolio, that is over a million tokens daily spent on context the agent probably will not read.

At 3K tokens per SOD call, the same workload uses roughly 15K tokens per day per project on initialization. The occasional on-demand document fetch adds a few thousand more, but only when the agent actually needs the content. Total initialization cost drops by more than an order of magnitude.

This is not about the dollar cost of tokens (though that matters). It is about context window capacity. Every token spent on unread documentation is a token unavailable for actual reasoning, code analysis, and conversation history. At 71K tokens of initialization overhead, the agent starts every session with a quarter of its working memory already occupied by reference material it may never consult.

A 50KB Safety Net

We added a size guard at the end of SOD message construction:

if (message.length > 50_000) {
  message +=
    `\n\n Warning: SOD message is ${Math.round(message.length / 1024)}KB` +
    ` - investigate size regression`
}

This is defense-in-depth. The index format and budget caps should keep the message well under 50KB, but context payloads have a tendency to grow silently. The warning fires if something regresses - a new section added without budget awareness, a note that grew beyond expected size, or a documentation index that expanded dramatically.

We found the original problem because a simpler version of this guard caught the 298K character response. Without it, we might have run for months with a third of our context window consumed on startup.

The Broader Lesson

The instinct when agents lack context is "add more." More documentation, more executive summaries, more project history. More feels safer - the agent has everything it could possibly need.

But context windows are finite, and the marginal cost of each additional token of context is not zero. It displaces reasoning capacity. It dilutes the relevance of actually important information. And it creates a baseline cost that every single session pays whether it benefits or not.

Context window management is an engineering discipline, not a loading problem. The right question is not "does the agent have access to this information?" but "does the agent need this information right now, and can it get it when it does?"

For documentation, the answer is almost always: provide an index at startup, fetch on demand during work. The agent knows what documents exist. It pulls specific documents when a task requires them. Most sessions need zero to two documents, not thirty-nine.

The 96% reduction was not the result of removing information from the system. Every document is still available. The agent can still access any piece of documentation at any time. We just stopped paying the cost of loading everything upfront on the assumption that the agent might need it.

Lazy loading is not a new idea. It is one of the oldest patterns in software engineering. But when working with AI agents, the temptation to front-load context is strong - the agent seems smarter with more context, and the cost is invisible until it is not. Treating context window capacity as a scarce resource, and managing it with the same discipline we apply to memory and bandwidth, produced a better system with a smaller change than we expected.

From Monolith to Microworker - Decommissioning the Relay

Thu, 05 Feb 2026 00:00:00 GMT

We deleted a Cloudflare Worker last week. Along with it went a D1 database and an R2 storage bucket. Nineteen files, 6,231 lines of code, removed from the monorepo in a single session. The system had been the backbone of our GitHub integration for months, and nothing noticed it was gone.

That last part is the interesting bit.

What the Relay Worker Did

The relay worker started life as a simple HTTP bridge. Early in our setup, AI coding agents couldn't call the GitHub API directly - Claude Desktop, the tool at the time, had no shell access. So we built a Cloudflare Worker that sat between the agent and GitHub, proxying API calls over HTTP.

The initial scope was tight: receive a request from the agent, forward it to the GitHub API, return the response. Label an issue. Post a comment. Close a PR. Maybe five endpoints, a few hundred lines of TypeScript.

Then it grew.

First came webhook processing. GitHub could POST events to the worker, and the worker could react - new issue opened, label changed, PR merged. Useful, straightforward, still within reason.

Then came AI classification. When a new issue arrived, the worker would call Gemini Flash to analyze the issue body, extract acceptance criteria, assign a QA grade, and apply labels automatically. This required prompt engineering, structured output parsing, confidence scoring, and a schema for the grading rubric.

Then came evidence storage. Classification results needed an audit trail, so we added an R2 bucket to store raw model outputs and classification evidence.

Then came retry logic, idempotency keys, error handling for each of the upstream APIs, and a V2 event system that was designed but never fully adopted.

By February 2026, the relay worker was 3,234 lines of TypeScript doing at least four distinct jobs: HTTP proxy for GitHub API calls, webhook receiver, AI-powered issue classifier, and evidence archive. Each feature had been a reasonable addition in isolation. The aggregate was not.

The Kill Signal

The decision to decommission wasn't driven by a refactoring sprint or an architecture review. It was driven by an accident.

While auditing Cloudflare secrets, we discovered that the relay worker's production deployment was missing its authentication secrets. The two keys it needed to accept incoming requests - both absent. Its API endpoints were non-functional in production.

Nobody had noticed.

No monitoring alert, no user complaint, no broken workflow. The worker had been silently failing (or more accurately, silently unreachable) for an unknown period. We searched the codebase: no MCP tools referenced its URL. No scripts. No slash commands. No CI jobs. Nothing in the entire monorepo was calling it.

We had migrated away from the relay worker without realizing we had migrated away. Two separate changes, made for their own reasons, had made it obsolete:

Direct CLI access. Once we moved from Claude Desktop to Claude Code, agents could shell out to gh CLI directly. The HTTP proxy pattern - agent calls worker, worker calls GitHub - became unnecessary overhead. Why proxy through a Cloudflare Worker when you can run gh issue list in a subprocess?
A purpose-built classifier. Webhook processing and issue classification had been extracted into a dedicated worker weeks earlier. That worker did one thing: receive a GitHub webhook, classify the issue with Gemini, apply labels. No proxy endpoints, no evidence storage, no V2 event system.

The relay worker was already dead. We just hadn't cleaned up the body.

The Replacement Architecture

The focused classifier worker that replaced the relay's webhook processing is roughly 1,000 lines of TypeScript (compared to 3,234). It has three HTTP routes:

GET /health - health check
POST /webhooks/github - receive and classify issues
POST /regrade - reclassify existing issues on demand

That is the entire API surface. One purpose: when a GitHub issue is opened, classify it.

The classification pipeline is straightforward. Validate the webhook signature. Parse the issue payload. Extract acceptance criteria from the issue body. Call Gemini 2.0 Flash with a structured prompt and response schema. Apply the resulting QA grade label (qa:0 through qa:3) and optionally a test:required label. Log the result to D1 for auditability.

It has idempotency (both delivery-based and semantic, so re-delivered webhooks and unchanged issues do not get reclassified). It has skip logic (bots, already-graded issues). It does not have an HTTP proxy, an evidence bucket, a V2 event system, or any endpoint that says "do this arbitrary thing to a GitHub issue on my behalf."

The wrangler.toml tells the story of scope:

name = "issue-classifier"
main = "src/index.ts"
compatibility_date = "2025-12-15"

[[d1_databases]]
binding = "DB"
database_name = "issue-classifier-db"

One worker. One database. One binding. Compare that to the relay worker's configuration, which had a D1 binding, an R2 binding, multiple secret bindings for different auth mechanisms, and environment-specific overrides for staging versus production.

Everything the relay worker did beyond classification is now handled by gh CLI, run directly from agent sessions. Label management, issue queries, PR operations, comment posting - all of these are gh subcommands that agents call directly. No intermediary worker needed.

The Cleanup Process

Decommissioning a worker is straightforward when you can prove nothing depends on it. Our verification process:

Search the codebase. Grep for the worker's URL, its name, any reference to its API endpoints. We found references in documentation and old configuration files, but zero live call sites.
Check the secrets. The missing production secrets were themselves evidence - if the worker had been needed, someone would have noticed the auth failures.
Delete the Cloudflare resources. The worker (both production and staging deployments), the D1 database, and the R2 evidence bucket.
Clean the monorepo. Remove the worker directory, update package.json, remove references from CI workflows, security configurations, and documentation.
Preserve what survived. The GitHub App that the relay worker had used for API authentication was still needed by the classifier worker. We renamed it from its legacy name to something that reflected its actual scope - a shared GitHub App for unattended API access across all venture installations.

The total removal: 19 files deleted, 6,231 lines removed, zero functionality lost.

Why Serverless Monoliths Happen

Serverless platforms make it dangerously easy to add responsibilities to an existing worker. There is no deployment friction. There is no "spin up a new service" cost. You open the file, add a route handler, deploy. Five minutes, done.

This is the serverless equivalent of a god class. In traditional backend development, at least creating a new service involves some ceremony - a new repository, a deployment pipeline, DNS configuration, maybe a load balancer rule. That friction, annoying as it is, creates a natural checkpoint: "Is this really part of this service's responsibility?"

Cloudflare Workers have almost zero deployment ceremony. A new worker is a directory with a wrangler.toml and an index.ts. Deploying it is npx wrangler deploy. There is no infrastructure to provision, no containers to configure, no DNS to manage (Workers get a *.workers.dev subdomain automatically). The marginal cost of a new worker is nearly zero.

But we didn't create a new worker. We added a handler to the existing one. Because it was already there, already deployed, already had the secrets configured, already had the D1 binding. The path of least resistance was always "add it to the relay."

This pattern has a name in traditional software engineering: accidental coupling. Features end up in the same deployment unit not because they belong together, but because that is where the code happened to be when someone needed to add something.

What Makes a Good Serverless Worker

After this experience, our heuristic for worker scope is simple: a worker should have one reason to be deployed.

The classifier worker gets deployed when classification logic changes - a new prompt version, a new grade level, a change to the skip rules. That is it. Changes to GitHub API interactions, context management, or documentation systems do not require touching the classifier.

The context API worker (our other primary worker) gets deployed when session management, handoff storage, or knowledge store logic changes. It has no opinions about webhook processing or issue classification.

Compare this to the relay worker, which would need redeployment for any change to: GitHub API proxy logic, webhook routing, classification prompts, evidence storage format, retry policies, or the V2 event schema. Six independent reasons to deploy a single worker.

A useful test: can you describe what the worker does in one sentence without using the word "and"? "It classifies GitHub issues" passes. "It proxies GitHub API calls and processes webhooks and classifies issues and stores evidence" does not.

The Broader Pattern

This was not a refactoring project. We did not sit down and say "let us decompose the monolithic worker into microworkers." The decomposition happened organically, driven by actual needs:

We needed a better classifier, so we built one as a standalone worker.
We needed direct GitHub access from agents, so we used gh CLI.
We discovered the relay was unreachable, so we deleted it.

The lesson is not "always start with microservices" or "monoliths are bad." The relay worker was the right architecture when it was built. Claude Desktop could not shell out to gh. A single worker handling everything was simpler than three workers when the team was small and the feature set was new.

The lesson is: pay attention to when something stops being called. If a service's production auth can break without anyone noticing, that service is not serving anyone. Dead code is bad enough in a codebase. Dead infrastructure is worse - it still costs attention, still shows up in dashboards, still creates the illusion that it matters.

The relay worker's D1 database still had tables, still had data. Its R2 bucket still had evidence files. Its Cloudflare dashboard still showed it as a deployed worker. All of that created cognitive overhead every time someone looked at the infrastructure. "What does this do? Is this important? Can I touch this?"

The answer was: it does nothing, it is not important, and yes, you should delete it.

Checklist for Decommissioning a Worker

For anyone facing a similar cleanup, here is the process that worked for us:

Search for references. Grep the entire codebase for the worker's URL, name, and endpoint paths. Check environment variables, MCP configurations, CI workflows, and documentation.
Check the secrets. Are the worker's production secrets present and valid? If not, how long have they been missing? (This is a strong signal.)
Check the logs. What is the worker's request volume? If it is zero or near-zero, that confirms nothing is calling it.
Identify what survives. Shared resources (like a GitHub App) may be used by other services. Rename them to reflect their actual scope rather than their historical origin.
Delete with confidence. Remove the worker deployment, the database, the storage bucket, all source files, and all references. Do not comment things out. Do not leave behind "just in case" stubs. Delete.
Verify after deletion. Run the full CI pipeline. Run agent sessions. Confirm that every workflow that was working before the deletion still works after.

The entire decommission - discovery, verification, deletion, cleanup, and verification - took a single session. That is the reward for clean boundaries: when something is truly unused, removing it is trivial.

We run AI coding agents across multiple projects and machines. Our infrastructure runs on Cloudflare Workers, D1, GitHub, and Claude Code. This article describes a real decommissioning that happened in February 2026.

Staging Environments for AI Agents

Tue, 03 Feb 2026 00:00:00 GMT

An AI agent running npx wrangler deploy during a development session just pushed to production. There was no staging environment. No gate. No confirmation prompt. The agent did exactly what it was told to do, and that was the problem.

When your deployment tooling has a single target and your "developers" are AI agents that execute commands literally, you get production deployments by default. A human developer might hesitate - "wait, is this production?" - and check the target before running the command. An agent runs the command. That is what agents do.

We had two Cloudflare Workers, two D1 databases, and a single environment: production. Every wrangler deploy from any machine, any session, any agent hit the same live infrastructure. Migrations ran directly against production data. There was no way to validate a change before it affected live agent sessions.

This worked fine during initial development. It stopped working when other projects started depending on the shared infrastructure.

Why Agents Make This Worse

The standard argument for staging environments - validate before you ship - applies doubly when AI agents are part of the deployment loop.

Agents execute commands literally. If a wrangler.toml has a single deployment target, npx wrangler deploy goes to that target. An agent will not second-guess the command. It will not open the config file to verify the target. It will not ask "are you sure?" unless explicitly instructed to. The command runs, the deployment happens.

Agent sessions are frequent and parallel. A solo operator running multiple agent sessions across several machines might trigger several deployments per day. Each one is a roll of the dice against production. The surface area for accidental damage scales with session count.

Agents chain operations. A single agent session might modify code, run tests, deploy, and then test the deployment - all in sequence. If the deployment target is production, the agent's post-deploy testing runs against production too. Any test that writes data or triggers side effects now contaminates production state.

Recovery requires human intervention. When a bad deployment hits production, the agent that caused it typically cannot fix the problem. It might not even detect the problem. A human has to notice, diagnose, and roll back. The blast radius is the time between the bad deploy and the human noticing.

The fix is not to make agents smarter about deployment. The fix is to make the infrastructure safe by default.

Phase 1: Cloudflare Environment Split

Cloudflare Workers support named environments in wrangler.toml. The default (no --env flag) deploys to one environment; --env production deploys to another. We made the default environment staging and the explicit flag production.

name = "my-worker-staging"
main = "src/index.ts"

# Default = staging
[[d1_databases]]
binding = "DB"
database_name = "my-worker-db-staging"
database_id = "<staging-db-id>"

[env.production]
name = "my-worker"

[[env.production.d1_databases]]
binding = "DB"
database_name = "my-worker-db-prod"
database_id = "<prod-db-id>"

This gives each worker:

A separate staging URL (e.g., my-worker-staging.account.workers.dev)
A separate production URL (e.g., my-worker.account.workers.dev)
Separate D1 databases per environment
The same codebase and migration files deployed to different targets

The key design choice is making staging the default. A bare npx wrangler deploy - which is what an agent will run unless told otherwise - hits staging. Production requires the explicit --env production flag. This inverts the risk: forgetting to specify the environment is now safe instead of dangerous.

D1 migrations use the same numbered sequence in both environments. Staging gets migrations first. If a migration breaks staging, it blocks subsequent migrations to production. This ordering is enforced by the CI pipeline, not by policy alone. A bare deploy hits staging; production requires an explicit flag.

Creating the staging D1 databases is straightforward:

npx wrangler d1 create my-worker-db-staging

Then run the existing migration files against the new database. The schema is identical. The data is not - more on that later.

Phase 2: Automated CI/CD Pipeline

With two environments in place, the deployment workflow becomes a pipeline:

PR -> CI verify -> merge to main -> deploy to staging -> smoke tests -> manual promote to production

A GitHub Actions workflow handles this. On merge to main (specifically, after the verification workflow passes), the pipeline automatically deploys changed workers to staging. It detects which workers have changes by diffing against the previous commit:

- name: Check for worker changes
  run: |
    CHANGED=$(git diff --name-only HEAD~1 HEAD)
    if echo "$CHANGED" | grep -qE "^(workers/my-worker/)"; then
      echo "skip=false" >> "$GITHUB_OUTPUT"
    else
      echo "skip=true" >> "$GITHUB_OUTPUT"
    fi

Only workers with actual file changes get redeployed. A change to Worker A does not trigger a redeploy of Worker B.

After staging deployment, automated smoke tests validate the deployment. These are deliberately minimal - health endpoint checks and D1 connectivity verification, with retries to account for edge propagation delay:

- name: Health check
  run: |
    for attempt in 1 2 3; do
      if curl -sf https://my-worker-staging.account.workers.dev/health \
        | jq -e '.status == "healthy"'; then
        exit 0
      fi
      sleep 5
    done
    exit 1

Production deployment requires a manual workflow_dispatch trigger with the production target selected. This is the critical gate. No automated process pushes to production. A human makes that decision, and the GitHub Actions environment protection rules enforce it.

The staging deploy is automatic. The production promotion is manual. This is deliberate. Staging should reflect main at all times. Production changes only when someone decides the staging deployment looks good.

Phase 3: Secrets per Environment

A staging environment is not useful if it shares secrets with production. Two workers hitting the same database with the same API keys means staging is just production with a different URL.

We use Infisical for secrets management, organized by venture path (/alpha, /beta, etc.). Adding environment separation meant creating distinct secret scopes:

Production secrets live in the prod environment, under each venture's path
Staging secrets live in the dev environment, under a /staging subfolder

Infrastructure keys - the API keys that authenticate agents to the context API and admin endpoints - are different per environment. An agent authenticated against staging cannot accidentally hit production, and vice versa. External service keys (GitHub App credentials, third-party API keys) are shared, since those services don't have per-environment equivalents.

The CLI launcher handles the routing. At session start, it reads CRANE_ENV and fetches secrets from the corresponding Infisical path:

CRANE_ENV=prod  ->  Infisical prod:/alpha  ->  production secrets
CRANE_ENV=dev   ->  Infisical dev:/alpha/staging  ->  staging secrets

The secrets are injected as environment variables into the agent's process. The agent never knows which Infisical path was used. It just has environment variables with the right values for its target environment.

One complication: not every project has staging infrastructure. Only the shared infrastructure project needed staging at this point. For other projects, requesting staging secrets produces a warning and falls back to production. This avoids premature infrastructure duplication while keeping the mechanism ready for expansion.

Phase 4: Agent-Aware Environment Toggle

The final piece connects the agent to the right environment end-to-end. A central configuration module resolves the CRANE_ENV variable into concrete URLs and paths:

export type CraneEnv = 'prod' | 'dev'

const URLS: Record<CraneEnv, string> = {
  prod: 'https://context-api.account.workers.dev',
  dev: 'https://context-api-staging.account.workers.dev',
}

export function getCraneEnv(): CraneEnv {
  const raw = process.env.CRANE_ENV?.toLowerCase()
  if (raw === 'dev') return 'dev'
  return 'prod'
}

export function getApiBase(): string {
  return URLS[getCraneEnv()]
}

This means:

CRANE_ENV=dev makes the MCP server connect to the staging context API
CRANE_ENV=dev makes the launcher fetch staging secrets from Infisical
The preflight tool displays which environment the agent is operating in
The launcher propagates the normalized CRANE_ENV to the agent child process

Default is production. You opt into staging explicitly. This keeps the common case (working against production) as the zero-configuration path, while making staging available when needed for testing deployments or running migrations.

The preflight check now shows the environment clearly at session start:

Environment: staging
API: https://context-api-staging.account.workers.dev

No ambiguity about where the agent is pointed.

The Unsolved Problem: Staging Data

Phase 1 through 4 solve the deployment safety problem. An agent running npx wrangler deploy hits staging. The CI pipeline auto-deploys to staging and gates production behind manual promotion. Secrets are scoped per environment. The MCP server routes API calls to the right endpoint.

What they do not solve is staging data representativeness.

The staging D1 databases are empty. They have the schema - all migrations have been applied - but no meaningful data. Testing against empty databases validates that the deployment mechanics work. It does not validate that the code handles real-world data correctly.

Consider a migration that adds a NOT NULL column to the sessions table. Against an empty staging database, this migration succeeds instantly. Against production, where the sessions table has thousands of rows, the same migration might fail or behave differently. The staging test gave a false green.

Possible solutions we have considered but not implemented:

Seed scripts that populate representative data after each staging migration. This requires maintaining the seed data, which drifts from reality over time.
Periodic snapshots from production, scrubbed of sensitive data, restored to staging. This gives realistic data but adds operational overhead and potential privacy concerns.
Accept the limitation. Staging validates deployment mechanics and code paths. Data correctness is validated through unit tests and integration tests that run in CI with synthetic data. Production data edge cases are caught by monitoring, not by staging.

We are currently living with option three. Staging catches deployment failures, broken migrations, and configuration errors. It does not catch data-dependent bugs. That is an acceptable trade-off for now.

The Sweet Spot for Solo Operators

Running multiple environments adds operational overhead. For a solo operator (or a very small team), the goal is maximum safety with minimum ceremony.

The sweet spot we landed on:

Automated staging deploy. Merge to main deploys to staging with zero manual steps. This means staging always reflects the latest code on main. There is no "forgot to deploy to staging" failure mode.
Automated smoke tests. Health checks and connectivity tests run after every staging deploy. If staging is broken, you know immediately.
Manual production promotion. One click in GitHub Actions. No scripts to run, no commands to remember. But the click is deliberate - a human decided this deployment is ready.
Safe defaults everywhere. wrangler deploy without flags hits staging. CRANE_ENV defaults to production for agent sessions (you don't want agents accidentally talking to staging). The config module falls back to production for unknown environment values.

What we explicitly did not build:

Blue-green deployments. Overkill for this scale.
Canary releases. Same.
Automated production deployment. The manual gate is the point.
Per-PR preview environments. Cloudflare supports this for Pages but not cleanly for Workers with D1 bindings. The complexity was not justified.

The total infrastructure cost of adding staging was zero additional dollars. Cloudflare's free tier covers the extra workers and D1 databases. The only cost is cognitive - remembering that two environments exist and that production requires the explicit flag.

Implementation Timeline

All four phases were implemented in a single day. This is not because the work was trivial - it is because the scope was deliberately constrained:

Phase 1 (environment split): Create two staging D1 databases, update two wrangler.toml files, run migrations, set secrets on staging workers.
Phase 2 (CI/CD pipeline): Write one GitHub Actions workflow with three jobs (deploy, smoke test, promote).
Phase 3 (secrets): Create the Infisical production environment, copy secrets across venture paths, update the CLI launcher's default environment.
Phase 4 (agent toggle): Add one config module, update the API client constructor, update the launcher's secret-fetching logic.

Each phase was independently useful. Phase 1 alone eliminated the "bare deploy hits production" risk. Phase 2 added automated validation. Phase 3 ensured environment isolation extended to secrets. Phase 4 made the whole system agent-aware.

If you are running AI agents that deploy infrastructure, start with Phase 1. Making staging the default deployment target is a ten-minute change that eliminates the most common failure mode. The other phases add defense in depth, but the default-to-staging pattern is where most of the safety comes from.

This article describes a production environment strategy for Cloudflare Workers infrastructure managed by AI coding agents. The system uses Wrangler environment splits, GitHub Actions CI/CD, Infisical secrets management, and an environment-aware MCP server. It has been in production since February 2026.

Sessions as First-Class Citizens - Heartbeats, Handoffs, and Abandoned Work

Sat, 31 Jan 2026 00:00:00 GMT

An AI coding agent is a process. It starts, does work, and eventually stops. Sometimes it stops gracefully. Sometimes it crashes. Sometimes the human closes the laptop and walks away. If you are running multiple agents across multiple machines, you need answers to the same questions you would ask about any distributed system process: Is it still alive? What was it working on? Did it finish? Can another process safely pick up where it left off?

We built a session management system for AI agents that borrows directly from distributed systems infrastructure - liveness detection via heartbeats, crash recovery via idempotent handoffs, and coordination via session awareness. This article goes deep on the design of each layer.

The problem: every session starts from zero

Without explicit session management, every agent session is amnesiac. The agent does not know what happened in the previous session, whether another agent is currently working on the same codebase, or whether the last session ended cleanly or was abandoned mid-task.

The naive solution - committing a markdown file to git at the end of each session - has real failure modes. The agent might crash before writing the file. Two agents might overwrite each other's handoffs. There is no way to distinguish "the last session ended cleanly" from "the last session was abandoned three hours ago." And querying across sessions (how many are active right now?) requires parsing files out of git history.

We needed sessions to be first-class entities with their own lifecycle, stored in a queryable database, with reliability guarantees that survive agent crashes.

Session lifecycle: active, stale, abandoned

Every session moves through a defined state machine:

  SOD (Start of Day)
       │
       ▼
   ┌────────┐    heartbeat     ┌────────┐
   │ active │ ───────────────> │ active │  (timestamp refreshed)
   └────────┘                  └────────┘
       │                            │
       │  EOD (manual)              │  no heartbeat for 45 min
       ▼                            ▼
   ┌────────┐                  ┌───────────┐
   │ ended  │                  │   stale   │  (detected at next SOD)
   └────────┘                  └───────────┘
                                    │
                                    │  next SOD for same tuple
                                    ▼
                               ┌───────────┐
                               │ abandoned │
                               └───────────┘

A session is created (or resumed) at the start of a work session. During work, heartbeat pings keep the last_heartbeat_at timestamp current. At the end of work, the agent explicitly ends the session and writes a structured handoff. If the agent disappears without ending the session - a crash, a closed terminal, a human who just walked away - the session becomes stale after 45 minutes of silence. The next time any agent calls SOD for the same agent + project + repository tuple, the stale session is marked as abandoned and a fresh session is created.

The end_reason field captures why a session ended:

manual - the agent explicitly called EOD
stale - the session was auto-closed due to inactivity
superseded - a newer session for the same tuple replaced it
error - the session ended due to a system error

This distinction matters for diagnostics. A high rate of stale endings suggests agents are not properly closing sessions. A spike in superseded endings might indicate agents are restarting too frequently.

Heartbeat design: why 45 minutes

The staleness threshold is the core parameter of the liveness detection system. Too short, and you get false positives - a session that is actively working but doing a long file operation gets marked stale. Too long, and stale sessions linger, polluting the "active sessions" view and misleading other agents about what work is in progress.

We settled on 45 minutes after observing actual agent session patterns. A coding agent doing deep work - refactoring a module, writing a complex test suite, debugging a production issue - might go 20-30 minutes between API calls. The heartbeat interval is 10 minutes (base), which means even during the longest stretches of uninterrupted work, heartbeats fire regularly. 45 minutes gives a 4.5x safety margin over the base heartbeat interval.

The heartbeat itself is a simple timestamp update:

UPDATE sessions SET last_heartbeat_at = ? WHERE id = ?

Staleness detection happens at query time, not via a background job. When any code path queries for active sessions, it compares last_heartbeat_at against a threshold:

export function isSessionStale(
  session: SessionRecord,
  staleAfterMinutes: number = STALE_AFTER_MINUTES
): boolean {
  const staleThreshold = subtractMinutes(staleAfterMinutes)
  return session.last_heartbeat_at < staleThreshold
}

This check-on-read approach avoids the need for a background cleanup process. Stale sessions are detected naturally when they matter - at the start of the next session. A partial index on the sessions table (WHERE status = 'active') keeps these queries fast.

Server-side jitter

If you run multiple agents across a fleet and they all heartbeat at exactly 10-minute intervals, they will periodically align and hit the API simultaneously. This is the thundering herd problem.

The fix is server-side jitter. Each heartbeat response includes the next heartbeat time, calculated as the base interval plus a random offset:

export function calculateNextHeartbeat(): {
  next_heartbeat_at: string
  heartbeat_interval_seconds: number
} {
  // Generate random jitter: +/-120 seconds (2 minutes)
  const jitter =
    Math.floor(Math.random() * (HEARTBEAT_JITTER_SECONDS * 2 + 1)) - HEARTBEAT_JITTER_SECONDS

  const intervalSeconds = HEARTBEAT_INTERVAL_SECONDS + jitter
  const nextHeartbeat = addSeconds(intervalSeconds)

  return {
    next_heartbeat_at: nextHeartbeat,
    heartbeat_interval_seconds: intervalSeconds,
  }
}

With a base of 600 seconds and jitter of plus or minus 120 seconds, the actual interval ranges from 480 to 720 seconds (8 to 12 minutes). Across a fleet, heartbeats naturally spread across the full interval window. The server controls the jitter, not the client, which means the distribution is enforced regardless of client implementation.

Session resume logic

SOD is not simply "create a new session." It implements resume-or-create semantics with edge case handling:

Find all active sessions matching the (agent, venture, repo, track) tuple
If multiple active sessions exist (should not happen, but handle it): keep the most recent, mark the rest as superseded
If a single active session exists: check if it is stale
- If stale: mark it abandoned, fall through to create
- If fresh: refresh its heartbeat and return it (resumed)
If no active session exists: create a new one

This logic is important because it makes SOD idempotent. Calling SOD twice in rapid succession returns the same session. Calling SOD after an abandoned session creates a clean new one. The system always converges to a valid state.

// Step 3: Check staleness
if (isSessionStale(existing, params.staleAfterMinutes)) {
  await markSessionAbandoned(db, existing.id)
  // Fall through to create new session
} else {
  await updateHeartbeat(db, existing.id)
  return updated // Resume existing session
}

The "mark abandoned, then create new" pattern is a deliberate design choice. We do not reuse abandoned sessions because their state is suspect - the previous agent may have left files in an inconsistent state, uncommitted changes, or half-completed operations.

Handoff design: the dual-write pattern

When a session ends, it produces a handoff - a structured summary of what happened, what is in progress, and what comes next. The handoff serves as the bridge between sessions.

We use a dual-write pattern: structured data goes to D1 (the edge database), and a human-readable markdown version goes to a git commit. These writes happen at different layers. The context API worker handles the D1 write. The MCP server on the agent's machine handles the git commit. The two are not transactionally coupled - they are coordinated by the end-of-day flow that triggers both.

The D1 handoff is machine-optimized. It has typed fields (summary, status_label, from_agent, to_agent, payload_json) that can be queried, filtered, and rendered programmatically. The next agent's SOD call fetches the latest handoff automatically and injects it into the session context.

The git handoff is human-optimized. It is a markdown file committed to the repo, visible in pull requests and git log. It provides a readable record of what happened across sessions that anyone can review without API access.

Canonical JSON and deterministic hashing

Handoff payloads are stored as canonical JSON per RFC 8785 and hashed with SHA-256. This might seem like over-engineering for what is essentially a JSON blob, but it solves a real problem: deduplication and integrity verification.

RFC 8785 defines a canonical serialization for JSON. It specifies key ordering (lexicographic), number formatting (no unnecessary trailing zeros), and string escaping rules. The result is that the same logical JSON object always produces the same byte sequence, regardless of what language, library, or platform serialized it.

import canonicalize from 'canonicalize'

export function canonicalizeJson(obj: unknown): string {
  const result = canonicalize(obj)
  if (!result) {
    throw new Error('Failed to canonicalize JSON')
  }
  return result
}

export async function hashCanonicalJson(obj: unknown): Promise<string> {
  const canonical = canonicalizeJson(obj)
  return await sha256(canonical)
}

The workflow on handoff creation:

Canonicalize the payload using RFC 8785
Measure the canonical payload size (must be under 800KB - D1 rows cap at 1MB, leaving headroom for metadata)
Compute SHA-256 of the canonical payload
Store the canonical JSON, hash, and size in the handoff record

The hash is stored alongside the payload. It is not currently used for deduplication (handoffs are append-only), but it provides an integrity check and a fast equality comparison for future features like change detection between handoffs.

The idempotency layer

An agent might crash mid-handoff and retry. A network timeout might cause a client to resend a request that the server already processed. Without idempotency, these retries create duplicate handoffs, duplicate sessions, or worse - conflicting state.

Every mutating endpoint (/sod, /eod, /update) accepts an Idempotency-Key header. If a request with the same key arrives within the TTL window, the server returns the cached response instead of processing the request again.

export async function handleIdempotentRequest(
  db: D1Database,
  endpoint: string,
  key: string | null
): Promise<Response | null> {
  if (!key) {
    return null // No key, proceed normally
  }

  const cached = await checkIdempotencyKey(db, endpoint, key)
  if (cached) {
    return reconstructResponse(cached)
  }

  return null // Key not found, proceed with request
}

The implementation uses hybrid storage for cached responses. If the response body is under 64KB, the full body is stored. If it is larger, only the SHA-256 hash is stored. This keeps the idempotency table from growing excessively while still providing exact replay for most requests.

CREATE TABLE idempotency_keys (
  endpoint TEXT NOT NULL,             -- /sod, /eod, /update
  key TEXT NOT NULL,                  -- Client-provided UUID
  response_status INTEGER NOT NULL,
  response_hash TEXT NOT NULL,        -- SHA-256(response_body)
  response_body TEXT,                 -- Full body if <64KB, NULL otherwise
  response_size_bytes INTEGER NOT NULL,
  response_truncated INTEGER DEFAULT 0,
  created_at TEXT NOT NULL,
  expires_at TEXT NOT NULL,           -- 1 hour TTL
  actor_key_id TEXT NOT NULL,
  correlation_id TEXT NOT NULL,
  PRIMARY KEY (endpoint, key)
);

Keys are scoped to endpoints (the same UUID used for /sod and /eod are treated as different keys) and expire after 1 hour. Expired keys are cleaned up opportunistically - when a cache miss occurs, the system deletes all expired keys as a background operation.

The 1-hour TTL is generous. Retry windows for transient failures are typically seconds or minutes. An hour provides a wide safety margin without accumulating significant storage.

Multi-session coordination

When two agents work on the same project, they need to know about each other. Without this awareness, they pick the same issue, create conflicting branches, or overwrite each other's work.

Session awareness is the first coordination layer. The context API exposes a GET /active endpoint that returns all active sessions for a given venture. The MCP server's SOD tool queries this endpoint, filters out the current agent, and surfaces the results:

const activeSessions = (session.active_sessions || []).filter((s) => s.agent !== getAgentName())

Each active session includes the agent name, repository, branch, and optionally the issue number being worked on. The SOD output renders this prominently:

### Other Active Sessions
- agent-mac2 on example-org/project-console (Issue #87)

This is enough for practical coordination. The agent sees that Issue #87 is already being worked on and picks different work.

Branch isolation provides the second layer. Each agent uses a host-scoped branch prefix (dev/hostname/feature-name), so branch names never collide even when agents work in the same repo simultaneously.

The schema also includes a track system - a numeric partition that can be assigned to agents at SOD time. Track 1 gets one set of issues, track 2 gets another. The tables, indexes, and query logic are all in place. We have not activated it yet because manual session awareness has been sufficient, but the infrastructure is ready for when parallel agent operations become routine enough to need automated work partitioning.

D1 schema design decisions

The data model reflects several deliberate choices that are worth explaining.

ULID IDs, type-prefixed. Every entity ID uses ULID format with a type prefix: sess_01HQXV3NK8... for sessions, ho_01HQXV4NK8... for handoffs, note_01HQXV5NK8... for notes. ULIDs are sortable (they embed a 48-bit millisecond timestamp), globally unique (80-bit random component), and URL-safe. The type prefix makes IDs self-describing - you can look at an ID in a log line and immediately know what kind of entity it references without additional context.

Actor key IDs. Every record stores an actor_key_id - the first 16 hex characters of SHA-256(api_key). This provides attribution without storing raw API keys. You can answer "who created this session?" and "is this the same actor who created that handoff?" without being able to recover the actual key. Changing a key changes the actor ID, but historical records remain traceable to the old key's identity.

Correlation IDs. Every request generates a corr_<UUID> correlation ID, carried through the entire request lifecycle and stored in every record created during that request. When debugging a production issue, you can query the request log by correlation ID and see the full trace of what happened: authentication, session creation, handoff storage, idempotency checks, and the final response status.

800KB payload limit. D1 has a 1MB row size limit. Handoff payloads are capped at 800KB to leave 200KB of headroom for the other columns in the row - IDs, timestamps, metadata, and index overhead. The application enforces this at the point of handoff creation:

const payloadSize = sizeInBytes(canonicalPayload)
if (payloadSize > MAX_HANDOFF_PAYLOAD_SIZE) {
  throw new Error(
    `Handoff payload too large: ${payloadSize} bytes (max ${MAX_HANDOFF_PAYLOAD_SIZE})`
  )
}

Denormalized context on handoffs. The handoffs table repeats venture, repo, track, and issue_number from the parent session. This is intentional denormalization. Handoffs are queried by these fields far more often than they are joined to sessions, and D1 (SQLite at the edge) performs better with denormalized reads. The index idx_handoffs_issue on (venture, repo, issue_number, created_at DESC) serves the most common query pattern directly.

The distributed systems parallel

These patterns are not novel. They are well-established techniques from distributed systems, applied to a new domain.

Heartbeats are the standard liveness detection mechanism for any system that needs to distinguish "working quietly" from "dead." Kubernetes uses them for pod health checks. ZooKeeper uses them for session management. Consul uses them for service registration. We use them because agent sessions have the same fundamental property: an external observer cannot tell the difference between an agent doing deep work and an agent that has crashed unless the agent periodically signals that it is alive.

Idempotency keys are the standard solution for at-least-once delivery semantics. Stripe popularized them for payment APIs. AWS uses them for EC2 instance creation. Any system where a retry might duplicate a side effect needs idempotent endpoints. Agent sessions have this exact property - a network timeout during handoff creation should not produce two handoffs.

Canonical serialization is a prerequisite for content-addressed storage. Git uses SHA-1 (now SHA-256) for commit and blob identity. Docker uses content-addressable layers. RFC 8785 brings the same property to JSON - a deterministic byte representation that enables stable hashing.

Session state machines with explicit transitions (active, ended, abandoned) and typed end reasons (timeout, superseded, explicit close) are the same pattern used for database connection pools, HTTP/2 streams, and TCP connections. Making transitions explicit means edge cases are handled by design rather than discovered in production.

The difference is scale. Distributed systems handle millions of processes. We handle a handful of agent sessions. But the reliability requirements are the same. When an agent session fails silently, the cost is not a 500 error to a user - it is hours of wasted compute and duplicated work. The patterns that prevent silent failures in distributed systems prevent them here too.

What we learned building this

Check-on-read beats background cleanup for small-scale systems. We initially planned a Cloudflare Cron Trigger to sweep stale sessions and expired idempotency keys. In practice, check-on-read is simpler and sufficient. When a new SOD detects a stale predecessor, it marks it abandoned in the same transaction. When an idempotency cache miss occurs, expired keys are cleaned up as a background fire-and-forget. No cron infrastructure, no additional failure modes.

The 45-minute threshold has been stable. We have not needed to adjust it since the initial deployment. The 4.5x safety margin over the heartbeat interval absorbs all the variability we have seen - long file operations, slow network connections, agents doing extended reasoning. If we ever need to tune it, the threshold is configurable via environment variable without a code change.

Canonical JSON solved a problem we did not anticipate. We adopted RFC 8785 for handoff payload hashing. The unexpected benefit was debuggability - canonical payloads are deterministic, so log comparisons between handoffs are byte-exact. No more "these look the same but the keys are in a different order."

Session awareness reduced duplicate work immediately. Before session awareness, we regularly had two agents pick the same issue. After adding the "Other Active Sessions" block to SOD output, this stopped happening. The fix was not a constraint system or a lock - just visibility. Showing agents what others are doing is enough for them to self-coordinate.

Idempotency prevented real data corruption. Within the first week of deployment, we observed retried requests that would have created duplicate handoffs without the idempotency layer. Network timeouts between the MCP server and the context API are common enough (edge latency, laptop sleep/wake cycles) that retries are not theoretical - they happen daily.

The session management layer has been running in production since January 2026, handling sessions across a fleet of development machines. The patterns are simple - heartbeats, state machines, idempotent writes, canonical hashing - but they provide the reliability foundation that makes multi-agent, multi-machine development practical.

Building an MCP Server for Workflow Orchestration

Thu, 29 Jan 2026 00:00:00 GMT

Before MCP, our agent integration was a collection of bash scripts invoked through a CLI skill system. The scripts called our backend API, parsed JSON with jq, and rendered output for the agent to consume. It worked, but barely.

The problems compounded. Environment variables set in the shell didn't reliably pass through to skill execution. Auth conflicts arose when scripts needed both OAuth tokens (for GitHub) and API keys (for our context backend) in the same invocation. Every new machine in the fleet required manual setup of script paths, permissions, and configuration. Adding a new tool meant writing bash, registering it in a command manifest, and debugging string escaping issues across three different shells.

MCP replaced all of that with a single pattern: a typed, validated, locally running server that the AI CLI connects to over stdio.

What MCP Is (Briefly)

Model Context Protocol is the standard extension mechanism for AI coding tools. It defines a JSON-RPC protocol over stdio that lets a host application (like Claude Code) discover and invoke tools provided by a server process.

The key properties that matter in practice:

Stdio transport. The MCP server is a subprocess of the CLI. No ports, no HTTP, no discovery. The CLI spawns it and communicates over stdin/stdout.
Typed tool schemas. Each tool declares its inputs as a JSON Schema. The CLI validates inputs before calling the tool, and the agent sees the schema to understand what parameters are available.
Single-file configuration. A .mcp.json file in the repo root (or a global config for other CLIs) declares which servers to start and what environment variables to pass. No shell profiles, no export statements, no sourcing dotfiles.
Auth via config. API keys go in the MCP config file, passed as environment variables to the server process. The CLI handles this at startup. No interactive prompts, no token refresh flows.

For our use case, this means: install once, configure once, and every agent session on that machine gets the same reliable tooling.

The Architecture

We don't connect the AI CLI directly to our cloud API. Instead, a local MCP server (Node.js, TypeScript) acts as middleware:

┌─────────────────────────────────────────────────────┐
│                  Developer Machine                    │
│                                                       │
│  ┌──────────────┐     stdio      ┌────────────────┐ │
│  │  Claude Code   │ ◄──────────► │   MCP Server    │ │
│  │  (AI agent)    │              │   (Node.js)     │ │
│  └──────────────┘              │                  │ │
│                                  │  • Git repo      │ │
│                                  │    detection     │ │
│                                  │  • GitHub CLI    │ │
│                                  │    integration   │ │
│                                  │  • Doc self-     │ │
│                                  │    healing       │ │
│                                  │  • Input         │ │
│                                  │    validation    │ │
│                                  └───────┬────────┘ │
│                                          │           │
└──────────────────────────────────────────┼───────────┘
                                           │ HTTPS
                                           ▼
                              ┌────────────────────────┐
                              │  Cloudflare Worker + D1 │
                              │  (Context API)          │
                              │  • Sessions             │
                              │  • Handoffs             │
                              │  • Knowledge store      │
                              │  • Doc management       │
                              └────────────────────────┘

Why a local server instead of direct API calls? Several reasons:

Client-side intelligence. The MCP server detects the current git repo, resolves the venture/project context, and passes that to the API. The API doesn't need to know about local filesystem layout.
Tool composition. Some tools call the gh CLI for GitHub data, the filesystem for local files, and the API for remote state - all in a single tool invocation. A remote API can't do that.
Fail-fast validation. Zod schemas validate inputs before any network call. Bad input gets a clear error message instantly, not after a round-trip.
The API stays simple. The cloud backend is stateless HTTP. No git operations, no filesystem access, no shell commands. The complexity lives in the MCP server where it can be tested and iterated quickly.

The server is built with the official @modelcontextprotocol/sdk package, which handles the JSON-RPC protocol, message framing, and lifecycle management. The dependency footprint is deliberately small: the SDK, Zod for validation, and Node.js standard library. No Express, no database drivers, no heavyweight frameworks.

{
  "dependencies": {
    "@modelcontextprotocol/sdk": "^1.0.0",
    "zod": "^3.24.0"
  }
}

The Tool Inventory

The server registers 11 tools. Each one maps to a specific workflow step, not a CRUD operation.

Tool	Purpose	Data Sources
`preflight`	Validate environment before starting work	Local env, `gh` CLI, API health
`sod`	Initialize session, load all context	API, GitHub, local filesystem
`handoff`	Record structured session summary	API (writes to D1)
`status`	Show full GitHub issue breakdown by priority	`gh` CLI
`context`	Show current session state	API, git
`ventures`	List projects with local install status	API, filesystem scan
`plan`	Read weekly priority plan	Local markdown file
`doc_audit`	Check and heal missing documentation	API + local file generation
`doc`	Fetch a specific document by scope and name	API
`note`	Create or update enterprise knowledge	API
`notes`	Search knowledge store by tag, scope, or text	API

The naming matters. We call the tool sod (Start of Day), not create_session. We call it handoff, not update_session_status. The names reflect what the agent is trying to accomplish, not the underlying data operation. When the agent sees sod in its tool list, it understands immediately: this is what you call at the start of a session.

Notice the mix of data sources. status calls the gh CLI directly (via execSync) to query GitHub issues - no API round-trip needed. plan reads a local markdown file from the repo. sod calls the remote API, the gh CLI, and the local filesystem in a single invocation. This heterogeneity is exactly why a local MCP server makes sense as middleware. A pure API integration couldn't reach the local filesystem or shell out to gh.

Tool Design: What We Learned

After building and iterating on these tools over several weeks, a few design principles emerged.

Task-oriented, not CRUD-oriented

The sod tool doesn't just create a session. It validates the environment, creates or resumes a session, loads the last handoff, queries P0 issues from GitHub, checks the weekly plan freshness, lists active parallel sessions, runs a documentation audit, and self-heals missing docs. A single tool call returns everything the agent needs to start working.

Early versions had separate tools for each of these: create_session, get_handoff, get_issues, check_plan. The agent had to know the right sequence and call them in order. It rarely did. Collapsing them into a single task-oriented tool improved reliability dramatically.

Validate inputs with Zod schemas

Every tool defines a Zod schema for its input:

export const handoffInputSchema = z.object({
  summary: z.string().describe('Summary of work completed'),
  status: z.enum(['in_progress', 'blocked', 'done']).describe('Current status'),
  issue_number: z.number().optional().describe('GitHub issue number if applicable'),
})

The schema serves three purposes. First, it validates input before any side effects occur. If the agent passes an invalid status value, the error is immediate and clear. Second, it generates the JSON Schema that the CLI presents to the agent, so the agent knows what parameters are available. Third, the .describe() annotations act as inline documentation - the agent reads them to understand what each field means.

Return structured text the agent can reason about

Every tool returns a message field containing formatted markdown. Not raw JSON. Not a data structure the agent has to interpret. Structured text that the agent can read, quote, and act on directly.

let message = '## Session Context\n\n'
message += `| Field | Value |\n|-------|-------|\n`
message += `| Venture | ${venture.name} (${venture.code}) |\n`
message += `| Repo | ${fullRepo} |\n`
message += `| Branch | ${currentRepo.branch} |\n`
message += `| Session | ${session.session.id} |\n\n`

We tried returning raw JSON and letting the agent format it. The agent did format it - differently every time, sometimes dropping fields, sometimes hallucinating data. Pre-formatted output is deterministic.

Fail fast with clear messages

When something goes wrong, the tool tells the agent exactly what to do:

if (!apiKey) {
  return {
    success: false,
    message: 'CRANE_CONTEXT_KEY not found. Launch with: launcher alpha',
  }
}

Not "authentication failed." Not an HTTP 401 status code. A concrete instruction: "Launch with: launcher alpha." The agent can relay this to the human verbatim, and the human knows exactly what command to run.

One tool per workflow step

The note and notes tools could be a single knowledge tool with a mode parameter. We split them because they represent different workflows: note is "store this thing" (a write operation the human initiates), while notes is "find me something" (a read operation the agent often initiates autonomously). Different intents, different tools.

The exception is note itself, which handles both create and update via an action parameter. This works because the intent is the same (persist knowledge), and the agent naturally says "update that note" or "create a new note."

The Server Entry Point

The entry point is straightforward. Register tools, handle calls, start the transport:

import { Server } from '@modelcontextprotocol/sdk/server/index.js'
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js'

const server = new Server(
  { name: 'my-mcp-server', version: '0.2.0' },
  { capabilities: { tools: {} } }
)

// Register tool list
server.setRequestHandler(ListToolsRequestSchema, async () => {
  return {
    tools: [
      {
        name: 'preflight',
        description: 'Run environment preflight checks...',
        inputSchema: { type: 'object', properties: {} },
      },
      // ... more tools
    ],
  }
})

// Handle tool calls
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const { name, arguments: args } = request.params
  switch (name) {
    case 'preflight': {
      const input = preflightInputSchema.parse(args)
      const result = await executePreflight(input)
      return {
        content: [{ type: 'text', text: result.message }],
      }
    }
    // ... more cases
  }
})

// Start
const transport = new StdioServerTransport()
await server.connect(transport)

Each tool is a separate module that exports a Zod schema and an execute function. The entry point is purely routing and transport. This separation makes each tool independently testable.

The API Client Layer

A dedicated API client class encapsulates all communication with the cloud backend:

export class CraneApi {
  private apiKey: string
  private apiBase: string

  constructor(apiKey: string, apiBase: string) {
    this.apiKey = apiKey
    this.apiBase = apiBase
  }

  async startSession(params: {
    venture: string
    repo: string
    agent: string
  }): Promise<SodResponse> {
    const response = await fetch(`${this.apiBase}/sod`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-Relay-Key': this.apiKey,
      },
      body: JSON.stringify({ ...params, schema_version: '1.0' }),
    })
    if (!response.ok) throw new Error(`API error: ${response.status}`)
    return response.json() as Promise<SodResponse>
  }

  // ... more methods
}

Every API method follows the same pattern: construct the request, include the API key header, handle errors. The class uses Node.js native fetch (no Axios, no node-fetch), keeping the dependency count low.

The API client also includes a simple in-memory cache for data that doesn't change within a session (like the ventures list). Since the MCP server is a long-lived process, this cache persists across tool calls within the same session.

All response types are defined as TypeScript interfaces in the same file. This gives us type safety end-to-end: the API client returns typed responses, and the tool functions consume them with full IntelliSense.

Testing MCP Tools

Each tool has a corresponding test file. The testing strategy has three layers:

Unit tests mock the API and external dependencies. The test suite uses Vitest with module mocking:

vi.mock('../lib/github.js')
vi.mock('../lib/repo-scanner.js')

it('returns pass when all checks succeed', async () => {
  process.env.CRANE_CONTEXT_KEY = 'test-key'
  vi.mocked(checkGhAuth).mockReturnValue({
    installed: true,
    authenticated: true,
  })
  vi.mocked(getCurrentRepoInfo).mockReturnValue(mockRepoInfo)
  mockFetch.mockResolvedValue({ ok: true })

  const result = await executePreflight({})

  expect(result.all_passed).toBe(true)
  expect(result.checks).toHaveLength(4)
})

Each test resets modules between runs (vi.resetModules()) to ensure clean state. Environment variables are snapshotted in beforeEach and restored in afterEach. The fetch global is stubbed to control API responses.

Integration tests hit the real API. These run less frequently (not in CI on every push) but verify that the MCP server talks to the actual Cloudflare Worker correctly. They use real API keys from Infisical and validate response shapes against TypeScript interfaces.

E2E verification runs on machine bootstrap. When a new machine joins the fleet, the bootstrap script validates the full chain: MCP server starts, connects via stdio, tool calls return valid responses, API connectivity works. This catches misconfigurations that unit tests can't reach (wrong Node.js version, missing npm link, broken PATH).

The test patterns we settled on:

Mock external dependencies (API, GitHub CLI, filesystem) at the module level
Test the execute* functions directly, not through the MCP protocol layer
Assert on both the structured result (.all_passed, .success) and the human-readable message (.message)
Use fixtures for common test data (repo info, venture configs)

Fleet Deployment

The MCP server needs to be installed on every development machine. Since it's part of a monorepo built with TypeScript, deployment means: pull the latest code, build, and re-link the binary.

A deployment script automates this across the fleet:

#!/bin/bash
# Deploy MCP server to all fleet machines
set -e

MACHINES=("machine1" "machine2" "machine3")

for SSH_HOST in "${MACHINES[@]}"; do
  ssh "$SSH_HOST" << 'EOF'
    cd ~/dev/project-console
    git stash --include-untracked
    git pull origin main
    cd packages/mcp-server
    npm install
    npm run build
    npm link
EOF
done

The script includes several safety checks:

Pre-flight validation. It verifies the local machine is on main and has no unpushed commits. Deploying unreleased code to the fleet would be a debugging nightmare.
Stash before pull. Remote machines sometimes have local changes (usually package-lock.json differences or experimental edits). The script stashes before pulling to avoid merge conflicts.
npm link. The npm link command creates symlinks from npm's global bin directory to the monorepo's build output. This means every terminal session on the machine uses the latest build, regardless of working directory.
SSH timeout and error handling. Each machine gets a 10-second connection timeout. Failed machines are collected and reported at the end, with Tailscale troubleshooting hints.

The typical deployment flow: make changes, test locally, push to main, run the deploy script. The script SSHes to each machine in sequence, pulls, builds, and reports success or failure.

The MCP Lifecycle Gotcha

This one cost us a multi-hour debugging session, so it's worth highlighting.

MCP servers run as subprocesses of the AI CLI. When you start Claude Code, it spawns the MCP server process. That process lives for the entire CLI session. Here's the catch: a session restart (which happens during context compaction when the conversation gets long) does NOT restart the MCP subprocess. The MCP process keeps running with whatever code it loaded at CLI startup.

This means if you rebuild the MCP server (npm run build) while an agent session is active, the running session still uses the old code. Only a full CLI exit and relaunch loads the new binary.

This is not a bug. It's the correct behavior - MCP servers are expected to be stable processes that outlive individual conversations. But it creates a trap during development: you change a tool, rebuild, test it, and the old behavior persists. The fix is always the same: exit the CLI, relaunch.

A related issue: Node.js caches modules at process start. If you modify a library that the MCP server imports, the cached version persists until the process restarts. Same root cause, different symptom.

We now include this in our developer onboarding docs with a simple rule: after any MCP server change, restart the CLI.

Why MCP Beats Prompt Engineering

Before MCP, the alternative was prompt engineering: paste API documentation into the system prompt, describe the expected request format, and hope the agent constructs valid HTTP requests. This works surprisingly well for simple cases, but breaks down in production:

Validation. A Zod schema rejects bad input before the API call. A system prompt instruction like "the status field must be one of: in_progress, blocked, done" gets ignored roughly 5% of the time. Over hundreds of daily tool calls, that 5% creates real problems.

Discoverability. MCP tools show up in claude mcp list. The agent can inspect the tool list and schemas. System prompt instructions get compressed, truncated, or buried in context as the conversation grows.

Reliability. An MCP tool either succeeds or returns a structured error. An agent constructing a curl command from a system prompt might get the URL wrong, forget a header, or misformat the JSON body. Each of these failures requires a retry cycle that wastes time and context window.

Composability. A single MCP tool can call the filesystem, shell out to gh, and hit an HTTP API. System prompt engineering would require the agent to chain three separate actions and handle intermediate failures. The tool does this internally and returns a unified result.

Maintainability. Tool changes go through TypeScript compilation, Zod schema validation, and test suites. System prompt changes go through... a text diff review and manual testing.

The tradeoff is real: MCP requires building and maintaining a server process. For a single tool that calls a single API, prompt engineering might be simpler. But for a workflow orchestration system with 11 tools, multiple data sources, and fleet-wide deployment, MCP is the right abstraction.

What We'd Do Differently

Start with MCP from day one. We spent weeks on the bash-script-with-skills approach before migrating. The migration itself took two days. We would have saved time overall by starting with MCP, even for the initial two-tool prototype.

Invest in the API client layer early. Our first version had inline fetch calls in each tool. Extracting the API client class took a refactoring pass that touched every tool. Having a dedicated client from the start would have saved churn.

Budget SOD output from the beginning. Our initial SOD tool returned everything - full document contents, all enterprise notes, complete handoff history. It consumed a third of the context window before the agent did any work. We retrofitted a budget system (12KB cap on enterprise notes, metadata-only document delivery) that reduced SOD token consumption by 96%. This should have been a design constraint from day one.

Test the MCP protocol layer, not just the tool functions. Our unit tests call executePreflight() directly, bypassing the MCP message framing. This means we've never caught a bug in the ListToolsRequestSchema handler or the tool name routing in the switch statement through automated tests. A small integration test that sends actual MCP messages over stdio would close this gap.

The MCP server is now the single most impactful piece of infrastructure we've built. It turns "start a coding session" from a five-minute setup ritual into a single command that gives the agent full context in under two seconds. If you're building tooling for AI coding agents, MCP is where to start.

Secrets Injection at Agent Launch Time

Tue, 27 Jan 2026 00:00:00 GMT

Secrets management gets harder the moment you have more than one project. Add multiple machines, multiple AI agent sessions, and multiple environments (dev, staging, production), and .env files become a liability.

We run several projects across a fleet of development machines. Each project has its own API keys, auth tokens, and service credentials. Each machine needs access to all of them. And each AI agent session needs the right secrets injected at launch - not the secrets for a different project, not production keys in a dev session, and definitely not a stale .env file that someone forgot to update three weeks ago.

The standard approach - .env files checked into repos or copied between machines - fails in predictable ways:

Stale secrets. Someone rotates an API key. The .env on two machines still has the old one. Nobody notices until the agent session fails mid-task.
Wrong-project secrets. Copy-paste a .env from one project to another, change two of six keys, forget the third. The agent runs with a hybrid environment that partially works.
Secrets in git history. Accidentally commit a .env file. Remove it. It's still in the history. Now you're rotating keys and scrubbing refs.
Agent exposure. AI agents can accidentally include environment variable values in commit messages, PR descriptions, or tool call arguments. The blast radius of a secret in an agent's environment is larger than in a traditional dev setup.

We built a CLI launcher that eliminates all of these failure modes. One command fetches the right secrets for the right project from Infisical (our centralized secrets manager), injects them into the agent process as environment variables, and spawns the session. No files on disk. No copy-paste. No guessing which .env is current.

The Launcher Flow

The entire sequence from command to running agent session looks like this:

launcher alpha
    │
    ├── 1. Resolve agent CLI (Claude Code, Gemini, Codex)
    ├── 2. Validate the agent binary is on PATH
    ├── 3. Fetch project registry from the context API
    ├── 4. Scan ~/dev/ for git repos
    ├── 5. Match project to local repo (org + repo name)
    ├── 6. Ensure Infisical config exists in the repo
    ├── 7. Fetch secrets from Infisical (single JSON export)
    ├── 8. Validate secrets (guard on required keys)
    ├── 9. Ensure MCP server binary exists (self-heal if missing)
    ├── 10. Register MCP server for the agent CLI
    └── 11. Spawn agent with secrets injected as env vars

The user types one command. Everything else is automated.

Repo Discovery

The launcher needs to know where each project's code lives on the current machine. Rather than maintaining a mapping file that goes stale, it scans ~/dev/ at launch time.

export function scanLocalRepos(): LocalRepo[] {
  const devDir = join(homedir(), 'dev')
  const repos: LocalRepo[] = []

  const entries = readdirSync(devDir)
  for (const entry of entries) {
    const fullPath = join(devDir, entry)
    const gitDir = join(fullPath, '.git')
    if (!existsSync(gitDir)) continue

    // Get remote URL
    const remote = execSync('git remote get-url origin', {
      cwd: fullPath,
      encoding: 'utf-8',
    }).trim()

    // Parse org/repo from remote
    const match = remote.match(/github\.com[:/]([^/]+)\/([^/.]+)/)
    if (match) {
      repos.push({
        path: fullPath,
        name: entry,
        remote,
        org: match[1],
        repoName: match[2],
      })
    }
  }

  return repos
}

Every directory under ~/dev/ that has a .git folder gets inspected. The scanner reads git remote get-url origin, parses the GitHub org and repo name from the URL, and builds an index. This handles both SSH (git@github.com:org/repo) and HTTPS (https://github.com/org/repo) remotes.

The results are cached for the duration of the launcher process. The scan itself takes milliseconds - there are typically fewer than a dozen repos to inspect.

Project Matching

Once the launcher has a list of local repos and a list of registered projects (fetched from the context API), it needs to match them. Each project has an org and a naming convention. Matching uses both:

function matchVentureToRepo(venture: Venture, repos: LocalRepo[]): LocalRepo | undefined {
  return repos.find((r) => {
    if (r.org.toLowerCase() !== venture.org.toLowerCase()) return false
    return r.repoName === `${venture.code}-console`
  })
}

If a project isn't cloned locally, the launcher offers to clone it via gh repo clone. This handles the first-run case on a new machine without requiring a separate setup step.

Infisical as the Secrets Backend

All secrets live in Infisical, organized by project path within a single workspace:

shared-workspace (workspace)
├── prod (environment)
│   ├── /alpha     - Project Alpha secrets
│   ├── /beta      - Project Beta secrets
│   ├── /gamma     - Project Gamma secrets
│   └── /delta     - Project Delta secrets
└── dev (environment)
    ├── /alpha     - Project Alpha dev/staging secrets
    ├── /beta      - Project Beta dev/staging secrets
    └── ...

Each project gets its own path. The launcher maintains a simple mapping from project code to Infisical path:

export const INFISICAL_PATHS: Record<string, string> = {
  alpha: '/alpha',
  beta: '/beta',
  gamma: '/gamma',
  delta: '/delta',
}

Shared secrets - infrastructure keys that every project needs, like the context API key - live at a designated source path and are synced to every other path via an audit script. The source of truth is always one path; the rest receive copies. This prevents the "which copy is current?" problem entirely.

# Audit: check all projects for missing shared secrets
launcher --secrets-audit

# Fix: propagate missing secrets from the source
launcher --secrets-audit --fix

When a new project is created, the setup script automatically creates its Infisical folder in both environments and propagates shared secrets. No manual intervention.

Runtime Injection: Secrets Never Touch Disk

This is the critical design decision. Secrets are fetched once at launch time, held in process memory, and injected as environment variables into the agent's child process. They never exist as files on disk.

export function fetchSecrets(
  repoPath: string,
  infisicalPath: string,
  extraEnv?: Record<string, string>
): { secrets: Record<string, string> } | { error: string } {
  // Build the infisical export command
  const args = ['export', '--format=json', '--silent', '--path', resolvedPath, '--env', resolvedEnv]

  const result = spawnSync('infisical', args, {
    cwd: repoPath,
    timeout: 30_000,
    encoding: 'utf-8',
  })

  // Parse JSON array of {key, value} objects
  const parsed = JSON.parse(result.stdout)
  const secrets: Record<string, string> = {}
  for (const entry of parsed) {
    secrets[entry.key] = entry.value
  }

  return { secrets }
}

The function calls infisical export --format=json, which returns the full secret set for a path as a JSON array. The launcher parses it, validates that required keys are present, and passes the result to the agent spawn:

const childEnv = { ...process.env, ...secrets, PROJECT_ENV: getProjectEnv() }

const child = spawn(binary, [], {
  stdio: 'inherit',
  cwd: venture.localPath,
  env: childEnv,
})

The agent process inherits the secrets through its environment. When the process exits, the secrets are gone. No cleanup, no file deletion, no residual state.

Trade-off: secrets are frozen at launch time. If someone rotates a key while an agent session is running, that session keeps using the old key until it's restarted. For static credentials like API keys and context tokens, this is fine. If we ever need secrets that rotate mid-session, we'd need a sidecar process or a refresh mechanism. We haven't needed that yet.

Validation: Don't Just Fetch, Verify

The launcher doesn't trust that Infisical returned useful data. It validates at three levels:

1. Non-empty response. If infisical export returns an empty array, the path probably doesn't exist or has no secrets configured. The error message tells you exactly which path was queried and which environment was used.

2. Required keys. The context API key (CONTEXT_API_KEY) must exist in every project's secret set. Without it, the MCP server can't authenticate to the context API, and the agent session is effectively blind - no handoffs, no session continuity, no enterprise knowledge. If it's missing, the launcher prints a remediation command:

Secrets fetched from '/alpha' but CONTEXT_API_KEY is missing.
Keys found: CLERK_SECRET_KEY, GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET
Fix: bash scripts/sync-shared-secrets.sh --fix

3. JSON parse safety. If Infisical returns malformed output (which can happen during Infisical upgrades or network issues), the launcher catches the parse error and shows the first 200 characters of the output for debugging.

This three-layer validation has caught real problems in production. The most memorable one deserves its own section.

The Cautionary Tale: Description as Value

An AI agent was asked to store a webhook secret in Infisical. The instruction was something like "store the GitHub webhook secret for the classifier." The agent dutifully ran:

infisical secrets set GH_WEBHOOK_SECRET_CLASSIFIER="GitHub webhook secret for the classifier" --path /alpha --env prod

The key existed. The value was non-empty. A naive check would say everything is fine. But the value was a human-readable description of what the secret should contain, not the actual cryptographic secret string.

The webhook signature validation failed silently. Incoming webhooks were rejected, but the error was deep in the call stack - an HMAC mismatch that looked like a configuration issue, not a "the secret is literally the words 'GitHub webhook secret for the classifier'" issue.

It took longer than it should have to diagnose because the key existed, the value was non-empty, and the agent had reported success. All the surface-level checks passed.

The fix was procedural, not technical. We added a rule to our agent instructions: always verify secret VALUES, not just that the key exists. When storing a secret, the agent must confirm the value looks like a credential (high entropy, correct format) rather than a description. For webhook secrets specifically, this means verifying the value is a hex string of the expected length.

This incident also reinforced a broader principle: secrets management for AI agents needs the same rigor as secrets management for production services - maybe more. A human developer would never paste "GitHub webhook secret for the classifier" as a secret value. An agent, operating on natural language instructions, made exactly that mistake. The surface area for agent-specific errors is different from human errors, and the validation layer needs to account for it.

Per-Environment Secrets

The same project often needs different secrets for different environments. A staging context API has different keys than production. The auth service might use test credentials in development.

The launcher selects the environment based on a single variable:

export function getProjectEnv(): CraneEnv {
  const raw = process.env.PROJECT_ENV?.toLowerCase()
  if (raw === 'dev') return 'dev'
  return 'prod'
}

Default is production. Setting PROJECT_ENV=dev before launching switches to the dev environment in Infisical. The launcher also handles a subtlety: some projects have staging-specific sub-paths in Infisical (e.g., /alpha/staging for staging infrastructure keys), while others only have prod and dev environments at the top level. The resolver handles this gracefully:

export function getStagingInfisicalPath(ventureCode: string): string | null {
  if (ventureCode === 'alpha') return '/alpha/staging'
  return null
}

If a project doesn't have a staging path, the launcher warns and falls back to production secrets. This prevents a half-configured staging environment from silently using no secrets at all.

The result is clean environment separation without duplicating configuration. The same launcher command works everywhere:

launcher alpha              # Production secrets (default)
PROJECT_ENV=dev launcher alpha  # Staging secrets

SSH Sessions: A Harder Problem

On a local machine, the Infisical CLI authenticates via an interactive browser login. The token is stored in the system keychain. This works well for desktop sessions but breaks completely over SSH - there's no browser, and the keychain is locked.

The launcher detects SSH sessions by checking for SSH_CLIENT, SSH_TTY, or SSH_CONNECTION environment variables. When running over SSH, it switches to Machine Identity authentication:

Reads Universal Auth credentials from ~/.infisical-ua (a file with chmod 600)
Authenticates via infisical login --method=universal-auth to get a JWT
Passes the token through the INFISICAL_TOKEN environment variable (not a CLI flag, which would be visible in ps output)
Adds --projectId to the export command, since token-based auth doesn't read the project config file

On macOS, there's an additional wrinkle: Claude Code stores its OAuth tokens in the system keychain, which is locked during SSH sessions. The launcher detects this and prompts for the keychain password once per session.

Each machine that will accept SSH connections needs a one-time bootstrap:

bash scripts/bootstrap-infisical-ua.sh

This prompts for Machine Identity credentials (created once in the Infisical web UI), writes the credentials file, and verifies authentication works. After that, launcher alpha works identically whether you're sitting at the machine or SSH'd in from an iPad.

Secrets for Agents: A Different Threat Model

Traditional secrets management assumes human operators. The threat model is unauthorized access, credential leakage through logs, and insider threats. AI agents introduce a different set of risks:

Tool call exposure. An agent might include a secret in a tool call argument. "Search for this API key in the codebase" could echo the key into a search query that gets logged.

Commit message leakage. An agent composing a commit message might mention "updated the API key to sklive..." if the key was part of the task context.

PR description inclusion. When summarizing work done, an agent might reference the specific values it configured rather than just the key names.

Accidental storage. As we experienced firsthand, an agent can store descriptions as values, or store actual secret values in the wrong system (a knowledge store instead of the secrets manager).

Runtime injection mitigates most of these risks. Secrets exist only in the process environment, not in files the agent can read or reference. The agent has access to the values through standard process.env lookups at runtime but doesn't see them listed in any file it might accidentally include in output.

The remaining risk - an agent echoing an env var value in output - is handled procedurally through agent instructions rather than technically. The instruction set explicitly states: never include secret values in commits, PRs, tool calls, or output. This isn't bulletproof, but combined with runtime-only injection, it dramatically reduces the attack surface compared to .env files sitting in the repo root.

What We Learned

Single-fetch, parse, validate is better than fetch-to-validate then fetch-to-use. Our original approach called Infisical twice: once to check that secrets existed, once via infisical run to wrap the agent process. The current approach fetches once as JSON, validates in-process, and injects directly. Simpler, faster, one fewer failure mode.

Self-healing MCP registration eliminated a class of support requests. If the MCP server binary isn't on PATH, the launcher rebuilds and re-links it automatically. If the MCP config file is missing from the target repo, the launcher copies a template. These self-healing steps mean new machines and new repos just work without a separate setup step.

The secrets audit script pays for itself on day one. When a new shared secret is added to the source path, running --secrets-audit --fix propagates it everywhere. Without this, you're manually adding the same key to multiple Infisical paths and hoping you don't miss one.

Frozen-at-launch secrets are fine for our use case. We considered a sidecar that refreshes secrets mid-session. The complexity wasn't justified. Agent sessions typically run 30-90 minutes. Key rotation happens on a scale of weeks or months. The mismatch is orders of magnitude.

Agent-specific validation rules are necessary. Standard secrets validation (key exists, value non-empty) isn't sufficient when agents are involved. Format-aware validation - checking that a webhook secret looks like a hex string, that an API key matches the expected prefix, that a PEM key has the right header - catches errors that standard checks miss. We're still adding these incrementally.

The core lesson: treat secrets injection for AI agents with at least the same rigor as secrets injection for production services. The failure modes are different, but the consequences are the same.

Fleet Management for One Person

Sat, 24 Jan 2026 00:00:00 GMT

A fleet of development machines - a mix of Apple Silicon Macs and Linux boxes - runs AI agent sessions roughly 18 hours a day. One person manages all of it. No DevOps team. No IT department. Just scripts.

Every machine needs identical tooling - Node.js, GitHub CLI, Infisical, Claude Code, SSH keys, tmux, a custom MCP server, and a CLI launcher. They all need to talk to each other over SSH. They all need to be hardened against public networks.

Doing this manually takes over two hours per machine and is error-prone. Forget one step and you discover it three days later when an agent session fails at 2am. The answer is treating dev machines like infrastructure: automated, repeatable, disposable.

The Bootstrap Problem

Every machine in the fleet needs the same baseline:

Runtime: Node.js 20, npm, Homebrew (macOS) or apt (Linux)
CLI tools: GitHub CLI (gh), Infisical, Wrangler, Claude Code, uv
SSH: Ed25519 key pair, authorized_keys for the fleet, config fragments for every peer
Networking: Tailscale connected with a stable IP
Project code: The management console repo cloned, MCP server built and linked
Configuration: Infisical project binding, MCP server registered with Claude Code

Missing any one of these means a broken session. An agent launches, tries to call the MCP server, finds it missing, and either errors out or wastes 20 minutes trying to self-heal something that should have been provisioned.

Idempotent Bootstrap

The bootstrap script handles everything in a single run. More importantly, it is idempotent - you can run it ten times and get the same result. Every step checks before acting. It never duplicates a key, never reinstalls a tool that is already present, never overwrites a config that is already correct.

The script moves through distinct phases:

Phase 1: Detect and validate. Determine OS (Darwin or Linux) and architecture (arm64 or x86_64). Verify Tailscale is installed and connected. If the macOS App Store version of Tailscale is installed but the CLI is not on PATH, create a wrapper script (more on this below).

Phase 2: Install tools. Each tool gets a check-before-install guard:

if ! command -v gh &>/dev/null; then
    brew install gh
else
    log_ok "GitHub CLI already installed"
fi

This pattern repeats for every tool. On macOS it uses Homebrew; on Linux, apt. Node.js gets version-checked (must be v20+), not just presence-checked.

Phase 3: Generate SSH key. If ~/.ssh/id_ed25519 does not exist, generate one. If it does, skip.

Phase 4: Register with the fleet API. The machine announces itself with hostname, Tailscale IP, OS, architecture, and public key. This registration is what makes the SSH mesh self-maintaining - new machines appear in the registry and get picked up on the next mesh run.

Phase 5: Fetch and apply SSH mesh config. Pull the mesh configuration from the API. Write SSH config fragments and distribute authorized_keys - including the machine's own key (a subtle requirement: without your own pubkey in authorized_keys, nobody can SSH in).

Phase 6: Build and link. Clone the management console repo if not present. Build the MCP package and npm-link it onto PATH.

The entire process takes five minutes on a fresh machine. On an already-bootstrapped machine, it completes in seconds.

$ API_KEY=<key> bash scripts/bootstrap-machine.sh
[OK]    Detected: darwin / arm64
[OK]    Tailscale IP: 100.119.24.42
[OK]    Homebrew already installed
[OK]    Node.js v20.11.1 already installed
[OK]    GitHub CLI already installed
[OK]    Infisical already installed
[OK]    Claude Code already installed
[OK]    SSH key already exists
[OK]    Machine updated (existing)
[OK]    SSH mesh config written
[OK]    Authorized keys: 0 added (self + fleet)
[OK]    CLI tools built and linked

The Tailscale CLI Gotcha

Tailscale provides zero-config mesh networking. Install it, sign in, and each device gets a stable 100.x.x.x IP that works regardless of physical network. NAT traversal, peer discovery, encrypted tunnels, and hostname resolution via MagicDNS - all handled automatically.

But there is a macOS gotcha that cost us hours of debugging.

When you install Tailscale from the Mac App Store (the recommended distribution for macOS), the binary lives inside the app bundle at /Applications/Tailscale.app/Contents/MacOS/Tailscale. It is not on PATH. The natural instinct is to symlink it:

# DO NOT DO THIS
sudo ln -s /Applications/Tailscale.app/Contents/MacOS/Tailscale /usr/local/bin/tailscale

This crashes. The Tailscale binary performs a bundle ID check at startup, and when invoked through a symlink, the check fails with a cryptic error about code signing. The fix is a wrapper script instead:

#!/bin/bash
exec /Applications/Tailscale.app/Contents/MacOS/Tailscale "$@"

Written to /opt/homebrew/bin/tailscale (or /usr/local/bin/tailscale on Intel Macs), this wrapper works perfectly. The exec replaces the shell process with the Tailscale binary, so the bundle context is preserved. The bootstrap script handles this automatically - it detects when the App Store version is installed but the CLI is not on PATH, and writes the wrapper.

SSH Mesh Networking

Any machine in the fleet should be able to SSH to any other machine. This is not just for human convenience - it is how fleet deployment scripts push updates, how tmux configs get synchronized, and how the mesh verification runs.

A dedicated mesh script establishes full connectivity in five phases: preflight checks (verify local key, test Remote Login, probe each remote), key collection (SSH to each machine, collect or generate Ed25519 pubkeys), authorized_keys distribution (add every machine's key to every other machine), config fragment deployment, and full mesh verification.

The key distribution step is where idempotency matters most. The script extracts the base64 fingerprint and checks before appending:

key_fingerprint=$(echo "$pubkey" | awk '{print $2}')
if grep -q "$key_fingerprint" "$HOME/.ssh/authorized_keys" 2>/dev/null; then
    echo "already present"
else
    echo "$pubkey" >> "$HOME/.ssh/authorized_keys"
fi

Config fragments go to ~/.ssh/config.d/fleet-mesh, never to ~/.ssh/config. The main config gets an Include config.d/* directive prepended if not already present. Personal SSH configs, work VPN entries, GitHub deploy keys - all untouched. Each host entry uses the Tailscale IP, Ed25519 identity, and keepalive settings:

Host server-1
    HostName 100.x.x.x
    User devuser
    IdentityFile ~/.ssh/id_ed25519
    StrictHostKeyChecking accept-new
    ServerAliveInterval 60
    ServerAliveCountMax 3

The final phase tests every source-to-target pair, including hop tests (SSH to machine A, then from A to machine B), and prints a verification matrix:

SSH Mesh Verification
==========================================
From\To     | dev-1     | server-1  | dev-2     | dev-3
------------|-----------|-----------|-----------|----------
dev-1       | --        | OK        | OK        | OK
server-1    | OK        | --        | OK        | OK
dev-2       | OK        | OK        | --        | OK
dev-3       | OK        | OK        | OK        | --

When the machine registry is connected to the fleet API, adding a new machine is automatic: run bootstrap on the new machine (which registers it), then run the mesh script from any existing machine (which picks up the new entry and distributes keys).

macOS Hardening

Development machines are not servers behind a firewall. They connect to coffee shop WiFi, hotel networks, and cellular hotspots. The hardening script addresses this reality.

Firewall and stealth mode. The macOS application firewall is off by default. The script enables it and turns on stealth mode, which silently drops unsolicited inbound packets. Network scans see nothing. Signed applications are auto-allowed (which covers Tailscale), and the Tailscale network extension is explicitly added to the firewall allow list.

Close unnecessary ports. AirPlay Receiver listens on ports 5000 and 7000 by default - visible to anyone on the same network. The script disables it. AirDrop gets restricted to Contacts Only.

DNS encryption. Tailscale routes DNS through the WireGuard tunnel to 100.100.100.100 (encrypted resolver). The system fallback is set to Cloudflare (1.1.1.1) for when Tailscale is disconnected.

Performance tuning. The same script increases kernel file descriptor limits (524288 max files, 131072 per process), excludes ~/dev from Spotlight indexing, reduces visual effects, and configures battery management for laptops.

Safari privacy defaults. Do Not Track headers, cross-site tracking restrictions, fraudulent website warnings. These use defaults write commands that vary across macOS versions, so every call is guarded with 2>/dev/null || true to prevent failures on different systems.

Like everything else in the fleet, the hardening script is idempotent. Run it on a machine that is already hardened, and nothing changes. Run it after a macOS update that reset some defaults, and it fixes only what changed.

tmux Across the Fleet

AI agent sessions can run for hours. A dropped SSH connection should not kill the session. tmux solves this - the session lives on the server, and you reconnect to exactly where you left off.

A deployment script pushes identical tmux configuration to every machine in the fleet. It handles three concerns:

Terminal compatibility. The Ghostty terminal emulator needs its terminfo entry installed on remote machines for correct color rendering. The script detects it locally and installs it on each target - without it, you get garbled colors and broken key sequences over SSH.

Consistent configuration. Every machine gets the same ~/.tmux.conf:

set -g default-terminal "tmux-256color"
set -ga terminal-overrides ",xterm-ghostty:Tc"
set -g mouse on
set -g history-limit 50000
set -g status-left "[#h] "
set -s escape-time 10
set -g set-clipboard on

The hostname in the status bar ([#h]) tells you which machine you are on at a glance. The clipboard bridge (set-clipboard on) enables OSC 52, which lets copy operations reach the local clipboard through any number of SSH or Mosh hops.

Session wrapper. A small script wraps tmux for agent sessions. If a tmux session for the requested project exists, it reattaches; otherwise it creates one. ssh server-1 then dev-session alpha - either starting fresh or resuming where you left off.

Field Mode

We have written about mobile access in detail previously - Blink Shell on iPhone, Mosh for resilient connections, the full mobile stack. The fleet management angle is what makes it work.

The portable MacBook carries the same bootstrap, the same hardening, and the same mesh connectivity as every other machine. When it joins a new network (hotel WiFi, phone hotspot, airport lounge), Tailscale handles the transition. The machine's 100.x.x.x address stays the same. SSH to the office server still works. The mesh is intact.

The hardening script is especially important here. Before connecting to an untrusted network: firewall is on, stealth mode is active, AirPlay ports are closed, DNS goes through the Tailscale tunnel. The machine is invisible to network scans.

If the laptop is unavailable (closed lid, dead battery), Blink Shell on iPhone connects directly to the always-on server via Mosh over Tailscale. The tmux session is waiting. The agent session is exactly where it was left. No context loss, no re-bootstrapping.

The Principle

The guiding principle behind all of this: if a machine dies, bootstrap a replacement in five minutes.

No precious state lives on any single machine. Code is in git. Secrets are in Infisical. Enterprise context is in the cloud (D1). Session handoffs are in the API. SSH keys are in the fleet registry. The machine itself is a commodity - an interchangeable node in the mesh.

This changes how you think about hardware problems. A failing disk is not a crisis. A stolen laptop is a security event (revoke keys, rotate secrets), not a data loss event. A new machine joining the fleet is a one-command operation.

The scripts are not clever. They are repetitive, predictable, and boring. Every one checks before acting. Every one produces the same output on the tenth run as on the first. That is the point. Infrastructure automation should be boring. The interesting problems are in the software it enables.

The fleet described here runs AI coding agents across multiple projects, managed by one person. The full stack is Tailscale for networking, Infisical for secrets, Cloudflare Workers + D1 for state, and Claude Code as the primary AI agent CLI. The bootstrap, mesh, hardening, and tmux scripts are all idempotent bash, designed to be run by agents or humans with identical results.

One Monorepo, Multiple Ventures - Registry-Driven Multi-Tenant Infrastructure

Thu, 22 Jan 2026 00:00:00 GMT

Running multiple products as a solo founder creates an infrastructure dilemma. Each product needs its own secrets, databases, GitHub labels, documentation requirements, and deployment pipelines. Duplicate all of that per product and you spend more time maintaining tooling than building products. Consolidate everything into one giant repo and you get cross-contamination - secrets leaking between projects, automation running where it shouldn't, configuration changes breaking unrelated products.

We needed a third option: shared infrastructure that knows about product boundaries and respects them automatically.

The answer is a monorepo for shared tooling - CLI launcher, MCP server, Cloudflare Workers, automation scripts - with separate repos for each product's application code. The monorepo is the control plane. Product repos are the data planes. And at the center of the control plane sits a single JSON file that defines every venture the system knows about.

The Venture Registry

Everything starts with config/ventures.json. This is the source of truth for the entire system. If a venture isn't in this file, it doesn't exist to the tooling.

{
  "ventures": [
    {
      "code": "alpha",
      "name": "Project Alpha",
      "org": "example-org",
      "capabilities": ["has_api", "has_database"],
      "portfolio": {
        "status": "building",
        "tagline": "Validation platform for early-stage teams",
        "techStack": ["Cloudflare Workers", "D1"]
      }
    },
    {
      "code": "beta",
      "name": "Project Beta",
      "org": "example-org",
      "capabilities": ["has_api", "has_database"],
      "portfolio": {
        "status": "building",
        "tagline": "Shared finance management for families",
        "techStack": ["Next.js", "Cloudflare Workers", "D1"]
      }
    },
    {
      "code": "gamma",
      "name": "Project Gamma",
      "org": "example-org",
      "capabilities": [],
      "portfolio": {
        "status": "internal",
        "techStack": []
      }
    }
  ]
}

Each entry carries a short code (two or three lowercase letters), a human-readable name, a GitHub organization, a capabilities array, and portfolio metadata. The code is the universal identifier - it shows up in secret paths, database names, resource prefixes, CLI commands, and documentation scopes. Everything downstream derives from this registry.

The registry is also served by the context API. A Cloudflare Worker reads the file and exposes it at /ventures, so the MCP server and other tooling can fetch it without needing local file access. But the JSON file in the monorepo remains the canonical source.

Capability Flags

Not every venture is built the same way. Some have APIs. Some have databases. Some are pure documentation or planning ventures with no running code at all.

The capabilities array captures these differences:

has_api - The venture exposes HTTP endpoints. This gates API documentation generation. When the doc audit system checks for missing documentation, it only requires API docs from ventures with this flag.
has_database - The venture uses D1 databases. This gates schema documentation and migration tracking. No database, no schema audit.

Without capability flags, automation has two choices: run everything everywhere (wasteful and noisy) or maintain separate lists of which ventures need which automation (duplicates the registry). Capabilities solve this by encoding the answer directly in the registry entry.

The doc audit system on the context API illustrates this. It stores documentation requirements with a condition field:

doc_name: "api-structure.md"
condition: "has_api"
auto_generate: true

When auditing Project Gamma (capabilities: []), the system skips this requirement entirely. When auditing Project Alpha (capabilities: ["has_api", "has_database"]), it checks for the doc, finds it missing, and optionally auto-generates it from the venture's source code.

This is a small detail that eliminates a whole category of false-positive alerts. Without it, every venture without an API would perpetually report "missing API documentation" in every audit.

Venture Discovery

The launcher CLI needs to find each venture's local repo on disk. Rather than hardcoding paths, a repo scanner builds the mapping dynamically.

The scanner reads ~/dev/, looking for git repositories. For each repo, it reads the origin remote URL, parses the GitHub org and repo name, and records the mapping:

function scanLocalRepos(): LocalRepo[] {
  const devDir = join(homedir(), 'dev')
  const entries = readdirSync(devDir)

  for (const entry of entries) {
    const fullPath = join(devDir, entry)
    if (!existsSync(join(fullPath, '.git'))) continue

    const remote = execSync('git remote get-url origin', {
      cwd: fullPath,
      encoding: 'utf-8',
    }).trim()

    const match = remote.match(/github\.com[:/]([^/]+)\/([^/.]+)/)
    if (match) {
      repos.push({
        path: fullPath,
        org: match[1],
        repoName: match[2],
      })
    }
  }
  return repos
}

Then the launcher matches ventures to repos using a naming convention. Each venture's application repo follows the pattern {code}-web (with a special case for the infrastructure venture, which uses a legacy name for historical reasons):

function matchVentureToRepo(venture, repos) {
  return repos.find((r) => {
    if (r.org.toLowerCase() !== venture.org.toLowerCase()) return false
    return (
      r.repoName === `${venture.code}-web` ||
      (venture.code === 'infra' && r.repoName === 'ops-console')
    )
  })
}

The result is automatic routing. Type launcher alpha and the CLI figures out where Project Alpha lives on disk without any manual configuration. If the repo isn't cloned yet, the launcher offers to clone it via gh repo clone.

The scan result is cached for the session, so repeated lookups during a single launcher invocation don't re-read the filesystem.

Per-Venture Isolation

Each venture gets its own isolated set of resources. The venture code acts as a namespace prefix across every system.

Secrets. Each venture gets its own Infisical path. Project Alpha's secrets live at /alpha, Project Beta's at /beta. The launcher maps venture codes to paths and fetches secrets in a single call at session start:

const INFISICAL_PATHS: Record<string, string> = {
  alpha: '/alpha',
  beta: '/beta',
  gamma: '/gamma',
}

Secrets are fetched once via infisical export --format=json, parsed, validated (the launcher specifically checks that the context API key exists), and injected as environment variables into the agent process. No secret from one venture ever appears in another venture's session.

Databases. Each venture gets its own D1 databases, prefixed by venture code. Project Alpha might have alpha-main and alpha-analytics. Project Beta has beta-main. The prefixing convention prevents accidental cross-venture queries.

GitHub. Each venture's repo gets its own labels, issue templates, and project board. The setup script creates a standard label set (priority labels, status labels, QA grade labels) for each new venture. Issues, PRs, and work queues are all scoped to the venture's repo.

Documentation. The context API scopes docs by venture code. Global docs (team workflow, coding standards) are shared. Venture-specific docs (API structure, project instructions, schema docs) are scoped to the venture code. When an agent starts a session on Project Alpha, it receives global docs plus alpha-scoped docs. It never sees beta-scoped docs.

The shared Cloudflare account is the only truly shared resource. All Workers, D1 databases, and KV namespaces live under one account. The venture code prefix provides logical separation. This is a deliberate trade-off - one account is cheaper and simpler to manage than separate accounts per venture, and the prefix convention has proven sufficient for isolation at this scale.

The Launch Sequence

When the CLI launcher runs, every piece described above comes together in a single flow:

$ launcher alpha

1. Resolve agent         → Claude Code (default)
2. Validate binary       → claude is on PATH
3. Load venture config   → Read ventures.json, find "alpha"
4. Discover local repo   → Scan ~/dev/, match org + repo name
5. Fetch secrets         → infisical export --path /alpha --format json
6. Validate secrets      → Context API key exists
7. Ensure MCP server     → MCP binary on PATH, .mcp.json in repo
8. Spawn agent           → cd ~/dev/alpha-web && claude

The launcher supports three agent CLIs (Claude Code, Gemini CLI, Codex CLI), each with its own MCP configuration format. Claude Code uses per-repo .mcp.json files. Gemini uses .gemini/settings.json. Codex uses ~/.codex/config.toml. The launcher handles the format differences - the user just picks the agent with a flag.

If any step fails, the launcher stops with a clear error message. If the MCP server binary isn't found, it auto-rebuilds from source and re-links. If the repo isn't cloned, it offers to clone it. If Infisical is misconfigured, it tells you exactly what to fix.

The entire flow takes about three seconds on a warm machine. Compare that to the manual process it replaced: navigate to the right directory, remember and export the right environment variables, check that MCP is configured, launch the CLI. That process was error-prone (wrong secrets, wrong directory, stale MCP config) and took a minute or more.

Adding a New Venture

Adding a venture is a predictable checklist, mostly automated by a setup script:

Install the GitHub App on the org for the new repo (manual - one-time browser action)
Run the setup script with the venture code, org name, and app installation ID

The script then automates approximately a dozen steps:

Creates the GitHub repo with a template structure (CLAUDE.md, README, directory layout, slash commands)
Creates standard labels (priority, status, QA grade, type)
Creates a project board
Updates the GitHub classifier Worker's installation config
Updates the context API's venture registry
Updates the launcher's Infisical path mapping
Deploys the updated Workers
Clones the repo to fleet machines
Creates the Infisical folder and syncs shared secrets

After the script runs, the new venture is immediately launchable: launcher newcode works, the MCP server recognizes it, the doc audit system starts checking its documentation, and the GitHub classifier processes its webhooks.

Without the script, this setup would take an hour or more of manual configuration spread across GitHub, Cloudflare, Infisical, and multiple source files. With the script, it takes about five minutes.

An evolution worth noting: the original setup process created a separate GitHub organization per venture. Each venture got its own org, its own repo namespace, its own GitHub App installation. This felt clean in theory - full isolation between projects.

In practice, it created overhead without benefit. Branch protection rules had to be configured per org. GitHub App installations multiplied. The classifier worker needed a mapping table of org-to-installation IDs. And the setup script had to handle org creation as a manual prerequisite (GitHub doesn't allow automated org creation).

We consolidated all repos under a single GitHub organization. This let us apply org-wide branch protection rulesets, simplify the GitHub App to a single installation, and remove the org-creation step from the setup checklist entirely. The registry still tracks an org field per venture (supporting the possibility of external orgs), but every current venture points to the same one.

The key insight is that the setup script reads and writes the same registry that everything else depends on. There is no separate "provisioning system" to keep in sync. When the org structure changed, we updated the registry and everything downstream followed.

Shared Secrets

Some secrets are needed by every venture. The context API key, for example, is the same regardless of which product you're working on. Rather than manually copying these to each venture's Infisical path, a sync script reads a sharedSecrets configuration from the registry:

{
  "sharedSecrets": {
    "source": "/infra",
    "keys": ["CONTEXT_API_KEY", "ADMIN_KEY"]
  }
}

The source path holds the canonical values. The sync script reads them and copies to every other venture's path. Run launcher --secrets-audit to check for drift, or launcher --secrets-audit --fix to repair it.

This keeps shared secrets consistent without requiring every venture to reference a shared path. Each venture has its own complete set of secrets, some shared and some venture-specific. The launcher doesn't need to know which secrets are shared - it just fetches everything from the venture's path.

When a Monorepo Doesn't Work

The monorepo holds shared tooling: the launcher, the MCP server, Workers, scripts, configuration. It does not hold application code. Each product's code lives in its own repo.

This split exists because ventures have different tech stacks. One product is Next.js. Another is Astro. A third is pure Cloudflare Workers. Putting all of these in one repo would mean conflicting dependencies, tangled build pipelines, and configuration files stepping on each other.

The monorepo works for the control plane because the tooling is homogeneous - it's all TypeScript, all built with the same tools, all deployed to the same infrastructure. The heterogeneity lives in the product repos, where it belongs.

If we were building multiple products with identical stacks, a true monorepo (control plane + data planes together) might make sense. But with divergent tech stacks, the hybrid approach - shared tooling monorepo plus separate product repos - gives us the benefits of code sharing without the costs of forced uniformity.

The Principle

The venture registry is a small file. It's about 100 lines of JSON. But it drives the entire operational surface: which products exist, what capabilities they have, where their secrets live, what documentation they need, how they're launched, how they're audited.

When adding a new feature to the tooling, the first question is always: "Does this read from the registry?" If the answer is no, the feature is probably going to drift out of sync with reality. Hardcoded venture lists, separate configuration files that duplicate registry data, automation that doesn't check capabilities - these are all symptoms of the same disease.

The registry is the spine. Everything else hangs off it.

This pattern is not novel. Feature flags, service registries, and tenant configuration databases all follow the same principle: define the taxonomy once, let everything else derive from it. The insight for a solo founder running multiple products is that you need this pattern earlier than you think. By the third product, manual per-venture configuration becomes the dominant source of operational errors. A 100-line JSON file and the discipline to treat it as the source of truth eliminated that entire category of problems.

Multi-Agent Team Protocols Without Chaos

Tue, 20 Jan 2026 00:00:00 GMT

The default state of multiple AI agents working on the same codebase is chaos.

Without explicit protocols, agents duplicate work. Two agents pick the same issue because neither knows the other exists. They create conflicting branches, edit the same files, and produce PRs that can't both merge. An agent "helpfully" refactors a module that another agent depends on. Nobody knows who owns what, what's been verified, or what's safe to ship.

Human teams handle this through ambient coordination - hallway conversations, Slack threads, shared understanding built over months. AI agents have none of that. They start every session cold. They follow instructions literally. They don't resolve ambiguity by walking over to someone's desk.

This means AI agent teams need more structure than human teams, not less. We learned this the hard way.

The Team Model

Our team has four roles with explicit, non-overlapping boundaries:

Role	Tool	Responsibility
Dev Agent	Claude Code	Implementation, PRs, technical decisions
PM Agent	Claude Code	Requirements, prioritization, verification
Advisor	Gemini CLI	Strategic input, risk assessment, planning perspective
Captain	Human	Routing, approvals, final decisions

The Captain is always human. This is non-negotiable.

Dev agents implement. They pick up issues marked ready, create branches, write code, open PRs, and report when work is code-complete. They don't decide what to build, they don't verify their own work beyond CI, and they don't merge.

PM agents own the requirements side: writing issues, defining acceptance criteria, assigning priority. They also own verification - when a PR is code-complete, the PM agent tests it against the acceptance criteria and submits a pass/fail verdict. This consolidation (PM does QA) was deliberate. At our scale, a separate QA handoff adds overhead without adding value. The PM already has full context on what the feature should do.

The advisor provides a second perspective on planning and strategy. Different model, different training data, different biases. When we're making architectural decisions or prioritizing a backlog, having a second opinion from a model that reasons differently is valuable. The advisor doesn't touch code or GitHub.

The Captain is the routing layer. The Captain reads handoffs from agents, decides what to do next, and pastes directives into the appropriate agent window. The Captain approves scope changes, answers questions agents can't resolve, and - critically - authorizes merges. The Captain never updates GitHub directly. All GitHub mutations flow through the agents.

How This Evolved

The team model did not start here. In the early weeks, we ran a split-tool setup: dev agents in Claude Code (terminal), PM agents in Claude Desktop (GUI). The reasoning was that PM work - writing issues, reviewing PRs, verifying features - was more conversational and benefited from Desktop's chat interface, while dev work needed shell access.

In practice, the split created friction. Claude Desktop could not run gh CLI commands, so PM agents had to route GitHub mutations through the Captain or wait for a dev agent. Verification that required terminal access - checking API responses, running database queries, inspecting build output - was impossible from Desktop. The PM agent would describe what it wanted to check, and someone else had to run the commands.

When Claude Code matured enough to handle conversational workflows alongside terminal access, we consolidated. Every agent role now runs in the CLI. The PM agent can write an issue, verify a deployment, and run a database query in the same session. The advisor moved from the Gemini web interface to Gemini CLI for the same reason - terminal access to the codebase makes strategic advice more grounded in reality.

The lesson: match your tools to your actual workflows, not to role labels. "PM work" sounded like it belonged in a GUI. It didn't. It belonged wherever the PM could actually execute the full verification loop without assistance.

Why Role Boundaries Matter More with Agents

With humans, role boundaries are guidelines. A developer might do a quick QA check, a PM might fix a typo in code, a manager might close a stale issue. Humans understand context well enough to bend rules without breaking things.

Agents don't. An agent told "you can do QA if needed" will QA its own work and pass it every time. An agent with merge access will merge PRs the moment CI goes green, skipping verification entirely. An agent asked to "help where you can" will refactor code that another agent is actively working on.

Explicit boundaries prevent these failure modes. Each agent knows exactly what it can and cannot do. There's no ambiguity to misinterpret.

Labels as the Coordination Mechanism

GitHub labels are the routing system. Every issue carries two signals: where it is in the lifecycle and who needs to act next.

Status Labels (Exclusive)

An issue has exactly one status label at any time. This is enforced by convention and caught in review.

status:triage    New, needs prioritization
status:ready     Approved, ready for development
status:in-progress  Dev actively working
status:qa        Under verification
status:verified  QA passed, ready to merge
status:done      Merged and deployed
status:blocked   Waiting on dependency

The flow is linear and predictable:

triage -> ready -> in-progress -> qa -> verified -> done

Any deviation (skipping qa, going backward from verified to in-progress) is a signal that something went wrong and needs human attention.

Routing Labels (Additive)

Routing labels indicate who needs to act next. An issue can have multiple routing labels simultaneously.

needs:pm   Waiting for PM decision or input
needs:dev  Waiting for Dev fix or answer
needs:qa   Ready for QA verification

When a dev agent finishes a PR, it applies status:qa and needs:qa. When the PM agent fails a verification, it applies needs:dev. When an agent has a requirements question, it applies needs:pm.

The Captain's daily routine is simple: scan for routing labels and act on them. needs:pm means answer a question or delegate to the PM agent. status:verified means decide whether to merge. status:blocked means investigate and make a decision.

Why Labels Instead of Something Fancier

We considered richer coordination mechanisms: a shared state database, real-time event streams, agent-to-agent messaging. Labels won because they're visible, auditable, and already built into GitHub. Every label change shows up in the issue timeline. You can reconstruct the full lifecycle of any issue by reading the label history.

Labels also degrade gracefully. If an agent crashes mid-workflow, the label stays where it was. The next session picks up the issue in its current state. There's no coordination state to corrupt or reconcile.

QA Grading: Not All Work Needs the Same Verification

Early on, every PR went through the same verification process: the PM agent would walk through every acceptance criterion, capture evidence, and submit a structured verdict. When the PM ran in Claude Desktop, this sometimes meant browser-based verification with screenshots. This was thorough but slow. A documentation fix and a new authentication flow got the same treatment.

QA grading fixes this by routing verification to the appropriate method based on the work type.

Grade	Name	Verification Method	Example
0	Automated only	CI green = pass	Refactoring with tests, docs updates
1	CLI/API verifiable	curl, CLI commands, DB queries	API endpoint changes, worker jobs
2	Light visual	Quick spot-check, single screenshot	CSS fixes, minor UI tweaks
3	Full visual	Complete walkthrough with evidence	New user flows, multi-page features

The dev agent assigns the grade when creating the PR, based on the nature of the changes. The PM agent can override if it disagrees (e.g., dev marked grade 0 but the change is actually user-facing).

Grade 0 is the fast path. CI passes, Captain reviews the diff, directs a merge. No manual verification at all. This is appropriate for refactoring with test coverage, test-only changes, configuration updates, and documentation.

Grade 1 keeps humans out of the browser. The dev agent includes verification commands in the PR description: curl commands, CLI invocations, database queries. Someone runs them, confirms the output matches expectations, done.

Grade 2 and Grade 3 involve visual verification at increasing levels of thoroughness. Grade 2 is a quick spot-check - navigate to the preview URL, confirm the change looks right, capture a screenshot. Grade 3 is a full walkthrough of every acceptance criterion with evidence capture for each.

The grading rule is simple: when uncertain, grade higher. Better to over-verify than to ship a broken feature.

The Grade Determines the Routing

When a dev agent reports "PR ready for QA", it includes the QA grade in the handoff. The Captain uses the grade to decide what happens next:

Grade 0: Check CI, direct merge
Grade 1: Route to dev self-verify or PM for CLI check
Grade 2: Tell PM agent to do a quick visual check
Grade 3: Tell PM agent to do full verification

This eliminated the verification bottleneck. Before grading, every PR got the same heavyweight treatment regardless of risk level. Now, roughly half of PRs verify through CI or CLI alone.

Captain-Directed Merges

The human retains merge authority. Always.

The flow is:

Dev agent opens PR, assigns QA grade, moves issue to status:qa
Verification happens per grade method
On pass, issue moves to status:verified
Captain sees status:verified and decides: merge now, merge later, or request changes
Captain tells PM agent or dev agent to execute the merge
Agent merges, updates status to status:done, closes issue

Step 4 is the critical gate. No agent merges on its own judgment. The Captain reviews the situation - is this the right time to merge? Are there other in-flight changes that might conflict? Is the deploy pipeline healthy? - and makes the call.

PM agents can execute merges, but only on explicit Captain directive. This was a deliberate design decision to eliminate routing overhead. When the Captain says "merge it," the PM agent can act immediately instead of waiting for the dev agent to become available. But the authorization always flows from the human.

The Anti-Patterns

These are failure modes we've observed when protocols are weak or missing. Each one caused real problems before we tightened the system.

Agents Self-Assigning Work

Without explicit work assignment, agents pick whatever looks interesting. Two agents grab the same issue. Or an agent picks a low-priority issue while a P0 sits in the queue because the P0 looked harder. Or an agent starts on something that was intentionally deferred.

Fix: Issues must have status:ready before dev starts. The Captain routes specific issues to specific agents. Agents don't browse the backlog and self-assign.

Duplicate Work on the Same Files

Agent A is refactoring the authentication module. Agent B, working on an unrelated feature, decides to "clean up" the same module while it's open. Both submit PRs. One of them can't merge.

Fix: Session awareness at start-of-day. Every agent session begins by checking what other agents are currently working on. If Agent A is active on the auth module, Agent B stays away from it.

PRs Merged Without Verification

An agent with merge access sees CI green and merges. The code reaches production without anyone checking whether the feature actually works as specified.

Fix: Merge requires Captain directive. status:verified is a prerequisite for merge, and only verification (not just CI) produces status:verified.

"Helpful" Refactoring

An agent finishes its assigned work early and decides to refactor adjacent code to be "cleaner." The refactoring breaks assumptions that other agents or the next sprint's work depends on.

Fix: Agents implement what's in the issue. Nothing more. If they see improvement opportunities, they note them in a comment. They don't act on them without approval.

Churning Without Escalating

An agent hits a credential problem and spends 10 hours trying different approaches instead of stopping after 3 failures. Activity isn't progress.

Fix: Mandatory escalation triggers. Credential not found in 2 minutes - stop and ask. Same error after 3 different approaches - stop and escalate. Blocked more than 30 minutes on a single problem - time-box expired, escalate or pivot.

The escalation format is structured:

BLOCKED: [Brief description]
TRIED: [What was attempted]
NEED: [What would unblock - decision, credential, different environment]

This came directly from a post-mortem where an agent consumed an entire day's compute budget making 100+ tool calls without advancing. The escalation protocol has prevented similar incidents multiple times since.

What We Didn't Build

It's worth noting what's intentionally absent from this system.

No agent-to-agent messaging. Agents communicate through artifacts (GitHub issues, PRs, labels) and through the Captain. There's no direct channel between the dev agent and the PM agent. This prevents emergent coordination that the human can't observe or override.

No automated priority assignment. The PM agent drafts issues and suggests priorities, but the Captain approves them. An agent's sense of "urgent" doesn't always match business reality.

No automated scope changes. If a dev agent discovers that an issue is bigger than expected, it doesn't split the issue or adjust the scope. It reports the situation and the Captain decides how to proceed.

No self-organizing sprints. Agents don't negotiate among themselves about what to work on next. The Captain maintains a weekly plan, and agents work from it.

Each of these would be technically feasible. We chose not to build them because every autonomous coordination mechanism is a place where agent behavior can diverge from human intent without the human knowing.

The Broader Point

The instinct with AI agents is to give them more autonomy. Let them self-organize. Let them figure out the best approach. Reduce the human overhead.

This instinct is wrong, at least at the current state of the technology.

Humans can handle ambiguity. When a process document says "coordinate with the team," a human knows what that means in context. They'll ping someone on Slack, or walk to their desk, or bring it up in standup. An agent reading the same instruction has no grounding for "coordinate" and will either do nothing or do something unhelpful.

Humans resolve conflicts in real-time. When two developers realize they're both editing the same file, they talk about it and figure out who goes first. Agents don't detect the conflict until both PRs are open, and they can't negotiate a resolution.

Humans bring judgment to edge cases. When something feels wrong even though the process says it's fine, a human investigates. An agent follows the process.

This doesn't mean agents are less capable. It means they're differently capable, and the coordination protocols need to account for those differences. Explicit role boundaries, visible state transitions, human-gated merge authority, structured escalation - these aren't overhead. They're the minimum viable structure for getting useful work out of a multi-agent team.

The protocols we've described aren't complex. Namespaced labels. A status flow. QA grades. Captain-directed merges. Escalation triggers. Each one is simple. Together, they create a system where multiple agents produce coherent output instead of incoherent noise.

The alternative - letting agents self-organize and hoping for the best - produces exactly the chaos you'd expect.

This article describes a production team workflow coordinating AI dev agents, PM agents, and an advisor - all running in CLI tools - with a human captain across multiple projects. The system has been in daily use since January 2026, evolving from a split GUI/CLI setup to an all-CLI model as the tools matured.

Kill Discipline for AI Agent Teams

Sat, 17 Jan 2026 00:00:00 GMT

An agent that makes 50 tool calls without advancing is worse than one that stops after 3 failed attempts and asks for help. The first agent looks busy. The second agent is actually useful.

This is the core insight behind what we call kill discipline: a set of mandatory stop points that force AI agents to escalate instead of spiral. It sounds obvious. It is not obvious to the agents themselves.

The Failure Mode

AI coding agents are optimistic by default. Given an error, they will try another approach. Given another error, they will try a variation. Given a third error, they will try combining the first two approaches. Given a fourth, they will start modifying things that were previously working. This can continue for hours.

We learned this the hard way. A post-mortem from January 2026 revealed an agent that had churned for over 10 hours on symptoms instead of escalating the underlying blocker. It tried dozens of approaches to a problem that required a credential it did not have. Every attempt looked productive - reading files, modifying configs, running tests, analyzing output. All of it was wasted motion.

The cost was not just tokens. It was the opportunity cost of a machine and an agent session doing nothing useful for an entire working day, plus the cleanup effort to untangle what the agent had changed during its spiral.

This is the most expensive failure mode in agent-assisted development: silent churn. Not crashes, not wrong answers, not syntax errors. An agent that quietly burns cycles on an unsolvable problem while everyone assumes it is making progress.

Why Agents Do Not Self-Correct

Large language models have a bias toward action. When presented with a problem, they want to solve it. "I cannot solve this" is almost never the first instinct. The model will generate plausible next steps long after a human engineer would have stepped back and said "something is fundamentally wrong here."

This is compounded by the session structure of agent-assisted development. Unlike a human developer who notices their own frustration after 20 minutes, an AI agent has no internal state that degrades with repeated failure. Attempt 47 feels exactly the same as attempt 1 to the model. There is no mounting annoyance, no gut feeling that says "stop, you are going in circles." The agent will keep trying variations with the same confidence it had at the start.

Left unchecked, this produces a specific anti-pattern we call the agent spiral: the agent tries approach A, it fails. Tries approach B, it fails. Tries approach C, it fails. Then it tries A again with a slight modification. Then B again with a slight modification. The search space expands but converges on nothing. Each individual step looks reasonable. The trajectory is aimless.

The Stop Rules

We codified five mandatory stop points. When any of these conditions is met, the agent must stop working on the current problem and escalate. Not "consider escalating." Not "try one more thing first." Stop.

Same error 3 times (different approaches). If an agent has tried three genuinely different approaches to the same problem and all three fail, the problem is not going to yield to a fourth variation. The agent must stop and escalate with a structured summary of what was tried. The key word is "different" - retrying the same command with a slightly different flag does not count as a different approach.

Blocked more than 30 minutes on a single problem. This is a hard time-box. Regardless of whether the agent feels like it is making progress, 30 minutes without resolving a blocker means the time-box is expired. Escalate or pivot to different work. Activity is not progress.

Credential not found in 2 minutes. If an agent cannot locate a required API key, token, or secret within two minutes, it must stop immediately. It must not guess, hunt through directories, or try to work around the missing credential. The correct action is to file an issue, ask the human, and move on. Missing credentials are never solved by more searching - they require someone with access to provision them.

Network or TLS errors from a test environment. If the environment itself cannot reach the network, no amount of curl variations will fix it. The agent must recognize this as an environmental constraint, not a code problem, and stop with a clear statement: "Cannot test from this environment."

Wrong repo or venture context twice. If an agent discovers it is operating in the wrong repository or project context, and this happens a second time, it must stop the session entirely. Not fix the context and continue - stop and investigate why the context is wrong. A recurring context error indicates a systemic problem with how sessions are being initialized.

The Escalation Format

When an agent hits a stop point, it needs to communicate three things clearly: what is blocked, what was tried, and what would unblock the situation. Unstructured messages like "I'm having trouble with authentication" are not useful. They force the human to ask follow-up questions, adding a round-trip of delay.

The required format is:

BLOCKED: [Brief description of the problem]
TRIED: [What was attempted - specific approaches, not vague descriptions]
NEED: [What would unblock - a decision, a credential, a different environment]

For example:

BLOCKED: Cannot authenticate to the staging API
TRIED: 1) Used the API key from project config 2) Tried the admin key from
       environment variables 3) Checked Infisical for a staging-specific key
NEED: A valid API key for the staging environment, or confirmation that staging
      uses the same key as production

This format works because it gives the human everything needed to unblock the situation without additional back-and-forth. The "TRIED" section prevents the human from suggesting something already attempted. The "NEED" section makes the required action explicit.

Anti-Patterns

Naming the failure modes makes them easier to catch. These are the patterns that kill discipline is designed to prevent.

"Let me try one more variation." This is the most common and most dangerous anti-pattern. After three failed attempts, the agent generates a fourth approach that looks plausible. It always looks plausible - that is what language models are good at. The rule exists precisely because the agent's own judgment about whether another attempt is worthwhile cannot be trusted after repeated failure.

Declaring partial success. An agent runs a subset of tests, sees them pass, and declares the task complete. The full test suite was not run. Or the agent tests the happy path but not the error cases specified in the acceptance criteria. Partial testing declared as success is worse than no testing, because it creates false confidence that the change works.

Brute-forcing past errors instead of investigating root causes. When a build fails, the correct response is to read the error message and understand what went wrong. The incorrect response is to modify code until the error changes, then modify more code until the next error changes, and so on until something compiles. This produces code that happens to compile but does not necessarily do what was intended. It is the coding equivalent of turning knobs until the dashboard light goes off.

Lots of tool calls with little progress. High activity is not evidence of progress. An agent reading 40 files, running 30 commands, and editing 20 files might be making great progress - or it might be thrashing. The distinguishing factor is whether the agent can articulate what it learned from each step and how it advances toward the goal. If the answer is "I'm still investigating," after 30 minutes of investigating, the time-box rule applies.

Papering over problems instead of surfacing them. An agent encounters an error and adds a try/catch block to suppress it. The underlying problem still exists, but the immediate symptom is gone. This is not a fix. Kill discipline requires that problems be surfaced, not hidden.

Evidence Requirements

Stopping bad work is half the discipline. The other half is ensuring that completed work is actually complete. Before any issue is closed, three requirements must be met.

End-to-end test. The fix or feature must be tested through the actual product interface - not a simulated environment, not a curl command against an isolated endpoint, not a unit test that mocks the dependencies. The test must exercise the real code path that users will hit.

Human confirmation. A person must verify that the fix works. The agent's own assertion that it tested something is necessary but not sufficient. The human review can be lightweight for low-risk changes, but it must happen.

Evidence. A screenshot, a terminal output, a session log - something concrete that demonstrates the verification happened. "I tested it and it works" is not evidence. Evidence is the artifact that proves it.

The close comment on any issue follows a structured format:

## Verification
- Tested on: [machine name]
- CLI used: [agent CLI name]
- Command run: [specific command]
- Result: [PASS with evidence link or inline output]

This seems bureaucratic until you have been burned by an issue that was closed as "fixed" but never actually verified. That post-mortem is worse.

Kill Discipline as a Cultural Practice

The most important thing about kill discipline is that it does not emerge naturally. You have to codify it explicitly and enforce it consistently. Without explicit rules in project instructions, every AI agent will default to optimistic persistence - trying one more thing, one more time, one more variation.

Here is how we implement it in practice.

Project instructions. The stop rules are written directly into the project's configuration files that agents read at session start. They are not in a wiki. They are not in a separate document that someone might forget to reference. They are in the same file that tells the agent what repo it is working in and what commands to run. The agent reads these rules at the start of every session.

Post-mortem reinforcement. When a churn incident happens, we update the rules. The January 2026 post-mortem that revealed 10+ hours of agent churn directly produced the mandatory stop points. The rules are not theoretical - they are extracted from real failures. The version history in the team workflow document traces each rule back to the incident that motivated it.

Start-of-day loading. The session initialization process surfaces P0 issues, active sessions, and the last handoff. It also loads the team workflow document that contains the escalation rules. The agent does not need to remember the rules from a previous session. They are delivered fresh every time.

Structured workflow integration. The escalation format is not just a template - it feeds into the team's routing system. A BLOCKED escalation becomes a labeled issue that routes to the right person. The human sees it in their priority queue, not buried in a conversation thread.

The analogy to code review is apt. Nobody naturally writes perfect code. Code review is an institutional practice that catches problems before they reach production. Kill discipline is the same thing for agent behavior - an institutional practice that catches unproductive spirals before they burn hours of compute and human attention.

The Broader Principle

Kill discipline is one expression of a broader principle: AI agents need governance, not just prompts. You cannot give an agent a task and walk away. You need to define what the agent should do when things go wrong, what evidence constitutes "done," and when the agent should stop trying and ask for help.

The tooling matters less than the discipline. Whether the stop rules live in a CLAUDE.md file, a system prompt, or a configuration database, the important thing is that they exist, they are specific, and they are loaded into every agent session automatically.

The teams that will get the most value from AI agents are not the ones with the most sophisticated models or the most tokens. They are the ones that have figured out how to make agents stop at the right time - the ones that have learned that knowing when to quit is just as important as knowing how to start.

Activity is not progress. Codify that, and your agents get dramatically more useful.

Why We Built a Development Lab Instead of a Product

Tue, 13 Jan 2026 00:00:00 GMT

The conventional startup playbook says pick one idea, build it, ship it, iterate. If it works, scale. If it doesn't, pivot. Repeat until you find product-market fit or run out of runway.

We did something different. Instead of going all-in on a single product, we spent the first months building a development lab - shared infrastructure that supports multiple ventures simultaneously. The bet is that the lab itself is the competitive advantage, not any individual product.

The Coordination Tax

Every software product, no matter how small, needs the same foundation: secrets management, CI/CD pipelines, deployment configuration, documentation, issue tracking workflows, session management for development. Call it the coordination tax.

For a single product, this tax is manageable. You set it up once, maintain it lightly, and spend most of your time on the product itself.

For multiple products, the tax compounds. Each new venture needs its own secrets in its own namespace. Its own CI workflows. Its own deployment pipeline. Its own documentation structure. Its own development environment setup. Multiply that by several ventures and the overhead starts to dominate the actual product work.

The insight was simple: most of this infrastructure is identical across ventures. The secrets management pattern doesn't change because the product domain changed. CI/CD workflows are 90% the same. Deployment targets are the same platform. Documentation requirements follow the same templates.

So we built it once. A venture registry tracks each project's metadata, tech stack, and capabilities. Infisical organizes secrets by project path. GitHub Actions workflows are templated. Documentation requirements are defined centrally and audited automatically. A single CLI launcher command spins up a fully configured development session for any venture in the portfolio.

The result is that adding a new venture takes minutes, not days. The coordination tax is paid once, amortized across everything.

The AI Agent Angle

Shared infrastructure is not a new idea. Monorepos, platform teams, and internal developer platforms all address the same problem. What makes the development lab model different is that AI coding agents change the economics of how many codebases a small team can maintain.

A solo founder without AI agents can realistically maintain one codebase at production quality. Maybe two if they're related. The bottleneck is human attention - context switching between unrelated codebases is expensive, and each one demands ongoing maintenance even when you're not actively building features.

AI coding agents shift this. An agent can pick up a codebase cold, orient itself using project instructions and documentation, and start productive work within minutes. It doesn't carry the cognitive overhead of context switching. It doesn't get tired of fixing lint errors across multiple repos. It doesn't forget the deployment process for a project it hasn't touched in two weeks.

But agents need infrastructure to work this way. They need structured handoffs so the next session knows what happened in the last one. They need a context management system that injects business knowledge at session start. They need a venture registry that maps project codes to repositories, secrets paths, and capabilities. They need documentation that's current, because stale docs make agents worse, not better.

The development lab is the infrastructure that makes multi-venture AI-assisted development practical. Without it, you have agents fumbling through setup steps, missing context, and duplicating work. With it, you have agents that start every session oriented and productive, regardless of which venture they're working on.

What Shared Infrastructure Looks Like

The lab is not a single monolithic system. It's a collection of purpose-built components that work together.

A venture registry - a JSON configuration file that defines each project: its code name, its GitHub organization, its capabilities (does it have an API? a database?), its portfolio status and stage. The registry drives conditional behavior throughout the system. Documentation requirements, schema audits, and API doc generation are only triggered for ventures with matching capabilities.

A context management system - a Cloudflare Worker backed by D1 that tracks agent sessions, stores structured handoffs, manages an enterprise knowledge store, and audits documentation freshness. Every agent session starts by calling this system, receiving the last handoff, active parallel sessions, and relevant business context. We wrote about this system in detail in a previous article.

A secrets pipeline - Infisical organizes API keys and tokens by project path. A CLI launcher fetches the right secrets and injects them as environment variables before spawning the agent. Secrets never touch disk in plaintext. Adding a secret for a new venture is one command.

A documentation system - centralized documentation with version tracking, staleness detection, and self-healing. When a new venture is added that has an API, the system automatically generates API documentation from the source code. Missing docs are flagged during session initialization.

A fleet of development machines - multiple macOS and Linux machines connected via Tailscale, with SSH mesh networking, tmux for persistent sessions, and mobile access through Mosh. Any machine can run any venture's development session with a single command.

CI/CD templates - GitHub Actions workflows for type checking, linting, testing, security scanning, and documentation sync. These are consistent across ventures, with per-venture customization only where the tech stack requires it.

What Doesn't Work Yet

Honesty about the current state: the development lab is real and in daily use. The ventures it supports are not yet generating revenue.

Several projects in the portfolio are at the prototype stage. Others are still in ideation. The lab infrastructure is production-grade, but the products it enables are pre-launch. This is the central tension of the model - we've invested heavily in the platform before proving that any individual product built on it can find a market.

Some specific gaps:

Multi-agent coordination is designed but undertested. The system supports parallel agent sessions with awareness of what other agents are working on. The track system for partitioning work across agents has the schema and indexes in place but hasn't been exercised under real parallel load. Most development still happens as single-agent sessions.

The portfolio is broad. The ventures span different domains - consumer, B2B, creative tools. This diversity is intentional (it's an exploration strategy), but it means attention is divided. Each venture gets less focused effort than it would in a single-product company.

Revenue is zero. The lab generates no direct revenue. It's pure investment in capability. If none of the ventures find product-market fit, the infrastructure has value only as a learning exercise and potentially as a template for others.

The Bet

The development lab model is a bet on three things.

First, that optionality beats conviction at the earliest stages. When you genuinely don't know which product idea has the best market, the ability to explore multiple ideas at low marginal cost is more valuable than deep commitment to one. The lab makes each additional exploration cheap.

Second, that AI agents will keep getting better. The lab's value scales with agent capability. As agents become more autonomous and require less human supervision, the number of ventures a small team can maintain increases. The infrastructure we built today is sized for current agent capabilities. Tomorrow's agents will get more leverage from the same infrastructure.

Third, that shared infrastructure compounds. Every improvement to the context management system benefits all ventures. Every documentation template, every CI/CD pattern, every deployment workflow is reusable. The development lab gets better with each venture added, not worse.

If one venture takes off, the infrastructure supports scaling it. The CI/CD pipelines, secrets management, and deployment patterns don't need to be rebuilt. If it doesn't take off, the same infrastructure supports pivoting to another venture or launching a new one. The lab itself is never wasted.

The worst case is that we've built a well-engineered development environment and learned a lot about AI-assisted multi-project development. The best case is that one of these ventures finds its market, and it gets there faster because the infrastructure was already in place.

We chose to build the lab first. Time will tell if that was the right call.