Article

Handshake Driven Development: A Working Agreement for AI Coding

A practical pattern for making AI coding more reliable: define the spec, context, tools, guardrails, and review loop before the agent starts writing code.

May 11, 202614 min readAISoftware EngineeringAgents

AI-assisted coding has crossed the point where it can be dismissed as a novelty. It changes software work because it can turn intent into code faster than most teams are used to reviewing it.

That speed is useful. It is also where the risk starts.

The first time an agent takes a rough idea and turns it into a working implementation, it feels like a shortcut through the boring parts of software. Boilerplate disappears. Repetitive edits happen in seconds. A feature that would have taken an afternoon can suddenly take a conversation.

For a while, that feels like the whole story.

Then the misses start getting harder to spot.

Not catastrophic failures. Those are easy. The more dangerous failures are the plausible ones: code that compiles, tests that still pass, UI that looks fine, and a small assumption buried in a secondary workflow that only shows up later. A missing import path. A skipped migration concern. A platform-specific difference that was never made explicit. A naming pattern that drifts just enough to become tomorrow's inconsistency.

That is where the initial productivity gain becomes more complicated. The first 80% gets dramatically faster. The last 20% can become noisier, because now you are not just writing code; you are auditing what the agent thought the project meant.

The uncomfortable lesson was simple: AI does not remove the need for engineering discipline. It amplifies whatever discipline is already there.

If your process has clear specs, good context, working tests, and humans reviewing the right things, an agent can accelerate that. If your process relies on tribal knowledge, vague tickets, and "it compiles, ship it," the agent accelerates that too.

This is the pattern I have been trying to name for myself: not better prompting alone, but a better working agreement with the agent before code starts.

I eventually started calling that agreement a handshake.

Not because the word sounds formal, but because the useful part happens before the work begins. Both sides need to know what is being built, what rules matter, what tools are in play, what risks to watch for, and how the result will be checked.

The rest of this article is the path that got me there.

Why Better Prompting Was Not Enough

A lot of advice around AI coding still collapses into some version of "prompt better."

Prompt quality matters. A clear request beats a vague one. Examples help. Constraints help.

The advice has also gotten more sophisticated. It is not only "write better prompts" anymore. It is also "use skills," "connect MCPs," "add memory," "give the agent tools," "write project rules," and "make the model aware of more context."

All of that helps. I use versions of those ideas myself.

The first step was learning to ask better. The next step was learning to shape the environment around the agent. I spent more time on the AGENTS.md file, coding standards, tool instructions, repository conventions, verification expectations, and hard limits around what the agent should not do. In other words, I started defining guardrails.

That helped. The output became closer to what I expected. The agent made fewer random stylistic choices. It had a better sense of how I wanted the work done, not only what I wanted built.

For a while, that felt like the next level of AI-assisted coding.

The Miss That Changed My Approach

I've been working on a passion project for a couple of years, an app called FossilVault. I collect fossils, and the app is where I have been turning that hobby into a real product. Since early last year, I have used agentic coding tools for some of the work.

One of the first moments that built my confidence was a simple model change. I needed to add a new field to the Specimen model in the app. I gave the agent what I considered a strong prompt: the intent, the change needed, the expected surfaces, and the obvious workflows.

The result was better than I expected. It touched the model, updated relevant usages, and handled more of the external behavior than earlier tools would have managed.

A few weeks later, I needed to add another field to the same model using almost the same prompting style. Again, most of the work was correct. The agent updated the add and edit specimen flows, updated the details view, kept stats working, and added the field to CSV export. But one important path was missed: CSV import.

That seems obvious in hindsight. If a field can be exported, it probably needs to be imported too. A human working inside the product context would usually connect those two workflows. But an automated system does not reliably infer every reciprocal path unless the workflow asks it to look for them.

That miss was more interesting than a hard failure because nothing exploded immediately. The code looked complete. The obvious flow worked. The change seemed done until a real workflow reached the boundary the agent had skipped.

That is the kind of failure that makes AI coding tricky. The result can be impressive and still incomplete. It can look like progress while quietly adding future cleanup. The prompt had not been bad. The agent had not been useless. The problem was that I was still treating the task as a one-off request instead of a workflow with an explicit contract.

For that kind of model change, a better process should have forced a checklist before implementation started:

Find every read and write path for the model.
Check the obvious UI and the secondary workflows.
Check every place the model enters, leaves, appears, gets edited, gets serialized, or gets reconstructed.
Look for platform-specific behavior.
Update tests or add a regression case where it fits.
Call out any path that was not inspected.
Verify the result against the acceptance criteria, not only against compilation.

That was the point where I started looking for something stronger than good prompts plus good guardrails.

Enter Spec-Driven Development

The next place I looked was spec-driven development. If prompts and guardrails helped the agent behave better, maybe the deeper fix was to change what the agent was working from in the first place.

Spec-driven development appealed to me because it changes the starting point.

Instead of beginning with implementation and letting the code become the first concrete artifact, SDD starts by defining what the system should do. The spec captures the behavior, acceptance criteria, scope boundaries, non-goals, and constraints before the agent starts writing code.

That matters for human work, but it matters even more for AI-assisted work. A spec gives the agent a stable target. It also gives the project a piece of documentation that survives the individual conversation. When the next feature touches the same area, the agent and the human can refer back to the decisions that were already made instead of reconstructing them from memory or nearby code.

In practice, SDD gave me three immediate benefits: clarity, because hidden assumptions had to be named; reproducibility, because the next session depended less on exact prompt phrasing; and better AI output, because agents perform better with a concrete behavioral target.

I tried this in a new project and the results were strong. The work done by the agents was much closer to what I wanted. Specs helped me keep track of changes, helped the agent understand features, and created a better source of truth.

The improvement was strong enough that I overcorrected. For a while, the spec felt like the whole missing piece. I paid less attention to the surrounding agent setup: the AGENTS.md rules, reusable skills, project-specific tools, and the rest of the harness I had been building around the agent. SDD made things better so quickly that it was easy to assume the spec was enough.

But as the project grew, another kind of drift appeared.

The features were better specified, but the agent still made choices about architecture, style, error handling, tool usage, and project conventions. Some of those choices were fine in isolation. Across a growing codebase, they started to diverge.

That exposed the limitation:

Specs tell the AI what to build, but not how to behave while building it.

The Missing Piece: The Handshake

So I went back to what had worked in the earlier project: agent configuration, guardrails, tool instructions, and explicit working rules.

This time, I combined that with the spec-driven approach.

That was when things clicked. The agent output became closer to what I expected more consistently. Each iteration had less variability. The spec defined the target, and the agent contract defined the way we would move toward it.

At that point, the earlier word started to fit better. A handshake is not a vague philosophy about "alignment." It is a concrete agreement between the engineer and the AI system before work begins.

It defines how the agent should think about the task, what tools it should use, what constraints it must respect, what risks it should watch for, how it should verify the work, and what kind of result it should deliver.

The point is to turn intent into operating rules before the agent starts:

Work from the spec and do not expand scope silently.
Follow the repository's architecture and naming conventions.
Use the available tools for the job instead of guessing.
Add regression tests when a bug is fixed.
Run the relevant checks before claiming the work is done.
Call out uncertainty instead of hiding it behind plausible code.

That is the part I had been missing. SDD gave the task a target. The agent configuration gave the agent rules of operation. Tooling contracts made capabilities explicit. Guardrails defined the boundaries. Human review kept judgment in the loop.

Together, those pieces became the working agreement I now think of as Handshake Driven Development.

Handshake Driven Development

Handshake Driven Development is the practice of establishing a repeatable working agreement with AI systems before any code is written.

The agreement covers behavior, tools, constraints, context, and verification. Before implementation begins, the agent and the human align on how the work should be understood, executed, checked, and folded back into the project.

This is not a claim that every task needs ceremony. HDD is a pragmatic pattern that has worked for me when reliability matters more than the initial dopamine hit of fast generation.

My shorthand is:

HDD = SDD + agent contract + tooling contracts + guardrails + human loop

The components are:

The spec: what needs to be built, including functional requirements, non-functional requirements, acceptance criteria, scope, and non-goals.
The agent contract: how the agent should work, usually captured in something like AGENTS.md, including coding standards, architecture rules, error handling expectations, and reasoning style.
Tooling contracts: what tools, APIs, databases, and project utilities the agent is expected or allowed to use.
Guardrails: what the agent must not do, which boundaries protect the codebase from shortcuts, unsafe edits, and silent debt.
The human loop: where architectural judgment, tradeoffs, product intent, and debt decisions stay with the engineer.

What a Handshake Looks Like

A useful handshake does not have to be large. For a meaningful change, I want the agent to know at least six things before implementation starts:

Source of truth: which spec, issue, or note defines the intended behavior.
Scope boundary: what should not change, even if the agent sees nearby cleanup opportunities.
Context sweep: which models, screens, workflows, platforms, imports, exports, tests, and docs need to be checked.
Tooling expectations: which project commands, scripts, MCPs, or local helpers should be used instead of guessing.
Verification: what must pass before the work can be considered done.
Handoff: what new knowledge should be written back into project context so the next session starts less blind.

For the FossilVault field update, the missing handshake item was the context sweep. The task was not only "add a field." It was "find every place this model enters, leaves, appears, gets edited, gets serialized, or gets reconstructed."

The human loop is not just final approval. It is where judgment stays.

The agent can inspect files, propose changes, run tests, and summarize tradeoffs. But the engineer still owns the decisions that require product taste, architectural memory, risk tolerance, and debt judgment.

In practice, that means the human should decide whether the spec is actually right, whether a local fix should become a broader refactor, whether a shortcut is acceptable, whether tests prove the behavior that matters, and whether the implementation still fits the product.

Why This Matters More Across Platforms

FossilVault runs across iOS and Android, which makes drift easy to see.

Without a shared spec and context baseline, parity work becomes interpretive. One platform implements the intent one way, the other platform implements a slightly different reading, and the difference may not show up until a user crosses a workflow boundary. The feature is "the same" in planning language, but not in product behavior.

Living specs change that. They let me carry intent, constraints, and acceptance criteria across codebases without starting the conversation from scratch each time. The second implementation becomes less about reinvention and more about applying the same contract in a different environment.

That is one reason HDD feels especially valuable for teams. The more people, platforms, agents, and codebases involved, the more expensive ambiguity becomes. Solo developers still benefit, but the value compounds differently: less lost project history, fewer rediscovered decisions, and less dependence on whatever happened to be in my head that week.

The Cost Is Real

HDD adds overhead, and there is no point pretending otherwise. You spend more time up front writing specs, curating context, defining guardrails, and deciding what must be verified before merge.

There is also a material cost. Agent work consumes tokens, and tokens cost money. A loose workflow can feel cheaper because the first request is small, but the cost often moves into repeated clarification, patching, review, and cleanup loops.

That cost gets worse when exploratory coding becomes the production workflow, or when compilation gets mistaken for validation.

In one small-feature run, for example, I asked an agent to move a button from one screen to another and adjust a small piece of surrounding logic. The initial implementation used roughly 80,000 tokens. Follow-up fixes consumed another 60,000 or so because the agent had missed some local context. A more deliberate HDD pass might have spent more up front, but it likely would have avoided at least part of that cleanup loop.

The tradeoff is that the overhead becomes lighter after the baseline exists. Specs evolve instead of starting from zero. Context artifacts accumulate. Guardrails become reusable. Tests and checks get sharper as misses are discovered. The process stops feeling like ceremony and starts behaving like infrastructure.

For teams, the payoff is usually faster because shared contracts reduce architectural divergence. For solo developers, the payoff is slower but still real, especially once the project has enough history that memory becomes unreliable.

Adapt the Pattern, Do Not Copy It

AI-assisted development is moving quickly, so I would treat my current HDD setup as a snapshot, not a template everyone should copy. The tools, models, and names will change.

The durable idea is simpler: make your agreement with the agent explicit before code starts, then make that agreement repeatable enough that the next run does not depend entirely on prompt luck.

For me, HDD is mostly about trust. Not blind trust in the model, but earned trust in the workflow around it.

I still want the speed. I still want the surprising moments where an agent handles a task better than expected. But I want those moments inside a process that protects the shape of the codebase.

That matters because hidden tech debt has a different emotional weight when it came from a tool that was supposed to help. It is even more frustrating to clean up code you accepted because it looked plausible in the moment.

What I want from AI coding is not just more code. I want to keep working on projects that would otherwise be too large for the amount of time I have, without slowly losing trust in the system I am building.

If you are using agents seriously, write down the agreement before the next meaningful change. What defines good? What context must be inspected? What should the agent never touch? What checks have to pass before you believe the result?

The tools will change. The handshake is the part worth keeping.