AX is the new DX | georgebuilds.dev

A while back I built agent-board, a kanban board that lives in the terminal, to help our coding agent keep track of long, multi-step work.

It started as a regular CLI. The kind you’d write for yourself. Six weeks later most of it had been rewritten, and the rewrites weren’t because I’d found bugs. They were because agents kept using the tool wrong, and most of the time it was the tool’s fault.

Agents are how more developers consume software and APIs, and the pattern won’t stay limited to developer tools. Designing for them is a different problem from designing for humans, a space the community is starting to call Agent Experience (AX). This post is about what changed and why. Not as advice. Just notes from the work.

The user is different

The thing reading your CLI is not a developer. It doesn’t skim docs once and remember. It has no muscle memory. Every time it picks up your tool, it starts from scratch, working only from whatever text it has in context at that moment.

That single fact ends up driving most of the design decisions. A few examples:

Inconsistency causes hallucinations. If board create and card update use different argument orders, the agent will mix them up and invent commands that look plausible but don’t exist.
Errors are part of the interface. Unlike their human counterparts, agents actually read the error message. Which is a gift: a well-written error is a steering wheel mid-session. “Error: invalid state” and the agent shrugs (same as the dev). But if the error tells it the next command to run, it recovers on the next turn. Devs would’ve scrolled past it anyway.
Context dilution. As the agent’s context fills up with tool outputs, prior turns, and tool descriptions, its accuracy drops. Think of it like Where’s Wally: the more crowded the page, the harder it is to stay sharp and find the thing that matters.

How to expose a capability

I had to pick a form factor for agent-board: CLI, MCP server, bash script, Skill, or something else. Three questions ended up driving it:

How many operations does it expose? If the agent uses a handful of operations constantly, MCP makes sense; the tools sit in context every turn but the agent reaches for them often enough to justify the context cost. If you have many operations (my rule of thumb is 4+), loading them all into every turn outweighs the convenience. A CLI is cheaper because the agent only pays for the command it actually runs, better progressive context disclosure.
How risky is auto-approving each call? It’s easier to implement auto-approval for MCP tools, for CLIs it’s more challenging, since you need to decompose shell commands, and the CLI’s sub-commands. For agent-board it would have been easier to enable auto-approval if it’s an MCP server.
Is it a generic primitive or a specific workflow? agent board is a task management primitive, it’s useful in many contexts (e.g. scaffolding a blog, dockerizing an app, migrating to nother cloud provider), so an artisan bash script wouldn’t make sense and would make distribution / portability harder.

agent-board has many operations, low risk per call, and they’re generic. So: leans more towards CLI.

A note on Code Mode (giving the agent freedom to write its own code against an API). I tried it. I didn’t like it. Pre-written, reusable scripts/clis beat regenerating code every time. More predictable, easier to test, and the agent isn’t spending tokens to re-derive logic you’ve already figured out.

AX patterns I discovered

Verb-first commands

My first version grouped commands under each entity. The way you’d lay it out for yourself:

board create "Sprint 12"
card get card_a3f
agent list
card update card_a3f

It worked, but agents made more mistakes than I expected. They’d reach for commands that didn’t exist (the agent kept trying card delete, which I didn’t support) or mix up the argument order across entities.

I rewrote it to put the verb first, the way git and kubectl do:

create board "Sprint 12"
get card_a3f
list agents
update agent_42

Agent accuracy went up. Maybe the verb-first shape was more familiar from training data, maybe the smaller surface area was easier to hold in context. I didn’t isolate the variable. What I know is the new version stopped triggering command hallucination. And it works because the IDs themselves carry the type (card_a3f, board_xyz, agent_42), so get <id> doesn’t need to know what kind of thing it’s fetching.

The rewrite deleted 299 lines from cli.rs. None of those lines were doing useful work; they existed because I’d modeled the CLI for myself (a human) first, not for the agent that was going to use it.

Fail loudly with instructions

My worst mistake was supposed to be a fancy helpful feature. When an agent ran update card --assign-to-me without an identity configured, I silently auto-created an identity and printed a note suggesting they export it. Seemed nice.

It wasn’t. Every invocation created a new identity, the export never happened, and the database filled up with phantom agents. I reverted the auto-creation and replaced it with a hard error:

$ agent-board update card_a3f --assign-to-me

Error: No agent identity configured.

To use --assign-to-me, set up an identity:
  1. Create an agent:  agent-board create agent
  2. Set the env var:  export AGENT_BOARD_AGENT_ID=<agent_id>

The error tells the agent how to use the tool, instead of silently recovering. Now it works. The lesson: silently fixing wrong state is worse than failing explicitly. The agent never learned about persistent identities until much later, when it queried for its own work and the records were attributed to phantom agents.

Cut concepts

I had a comment list subcommand. I also showed comments inline when you ran card get. So I removed comment list. Same capability, fewer ways to get the same information.

I had multi-checklist support per card. Agents never used the feature, but they had to read about it every time. I removed it and went to one checklist per card. The flexibility was costing me more than it was worth.

I condensed SKILL.md, the file that gets loaded into context describing how to use the tool. Then I went further and moved the long-form details behind subcommands. Running agent-board create --help shows the agent the available create commands and their flags, instead of pre-loading all of that in the system prompt.

This is the part where my intuitions were wrong. I kept thinking “more documentation = better.” For a human reading docs once, sure. For an agent we pay a compounding cost on every turn, the math is different, “Progressive disclosure” keeps the context lean, while helping the agent at the right moment.

Be forgiving of mistakes

Agents are wrong a lot. They will delete the wrong card, retry an operation that already succeeded, and try to mutate state that no longer exists. (we also make mistakes, that’s what IDEs have undos)

Three things help:

Soft delete. Hard deletes punish experimentation. Soft delete with --include-deleted lets agents try things and recover.
Immutability. I make use of the session history as an audit log. The agent (and the human watching) can re-read history and recover from mistakes.
Idempotency. Some agent harnesses can accidentally run the same command multiple times, like on session restart (not all harnesses support durable execution), we should handle this gracefully by making actions idempotent.

These aren’t novel ideas. We use the same techniques to protect humans from themselves, the same applies to agents x100.

Tools aren’t portable

Different model families won’t have the same level of proficiency with the same tool. Claude prefers constraints and reasoning in the descriptions (like a child who doesn’t like being told what to do by its parents). Smaller models need multiple concrete examples. A description that helped one model won’t work as well with another.

I don’t have a clean answer for this. We mostly tune for the model we ship with, and accept that “design once, run anywhere” isn’t real for agent tools yet.

How I actually figured this out

One thing worth saying out loud: none of these patterns came from thinking hard up front about agent ergonomics. I didn’t sit in a room and design AX-friendly primitives. I shipped a CLI, watched agents struggle to use it, and removed the friction friction. Then observe again.

That’s the whole loop. Build, watch an agent use it, remove the friction, repeat. Good AX is something you discover, not reason your way to.

One concrete takeaway: instrument your agent. Watch the traces (can be automated by agents). The bad patterns are obvious once you see them. Most of what I deleted from agent-board, I deleted because I caught the agent getting confused by it on a real task.

agent-board is open source. So is Stakpak. Both were built using these patterns.