Harness engineering in practice

Posted on Jun 27, 2026

If you have ever built an agent that dazzled in a demo and then came apart the first time it touched something real, you already know the problem this article is about.

When it falls apart, the reflex is to blame the model. Once in a while that is fair. Far more often the model was fine and the layer around it was not. The context it got was stale, a tool handed back garbage, a command ran somewhere it should not have, or a confident final answer went out with nobody checking whether it was true.

That layer is the harness, and getting good at building it is most of what separates an impressive prototype from an agent you can actually leave running. The harness decides what the model sees, what it is allowed to touch, what it remembers between turns, which work gets handed off, what gets checked before it reaches a human, and what happens when something throws. If the model is the reasoning engine and the runtime is the loop that lets it observe, decide, and act, then the harness is everything wrapped around that loop.

What follows is a tour of the parts that make up a harness, one at a time. For each one I will explain what it is and why it matters, then make it concrete with code from a real agent I run in production. The abstract definition tells you a primitive exists; the real example is what shows you where it actually lands in a working system, which is the part that is hard to picture until you have built one.

The agent in question is a read-only copilot for the Cloudways support team that helps them diagnose complex server-level issues like high CPU, slow databases, failing PHP processes, and disks filling up. An investigation that used to take an engineer half an hour of manual digging now lands in a few minutes. It runs on Google ADK with LiteLLM in front of a hosted inference endpoint, but nothing here is tied to ADK. The same primitives are sitting inside Claude Code, Codex, Gemini CLI, and anything serious you build yourself.

1. Instructions

Every agent starts here. You write down who the model is, what it does, how it should talk, and the lines it must never cross. The system prompt, AGENTS.md, CLAUDE.md, Cursor rules: all the same idea. They earn their keep by moving guidance into the environment so you are not retyping it on every turn.

The thing to understand early is that instructions are passive. They describe the behavior you want. They cannot force it. You can write “cite evidence before recommending a change” in bold at the top of the prompt, and the model will still, sometimes, skip the evidence.

The agent has a single identity that it never drifts from: a read-only diagnostician. The prompt spells out the hard lines. Do not propose a mutating command unless the user explicitly asks and the risk is flagged. Never suggest something that would cut the platform’s own path to manage the server. One small habit that has paid off more than I expected is stamping the prompt version into session state on every turn.

PROMPT_VERSION = "1.8.0"

def stamp_prompt_version(callback_context):
    callback_context.state["prompt_version"] = PROMPT_VERSION

Months later, when an answer looks wrong, you can pull it up and know exactly which version of the prompt produced it. And you need that, because instructions only ever get you to “mostly.” The prompt says don’t recommend restarting a service without first showing it was actually wedged, and mostly the model listens. Closing the gap between “mostly” and “always” is what verification, further down, is for.

2. Context provisioning

If instructions tell the model who it is, context tells it what it is looking at. This is the file you drag into the chat, the failing test you paste, the stack trace. Ask a model to fix a bug with nothing attached and you get a generic answer. Hand it the actual source and the actual error and it behaves like a completely different system.

A diagnostic agent has the same need, except its “files” are the live state of a server. The lazy version waits for the model to ask for that state, which means burning a full round-trip on the model guessing what to look at before it has seen anything. The copilot does not wait. On the first turn, a before_model_callback runs a fixed read-only triage, parses the output in Python, and drops a compact snapshot straight into the system instruction.

def collect_triage_before_model(callback_context, llm_request):
    if callback_context.state.get(_TRIAGE_DONE_KEY):
        return
    snapshot = _run_triage(host)
    block = render_compact(snapshot)
    llm_request.append_instructions([block])
    callback_context.state[_TRIAGE_DONE_KEY] = True

By the time the model writes its first word, it is already holding the incident’s stack trace. There is a second source folded into that same snapshot: findings from earlier investigations of the same server, pulled out of durable storage (more on that in section 6). So even a session that starts cold gets the benefit of whatever the last engineer worked out. The cheapest tool call is always the one the model never had to make.

3. Context management

Provisioning is about getting the right material in front of the model. Management is about keeping it from drowning once it is in there. These pull in opposite directions, and the second one is easy to forget.

The window is finite, and even inside it, attention is not free. The part that trips people up is that wrong context is often worse than missing context, because a plausible-but-irrelevant detail will happily send the model down the wrong path. So the harness has to be opinionated about what gets through.

In the copilot, that happens three ways. Long sessions get compacted: a small, cheap model summarizes the history past a token threshold, so a twenty-turn investigation is not dragging all twenty turns forward into every call. Tool outputs get capped before they are ever written into the event log, so one enormous log dump cannot eat the whole window.

The third one mattered most. Early on, the agent would investigate a symptom by firing off twenty-odd exploratory read-only commands, each one a separate round-trip through the model, each one piling more noise into the transcript. The fix was to stop letting it explore freely and instead give it profiled collectors. When the harness recognizes a symptom, it runs a fixed sequence of commands, filters the output in Python, and returns a single compact block.

def collect_signals(host: str, profile: str) -> str:
    commands = PROFILES[profile]
    results  = [run_readonly(host, c) for c in commands]
    return render(parse(results))

The lesson that stuck: you do not make an agent smarter by feeding it more. You make it smarter by being ruthless about what it pays attention to.

4. Tool interface

Context gets the model thinking, but a model that can only think is still just talking. Tools are what let it act. A tool is a name, a description, an input schema, and, if you are doing it right, an output schema too. Whether you call it MCP, function calling, or tool use, it is the same primitive underneath.

The output schema is the part people skimp on, and it is the part that matters here. The agent’s read-only command tool does not return a string. It returns a typed result, so both the model and the code around it can treat a failure as a real value instead of having to read it out of some error text.

@dataclass
class CommandResult:
    status: str
    output: str
    error_code: str | None
    http_status: int | None

The profiled collectors from the last section take this further and return Pydantic-validated objects, like an Http502Signals with an already-deduplicated list of application errors. What the model sees is typed, bounded, and the same shape every run. That stability is worth a lot. And modeling the failure cases in the schema, not just the happy path, is exactly what lets every layer above the tool respond sensibly when something breaks.

5. Execution environment

Once the model has picked a tool and good arguments, there is still a question the model does not get to answer: where does this actually run, under what limits, and how much do you trust what comes back. Filesystem scope, network access, credentials, sandboxes, containers, work trees. It is tempting to file this under deployment, but it is a harness primitive in its own right, and it is where trust stops being a wish and becomes a property of the system. You do not ask the model to avoid touching secrets. You build a place where it physically cannot reach them.

For this copilot, that is the whole ballgame, because it runs commands against customer production servers. Nothing about its safety rests on the prompt being persuasive.

Commands go through a command connector that enforces a binary allowlist and runs as a low-privilege user. There is no write path to customer data, no way to read a secret, no mutating command available, no matter what the model talks itself into. Reachability is a separate gate from conversation: the agent will cheerfully discuss a server it has no access to, but it cannot run a single diagnostic outside the surface it is allowed.

The environment’s own failures are modeled as carefully as its successes. When the connector rate-limits, or a host has been decommissioned, the tool layer turns that into a specific error_code instead of returning empty output that the model would misread as “everything looks fine.”

def run_readonly(host: str, command: str) -> CommandResult:
    resp = connector.exec(host, command)
    if resp.status == 429:
        return CommandResult("error", "", "CONNECTOR_RATE_LIMITED", 429)
    if resp.status == 501:
        return CommandResult("error", "", "HOST_DECOMMISSIONED", 501)
    return CommandResult("success", resp.body, None, resp.status)

“Please don’t run anything destructive” is not a control. The environment is. Read-only lives underneath the model where its choices cannot reach, and the environment is honest about its own errors so the agent never confuses “I was blocked” with “all clear.”

6. Durable state

A single clean run is not how real work happens. Tasks pause, get picked back up, branch, and fail halfway. They need somewhere to keep a plan, a log, a record of what has already been tried. That somewhere has to outlive the current turn and stay readable from outside the model’s head. Context management decides what is in front of the model right now; durable state holds onto the facts no matter what the window is doing.

Server investigations are rarely one and done, and the same box often gets looked at by different engineers in different sessions. So when an investigation lands on something, an after_model_callback writes a short findings record to a store that sits apart from the session, keyed by server, and readable across sessions and users.

class ServerAnalysis(Base):
    __tablename__ = "server_analysis"
    server_id      = Column(String, nullable=False)
    summary        = Column(Text,   nullable=False)
    prompt_version = Column(String, nullable=True)
    updated_at     = Column(DateTime, onupdate=_utcnow)

This is the thing that feeds the recall back in section 2. A colleague who opens the same server next week starts from last week’s conclusion instead of from nothing. And because it lives outside the session and never enters the transcript, it cannot quietly mess with compaction, verification, or anything else downstream. The context window is the worst place to keep anything you actually want to hold onto.

7. Orchestration

Durable state remembers the work but does not move it along. Orchestration is the part of the harness that decides how work flows: what runs before a tool call, what happens after one fails, when to retry, when to ask for approval, what order things go in. The model is not coordinating any of this through sheer effort. The harness is carrying it, and this is the point where an agent stops feeling like a chat box and starts feeling like a runtime.

In the copilot this lives in a layer of callbacks and plugins wrapped around every model and tool call. The before-and-after hooks are where the prompt version gets stamped, the triage gets injected, findings get persisted, and the verification gate runs. A failed read-only command does not kill the run; it gets caught, reflected on, and retried inside a fixed budget.

The one I underestimated was provider degradation. Hosted inference endpoints throw transient capacity errors, the “overloaded” and HTTP 429 kind, more often than you would like. The harness retries with exponential backoff inside the call, and if it still cannot get through, an error handler turns the raw exception into a friendly, resumable message rather than leaking a stack trace and dropping the turn on the floor.

llm_kwargs = {"model": MODEL, "num_retries": 4, ...}

class ModelErrorHandlerPlugin(BasePlugin):
    async def on_model_error_callback(self, *, error, **_):
        if _is_transient(error):
            return LlmResponse(content=friendly_retry_message())
        return None

Underneath all of it sits a hard cap on calls per user turn, as a backstop against a runaway loop. Most of what people call reliability is really just orchestration. The model never decided to back off on a 429 or to stop after the tenth tool call. The harness did, in hooks the model cannot even see.

8. Sub-agents

A single agent has a single stream of attention. When the work naturally splits, say you want to explore the code, review a diff, and check a source at the same time, doing it all in one sequence is both slow and a great way to clutter the context. Sub-agents let you break a job into smaller bounded loops.

The mental model that helps: a sub-agent is not more model, it is a model with a smaller job, a smaller slice of context, and usually fewer tools. The trap is consistency. If each sub-agent improvises its own way of working, you trade one messy process for several.

The agent reaches for sub-agents rarely and keeps them on a short leash. The clearest case is an optional verifier. Hand it a high-risk recommendation and the evidence behind it, and it gives back a skeptical second opinion. It has no tools, a tiny budget, and shows up to the main agent as just another tool to call, so the main agent stays the manager and the verifier stays a specialist it consults.

if config.enable_verifier_subagent:
    verifier = LlmAgent(name="Verifier", model=small_model,
                        instruction=SKEPTIC_PROMPT, tools=[])
    tools.append(AgentTool(agent=verifier))

It is behind a flag on purpose, so the default cost and latency stay where they are. A sub-agent is a trade you make deliberately when a task genuinely needs its own context, not something you sprinkle in to feel parallel. When you do reach for one, bound it hard: small job, small context, small toolset.

9. Skills

Splitting work across agents raises the consistency problem; skills are how you answer it. A skill is just a reusable procedure for work that keeps coming back: when to use it, what it takes as input, what steps run in what order, which tools to prefer. Slash commands, playbooks, runbooks, recipes, they are all this.

What a skill buys you is moving expertise out of “remember to do the thing” and into something the harness can name and call on demand. What it does not buy you is any proof the thing worked; that is a different primitive.

The profiled collectors from section 3 are already skills in everything but name. Each profile is a small runbook: for this symptom, run these read-only commands, in this order, and parse them this way. A disk-pressure investigation and a slow-database one are genuinely different procedures, and the harness picks the right runbook instead of re-deriving it from scratch every time.

The next step is to make them load on demand. Instead of carrying every symptom playbook in the always-on prompt, the harness keeps a catalog and pulls in only the one the current incident needs. The agent knows what it could do without paying tokens for all of it up front. As a rule of thumb, the moment you catch yourself explaining the same procedure to an agent twice, you have found a skill waiting to be pulled out.

10. Verification and observability

This is the part that changes how you think about agents. The agent says it is done. The harness should answer “show me.” You do not trust the last sentence because it sounds sure of itself; you ask what outside check backs it up. For coding agents that check is tests, builds, lint. For research it is a primary source. For a diagnostic agent it is evidence sitting in the signals it already collected. “Looks good to me” is not a strategy.

And when something does slip through, observability is how you find out where. Traces, tool timelines, cost, latency, the prompt and tool versions in play, all of it is how you discover the real mistake happened three tool calls before the one that looked wrong.

Verification in the copilot is layered, cheapest check first. A deterministic gate runs after the model and reads the final answer. If the answer recommends a high-risk change but the evidence for it is not actually in the transcript, the gate appends a caveat instead of letting the claim go out clean. The textbook case is “raise the worker-pool limit” with nothing anywhere showing the pool was ever saturated.

def verify_high_risk_after_model(callback_context, llm_response):
    text = answer_text(llm_response)
    for rule in HIGH_RISK_RULES:
        if rule.matches(text) and not rule.evidence_present(callback_context):
            append_caveat(llm_response, rule.caveat)

When the deterministic rules cannot make the call, the bounded LLM verifier from section 8 steps in. A no-LLM self-test runs in CI so a refactor cannot quietly break the gate. And an LLM-as-judge sweep grades whole investigations against a versioned rubric, things like read-only discipline and evidence-before-recommendation, and surfaces patterns for a human to look at.

Sitting under all of it is the recorder. Every run emits traces with cost, latency, which provider served it, and the prompt version stamped back in section 1, so any answer can be walked back to the exact prompt, tools, and outputs that made it. Confidence and correctness are not the same thing. Build the gate that asks for receipts, and the recorder that shows you which step lied.

11. Evolution

A harness that does not learn just hands you the same lesson in fresh packaging every week. The whole thing pays off when failures stop being incidents and start becoming infrastructure.

A context miss that keeps happening turns into a new injection rule. A bad tool result turns into a stricter schema. A dangerous command turns into a permission gate. A missed edge case turns into a test. A correction you keep making turns into a stored memory. A workflow you keep repeating turns into a skill. It is the post-mortem loop, applied to an agent: do not just write up what went wrong, change the system so that whole class of failure has a harder time coming back.

Every row in this table actually happened.

Failure	What it became
Connector errors silently returned empty output	A stricter typed schema (`CommandResult` with `error_code`)
Worker-pool advice with no saturation evidence	A deterministic verify-gate (section 10)
Exploration storms burning tokens	Profiled collectors and proto-skills (sections 3 and 9)
A diagnostic command behaving differently than mocked	A regression case in the evalset
Provider 429s killing a turn	Backoff and graceful degradation (section 7)
Re-investigating the same box from scratch	Durable findings and recall (section 6)

The loop is wired up now. When the LLM verifier has to step in because the deterministic rules could not decide, the harness logs a structured record of it: the recommendation, the evidence, and why the cheap gate abstained. Those records get reviewed, and the patterns that keep showing up get written back as new deterministic rules, each with its own CI test. Over time the cheap gate covers more ground and the expensive verifier fires less. The judge sweep does the same job on the prompt side. Failures do not just get fixed, they get turned into the check that stops the next one of their kind.

The test to leave you with

Next time your agent fails, resist the urge to ask only whether the model was good enough. Ask which layer of the harness ran out of road.

Was the instruction missing? Was the context wrong, or just stale? Was the tool schema too vague to act on? Did the command run somewhere it shouldn’t have? Did the work need durable state, or orchestration, or to be handed to a sub-agent? Was there no skill for it, no verification on the way out, no trace to look back through? And did anything actually change after the last time this broke?

Run that list against your own incidents and the answer is almost never “the model is dumb.” It is a layer that was not there yet. That is the shift worth internalizing. The model still matters, but dependability gets built in the system around it, not inside it.