Research.tech · Krakow · April 2026

Yuriy Babyak

Harness
engineering.

Making AI-assisted development actually reliable.

The problem

Better prompts won't save you.

→Complex task, inconsistent output.
→More instructions, more drift.
→Bigger model, same failure mode.

The fix

Harness = scaffolding around the model.

01 — Context

What it sees.

AGENTS.md hierarchy, path-scoped rules, generated docs injected on demand.

02 — Enforcement

What it can't skip.

Hooks that run code, not vibes. Verification as a hard gate, not a prompt.

03 — Determinism

What it shouldn't think about.

If inputs map to outputs, ship a CLI. The agent invokes — it doesn't reinvent.

04 — Dispatch

Who helps it.

One orchestrator, specialist sub-agents, parallel models, budgeted edits.

Before we dive in

Run it remote. Not on your laptop.

→Disposable — agents can't wreck your machine.
→Reproducible — Dockerfile + JSON config lock the env.
→Parallel — many agent sessions, no collisions.
→Powerful — throw a 32-core VM at it when you want.

A devcontainer is Docker + a small spec. VS Code, Codespaces, or any SSH target opens it.

Demo 01

Context in action.

AGENTS.md hierarchy .claude/rules/* rules-injector hook

code.claude.com/docs/en/hooks-guide

Pillar 02 — Enforcement

Hooks run code, not vibes.

Prompts apply only when the model decides to apply them. Hooks fire every time. That's the difference between suggestion and guarantee.

Hooks — surface

Every meaningful moment is a hook point.

Hooks — resolution

You target with surgical precision.

Prompts suggest.
Hooks guarantee.

Pillar 03 — Determinism

If it's deterministic, don't ask the LLM.

→Same inputs → same outputs → ship a CLI.
→The agent invokes. It doesn't reinvent.
→Cheap, fast, testable, diffable. No token tax.

check:architecture generate:skills-inventory rt-debug investigate agent-browser bun run verify

code.claude.com/docs/en/skills

Trusted workflows

Skills turn prompts into procedures.

→Rigid workflow — same steps, every time.
→Verification built in — the skill ends when it's proven.
→Dispatches sub-agents — the skill decides when to fan out.
→Versioned in the repo — .claude/skills/*/SKILL.md.

Recipes, not suggestions. The skill enforces the steps.

Demo 02

/investigate in action.

Sentry · spans · logs · Postgres parallel hypotheses one root-cause report

Pillar 03 — Dispatch

One orchestrator. Many specialists.

→Claude Code runs as the orchestrator.
→Codex · Gemini · Claude dispatched in parallel.
→.subagents/ — pessimist, optimist, risk-analyst.
→Budget system — N direct edits per sub-agent dispatch.

code.claude.com/docs/en/scheduled-tasks

The time axis

Loops: agents that don't sleep.

→Wrap any skill on an interval — /loop 3m /watch-pr
→Polls, retries, audits — without you in the seat.
→Scheduled tasks turn Claude from REPL into a daemon.

Leave it running. Come back to results.

The closed loop

Nothing is "done" until verified.

→Edit → verify-reminder → escalating nudge.
→Finish → verify-stop → hard block.
→Skill → adherence budget → block on drift.

Self-correcting, not drifting.

If you're building on agents

Four rules.

01Hook what matters. Don't trust prompts.
02Ship a CLI for deterministic work. Don't make the LLM think.
03Make verification non-skippable. A gate, not a suggestion.
04Orchestrate models. Don't pick one.

Questions & pushback welcome.