How to Master Agentic Engineering: Patterns, Security, and TDD
Agentic engineering has crossed a threshold most practitioners felt but couldn't name — the moment when an AI agent writes more code than you do. Simon Willison, software engineer and AI researcher, laid out exactly where we are, why it's different from everything before, and how to operate safely a
Agentic engineering has crossed a threshold most practitioners felt but couldn’t name — the moment when an AI agent writes more code than you do. Simon Willison, software engineer and AI researcher, laid out exactly where we are, why it’s different from everything before, and how to operate safely at this capability level in his fireside chat at the Pragmatic Summit in San Francisco, hosted by Eric Lui from Statsig. This tutorial breaks down every pattern, security framework, and workflow technique he shared, giving you a practical implementation path you can start using today.
What This Is
The fireside chat, published March 14, 2026, on Simon Willison’s blog, captures a practitioner-level conversation about what Willison calls agentic engineering — the discipline of designing, coordinating, and securing AI agent workflows as the primary mode of software development.
This isn’t a discussion about AI as a coding assistant. It’s about a structural shift in how software gets built: engineers evolve from code writers into system architects who direct, verify, and constrain AI agents doing the actual implementation work. The NotebookLM research report frames this precisely: “The value of an engineer’s contribution has shifted from implementation details to system architecture, agent coordination, and strategic problem decomposition.”
Willison identifies three distinct stages of AI adoption that most engineering teams move through:
Using AI as a search tool — asking ChatGPT questions, treating it like a smarter Stack Overflow
AI as a coding collaborator — the model writes snippets, you integrate them
AI as the primary coder — the agent writes more code than you do, and you shift into a review and architecture role
Stage 3 is where things get both powerful and dangerous. Willison points to StrongDM as an example of a team at the far end of this progression: “nobody writes any code, nobody reads any code” — an approach where agents generate and test code autonomously. According to the research report, developers currently use AI in approximately 60% of their work, though full delegation (zero human review) covers only 0–20% of tasks, reflecting a collaborative model that still keeps humans in the loop for critical decisions.
Willison also discusses model capability milestones that made this shift possible. GPT-4 proved that LLMs could be genuinely useful for real engineering tasks. Claude Code hitting its one-year milestone and November 2025 marked an inflection point where reliability became consistent enough to trust for production-level work. He specifically called out Claude Opus 4.5 as the first model that earned his confidence for “familiar problem classes” — where he already knows what correct output looks like.
The underlying technical architecture powering this shift is the Model Context Protocol (MCP), described in the research report as the “USB-C of AI integrations.” MCP standardizes how agents connect to external data sources and services, eliminating the bespoke adapter problem that previously made every enterprise integration a custom engineering project. Its Python and TypeScript SDKs reached over 97 million monthly downloads by 2026, making it the de facto industry standard for agent-to-tool communication.
Alongside MCP, the Agent-to-Agent (A2A) protocol handles task delegation and capability discovery between specialized agents, while WebMCP gives agents structured access to web content via the navigator.modelContext API. These three protocols together — MCP, A2A, and WebMCP — form the integration stack that makes multi-agent systems viable at production scale.
The key insight: agentic engineering is not about blind delegation. It’s about calibrated delegation. You fully delegate only in domains where you can verify the output — through automated tests, manual inspection, or reference to an existing working system.
Why It Matters
The practical impact of agentic engineering is measurable, not theoretical. The research report documents a CTO whose project, estimated at 4 to 8 months, was completed in two weeks using Claude-powered development tools (Augment Code). Rakuten engineers used Claude Code to implement a complex extraction method in vLLM — a library with 12.5 million lines of code — and the agent completed it autonomously in seven hours with 99.9% accuracy.
For engineering teams, the workflow implications break into four categories:
Onboarding speed: Traditional onboarding takes weeks or months to reach productivity. With agentic tooling, the timeline collapses to hours. The agent carries deep codebase knowledge from context, so a new contributor can make meaningful progress from day one using AI-mediated workflows.
Capability expansion: AI fills knowledge gaps. Willison released three Go projects recently while not being a fluent Go programmer — he simply coded them using agents, learning through doing rather than studying. This matters for marketing teams, data teams, and legal departments who now have access to engineering capability without the prerequisite of becoming engineers.
Full-stack reach: The research report notes that AI allows engineers to work “full-stack across frontend, backend, databases, and infrastructure” — eliminating the coordination overhead that slows cross-functional projects and makes specialists dependent on one another.
Open source disruption: Willison raises a pointed observation about the open source ecosystem: why use a third-party date picker library when an agent can write exactly the date picker you need? This threatens specialized component libraries (he singles out Tailwind’s component business) while simultaneously deepening dependence on foundational open source infrastructure that agents rely on to generate their outputs.
For security practitioners, the stakes are equally significant. Agentic systems introduce a new class of vulnerability — prompt injection at scale — that traditional security tools cannot detect because the attack vector is natural language, not code. An attacker embedding malicious instructions in a meeting invite or a resume can cause a high-privilege agent to exfiltrate data through entirely authorized channels, without any user clicking a link.
The Data
Here is a comparison of the three AI adoption stages Willison describes, mapped to observable engineering behaviors and key risks — drawn from his Pragmatic Summit talk and cross-referenced with the research report:
Stage
Primary Mode
Code Delegation
Key Risk
Required Safeguard
Stage 1: Query
AI as search/assistant
0–5%
Over-reliance on incorrect answers
Verify against documentation
Stage 2: Collaboration
AI writes snippets, dev integrates
20–40%
Integration errors, style drift
Code review, test coverage
Stage 3: Orchestration
Agent writes majority of code
60–80%+
Prompt injection, hallucinated logic
TDD, sandboxing, authorization gates
And here is the 2026 protocol landscape — the infrastructure powering multi-agent systems — from the research report:
Protocol
Focus
Function
Adoption Signal
MCP (Model Context Protocol)
Agent-to-Tool
Standardized interface for agents to connect to any external data source or service
97M+ monthly downloads (Python + TypeScript SDKs)
A2A (Agent-to-Agent)
Agent-to-Agent
Capability discovery and task delegation between different AI agents
Emerging standard
WebMCP
Agent-to-Web
Structured web access via navigator.modelContext API
Early adoption
The research report cites a 30% reduction in development overhead and 50–75% acceleration in task completion for teams that have standardized on MCP for production deployments — compared to the bespoke integration approach that preceded it.
This is the practical core. Willison laid out specific techniques in his Pragmatic Summit fireside chat that you can implement immediately. Walk through each phase in sequence — they build on one another.
Prerequisites
An agentic coding tool (Claude Code, Cursor, Augment Code, or equivalent)
A version-controlled codebase with existing test infrastructure
Access to a sandboxed development environment (local Docker, cloud container, or Anthropic-managed environment)
A project where you understand the expected outputs well enough to verify agent work
Phase 1: Build and Maintain a High-Quality Codebase Template
Willison’s foundational principle: agents follow existing patterns with remarkable consistency. Your template is your most powerful quality control mechanism — more powerful than any prompt you write.
Before issuing a single agent instruction, establish your project template. It should define:
Directory structure — where tests live, where source lives, how modules are organized
Test style — whether you’re using pytest, Jest, Go testing, or another framework; what a typical test file looks like
CI configuration — what automated checks run on every commit
Documentation conventions — docstring format, README structure, inline comment style
# Install cookiecutter for template-based project scaffolding
pip install cookiecutter
# Create a project from your agentic template
cookiecutter gh:your-org/agentic-project-template
# Or scaffold manually
mkdir -p myproject/{src,tests,docs}
touch myproject/tests/conftest.py
touch myproject/.github/workflows/ci.yml
touch myproject/README.md
Once the template exists, the agent will replicate it. When you ask it to add a new feature, it will mirror the existing test style, directory patterns, and CI hooks automatically — without being told to. The research report confirms this: agents produce superior output when working within established templates, and will follow those patterns “to a tee.”
Phase 2: Mandate Red-Green TDD Before Every Feature
This is the highest-leverage technique Willison describes. The instruction is deceptively simple — a five-token addition to your prompt:
“Use red-green TDD before implementation.”
Infographic: How to Master Agentic Engineering: Patterns, Security, and TDD
Before writing any implementation code, the agent writes a failing test that defines the expected behavior, then writes the implementation that makes the test pass.
Here is what a TDD-first agent workflow looks like in practice:
You: Add a function that extracts all URLs from a markdown document.
Use red-green TDD before implementation.
Agent workflow:
1. Writes tests/test_url_extractor.py with 5+ test cases
(empty doc, single URL, multiple URLs, malformed URLs, nested links)
2. Runs the tests → confirms they fail (red)
3. Implements src/url_extractor.py
4. Runs the tests → confirms all pass (green)
5. Refactors for clarity, re-runs tests to confirm they still pass
Here is what the agent produces at step 1, before any implementation:
# tests/test_url_extractor.py
import pytest
from src.url_extractor import extract_urls
def test_empty_document_returns_empty_list():
assert extract_urls("") == []
def test_single_markdown_link():
doc = "Check [this out](https://example.com)"
assert extract_urls(doc) == ["https://example.com"]
def test_multiple_links():
doc = "[A](https://a.com) and [B](https://b.com)"
result = extract_urls(doc)
assert len(result) == 2
assert "https://a.com" in result
assert "https://b.com" in result
def test_bare_urls_are_captured():
doc = "Visit https://example.com directly"
assert "https://example.com" in extract_urls(doc)
def test_malformed_url_excluded():
doc = "[broken](not-a-url)"
assert extract_urls(doc) == []
Willison’s point is sharp: “tests are no longer even remotely optional” when working with agents because the human cost of writing them has dropped to near zero. The agent writes the tests. The upside — reliable, runnable verification of every behavior — is substantial. Skip tests and you’re reviewing agent output on faith.
Phase 3: Add Manual Testing via Server and Curl
Beyond automated unit tests, Willison developed a tool called Showboat specifically for agentic manual testing workflows. The concept: instruct the agent to spin up a server, then test it using curl commands and capture the actual output as markdown documentation.
# Add this instruction to your agent session after implementation:
"After implementation, start the development server on port 8000
and run manual tests using curl. Document each command and
its actual output."
# Agent executes and documents:
uvicorn src.main:app --port 8000 &
curl -X POST http://localhost:8000/extract \
-H "Content-Type: application/json" \
-d '{"content": "Check [this](https://example.com)"}'
# Actual output: {"urls": ["https://example.com"]}
curl http://localhost:8000/extract
# Actual output: 405 Method Not Allowed
Showboat captures this as a reproducible markdown testing record — actual commands, actual outputs. This is particularly valuable for HTTP APIs, CLI tools, and any interface where unit tests don’t capture real network behavior or filesystem side effects.
Phase 4: Implement Conformance-Driven Development
For systems that need to work correctly across multiple frameworks or languages, conformance-driven development is the right architecture. The technique:
Write a test suite that defines the full expected behavior
Implement that behavior in multiple languages or frameworks (Go, Node.js, Django, Starlette)
Run the same test suite against all implementations
The test suite becomes the living specification
# conformance_tests/test_extraction_spec.py
# This suite runs against ANY implementation via configurable URL
import os
import requests
BASE_URL = os.environ.get("IMPLEMENTATION_URL", "http://localhost:8000")
def test_spec_single_url():
r = requests.post(
f"{BASE_URL}/extract",
json={"content": "[A](https://a.com)"}
)
assert r.status_code == 200
assert "https://a.com" in r.json()["urls"]
def test_spec_empty_content():
r = requests.post(f"{BASE_URL}/extract", json={"content": ""})
assert r.status_code == 200
assert r.json()["urls"] == []
Run it against each implementation:
# Test Go implementation
IMPLEMENTATION_URL=http://localhost:8080 pytest conformance_tests/
# Test Django implementation
IMPLEMENTATION_URL=http://localhost:8001 pytest conformance_tests/
# Test Starlette implementation
IMPLEMENTATION_URL=http://localhost:8002 pytest conformance_tests/
This lets you reverse-engineer standards, validate multiple implementations simultaneously, and maintain a single source of truth for system behavior — regardless of what language or framework is underneath.
Phase 5: Sandbox Your Agent’s Execution Environment
Willison is explicit: never run agents with full system access if you can avoid it. He admits to using --dangerously-skip-permissions on his own Mac while fully understanding the risk — but that is a calculated personal decision, not a team practice.
For anything that touches real data or production systems:
Key flags:
– --network none disables all outbound network access by default — enable selectively as needed
– --memory 2g caps memory to prevent runaway processes
– --rm cleans up the container after each run
Managed environments like Claude Code for the Web (Anthropic-managed containers) handle sandboxing automatically. The tradeoff is capability constraints — which is exactly the point.
Phase 6: Enforce Human Authorization Gates on High-Risk Actions
Any agent action that communicates externally — sending emails, making API calls, posting to webhooks — must route through a human approval checkpoint before execution. This is a hard architectural requirement, not a prompt engineering suggestion. The research report is clear: rely on structural enforcement, not on asking the model nicely to be careful.
# Authorization gate pattern for agent actions
from enum import Enum
class RiskLevel(Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
class AgentAction:
def __init__(self, action_type: str, payload: dict, risk_level: RiskLevel):
self.action_type = action_type
self.payload = payload
self.risk_level = risk_level
def execute_with_gate(action: AgentAction):
if action.risk_level == RiskLevel.HIGH:
print(f"\n⚠️ APPROVAL REQUIRED")
print(f"Action type: {action.action_type}")
print(f"Payload: {action.payload}")
approval = input("Approve this action? [y/N]: ").strip().lower()
if approval != "y":
raise PermissionError(f"Action '{action.action_type}' rejected by operator")
return execute_action(action)
# Example: agent wants to send an external email
send_email_action = AgentAction(
action_type="send_email",
payload={"to": "client@example.com", "subject": "Report", "body": "..."},
risk_level=RiskLevel.HIGH
)
execute_with_gate(send_email_action) # Will prompt for approval
Also enforce data flow rules: if an agent session has accessed data tagged as sensitive, block all external communication channels for that session. Short-lived scoped tokens (not long-lived API keys) ensure that even if a session is compromised, the blast radius is bounded to that session’s specific permissions.
Expected Outcomes After All Six Phases
After implementing these patterns:
– Test coverage above 80% maintained automatically as the agent adds features
– Manual test documentation generated alongside each implementation, with real command output
– Zero agent actions touching external systems without explicit human approval
– Portable implementations that all pass a shared conformance suite
– Agent output that closely matches your existing codebase style
– A sandboxed execution environment that limits damage if a prompt injection attack succeeds
Real-World Use Cases
Use Case 1: Legal Document Triage at a Mid-Size Law Firm
Scenario: A legal operations team needs to classify incoming contracts by type, flag high-risk clauses, and route documents to the appropriate practice group — work previously handled by junior associates reviewing every document manually.
Implementation: Deploy a three-agent pipeline: a Parser agent extracts text and metadata from incoming files; a Classifier agent applies risk heuristics and contract-type logic; a Dispatcher agent routes documents and sends routing notifications. The research report specifically calls out legal triage as an example of non-technical teams deploying agentic workflows without engineering support. The human authorization gate triggers for any contract flagged as high-risk before the Dispatcher sends external notifications.
Expected Outcome: Contract intake time drops from 48 hours to under 2 hours. Associates shift from processing every document to reviewing only flagged exceptions. Engineering support is not required for ongoing operation or modifications — the legal ops team manages the workflow directly.
Use Case 2: Full-Stack Feature Development by a Solo Founder
Scenario: A developer who is not a fluent Go programmer needs to ship a web API in Go because the performance requirements demand it — but learning Go from scratch would take months.
Implementation: Establish a Go project template with standard directory structure, go test configuration, and a CI pipeline. Instruct the agent to use red-green TDD. Provide the functional specification: “an HTTP endpoint that extracts URLs from a markdown document and returns them as JSON.” Let the agent implement, test, and iterate against the conformance suite.
Expected Outcome: A production-ready Go API with full test coverage, shipped in hours. Per Willison’s observation, this pattern allows developers to release projects in languages they don’t fluently speak by providing templates and test-driven constraints that maintain output quality.
Use Case 3: Enterprise Codebase Migration at Scale
Scenario: A platform team needs to implement a new data extraction method inside a massive, legacy library with 12.5 million lines of code, without breaking any existing behavior.
Implementation: Load the codebase into a capable agent with full context (the research report documents Rakuten using Claude Code for exactly this scenario). The agent reads existing patterns, identifies the correct insertion point, implements the extraction method, runs the existing test suite against its changes, and iterates until all tests pass.
Expected Outcome: Per the research report, Rakuten’s agent completed this task in seven hours with 99.9% accuracy — a task that would have taken a senior engineer several days to handle safely without breaking existing functionality.
Use Case 4: Marketing Analytics Automation Without Engineering Support
Scenario: A marketing team needs a unified weekly analytics report pulled from three separate platforms — but engineering has no bandwidth to build or maintain a custom data pipeline.
Implementation: Use an MCP-enabled agent to connect to each analytics platform’s API (Google Analytics, HubSpot, and a custom CRM). The agent writes the data-fetching pipeline, handles authentication via short-lived scoped tokens, and generates a structured weekly markdown report. MCP eliminates the need for custom adapters for each platform — per the research report, standardizing on MCP reduces development overhead by approximately 30%.
Expected Outcome: The marketing team deploys and maintains this pipeline independently. Report generation time drops from a half-day of manual export-and-combine work to under five minutes. When a platform’s API changes, the team prompts the agent to update the integration rather than filing an engineering ticket.
Use Case 5: Surge Staffing for a High-Priority Product Sprint
Scenario: An engineering team needs to ship a feature that requires deep codebase knowledge but has no time to ramp up additional headcount before the deadline.
Implementation: Deploy an agentic “surge” model: bring in a specialist who uses the established codebase template and agent tooling to contribute immediately, without the traditional productivity dip that comes from learning a new codebase. The research report notes that “onboarding timelines have collapsed from weeks to hours” for teams using agentic workflows with high-quality templates.
Expected Outcome: The specialist ships production-ready code on day one, matching the quality and style of the existing codebase because the agent follows the established template consistently. No extended ramp-up period required.
Common Pitfalls
Pitfall 1: Skipping tests because “the agent seems reliable.”
This is how you accumulate silent failures. Willison’s position is unambiguous — “tests are no longer even remotely optional” precisely because agents are not perfectly reliable and you cannot review every line they generate. Red-green TDD is the verification mechanism that replaces close code review. Skip it and you’re flying blind on every feature.
Pitfall 2: Running agents with full system permissions on anything production-adjacent.
The research report describes the Lethal Trifecta: private data access + untrusted content processing + external communication. Each component alone is manageable. Combined, they create a zero-click attack surface where a malicious instruction embedded in a meeting invite or a resume causes a high-privilege agent to silently exfiltrate data through authorized channels. Sandbox first, grant permissions narrowly, implement authorization gates before any external communication.
Pitfall 3: Asking agents to extend a codebase without providing a template.
Agents follow patterns. If there are no patterns, they invent them — and the result is inconsistent, unmaintainable output. Always establish a reference implementation or template before asking the agent to extend a system. Per the research report, a high-quality codebase template is the primary lever for consistent agent output quality.
Pitfall 4: Using production data in agent testing environments.
Willison explicitly calls this out as a security risk. Build robust mock data generation instead — a button that creates one hundred randomized test users, or a script that simulates edge cases like users with thousands of records. Never expose real customer data to an agent session, regardless of how sandboxed the environment appears.
Pitfall 5: Treating agent output as final without a machine-verifiable check.
Even when you trust the agent’s capability, verify through tests — not manual code reading. The agent should run its own tests and confirm they pass before presenting output to you. Your role as orchestrator is to review test results and architectural decisions, not to read every implementation line.
Expert Tips
Tip 1: Run multiple agent sessions in parallel to manage cognitive load.
Deep agentic work — reviewing outputs, making architectural decisions, debugging failures — is cognitively intensive. Willison structures his day by juggling multiple projects, switching after two to three hours of focused agentic work on each. This prevents exhaustion while keeping multiple workstreams moving.
Tip 2: Distinguish your “spaghetti” projects from your production templates.
Code quality matters primarily for long-term maintenance. For throwaway scripts or temporary tools, let the agent produce whatever structure it finds natural without cleanup. For production systems, maintain clean, well-structured templates. The research report notes that agents will replicate whatever quality level you establish in the template — so the template investment compounds.
Tip 3: Issue short-lived scoped tokens for every agent integration, without exception.
This is a hard security requirement. Per the research report: “ensure agents only have access to specific resources for specific session operations.” Never issue long-lived API credentials to an agent session. If the session is compromised via prompt injection, scoped tokens bound the blast radius to that session’s explicit permissions.
Tip 4: Build conformance tests before implementing multiple framework versions.
If you know a system will eventually need to run in multiple languages or frameworks, write the conformance test suite first. This gives you a portable specification and prevents implementation drift as codebases evolve independently. The test suite becomes the source of truth, not any particular implementation.
Tip 5: Continuously probe what the current model can do that earlier versions couldn’t.
Willison frames ongoing capability exploration as a core engineering responsibility: “what can Claude Opus 4.6 do that we haven’t figured out yet?” Capabilities that were unreliable six months ago may now be solid. He notes that spellchecking went from unreliable to production-ready within approximately twelve months. Schedule periodic evaluation sessions where you test model capabilities against tasks you previously ruled out.
FAQ
Q: How do I know when to delegate a task fully to an agent versus keeping a human in the loop?
Willison’s heuristic: delegate fully when you “know what the answer should look like.” If you have a reference implementation, a passing test suite, or enough domain knowledge to verify the output confidently, full delegation is appropriate. If you’re in unfamiliar territory without verification mechanisms, keep a human in the loop for critical decisions. Per the research report, even experienced developers fully delegate only 0–20% of tasks currently — the rest is collaborative, with humans reviewing or guiding.
Q: What exactly is the Lethal Trifecta and how do I defend against it?
The Lethal Trifecta — as described by Simon Willison — is the convergence of: (1) agent access to private data, (2) agent processing of untrusted external content (emails, web pages, uploaded documents), and (3) agent ability to communicate externally (send emails, call APIs, post to webhooks). Any single component is manageable. All three together create a zero-click attack surface where a hidden instruction in ordinary content silently triggers data exfiltration. Defense layers: sandboxed execution, human authorization gates on all external actions, data flow enforcement rules that block outbound channels after accessing tagged sensitive data, and short-lived scoped tokens for all integrations.
Q: Is MCP mature enough for production use in 2026?
Yes. The research report documents MCP’s Python and TypeScript SDKs at over 97 million monthly downloads, establishing it as the de facto industry standard for agent-to-tool integration. For production deployments, use HTTP transport (not stdio) to support load balancing and multi-tenancy across agent sessions. The report cites a 30% reduction in development overhead and 50–75% acceleration in task completion for teams standardized on MCP, compared to bespoke integration approaches.
Q: Do agentic tools make specialized open source libraries obsolete?
Not foundational infrastructure — agents depend heavily on open source libraries and documentation to generate functional code. But specialized component libraries face real disruption. Willison raises the example directly: why install and configure a third-party date picker library when an agent can write exactly the date picker you need in the same time? Per his Pragmatic Summit talk, this threatens the business model of component-focused open source projects while deepening dependence on low-level infrastructure libraries.
Q: How should engineering leadership restructure team roles for agentic development?
The research report outlines three concrete shifts: (1) reinvent the junior/senior dynamic — use agents for prototype work and minor bug fixes (“papercuts”), preserve senior engineers for architectural decisions and output verification; (2) mandate red-green TDD as the standard workflow for all agent-generated code; (3) expand agentic tooling beyond engineering — deploy it to sales, legal, and HR teams who can now automate workflows without engineering support, removing traditional organizational bottlenecks. The critical role becomes the architect-orchestrator: someone who decomposes complex problems into agent-executable sub-tasks and maintains overall system coherence.
Bottom Line
Agentic engineering is not a future state — it is the current production reality for teams that have moved through the three adoption stages Simon Willison describes. The patterns that make it work — red-green TDD, high-quality codebase templates, conformance testing, sandboxed execution, and human authorization gates on high-risk actions — are implementable today with existing tools. The security risks represented by the Lethal Trifecta are real but architecturally addressable: structural enforcement, not prompt engineering, is the correct defense. Organizations that standardize on MCP as the integration layer, per the research report, and invest now in agentic engineering patterns will carry a structural advantage into an era where cycle times measured in months are increasingly measured in hours. The shift from implementer to orchestrator is already underway — the question is whether your team has the patterns in place to make it work reliably and safely.
0 Comments