Tag Archives: agents

The Architecture Hub: Teaching AI to Understand Your System So You Don’t Have To

The Goal

The primary goal is pretty straightforward. Build a knowledge base that AI agents can query to understand system architecture. When an agent needs to design a feature, it should have context about existing patterns and dependencies. When an agent traces a bug, it should know which services participate in the flow. When an agent assesses the impact of a change, it should understand what depends on what.

We already know LLMs can read and code and write documentation. Not only that, they do it repeatedly, consistently and tirelessly.

If we design the extraction and documentation process well, we can have agents that produce documentation that is actually useful. Not just generated API docs with lists of endpoints³, but actual structured documentation, semantically summarizing the code, with citations back to the actual source code.

In this sense, AI works much better. A human who goes through source code listings can spend hours building a mental model of relationships between services⁴. An agent can produce a structured summary in minutes. Given the right extraction prompts, it can produce meaningful descriptions, in a consistent format. And this can of course scale across hundreds of repos. Contrast this with humans documenting different repos, bringing in their own style, preferences and assumption on what matters. This results in inconsistency that makes it very hard to reason and correlate across services.

LLMs also make incremental updates easier. With the ability to compare (“diff”) the current state, identify what has changed and make changes only in the necessary sections. AI agents don’t get bored or decide that updating documentation is not a priority and can be pushed to a later sprint⁵. Humans rarely sustain this over time. They might invest initially, but entropy will win.

So my goal here is: have a living knowledge base where AI is used both to maintain it and consume it – AI agents are the prime consumers. Agents can query the hub to understand the system, as well as extract information and keep it up-to-date.

It’s important to note that it turns out that humans (unsurprisingly) also need this. As I noted above, the introduction of LLMs to coding and design did not invent the problem of understanding the system. And given up-to-date structured documentation, with AI helping to query it, humans find it useful as well.

AI-generated documentation isn’t a groundbreaking concept. What matters here is for this to be relevant and with high quality to the relevant use cases. The thought here is that AI-based documentation, with proper engineering about extraction process and relevant tooling, can outpace human-maintained documentation. This is not because AI is smarter, but because it is smart enough, and tireless.

Designing the Architecture Hub

Even though it turns out the architecture hub is useful for humans, the driving force behind the design was consumption by LLMs and tools driven by LLMs. Even when humans use it, they do it using LLM-based tools.

Initially, I started researching and thinking about achieving scale – graph databases, maintaining large collections of documents, specifying potentially complex ontologies of objects.

I can’t rule out the usefulness of these techniques just yet, but I quickly came to realize that I was prematurely optimizing⁶.

So I quickly pivoted to starting with a much simpler approach. The architecture hub is, for now, a simple Git repository. It’s not a code repository with implemented business flows and tests. There are no deployable artifacts. Instead it maintains a series of markdown files organized consistently into several directories.

This in itself already allows for simple consumption – AI agents can easily read markdown files. It’s also easily reviewable and usable by humans. Combined with a github MCP server, or simply cloning the repo locally, any AI agent can easily access the information.

The “unit of ingestion” is a single code repository. These usually already encapsulate a specific logic, and are easy to follow and build the tooling around.

Architecture Facets

We could have a single file per repository, describing each repo in detail. But this easily gets too large and unfocused. Different tasks (by agents or humans) require different types of information. For example, tracing a bug requires understanding events and call flows; assessing impact of changes requires understanding dependencies. Having a single giant file would mean that an agent would have to load everything and burn tokens on information it doesn’t need. It could easily pollute the context. Instead, I decided to structure the hub around different facets of the architecture.

The application architecture hub is structured around simple file system directories containing the files. Each such directory represents a specific perspective (a facet) of the architecture (APIs, domain models, events produced/consumed, etc.). A directory contains one markdown file per code repository ingested, they all have a consistent template with consistent metadata. This is a consistent and predictable structure that is also easy to describe.

Facet	What It Documents	Questions It Answers
Domain	Data entities, relationships, types	What data does this service manage? How is the data structured?
API	Endpoints, request/response contracts	How do I call this service? What functionality does it offer, if any?
Events	Message topics, payloads, producers, consumers	What does this service emit or consume asynchronously?
Frontend	Frontend applications: state management, components, routing	How does the UI work?
External Dependencies	Databases, brokers, external services	What components and external services does this service depend on?
Dataflow	Inputs, transforms, outputs, sensitive data	How does data move through this service?

The list of facets is stable and aims to document interesting aspects that often come up during design, and allow us to ask more complicated questions. It can of course be extended to include more aspects.

The design is therefore simple: one file per repository (usually named after the repository name), per relevant facet⁷. If you need to understand the HTTP API exposed by the payments service (from a repo called “payments”), you simply look for `api/payments.md`. If you need to see which events this same service emits, you can look in `events/payments.md`. This is a simple to follow structure, both for AI and humans.

Dividing the information into different files has other benefits beyond simple context window efficiency:

Easier to search (e.g. using grep) for specific facet information across repos. Remember that our prime motivation is system wide patterns (cross-repo).
Parallelism: it’s easier to divide work across sub-agents when they can ingest and search on separate file directories.
Incremental updates: updating a changed API usually does not require updating the domain model information, or external dependencies.

Note that searching the files does not exclude searching the code as well. In fact, the extraction takes care to maintain explicit code references. And when querying the hub I often find myself asking the agent to start from the architecture hub, but also use the git tools (either MCP or github CLI) to look into the specific code, based on the citations.

The use of a simple Git repo derives other immediate advantages of dealing with textual content – it’s versioned and easily reviewable. It’s easy to see what gets updated and when.

The flow at a high level is therefore:

Ingestion Pipelines

How does ingestion – creating or updating documentation – work?

As noted above, the main unit of ingestion is a code repository. Each code repository is ingested in turn, and the created artifacts reflect the original code repository. This allows us to debug, retry and review specific repos, and tie the ingestion into already existing CI processes. We don’t need to invent new relationships or mappings of code repositories to artifacts. It’s also easier to query specific code files using the hub as the guiding index when necessary.

Technically, we implement the extraction process as a series of agent skills: structured prompts with accompanying templates and scripts. These guide the extracting agent what to look for, how to search the codebase and the format of documentation file to produce.

Why skills?

Besides being text-based and therefore easily version controlled, skills allow us to leverage the LLM’s built-in capability to understand the code and its semantics. With a good enough LLM an agent with a skill can produce consistent results. We do use scripts for basic understanding of the hub (e.g. the repos already ingested), and we can probably optimize with scripts that parse the code deterministically (similar to static code analysis), but we’re starting simple, with an implementation that doesn’t require any extra runtime agent beyond the running agent(s).

Each facet has two skills – one for extracting the facet from scratch, and a skill for updating the documentation. The update skill compares the change in the code against the current documentation state and only updates what’s changed. Full re-extraction is possible, but seems too expensive.

The skills define what to look for, depending on the facet they’re documenting. For example, the API skills look for HTTP controllers and decorator (we’re mostly NestJS-based); the event skills look for message schemas; the dependency skills look for definitions of connection strings, external endpoints, etc. All skills have a template they follow, so outputs are uniform in structure. All templates include a metadata section (repository url, date of ingestion, git commit sha of the repo at the time of extraction).

The ingestion pipelines themselves exist in two versions: remote and local. The difference is in how they use the data.

The remote version accesses the ingested code repo by using Github MCP server. It does not require a local clone, and can effectively work from anywhere with the proper credentials set up.

The local version uses git CLI to clone the ingested code repo locally to a temporary directory and then reads the code locally using file system tools. The local version is generally cheaper and more reliable. It does require more disk space.

In addition to producing the documentation files, the ingestion agents also update an existing llms.txt file, which serves as the hub’s index. This is a plain text file, listing all the different documented repos, and explaining the structure of the architecture hub.

The querying skills guide the agent to first look at this file, understand the hub’s structure and start the lookup from this point. Since the repository structure is simple, the llms.txt file structure is simple – one line per document created, with a simple one line description of the content, divided by the facets.

This makes locating documentation across different axes simple enough to use with a simple grep. For example, looking for all domain documentation is a simple search for `domain/*.md` in the file, and getting a list of results. Similarly, looking for all information about the reservation service is simply grepping⁸ for `*/reservations.md`.

Ingestion itself can be triggered manually by any user (a github action invoked in the Github UI or a script). It can also be invoked by a CI step (non-blocking) that is triggered on every merge to master/main – we want to update our documentation, but only the changes that make it to the main branch.

The whole process is orchestrated by a single orchestrator agent (implemented as a skill as well), which launches sub-agents – one per facet.

The orchestrator takes care to clone the repository if needed, and then invokes the separate subagents to either create or update documentation for each facet independently:

The motivation for launching sub agents comes from two main drivers: resiliency and latency. Since the work of each subagent is independent, they do not interfere with each other – all of them just read the code and write independent files. They are invoked in parallel, so the overall process terminates earlier. Also, failure in one subagent does not cascade to others. Technically it also means that the skills for each facet are separate, and therefore simpler – less room for LLM mistakes. A single facet failure is also easier to troubleshoot and re-run if necessary.

Note that it is the orchestrator agent that updates the index (llms.txt) file. Technically, each subagent can update the index file on its own upon completion. But since this is a shared resource, we run into overlapping write conflicts. Since this is file system-based work it’s easier to instruct the agents to return the result of their work as their output, and have the orchestrator agent update the index file. Updates to the shared resource then happen in one place – the orchestrator – and we avoid conflicts.

The ingestion itself can be triggered manually or as part of an automated process, e.g. after a successful merge and build of the master branch. In either case, the ingestion stops at creating a PR that can be reviewed by a human. Review by humans is still important, both to account for inaccuracies (which hopefully will be reduced over time), but also so people learn to trust the information. Without reviewing errors that are still possible at this stage, errors will accumulate, and trust will erode. It’s important to have this level of trust in the process.

Querying the Hub

Once we have the documentation in place, we can start querying it.

Generally, the querying process is simply prompting an agent to read the documentation and construct a report:

Identifying the relevant facets and extracting necessary information, including correlations across different documentation files is where we let the LLM apply its reasoning. We just take care to have a consistent structure, with enough information.

We have several “query” skills which instruct the agent to look in the index file, and some other technical layout information. They also instruct the agent to cite its sources. This helps to both reduce hallucinations as well as provide the result consumer (human or AI) with pointers to source material. The actual querying and output really depends on the use case and the query issuer.

The query itself can be by a human user invoking some AI agent with a user interface (e.g. Cursor, Claude Code or some chat interface with access to the file system). And of course, it can be some other agent-driven process which is simply given access to the files. I have used the architecture hub as a context directory for a dialectic-agentic design debate – it works⁹.

There is no specific query language – we let the LLM interpret the query and work its way through the documentation. We can of course provide hints (“look at the ‘reservation’ service”), but this is not mandatory.
Examples for ad-hoc queries:

“Which services consume the financial-related events from the ‘financials’ service?”
“What overlap do we have in domain models between payments service and reservation service? And why”
“Who is calling the accounting service?”

Technically, the query skill comes in 3 variations:

Remote: querying the hub using Github MCP server
Local: querying local file system, assuming the hub is locally available, and up-to-date.
Auto-Local: similar to local just takes care to first clone/pull from the architecture hub’s repo to a temporary local directory in order to make sure information is up-to-date.

Note that we can also instruct the agent to continue looking into the actual source code if our requested analysis needs this. Having the Github MCP available (or code locally cloned) makes further investigation into source code only a tool call away for the agent. The documentation in the hub does not replace code indexing, it’s more about bridging between (technically) disconnected repositories and mapping/deriving semantic relationships where they exist. There is little value in trying to replicate the existing code indexing and understanding already performed by current coding agents and tools.

It’s interesting to see that even when humans query the hub, it’s done using AI agents. In fact both the producer and consumer of the hub is AI, also when directly instructed by a human user. It’s LLMs that produce the documentation, and LLMs that consume it. This opens the possibility also for an ingesting agent to verify itself simply by querying the hub for the changes it just introduced. By itself, it might not sound that interesting, but considering the scale makes it a bit more interesting. Maintaining technical documentation, with appropriate quality, now becomes a purely mechanical process that can scale more easily.

Structured Reports

Beyond ad-hoc queries, the hub supports reusable report templates. A report template is simply a prompt file, meant to be used with the query skill, that guides the agent through a more complicated analysis workflow. It specifies what to read, what to search for and how to format the output.

Using a report is simply prompting an agent with something like this:

Using the local query skill, follow the report instructions in ./reports/dependencies.md for the reservation service as the root service.

Output your result to ~/tmp/dependencies_reservations.md.

This now launches the agent into looking into the documentation, mapping out services and their dependencies and producing a complete report with relevant pointers to source code.

An investigation that could take hours or sometimes days is done in minutes¹⁰.

We currently have several such predefined reports, each useful in different cases.

Dependency map
Given a specific service, map out all other services making API calls to it, and what other services it calls. It also maps out events produced and consumed by the services, as well as services sharing the DB¹¹.

Useful when trying to estimate the blast radius of a given change.

Cross service flow analysis
A flow analysis traces a business process end-to-end across multiple services. The agent follows API calls, events, and data writes across service boundaries. The output is a sequence diagram plus a step-by-step breakdown with source citations.

“Trace the order cancellation flow” produces a sequence diagram showing the user request hitting the order service, the order service publishing a cancellation event, the payment service processing a refund, the notification service sending confirmation. Each step cites the documentation that describes it (which in turn cites the source code).

“Plain English” Flow Explainer
Not everyone reads technical documentation. Product managers and stakeholders need to understand flows without wading through event topic names and API paths. The plain English explainer produces a narrative description of a business flow. No technical jargon. Just a story of what happens and why. But it does it based on up-to-date technical documentation – the code is the truth.

Example output:

"When a customer cancels an order, the system first checks if the order is eligible for cancellation. If eligible, it reverses any payment charges and releases held inventory. The customer receives a confirmation email with the refund details. The host receives a notification about the cancelled booking."

This report is useful during discovery and planning. When a product manager asks “how does X work today?”, you can point them to the hub instead of scheduling a meeting with an engineer.

This report specifically also instructs the agent to use the web search tool to search information in other online resources (help center), which demonstrates the flexibility of the model. This is not a built-in feature of the architecture hub, just a tool available in the underlying platform that is composed into the process using the prompt. In my view it’s an interesting case of the “Application Logic Lives in Prompts” principle of agent-driven applications.

Also, the report essentially produces very similar information to the “Cross service flow analysis” report, only phrasing it in a way that’s more suitable for a different audience – another demonstration of a feature that is easily enabled by LLMs.

So How Do We Use It?

Regardless of the actual query being performed, we already see the value here: answering quick questions as well as generating more complicated reports, with deeper analysis.

For AI Agents

AI agents used in software are the primary intended audience here.

Several notable cases where this is used:

A troubleshooting agent that brings together information from bug reports, live monitoring data (logs, datadog) but also interacts with the architecture hub to understand relationships between services.
Design tasks and understanding impact of changes

For Humans

Information gathering was a pain before the introduction of AI coding agents. The simple fact that we have up-to-date technical documentation already allows us to use it daily.

Examples:

Onboarding to a new code repo – whether it’s new employees getting to know the system, or simply a neighboring team needing to make changes in a repo they don’t own. Understanding dependencies, call patterns and domain models.
During planning: understanding impact and inter-team dependencies.
Mapping customer inquiries (specifying required data objects) to the APIs that provides them, across the system.
Quickly figuring out cross-repo dependencies in live design discussions; e.g. “what services consume these events?”
Understanding complex flows and data dependencies.

We also foresee more cases where this can be used: PR reviews, incident investigation, understanding compliance issues.

Anything that requires system-wide information that is reflected in the technical architecture.

It’s important to note what the hub should not be used for. It should not be used for understanding code or functionality of a single repository (or very few loaded into a workspace). At least not as a primary source. There are also better ways to understand the evolution of repos (git history). Rationale for designs should probably also be gleaned from other sources if they exist, using the hub as a way to validate decisions and track adoption.

Code tells you what happened, Git tells you when it happened, design documents and plans describe why things happen. The hub connects these perspectives across the system, and serves as a map to navigate the terrain.

Challenges and Roadmap

I would be misrepresenting things if I presented this as a fully solved problem. There are still remaining and expected challenges ahead.

First, staleness of data.

Stale documentation is in a way worse than non-existing documentation since it may mislead people (and LLMs). Code changes after initial ingestion, and documentation needs to be updated.

As it currently stands, the automated CI workflow is an opt-in solution (teams need to enable it via a simple Github flow variable set to “true”). But this is a limited rollout period. Once we make sure everything works, and figure out kinks, we can flip the condition and make it an opt-out solution.

Additionally, each update records the time of the update, and each file contains a change log. So it should be easy to spot documentation files that are not up-to-date.

Second, there is a quality variance. And this depends largely on the quality of ingested code¹². Messy code with inconsistent patterns produces worse documentation. Code that is consistent, with known patterns and proper naming conventions is much easier for the LLM to understand and build the documentation for. The extraction skills look for API controllers or type definitions or configuration files in specific places. If the code doesn’t follow these conventions, the quality of generated documentation will degrade. We will fine-tune the extraction over time as we observe this, but this is largely a reactive measure.

Related to this is the problem of potential hallucination. Even though hallucinations are generally decreasing, at least with frontline models, this is still a potential issue, especially when an LLM is asked to describe the purpose or intent of a specific feature. As we know it might make assumptions and present them confidently as facts. One way to mitigate this is by mandating citations of source code. This focuses LLMs on grounding their outputs in the real code. This seems to reduce hallucinations; and it also enables humans to more easily review and cross reference findings.

Another issue that might come up is cost. Running LLMs at scale will cost us money. This is the main reason for having a separate “update” vs. “full ingest” skills – it updates only according to changes it finds instead of re-producing the entire file. We’ll need to monitor this and see how things can be optimized if necessary, e.g. batching a few changes and re-ingesting only after a few commits/merges.

Related to cost is the general issue of scale, when it comes to quality of service. What happens when the hub includes hundreds of documents? How long will it take to query it (even when done on a local file system), and how good will the result be?

We may very well need to adopt a more scalable solution, e.g. a more scalable database, and not relying on file system searches if we want faster answers to more (concurrent) users.

Perhaps the hardest hurdle to overcome is that of adoption. In order for this to be adopted internally it has to be better. Not marginally better – clearly better. So far the response has been positive by people who have seen it. And effort is being done to make querying easier and as painless as possible.

Some future thoughts involve also providing a mechanism to give feedback and local notes (inspired by the `annotate` and `feedback` commands in chub); but this is not implemented yet.

Adoption of course needs to be not just by humans querying it, but also by internal AI agents using it.

Beyond Initial Implementation

Currently the architecture hub has a solid foundation, and shows value. But there’s still work to do, some obvious, some less so.

In the short term, we need to increase coverage of all repos. This is more of a technical gap.

We will also need to fine-tune the extraction skills and associated templates. Some feedback is already incoming. The same goes for pre-defined reports.

After that we’ll need to make sure this is adopted by AI agents. In a sense, the application architecture hub should be part of the default context for all technical agents doing design, troubleshooting, and planning. This will require more standardized interfaces for querying and reports.

Another important step – ingesting more relevant information sources. Two immediately relevant sources are infrastructure information and design decisions (ADRs). This will enrich the available information and allow us to answer and connect information in different layers of the technical architecture – all the way from “why was this designed this way?” to “how is this actually deployed?”

But other interesting architectural aspects may be interesting as well. For example, a security facet, mapping out authentication and authorization information as well as data sensitivity aspects. This can help agents with understanding and designing for secure software, consistent with the rest of the system.

As noted above, having a feedback mechanism is also very useful in order to have a continuous improvement, hopefully grassroots, that will maintain and improve the quality of information.

Other steps might include (depending on need) introducing semantic search (RAG?) so we avoid issues with terminology misalignment, or having the user know the exact repo to start with.

When it comes to accessibility to larger audiences, not so much AI agents, a visual explainer – automatically producing diagrams can prove to be useful for humans who need a living, breathing, map of the system.

Takeaways

The architecture hub started from a simple observation¹³: AI agents are great at understanding code (and getting better), but larger systems, with a lot of moving parts are harder to accommodate reliably in one agent’s context window. Knowing how services interact, where data flows, how changes propagate – this is intractable in a large distributed system. If we want AI to go beyond simply coding, we have to teach it what we know. Knowing the system was a problem even before AI came along. LLMs just exposed the gap and made it more obvious. We got hungry for more.

But given the right mechanisms and tools, LLMs also present a solution. We can now generate and update reliable technical documentation at scale, simply because it’s mechanized.

LLMs emphasize the need and present the solution at the same time. In this system, AI is both the consumer and maintainer of architectural knowledge.

There are already some interesting points to learn from this (still ongoing) journey:

For this to work, the extraction process needs to be engineered. We need to make sure the quality is high and that it can scale technically and organizationally.
Architecture is built on different aspects. Having one document cover everything is hard, and inefficient. The idea of different facets is important for effectiveness as well as efficiency.
Humans in the loop are important to understand errors, but also to build trust in the system. We’re trying to extract years of human-generated knowledge (in the form of code) and let machines run with it.
The value is in the query. The documents themselves are great, but AI and people need answers. The hub’s main value will come from delivering answers; documents are just the substrate on which this is built.
The original motivation (and still the main one) is for AI coding agents to consume the knowledge. But this is also extremely helpful for humans. It so happens that having reliable documentation, with consistent templates and explicit citations is useful for humans as well.

I’m betting that AI-maintained documentation can outpace human-maintained documentation. So far, feedback has been positive.

But the real test will come with adoption. When people and agents use the architecture hub as the first place to look for information.

(and yes, all dashes in this post are hand-typed)

This was also, unsurprisingly, one of the conclusions from the testing of Dialectic. See “Does Clarification Matter?” here. ↩︎
That would be what I called the 2nd phase in a possible AI adoption roadmap. ↩︎
Which is also useful of course ↩︎
HTTP calls, domain models, events raised and messages consumed, … ↩︎
We all know the “Documentation” work item that gets pushed across sprints until it’s simply marked as obsolete. ↩︎
And I’m not sure about the root of all evil, but it’s a surefire way to get stuck in analysis-paralysis. ↩︎
For example, backend services are irrelevant for frontend applications. Similarly, frontend applications don’t expose HTTP-based APIs. ↩︎
Is that a valid word? ↩︎
I have to admit, it was somewhat of a “proud dad” moment, watching the dialectic agent pick up the relevant files from the architecture hub, copying them to its working directory and feeding them to the debating agents. ↩︎
Or at least a decent first draft that can be more easily validated. ↩︎
An anti-pattern(?), but that’s a discussion for another time. ↩︎
“Garbage in Garbage out” holds also for technical documentation. ↩︎
That I believe is now more or less a consensus. ↩︎

When Linear Logic Hits a Ceiling: The Case for Agent-Native Architecture

State Machine Orchestration

So if I want to model my code in a way that allows me to express decisions as reactions to inputs, system events and state (in addition to predetermined configuration), modeling the system around a finite state machine seems like an obvious choice⁴.

The transition itself is also pretty straightforward. The linear flow maps directly to states: we model each phase (propose, critique, refine) as a state of the system, as well as clarifications and synthesis. The system is naturally at one state at any given point in time. In a sense, the original linear flow is a specific case of the broader set of behaviors possible with the state machine.

We end up with a state machine that looks (at a high level) something like this:

The application is now modeled as a graph of nodes (~= “tasks”), where orchestration happens as response to events that cause edge transitions.

This model immediately lends itself to implementation of two improvements:

Asking clarifying questions is easily modeled as a state with a clear event telling the system when we’re done (“No more questions”) ⇒ agents can ask follow-up questions⁵, and can easily be made aware of each other’s questions.
Deciding when the debate is done is also modeled as an event, based on the judge’s decision ⇒ autonomous convergence is easier to implement.

Note that configurable safeguards are still in place. We can still cap the number of clarifying questions to be asked as well as the number of debate rounds. But it naturally opens the door to more efficient handling of these situations.

I won’t go into implementation details here (you can inspect the code, and documentation), but this new flexibility also allows for easier implementation of other scenarios and improvements.

Adding a new phase/step in the flow, e.g. “review”, is essentially introducing a new node, with relevant transitions.
If a specific agent fails in some node, it can be retried independently of other agents in the same phase.

It does require an implementation of a different kind of orchestrator, and adapting the UI options (CLI, web) to this operational model. The orchestration logic now lives in different nodes, and some intermediate technical states need to be introduced.

Interestingly, the agent LLM prompts themselves didn’t change much to accommodate for the new state machine orchestration. This hints at generally a good abstraction at that level – orchestration vs. agents behaviors.

Where does this leave us from an architecture point of view?

One can argue that the state machine is also hard coded, and fundamentally, the graph transitions are not that different from a program counter moving through instructions. If you squint, it does look similar.
On the other hand, this more naturally allows for easier extensions as noted above (interactive clarifications and autonomous convergence) as well as easier error handling at the node level. There’s also no constraint on having a static predetermined state machine. The state machine itself can be constructed in runtime based on configuration or input.

In addition, if we zoom out a bit, and think of a potential roadmap, an event-based model allows the application to be easily re-implemented as separate processes, with different nodes implemented in separate “services” responding to events. Scaling becomes easier. Doing it based on the rigid loop-based flow would’ve been harder⁶.

But there’s something more fundamental in how the application is built – it’s still expressed in code.

An Agentic (?) Application

The refactoring described above works in the sense that it does improve the mechanics of extending the code. It allows us to express behaviors more naturally, and potentially scale better.

Still, the core application logic is expressed in a series of typescript code files – state machine transitions are expressed in code. Even the agent prompts are delivered as part of the code.

At a basic technical level, any material change to the behavior of the application requires some code change (+rebuilding and shipping). Extensibility, even when easier, is still code-centric. This becomes more of an issue if our application requires more flexibility and customization from a user.

We have improved control-flow modeling, as well as runtime semantics. But the application behavior is not fully externalized as a protocol/data.

What does it mean for the application protocol to be externalized as data?

At the heart of it, the application’s logic is represented as artifacts that are observable and even open to manipulation by the system’s operators, not just its coders.

To the older programmers in the crowd, this would be somewhat reminiscent of Lisp/Smalltalk and other homoiconic languages, where the program representation is directly manipulable in the same semantic system as the program data (e.g. forms/objects, S-expressions)⁷.

But this is not exactly homoiconicity. In this case, we are able to modify the program’s behavior by manipulating files that are read during execution.

In a system running continuously, this gives us a chance to change the system’s behavior as it’s running. In that respect, it is similar. I guess it’s more “workflow as data” and not so much “code as data”.

Another analogy might be to a template in a no-code tool, where users have the option to customize the flow without coding. It is similar in the technical sense, only here we don’t have formal semantics that usually come with modeling in some no-code tool. We have the English language, with the aid of tools (again – code) to help provide a more rigid structure.

What I’m after here is a clear separation between the agent “runtime” and the application’s business logic, in a way that allows the application protocol to be defined as malleable artifacts.

Which brings me back to the idea of implementing the application with an AI (LLM-based) agent at its core.

Practically this would mean that the application workflow would be represented in a series of artifacts that are inspectable and amenable by the user or operator of the system. The “runtime” itself would be an agent platform with basic capabilities, driven by an LLM, with relevant basic tools.

What do we gain here?
We gain transparency and faster architectural iteration with rudimentary tooling⁸. We also get easier customization of behavior.

At the same time, we must recover guarantees we lost when the workflow was implemented in code, compiled and verified. We’re moving from imperative coding to inspectable runtime artifacts.

This is how I got to Dialectic-Agentic.
It is essentially the same Dialectic application, re-imagined as an agent-native application.

The core execution engine is any agent platform available today. This should work with Claude Code, Cursor, etc. These already implement the basic agent loop and tool abstractions (+some built-in tools) that would allow to build the application on top.

The application protocol is expressed through a series of skill files and prompts. These enforce strict file conventions that serve as the local communication mechanisms between agents.

The flow orchestration is described in the Orchestrator agent skill. This is the main agent running in the agent loop. Using the built-in “Task” tool, it executes various subagents (per role) and the judge agents.

All work and communication between the orchestrator and other agents is done through reading and writing files in a dedicated debate workspace. This also allows us to follow the progress and status of the debate (there’s a `progress.md` file).

(blue components are the application’s “code”; rounded rectangles are files; labeled arrows are control flow, unlabeled arrows are data flows)

Invocation simply happens by invoking the relevant skill in the relevant agent platform, with the problem description and context directories given, as before.

Configuration is similar to the code-centric version. Only note that here a lot of the agent and LLM configuration is irrelevant since this is implied by the running agent platform. The configuration is focused on the agent-specific instructions and guardrails.

The entire application logic is encoded in skill files (blue components above), taking advantage of the agent runtime capabilities of reading files or doing any kind of web search or any other customized tool. The LLM configuration is entirely out of scope for the application.

Application UI is essentially the built-in agent chat window or terminal, whichever the user decides to use. The intermediate files are of course also part of the UX. You can track progress and status using the information written in the debate workspace, as they are being written and updated using the agents operating. The debate workspace is also available at the end for troubleshooting or other analytics.

This is still not a full-blown agent-driven application as I have outlined before, but the core components are there: the agent loop and basic tools are already part of the underlying agent platform. The shared context is given in the debate workspace – a simple file system directory.

The workflow, at least at this point, is a rather simple one, with a clear beginning and end. There’s no sharing context with the user while the application is running, but this is mainly because the running time is finite, and usually short.

—

At this point we have 3 different implementations of the same pattern, it might be worth taking a step back and consider the tradeoffs.

Comparing Implementations

The 3 different implementations of the same application (imperative, state machine, agent-based) accomplish essentially the same task – running a system design debate and producing a result.

I have not achieved exact feature parity between the implementations, but there should not be anything that fundamentally prevents us from doing it, even if the implementation may be awkward.

It would be interesting to examine the tradeoffs of the different implementations from an architectural point of view. How do the different implementation approaches differ in different aspects?

Change Velocity

How long will it take to implement a new feature, and deliver it to users?

The general question of course depends on the feature and its complexity, but it still might be worth examining it through the lens of a specific feature (or set of features). Imagine, for example, that we need to include a new step in the process, e.g. a final review of the solution by all agents⁹.

The deterministic flow would require changes in several code files (the orchestrator, role-based agent interface and implementations). It would also require new prompts and potentially new state attributes to be passed.
It will probably also require specific context construction.

The state-based flow would require a new graph node implementation, with relevant wiring. It’s better organized where the flow is clearly separated from other aspects.

Both of these implementations require of course code changes + build and deployment of compiled files. This includes package publishing etc.

The agentic implementation requires basically some change in the core protocol (a new step before the synthesis phase?) and that’s it really.

Delivery of the actual skill files really depends on the platform, but it’s essentially copying the necessary markdown file.

Failure Isolation

This aspect of course depends on the type of failure mode. It’s obvious that an underlying failure in the LLM APIs or availability of API is a blocker for any kind of application where LLMs play a vital part.

Any central failure, e.g. no LLM available, will affect the entire execution.

I think it might be more interesting to address the question of how isolated a failure mode is when it does happen in a specific step/component.

Let’s consider a failure in one agent execution, in one phase. It could be because of some misconfiguration of LLM or prompt, or some tool call, causing an LLM to return an invalid response – not according to protocol.

The imperative implementation would either try to work with the given response, however lacking/broken, or stop the debate completely (e.g. in the proposal phase). Not all errors will be immediately obvious but this is more an issue with the current implementation, not so much with the pattern. A technical failure is more likely to cause the entire run to fail. Isolation would require granular error handling at the code level, e.g. smaller and specific try-catch blocks.

The state-machine implementation works largely the same for phase-scoped errors. It either aborts the flow completely (proposal, refinement phases) or continues with partial results (critique phase). The specific mechanism is different, but the result is the same from an overall application point of view.

Note that in the current implementation, there’s no validation of the quality of returned result from agents – nonsensical LLM responses may propagate.

The node/event isolation provides a slightly easier way to isolate problems when they happen. Especially if we want to execute them in a separate process (not the current implementation).

With the agent-based implementation, the policy is embedded into the skill file, e.g. here (section 4.2):

**Wait** for all N subagents to complete.

**Verify** that each expected file exists: `{WORKSPACE}/debate/round-{ROUND}/proposals/{agent.id}.md`

If any file is missing:

1. Log a warning to `progress.md`: "WARNING: {agent.name} proposal missing in round {ROUND}. Retrying."

2. Re-dispatch that agent's subagent once.

3. If still missing after retry: log "WARNING: {agent.name} skipped in round {ROUND}" and continue without this agent. Inform the judge of missing agents when it runs.

i.e. the current policy is to retry an agent execution once, and if it fails (no file found) – log a warning and continue. It does not stop the debate, but does make the problem explicit.

Note that also in this case, in case of a faulty response, or missing response (after 1 retry), the process continues. So a problematic response will also propagate to the debate and may cause downstream issues.

Failure is generally more isolated in this case simply because it happens at a subagent level, and focused on specific task execution.

Note that the actual handling of errors really depends on the executor being strict in its execution. There might also be drift occurring from the artifact changing, or instructions coming up in prompt that alter this behavior. This behavior is not absolutely guaranteed.

In all 3 implementations, we can create a more robust failure handling. Validate actual result, retry execution, isolate specific agents.
The question then becomes how easy it is to introduce a more robust failure handling mechanism.

Imagine we’d want to isolate changes of an agent so it won’t stop the debate.

With the imperative solution, this would entail coding a whole protocol between the orchestrator and other agents.

With the state-machine implementation, this would require introducing new states dynamically (“1 agent completed”, “2 agents completed”, …, “N agents completed”). This is not currently implemented, but the basic mechanism is there (note it’s called “DEFAULT_TRANSITIONS”).

With the agent-based implementation, the policy is basically the 5 lines quoted above. Implementing it is basically changing the SKILL file, or providing extra instructions when invoking it (the “user prompt”). This of course assumes the underlying LLM follows instructions¹⁰. In short, it’s easier to implement, but we’re more at the mercy of the underlying agent to follow the instructions as intended.

Runtime Transparency

How easy it would be to understand the execution as it is running?

In the imperative implementation, the flow is mostly implied in the code itself. We would need to log everything or implement tracing to gain visibility. In short – more code.

In the state machine implementation, the flow is also expressed in code, but it’s easier to understand where it stands just by tracing/logging state transitions. Another case where better code organization benefits us. If nodes communicate by some other inter-process communication protocol, e.g. message queues, it’s also possible to track these.

In the agent-native implementation, since all communication between agent execution happens in files (status.md, progress.md, files written with proposals, critiques, etc.) it’s very easy to simply look at the file system and understand how the process is progressing, or where it fails.

Determinism and Reproducibility

How deterministic is a given execution? How easy would it be to reproduce it?

In both of the code-based implementations, the process is expressed in code. Given the exact same inputs and sequence of events, we’re almost certain to reproduce the same results. While there is some non-determinism in the potential LLM response, it would not likely affect the execution of the flow. It might affect the quality of the end result.

In the agent-native approach, a lot of the execution depends on the LLM following instructions properly. The execution here is a lot more sensitive to the agent platform running it, prompting and runtime changes.

This might be good in some cases if the LLM finds ways to overcome obstacles, but generally speaking, the behavior is less predictable, compared to code. In order to mitigate this, we’d need to invest more in verifying contracts (e.g. files created). It’s no question that this approach is weaker on this point.

Tool Integration Ergonomics

How easy is it to integrate tools into the flow and direct the LLMs to use it when necessary?

In both of the code-based implementations, the tool registration and execution is code centric. We would need to implement tool discovery¹¹ and integration into prompts as well as execution. It’s possible to integrate more well-established protocol, e.g. MCP, but still requires investment in implementation and maintenance. There are of course established agent frameworks these days that do a lot of this heavy lifting.

In the agent-native approach, this is largely solved by the underlying agent platform. It already takes care of registering tools, including custom tools; and it usually has some basic tools already built-in. For example, in Cursor, file_read and web_search are available as part of the platform. We’re only left with guiding the agents on how to use them. In this respect, it’s a done deal and the application developer only needs to focus on usage of tools. It also means that tool usage might not be immediately transferable to other platforms, unless we somehow make sure we’re using some standard tooling, e.g. the same MCP servers.

I’m not sure there’s a clear winner in this aspect. Only that existing platforms already support this out of the box.

Testing

How easy would it be to test the application behavior in each approach? How well can we use established testing tools and methodologies?

The imperative implementation is a winner in this aspect. It is best suited for traditional unit testing and other automated testing approaches.

The state-machine implementation is also code-centric and therefore easily testable with existing features. It might need a bit more testing for the nodes/events facility, but this added testing complexity isn’t a significant addition.

The agent-native is weaker in this aspect. Testing here requires relying on golden artifact testing, validating implicit contracts (file naming and content) and generally a more end-to-end approach for testing.

This is a point that’s generally true for applications relying on LLM execution, and I think merits its own separate discussion¹².

So Which is Better?

To summarize this comparison, if I had to rate these implementations on a 1 to 5 scale (1 – weak, 3 – balanced, 5 – strong), it would look something like this:

Aspect	Imperative Implementation	State-machine Implementation	Agent-native (skill-based) Implementation
Change Velocity	2	3	5
Failure Isolation	2	4	3
Runtime Transparency	2	3	5
Determinism / Reproducibility	4	4	2
Tool Integration Ergonomics	3	3	5
Testing	5	4	2

Unsurprisingly, there’s no one architecture that dominates all of these aspects. Each refactor done here improved some areas at the cost of others.

When we moved from imperative code to state machine implementation, we gained better code organization, flow modeling and failure boundaries. But we paid a “tax” in complexity (managing nodes, events, suspend/resume cycles).

When we moved to agent-native architecture we gained flexibility and easier customization as well as velocity. This allows the system to adapt to the conversation rather than following a script. But we pay in less deterministic execution and a harder to test application.

As always, the answer to what is better is ‘it depends’. There is no necessarily better architecture, only a better fit for the specific problem at hand.

If we optimize for predictability, and maybe compliance, it would probably be better to go with one of the code-based approaches.

If we optimize for rapid iterations, and protocol flexibility (including user contributions), it might be better to go with an agent-native approach.

And of course, other applications, with more complex flows¹³ might work as a more hybrid approach, where some of the process, namely the part we want to be more predictable and compliant, is implemented in code and integrated as a tool with an underlying agent.

For this specific use case, the flow remains fairly simple and predictable. My takeaway is that an agent-native architecture really fits when the path to a solution isn’t an obvious “straight line” – where flows are less rigid, or where different processes must be combined on the fly in unforeseen ways.

Consider, for example, a Tier 1 customer support bot following a well-known script. This is usually predictable and code-like (“if this is raised do this, otherwise do that”). Contrast this with a support bot that behaves more like a high-level troubleshooter, and pivots based on the complexity of the problem and its context. In that scenario an agent-native architecture will fit better.

Similarly, a supply chain software that needs to set up a delivery route. An agent, connected to online information, and absorbing different inputs about external events (e.g. extreme weather, fuel shortage), should be able to adapt better than a static route based on hard-coded heuristics.

In the end, we architect for the predictable, but we try to build for the unknown.
And it is in the “unknown” that an agent-native approach finally pays its rent.

Token economy! ↩︎
For example, the “architect” role agents and the “performance” role agents have a lot of overlap in their clarifying questions. ↩︎
For example, a tool failure or connectivity issue ↩︎
I guess there’s a reason why LangGraph is basically built around a similar model. It’s natural for a workflow ↩︎
Admittedly, working interaction into the state machine is a bit more involved, but doable. ↩︎
Then again, it’s not a requirement, so I wouldn’t run to implement it just yet. ↩︎
Usually with strong meta-programming affordances ↩︎
And, well… a robust LLM. ↩︎
Or ADR documentation, or JIRA update, or whatever ↩︎
It’s also possible for a model to try and overcome the issue in some other creative method. ↩︎
Similar to how it’s currently implemented. ↩︎
Maybe tests defined on traces? ↩︎
Like a lot of business applications focused on processes ↩︎

Bits and Bytes

Lior Schejter et al