Tag Archives: architecture

The Architecture Hub: Teaching AI to Understand Your System So You Don’t Have To

I have argued before that I believe the real gain from AI in software engineering is not only in code production. GenAI is definitely a useful tool for coding, but coding is not where the bottlenecks are. In order to be effective, not merely efficient, in coding, design of the software being written is crucial. I think it’s already pretty much a consensus by now that in order to be really productive with coding agents, you need to carefully direct them, and of course provide the proper context.

Proper context is more than just good requirements/specifications. These might be good enough for greenfield projects where we’re starting from scratch. The reality, however, for many existing companies and projects is that our starting point is much more muddy than we’d like, and simply connecting an AI to it isn’t enough. A system with hundreds of separate services communicating to implement different business flows and user interfaces is hard to follow whether you’re the human who built it or a supercharged AI agent that understands code perfectly. Adopting AI effectively in such circumstances is not just letting the AI tool (e.g. Cursor, Claude Code) read and index the code. It’s definitely an important pre-requisite, but it’s not enough.

Any design methodology would at the very least require us to have knowledge of the existing system and processes it implements. Otherwise we’ll be “stuck” with generic advice which often becomes useless pretty quickly¹. When dealing with a complicated system we have to let the AI investigate on its own if we want it to help us with the design. This complicated internal knowledge, often domain-specific, has to be made available to the LLMs² if we have any hope of the AI helping with the design.

Note that this isn’t an AI-only problem. I’ve often encountered the situation where there’s, at best, a single engineer who remembers why a certain flow is implemented in a certain way, or why there are two separate endpoints that implement pretty much the same logic. It’s a human problem as well. Only as humans we compensate for it by relying on tribal knowledge: old emails and Slack threads. This might be an option in some cases for an AI as well. But at best it is very inefficient.

On top of this, a lot of times, the reality of modern business software is that of a distributed architecture, with hundreds of services and legacy code coexisting with more recent rewrites. Cross-service flows can become very intricate, and they are often undocumented. Even when the knowledge exists (in someone’s head) it’s hard to puzzle things together, and practically impossible for an AI agent to understand it without proper architecture context. Humans can eventually trace flows, but they rarely document them. AI agents can probably do something close to that, but it’s very inefficient, both in running time and token cost.

If we want AI to design features, troubleshoot issues or help us in assessing impact of changes, we have to help it understand how the system fits together. The need existed well before AI took the stage, but LLM-based tooling both highlights the gap as well as offers a path to solve it. Humans are traditionally bad at maintaining documentation reliably. But given the right tools and direction, AI can also help in creating and maintaining the relevant documentation.

This is what led me to the Application Architecture Hub.

The Goal

The primary goal is pretty straightforward. Build a knowledge base that AI agents can query to understand system architecture. When an agent needs to design a feature, it should have context about existing patterns and dependencies. When an agent traces a bug, it should know which services participate in the flow. When an agent assesses the impact of a change, it should understand what depends on what.

We already know LLMs can read and code and write documentation. Not only that, they do it repeatedly, consistently and tirelessly.

If we design the extraction and documentation process well, we can have agents that produce documentation that is actually useful. Not just generated API docs with lists of endpoints³, but actual structured documentation, semantically summarizing the code, with citations back to the actual source code.

In this sense, AI works much better. A human who goes through source code listings can spend hours building a mental model of relationships between services⁴. An agent can produce a structured summary in minutes. Given the right extraction prompts, it can produce meaningful descriptions, in a consistent format. And this can of course scale across hundreds of repos. Contrast this with humans documenting different repos, bringing in their own style, preferences and assumption on what matters. This results in inconsistency that makes it very hard to reason and correlate across services.

LLMs also make incremental updates easier. With the ability to compare (“diff”) the current state, identify what has changed and make changes only in the necessary sections. AI agents don’t get bored or decide that updating documentation is not a priority and can be pushed to a later sprint⁵. Humans rarely sustain this over time. They might invest initially, but entropy will win.

So my goal here is: have a living knowledge base where AI is used both to maintain it and consume it – AI agents are the prime consumers. Agents can query the hub to understand the system, as well as extract information and keep it up-to-date.

It’s important to note that it turns out that humans (unsurprisingly) also need this. As I noted above, the introduction of LLMs to coding and design did not invent the problem of understanding the system. And given up-to-date structured documentation, with AI helping to query it, humans find it useful as well.

AI-generated documentation isn’t a groundbreaking concept. What matters here is for this to be relevant and with high quality to the relevant use cases. The thought here is that AI-based documentation, with proper engineering about extraction process and relevant tooling, can outpace human-maintained documentation. This is not because AI is smarter, but because it is smart enough, and tireless.

Designing the Architecture Hub

Even though it turns out the architecture hub is useful for humans, the driving force behind the design was consumption by LLMs and tools driven by LLMs. Even when humans use it, they do it using LLM-based tools.

Initially, I started researching and thinking about achieving scale – graph databases, maintaining large collections of documents, specifying potentially complex ontologies of objects.

I can’t rule out the usefulness of these techniques just yet, but I quickly came to realize that I was prematurely optimizing⁶.

So I quickly pivoted to starting with a much simpler approach. The architecture hub is, for now, a simple Git repository. It’s not a code repository with implemented business flows and tests. There are no deployable artifacts. Instead it maintains a series of markdown files organized consistently into several directories.

This in itself already allows for simple consumption – AI agents can easily read markdown files. It’s also easily reviewable and usable by humans. Combined with a github MCP server, or simply cloning the repo locally, any AI agent can easily access the information.

The “unit of ingestion” is a single code repository. These usually already encapsulate a specific logic, and are easy to follow and build the tooling around.

Architecture Facets

We could have a single file per repository, describing each repo in detail. But this easily gets too large and unfocused. Different tasks (by agents or humans) require different types of information. For example, tracing a bug requires understanding events and call flows; assessing impact of changes requires understanding dependencies. Having a single giant file would mean that an agent would have to load everything and burn tokens on information it doesn’t need. It could easily pollute the context. Instead, I decided to structure the hub around different facets of the architecture.

The application architecture hub is structured around simple file system directories containing the files. Each such directory represents a specific perspective (a facet) of the architecture (APIs, domain models, events produced/consumed, etc.). A directory contains one markdown file per code repository ingested, they all have a consistent template with consistent metadata. This is a consistent and predictable structure that is also easy to describe.

Facet	What It Documents	Questions It Answers
Domain	Data entities, relationships, types	What data does this service manage? How is the data structured?
API	Endpoints, request/response contracts	How do I call this service? What functionality does it offer, if any?
Events	Message topics, payloads, producers, consumers	What does this service emit or consume asynchronously?
Frontend	Frontend applications: state management, components, routing	How does the UI work?
External Dependencies	Databases, brokers, external services	What components and external services does this service depend on?
Dataflow	Inputs, transforms, outputs, sensitive data	How does data move through this service?

The list of facets is stable and aims to document interesting aspects that often come up during design, and allow us to ask more complicated questions. It can of course be extended to include more aspects.

The design is therefore simple: one file per repository (usually named after the repository name), per relevant facet⁷. If you need to understand the HTTP API exposed by the payments service (from a repo called “payments”), you simply look for `api/payments.md`. If you need to see which events this same service emits, you can look in `events/payments.md`. This is a simple to follow structure, both for AI and humans.

Dividing the information into different files has other benefits beyond simple context window efficiency:

Easier to search (e.g. using grep) for specific facet information across repos. Remember that our prime motivation is system wide patterns (cross-repo).
Parallelism: it’s easier to divide work across sub-agents when they can ingest and search on separate file directories.
Incremental updates: updating a changed API usually does not require updating the domain model information, or external dependencies.

Note that searching the files does not exclude searching the code as well. In fact, the extraction takes care to maintain explicit code references. And when querying the hub I often find myself asking the agent to start from the architecture hub, but also use the git tools (either MCP or github CLI) to look into the specific code, based on the citations.

The use of a simple Git repo derives other immediate advantages of dealing with textual content – it’s versioned and easily reviewable. It’s easy to see what gets updated and when.

The flow at a high level is therefore:

Ingestion Pipelines

How does ingestion – creating or updating documentation – work?

As noted above, the main unit of ingestion is a code repository. Each code repository is ingested in turn, and the created artifacts reflect the original code repository. This allows us to debug, retry and review specific repos, and tie the ingestion into already existing CI processes. We don’t need to invent new relationships or mappings of code repositories to artifacts. It’s also easier to query specific code files using the hub as the guiding index when necessary.

Technically, we implement the extraction process as a series of agent skills: structured prompts with accompanying templates and scripts. These guide the extracting agent what to look for, how to search the codebase and the format of documentation file to produce.

Why skills?

Besides being text-based and therefore easily version controlled, skills allow us to leverage the LLM’s built-in capability to understand the code and its semantics. With a good enough LLM an agent with a skill can produce consistent results. We do use scripts for basic understanding of the hub (e.g. the repos already ingested), and we can probably optimize with scripts that parse the code deterministically (similar to static code analysis), but we’re starting simple, with an implementation that doesn’t require any extra runtime agent beyond the running agent(s).

Each facet has two skills – one for extracting the facet from scratch, and a skill for updating the documentation. The update skill compares the change in the code against the current documentation state and only updates what’s changed. Full re-extraction is possible, but seems too expensive.

The skills define what to look for, depending on the facet they’re documenting. For example, the API skills look for HTTP controllers and decorator (we’re mostly NestJS-based); the event skills look for message schemas; the dependency skills look for definitions of connection strings, external endpoints, etc. All skills have a template they follow, so outputs are uniform in structure. All templates include a metadata section (repository url, date of ingestion, git commit sha of the repo at the time of extraction).

The ingestion pipelines themselves exist in two versions: remote and local. The difference is in how they use the data.

The remote version accesses the ingested code repo by using Github MCP server. It does not require a local clone, and can effectively work from anywhere with the proper credentials set up.

The local version uses git CLI to clone the ingested code repo locally to a temporary directory and then reads the code locally using file system tools. The local version is generally cheaper and more reliable. It does require more disk space.

In addition to producing the documentation files, the ingestion agents also update an existing llms.txt file, which serves as the hub’s index. This is a plain text file, listing all the different documented repos, and explaining the structure of the architecture hub.

The querying skills guide the agent to first look at this file, understand the hub’s structure and start the lookup from this point. Since the repository structure is simple, the llms.txt file structure is simple – one line per document created, with a simple one line description of the content, divided by the facets.

This makes locating documentation across different axes simple enough to use with a simple grep. For example, looking for all domain documentation is a simple search for `domain/*.md` in the file, and getting a list of results. Similarly, looking for all information about the reservation service is simply grepping⁸ for `*/reservations.md`.

Ingestion itself can be triggered manually by any user (a github action invoked in the Github UI or a script). It can also be invoked by a CI step (non-blocking) that is triggered on every merge to master/main – we want to update our documentation, but only the changes that make it to the main branch.

The whole process is orchestrated by a single orchestrator agent (implemented as a skill as well), which launches sub-agents – one per facet.

The orchestrator takes care to clone the repository if needed, and then invokes the separate subagents to either create or update documentation for each facet independently:

The motivation for launching sub agents comes from two main drivers: resiliency and latency. Since the work of each subagent is independent, they do not interfere with each other – all of them just read the code and write independent files. They are invoked in parallel, so the overall process terminates earlier. Also, failure in one subagent does not cascade to others. Technically it also means that the skills for each facet are separate, and therefore simpler – less room for LLM mistakes. A single facet failure is also easier to troubleshoot and re-run if necessary.

Note that it is the orchestrator agent that updates the index (llms.txt) file. Technically, each subagent can update the index file on its own upon completion. But since this is a shared resource, we run into overlapping write conflicts. Since this is file system-based work it’s easier to instruct the agents to return the result of their work as their output, and have the orchestrator agent update the index file. Updates to the shared resource then happen in one place – the orchestrator – and we avoid conflicts.

The ingestion itself can be triggered manually or as part of an automated process, e.g. after a successful merge and build of the master branch. In either case, the ingestion stops at creating a PR that can be reviewed by a human. Review by humans is still important, both to account for inaccuracies (which hopefully will be reduced over time), but also so people learn to trust the information. Without reviewing errors that are still possible at this stage, errors will accumulate, and trust will erode. It’s important to have this level of trust in the process.

Querying the Hub

Once we have the documentation in place, we can start querying it.

Generally, the querying process is simply prompting an agent to read the documentation and construct a report:

Identifying the relevant facets and extracting necessary information, including correlations across different documentation files is where we let the LLM apply its reasoning. We just take care to have a consistent structure, with enough information.

We have several “query” skills which instruct the agent to look in the index file, and some other technical layout information. They also instruct the agent to cite its sources. This helps to both reduce hallucinations as well as provide the result consumer (human or AI) with pointers to source material. The actual querying and output really depends on the use case and the query issuer.

The query itself can be by a human user invoking some AI agent with a user interface (e.g. Cursor, Claude Code or some chat interface with access to the file system). And of course, it can be some other agent-driven process which is simply given access to the files. I have used the architecture hub as a context directory for a dialectic-agentic design debate – it works⁹.

There is no specific query language – we let the LLM interpret the query and work its way through the documentation. We can of course provide hints (“look at the ‘reservation’ service”), but this is not mandatory.
Examples for ad-hoc queries:

“Which services consume the financial-related events from the ‘financials’ service?”
“What overlap do we have in domain models between payments service and reservation service? And why”
“Who is calling the accounting service?”

Technically, the query skill comes in 3 variations:

Remote: querying the hub using Github MCP server
Local: querying local file system, assuming the hub is locally available, and up-to-date.
Auto-Local: similar to local just takes care to first clone/pull from the architecture hub’s repo to a temporary local directory in order to make sure information is up-to-date.

Note that we can also instruct the agent to continue looking into the actual source code if our requested analysis needs this. Having the Github MCP available (or code locally cloned) makes further investigation into source code only a tool call away for the agent. The documentation in the hub does not replace code indexing, it’s more about bridging between (technically) disconnected repositories and mapping/deriving semantic relationships where they exist. There is little value in trying to replicate the existing code indexing and understanding already performed by current coding agents and tools.

It’s interesting to see that even when humans query the hub, it’s done using AI agents. In fact both the producer and consumer of the hub is AI, also when directly instructed by a human user. It’s LLMs that produce the documentation, and LLMs that consume it. This opens the possibility also for an ingesting agent to verify itself simply by querying the hub for the changes it just introduced. By itself, it might not sound that interesting, but considering the scale makes it a bit more interesting. Maintaining technical documentation, with appropriate quality, now becomes a purely mechanical process that can scale more easily.

Structured Reports

Beyond ad-hoc queries, the hub supports reusable report templates. A report template is simply a prompt file, meant to be used with the query skill, that guides the agent through a more complicated analysis workflow. It specifies what to read, what to search for and how to format the output.

Using a report is simply prompting an agent with something like this:

Using the local query skill, follow the report instructions in ./reports/dependencies.md for the reservation service as the root service.

Output your result to ~/tmp/dependencies_reservations.md.

This now launches the agent into looking into the documentation, mapping out services and their dependencies and producing a complete report with relevant pointers to source code.

An investigation that could take hours or sometimes days is done in minutes¹⁰.

We currently have several such predefined reports, each useful in different cases.

Dependency map
Given a specific service, map out all other services making API calls to it, and what other services it calls. It also maps out events produced and consumed by the services, as well as services sharing the DB¹¹.

Useful when trying to estimate the blast radius of a given change.

Cross service flow analysis
A flow analysis traces a business process end-to-end across multiple services. The agent follows API calls, events, and data writes across service boundaries. The output is a sequence diagram plus a step-by-step breakdown with source citations.

“Trace the order cancellation flow” produces a sequence diagram showing the user request hitting the order service, the order service publishing a cancellation event, the payment service processing a refund, the notification service sending confirmation. Each step cites the documentation that describes it (which in turn cites the source code).

“Plain English” Flow Explainer
Not everyone reads technical documentation. Product managers and stakeholders need to understand flows without wading through event topic names and API paths. The plain English explainer produces a narrative description of a business flow. No technical jargon. Just a story of what happens and why. But it does it based on up-to-date technical documentation – the code is the truth.

Example output:

"When a customer cancels an order, the system first checks if the order is eligible for cancellation. If eligible, it reverses any payment charges and releases held inventory. The customer receives a confirmation email with the refund details. The host receives a notification about the cancelled booking."

This report is useful during discovery and planning. When a product manager asks “how does X work today?”, you can point them to the hub instead of scheduling a meeting with an engineer.

This report specifically also instructs the agent to use the web search tool to search information in other online resources (help center), which demonstrates the flexibility of the model. This is not a built-in feature of the architecture hub, just a tool available in the underlying platform that is composed into the process using the prompt. In my view it’s an interesting case of the “Application Logic Lives in Prompts” principle of agent-driven applications.

Also, the report essentially produces very similar information to the “Cross service flow analysis” report, only phrasing it in a way that’s more suitable for a different audience – another demonstration of a feature that is easily enabled by LLMs.

So How Do We Use It?

Regardless of the actual query being performed, we already see the value here: answering quick questions as well as generating more complicated reports, with deeper analysis.

For AI Agents

AI agents used in software are the primary intended audience here.

Several notable cases where this is used:

A troubleshooting agent that brings together information from bug reports, live monitoring data (logs, datadog) but also interacts with the architecture hub to understand relationships between services.
Design tasks and understanding impact of changes

For Humans

Information gathering was a pain before the introduction of AI coding agents. The simple fact that we have up-to-date technical documentation already allows us to use it daily.

Examples:

Onboarding to a new code repo – whether it’s new employees getting to know the system, or simply a neighboring team needing to make changes in a repo they don’t own. Understanding dependencies, call patterns and domain models.
During planning: understanding impact and inter-team dependencies.
Mapping customer inquiries (specifying required data objects) to the APIs that provides them, across the system.
Quickly figuring out cross-repo dependencies in live design discussions; e.g. “what services consume these events?”
Understanding complex flows and data dependencies.

We also foresee more cases where this can be used: PR reviews, incident investigation, understanding compliance issues.

Anything that requires system-wide information that is reflected in the technical architecture.

It’s important to note what the hub should not be used for. It should not be used for understanding code or functionality of a single repository (or very few loaded into a workspace). At least not as a primary source. There are also better ways to understand the evolution of repos (git history). Rationale for designs should probably also be gleaned from other sources if they exist, using the hub as a way to validate decisions and track adoption.

Code tells you what happened, Git tells you when it happened, design documents and plans describe why things happen. The hub connects these perspectives across the system, and serves as a map to navigate the terrain.

Challenges and Roadmap

I would be misrepresenting things if I presented this as a fully solved problem. There are still remaining and expected challenges ahead.

First, staleness of data.

Stale documentation is in a way worse than non-existing documentation since it may mislead people (and LLMs). Code changes after initial ingestion, and documentation needs to be updated.

As it currently stands, the automated CI workflow is an opt-in solution (teams need to enable it via a simple Github flow variable set to “true”). But this is a limited rollout period. Once we make sure everything works, and figure out kinks, we can flip the condition and make it an opt-out solution.

Additionally, each update records the time of the update, and each file contains a change log. So it should be easy to spot documentation files that are not up-to-date.

Second, there is a quality variance. And this depends largely on the quality of ingested code¹². Messy code with inconsistent patterns produces worse documentation. Code that is consistent, with known patterns and proper naming conventions is much easier for the LLM to understand and build the documentation for. The extraction skills look for API controllers or type definitions or configuration files in specific places. If the code doesn’t follow these conventions, the quality of generated documentation will degrade. We will fine-tune the extraction over time as we observe this, but this is largely a reactive measure.

Related to this is the problem of potential hallucination. Even though hallucinations are generally decreasing, at least with frontline models, this is still a potential issue, especially when an LLM is asked to describe the purpose or intent of a specific feature. As we know it might make assumptions and present them confidently as facts. One way to mitigate this is by mandating citations of source code. This focuses LLMs on grounding their outputs in the real code. This seems to reduce hallucinations; and it also enables humans to more easily review and cross reference findings.

Another issue that might come up is cost. Running LLMs at scale will cost us money. This is the main reason for having a separate “update” vs. “full ingest” skills – it updates only according to changes it finds instead of re-producing the entire file. We’ll need to monitor this and see how things can be optimized if necessary, e.g. batching a few changes and re-ingesting only after a few commits/merges.

Related to cost is the general issue of scale, when it comes to quality of service. What happens when the hub includes hundreds of documents? How long will it take to query it (even when done on a local file system), and how good will the result be?

We may very well need to adopt a more scalable solution, e.g. a more scalable database, and not relying on file system searches if we want faster answers to more (concurrent) users.

Perhaps the hardest hurdle to overcome is that of adoption. In order for this to be adopted internally it has to be better. Not marginally better – clearly better. So far the response has been positive by people who have seen it. And effort is being done to make querying easier and as painless as possible.

Some future thoughts involve also providing a mechanism to give feedback and local notes (inspired by the `annotate` and `feedback` commands in chub); but this is not implemented yet.

Adoption of course needs to be not just by humans querying it, but also by internal AI agents using it.

Beyond Initial Implementation

Currently the architecture hub has a solid foundation, and shows value. But there’s still work to do, some obvious, some less so.

In the short term, we need to increase coverage of all repos. This is more of a technical gap.

We will also need to fine-tune the extraction skills and associated templates. Some feedback is already incoming. The same goes for pre-defined reports.

After that we’ll need to make sure this is adopted by AI agents. In a sense, the application architecture hub should be part of the default context for all technical agents doing design, troubleshooting, and planning. This will require more standardized interfaces for querying and reports.

Another important step – ingesting more relevant information sources. Two immediately relevant sources are infrastructure information and design decisions (ADRs). This will enrich the available information and allow us to answer and connect information in different layers of the technical architecture – all the way from “why was this designed this way?” to “how is this actually deployed?”

But other interesting architectural aspects may be interesting as well. For example, a security facet, mapping out authentication and authorization information as well as data sensitivity aspects. This can help agents with understanding and designing for secure software, consistent with the rest of the system.

As noted above, having a feedback mechanism is also very useful in order to have a continuous improvement, hopefully grassroots, that will maintain and improve the quality of information.

Other steps might include (depending on need) introducing semantic search (RAG?) so we avoid issues with terminology misalignment, or having the user know the exact repo to start with.

When it comes to accessibility to larger audiences, not so much AI agents, a visual explainer – automatically producing diagrams can prove to be useful for humans who need a living, breathing, map of the system.

Takeaways

The architecture hub started from a simple observation¹³: AI agents are great at understanding code (and getting better), but larger systems, with a lot of moving parts are harder to accommodate reliably in one agent’s context window. Knowing how services interact, where data flows, how changes propagate – this is intractable in a large distributed system. If we want AI to go beyond simply coding, we have to teach it what we know. Knowing the system was a problem even before AI came along. LLMs just exposed the gap and made it more obvious. We got hungry for more.

But given the right mechanisms and tools, LLMs also present a solution. We can now generate and update reliable technical documentation at scale, simply because it’s mechanized.

LLMs emphasize the need and present the solution at the same time. In this system, AI is both the consumer and maintainer of architectural knowledge.

There are already some interesting points to learn from this (still ongoing) journey:

For this to work, the extraction process needs to be engineered. We need to make sure the quality is high and that it can scale technically and organizationally.
Architecture is built on different aspects. Having one document cover everything is hard, and inefficient. The idea of different facets is important for effectiveness as well as efficiency.
Humans in the loop are important to understand errors, but also to build trust in the system. We’re trying to extract years of human-generated knowledge (in the form of code) and let machines run with it.
The value is in the query. The documents themselves are great, but AI and people need answers. The hub’s main value will come from delivering answers; documents are just the substrate on which this is built.
The original motivation (and still the main one) is for AI coding agents to consume the knowledge. But this is also extremely helpful for humans. It so happens that having reliable documentation, with consistent templates and explicit citations is useful for humans as well.

I’m betting that AI-maintained documentation can outpace human-maintained documentation. So far, feedback has been positive.

But the real test will come with adoption. When people and agents use the architecture hub as the first place to look for information.

(and yes, all dashes in this post are hand-typed)

This was also, unsurprisingly, one of the conclusions from the testing of Dialectic. See “Does Clarification Matter?” here. ↩︎
That would be what I called the 2nd phase in a possible AI adoption roadmap. ↩︎
Which is also useful of course ↩︎
HTTP calls, domain models, events raised and messages consumed, … ↩︎
We all know the “Documentation” work item that gets pushed across sprints until it’s simply marked as obsolete. ↩︎
And I’m not sure about the root of all evil, but it’s a surefire way to get stuck in analysis-paralysis. ↩︎
For example, backend services are irrelevant for frontend applications. Similarly, frontend applications don’t expose HTTP-based APIs. ↩︎
Is that a valid word? ↩︎
I have to admit, it was somewhat of a “proud dad” moment, watching the dialectic agent pick up the relevant files from the architecture hub, copying them to its working directory and feeding them to the debating agents. ↩︎
Or at least a decent first draft that can be more easily validated. ↩︎
An anti-pattern(?), but that’s a discussion for another time. ↩︎
“Garbage in Garbage out” holds also for technical documentation. ↩︎
That I believe is now more or less a consensus. ↩︎

When Linear Logic Hits a Ceiling: The Case for Agent-Native Architecture

Leave a reply

During the development and testing of Dialectic, something kept bothering me. While the application worked largely as designed, the implementation felt a bit too… simplistic.

The debate flow is largely a linear sequence: a loop iterating over debate rounds with an optional clarification step:

This isn’t necessarily a bad thing. It’s easy to understand and troubleshoot. It’s predictable. More importantly, it’s a decent starting point, an MVP.

But the problems start to show when using it.

First, the convergence decision is decoupled from the context of the debate itself – the debate always ends after a fixed number of rounds. This would mean that a debate that is simple and converges after a round or two may run unnecessarily for extra rounds. This is obviously wasteful¹, but it also risks introducing ‘hallucination drift’ into what would otherwise be a perfectly good conclusion.
Alternatively, the predefined number of rounds may not be enough. I’ve had several cases (mainly in work-related invocations) where qualitative examination of the resulting report revealed several open points and/or questions.

Second, the clarifications step was constructed in a way that all agents are exposed to the problem description and context, and ask a set of questions at once.

While this allowed agents to gather specific context – which was helpful – it still presents two limitations:

No real interactivity: debating agents could not follow up with questions after the answers given by the user. This means that the agent got clarifications up to a point, but was often not completely satisfied with all the questions answered.
Isolation: the debating agents don’t see each others’ questions and answers and cannot derive conclusions from them².

A third point is more on the implementation/operational side. When a given agent failed for whatever reason³, it meant that its contribution, at least in that round, was ignored and effectively lost. The linear loop meant that a failed agent invocation would be ignored at best. Retrying meant that the whole round or phase had to be retried. In other words, agent invocations were coupled together in error handling.

All of these problems could be solved in the original code design. But when I started thinking about it, it quickly became obvious, at least to me, that the code would become unwieldy and harder to reason about.

I started thinking whether it would actually be better to write the tool differently.
Since I don’t need a lot of excuses to write code, I rewrote it to accommodate for these problems, but also to experiment more with the idea of agent-driven applications.

State Machine Orchestration

So if I want to model my code in a way that allows me to express decisions as reactions to inputs, system events and state (in addition to predetermined configuration), modeling the system around a finite state machine seems like an obvious choice⁴.

The transition itself is also pretty straightforward. The linear flow maps directly to states: we model each phase (propose, critique, refine) as a state of the system, as well as clarifications and synthesis. The system is naturally at one state at any given point in time. In a sense, the original linear flow is a specific case of the broader set of behaviors possible with the state machine.

We end up with a state machine that looks (at a high level) something like this:

The application is now modeled as a graph of nodes (~= “tasks”), where orchestration happens as response to events that cause edge transitions.

This model immediately lends itself to implementation of two improvements:

Asking clarifying questions is easily modeled as a state with a clear event telling the system when we’re done (“No more questions”) ⇒ agents can ask follow-up questions⁵, and can easily be made aware of each other’s questions.
Deciding when the debate is done is also modeled as an event, based on the judge’s decision ⇒ autonomous convergence is easier to implement.

Note that configurable safeguards are still in place. We can still cap the number of clarifying questions to be asked as well as the number of debate rounds. But it naturally opens the door to more efficient handling of these situations.

I won’t go into implementation details here (you can inspect the code, and documentation), but this new flexibility also allows for easier implementation of other scenarios and improvements.

Adding a new phase/step in the flow, e.g. “review”, is essentially introducing a new node, with relevant transitions.
If a specific agent fails in some node, it can be retried independently of other agents in the same phase.

It does require an implementation of a different kind of orchestrator, and adapting the UI options (CLI, web) to this operational model. The orchestration logic now lives in different nodes, and some intermediate technical states need to be introduced.

Interestingly, the agent LLM prompts themselves didn’t change much to accommodate for the new state machine orchestration. This hints at generally a good abstraction at that level – orchestration vs. agents behaviors.

Where does this leave us from an architecture point of view?

One can argue that the state machine is also hard coded, and fundamentally, the graph transitions are not that different from a program counter moving through instructions. If you squint, it does look similar.
On the other hand, this more naturally allows for easier extensions as noted above (interactive clarifications and autonomous convergence) as well as easier error handling at the node level. There’s also no constraint on having a static predetermined state machine. The state machine itself can be constructed in runtime based on configuration or input.

In addition, if we zoom out a bit, and think of a potential roadmap, an event-based model allows the application to be easily re-implemented as separate processes, with different nodes implemented in separate “services” responding to events. Scaling becomes easier. Doing it based on the rigid loop-based flow would’ve been harder⁶.

But there’s something more fundamental in how the application is built – it’s still expressed in code.

An Agentic (?) Application

The refactoring described above works in the sense that it does improve the mechanics of extending the code. It allows us to express behaviors more naturally, and potentially scale better.

Still, the core application logic is expressed in a series of typescript code files – state machine transitions are expressed in code. Even the agent prompts are delivered as part of the code.

At a basic technical level, any material change to the behavior of the application requires some code change (+rebuilding and shipping). Extensibility, even when easier, is still code-centric. This becomes more of an issue if our application requires more flexibility and customization from a user.

We have improved control-flow modeling, as well as runtime semantics. But the application behavior is not fully externalized as a protocol/data.

What does it mean for the application protocol to be externalized as data?

At the heart of it, the application’s logic is represented as artifacts that are observable and even open to manipulation by the system’s operators, not just its coders.

To the older programmers in the crowd, this would be somewhat reminiscent of Lisp/Smalltalk and other homoiconic languages, where the program representation is directly manipulable in the same semantic system as the program data (e.g. forms/objects, S-expressions)⁷.

But this is not exactly homoiconicity. In this case, we are able to modify the program’s behavior by manipulating files that are read during execution.

In a system running continuously, this gives us a chance to change the system’s behavior as it’s running. In that respect, it is similar. I guess it’s more “workflow as data” and not so much “code as data”.

Another analogy might be to a template in a no-code tool, where users have the option to customize the flow without coding. It is similar in the technical sense, only here we don’t have formal semantics that usually come with modeling in some no-code tool. We have the English language, with the aid of tools (again – code) to help provide a more rigid structure.

What I’m after here is a clear separation between the agent “runtime” and the application’s business logic, in a way that allows the application protocol to be defined as malleable artifacts.

Which brings me back to the idea of implementing the application with an AI (LLM-based) agent at its core.

Practically this would mean that the application workflow would be represented in a series of artifacts that are inspectable and amenable by the user or operator of the system. The “runtime” itself would be an agent platform with basic capabilities, driven by an LLM, with relevant basic tools.

What do we gain here?
We gain transparency and faster architectural iteration with rudimentary tooling⁸. We also get easier customization of behavior.

At the same time, we must recover guarantees we lost when the workflow was implemented in code, compiled and verified. We’re moving from imperative coding to inspectable runtime artifacts.

This is how I got to Dialectic-Agentic.
It is essentially the same Dialectic application, re-imagined as an agent-native application.

The core execution engine is any agent platform available today. This should work with Claude Code, Cursor, etc. These already implement the basic agent loop and tool abstractions (+some built-in tools) that would allow to build the application on top.

The application protocol is expressed through a series of skill files and prompts. These enforce strict file conventions that serve as the local communication mechanisms between agents.

The flow orchestration is described in the Orchestrator agent skill. This is the main agent running in the agent loop. Using the built-in “Task” tool, it executes various subagents (per role) and the judge agents.

All work and communication between the orchestrator and other agents is done through reading and writing files in a dedicated debate workspace. This also allows us to follow the progress and status of the debate (there’s a `progress.md` file).

(blue components are the application’s “code”; rounded rectangles are files; labeled arrows are control flow, unlabeled arrows are data flows)

Invocation simply happens by invoking the relevant skill in the relevant agent platform, with the problem description and context directories given, as before.

Configuration is similar to the code-centric version. Only note that here a lot of the agent and LLM configuration is irrelevant since this is implied by the running agent platform. The configuration is focused on the agent-specific instructions and guardrails.

The entire application logic is encoded in skill files (blue components above), taking advantage of the agent runtime capabilities of reading files or doing any kind of web search or any other customized tool. The LLM configuration is entirely out of scope for the application.

Application UI is essentially the built-in agent chat window or terminal, whichever the user decides to use. The intermediate files are of course also part of the UX. You can track progress and status using the information written in the debate workspace, as they are being written and updated using the agents operating. The debate workspace is also available at the end for troubleshooting or other analytics.

This is still not a full-blown agent-driven application as I have outlined before, but the core components are there: the agent loop and basic tools are already part of the underlying agent platform. The shared context is given in the debate workspace – a simple file system directory.

The workflow, at least at this point, is a rather simple one, with a clear beginning and end. There’s no sharing context with the user while the application is running, but this is mainly because the running time is finite, and usually short.

—

At this point we have 3 different implementations of the same pattern, it might be worth taking a step back and consider the tradeoffs.

Comparing Implementations

The 3 different implementations of the same application (imperative, state machine, agent-based) accomplish essentially the same task – running a system design debate and producing a result.

I have not achieved exact feature parity between the implementations, but there should not be anything that fundamentally prevents us from doing it, even if the implementation may be awkward.

It would be interesting to examine the tradeoffs of the different implementations from an architectural point of view. How do the different implementation approaches differ in different aspects?

Change Velocity

How long will it take to implement a new feature, and deliver it to users?

The general question of course depends on the feature and its complexity, but it still might be worth examining it through the lens of a specific feature (or set of features). Imagine, for example, that we need to include a new step in the process, e.g. a final review of the solution by all agents⁹.

The deterministic flow would require changes in several code files (the orchestrator, role-based agent interface and implementations). It would also require new prompts and potentially new state attributes to be passed.
It will probably also require specific context construction.

The state-based flow would require a new graph node implementation, with relevant wiring. It’s better organized where the flow is clearly separated from other aspects.

Both of these implementations require of course code changes + build and deployment of compiled files. This includes package publishing etc.

The agentic implementation requires basically some change in the core protocol (a new step before the synthesis phase?) and that’s it really.

Delivery of the actual skill files really depends on the platform, but it’s essentially copying the necessary markdown file.

Failure Isolation

This aspect of course depends on the type of failure mode. It’s obvious that an underlying failure in the LLM APIs or availability of API is a blocker for any kind of application where LLMs play a vital part.

Any central failure, e.g. no LLM available, will affect the entire execution.

I think it might be more interesting to address the question of how isolated a failure mode is when it does happen in a specific step/component.

Let’s consider a failure in one agent execution, in one phase. It could be because of some misconfiguration of LLM or prompt, or some tool call, causing an LLM to return an invalid response – not according to protocol.

The imperative implementation would either try to work with the given response, however lacking/broken, or stop the debate completely (e.g. in the proposal phase). Not all errors will be immediately obvious but this is more an issue with the current implementation, not so much with the pattern. A technical failure is more likely to cause the entire run to fail. Isolation would require granular error handling at the code level, e.g. smaller and specific try-catch blocks.

The state-machine implementation works largely the same for phase-scoped errors. It either aborts the flow completely (proposal, refinement phases) or continues with partial results (critique phase). The specific mechanism is different, but the result is the same from an overall application point of view.

Note that in the current implementation, there’s no validation of the quality of returned result from agents – nonsensical LLM responses may propagate.

The node/event isolation provides a slightly easier way to isolate problems when they happen. Especially if we want to execute them in a separate process (not the current implementation).

With the agent-based implementation, the policy is embedded into the skill file, e.g. here (section 4.2):

**Wait** for all N subagents to complete.

**Verify** that each expected file exists: `{WORKSPACE}/debate/round-{ROUND}/proposals/{agent.id}.md`

If any file is missing:

1. Log a warning to `progress.md`: "WARNING: {agent.name} proposal missing in round {ROUND}. Retrying."

2. Re-dispatch that agent's subagent once.

3. If still missing after retry: log "WARNING: {agent.name} skipped in round {ROUND}" and continue without this agent. Inform the judge of missing agents when it runs.

i.e. the current policy is to retry an agent execution once, and if it fails (no file found) – log a warning and continue. It does not stop the debate, but does make the problem explicit.

Note that also in this case, in case of a faulty response, or missing response (after 1 retry), the process continues. So a problematic response will also propagate to the debate and may cause downstream issues.

Failure is generally more isolated in this case simply because it happens at a subagent level, and focused on specific task execution.

Note that the actual handling of errors really depends on the executor being strict in its execution. There might also be drift occurring from the artifact changing, or instructions coming up in prompt that alter this behavior. This behavior is not absolutely guaranteed.

In all 3 implementations, we can create a more robust failure handling. Validate actual result, retry execution, isolate specific agents.
The question then becomes how easy it is to introduce a more robust failure handling mechanism.

Imagine we’d want to isolate changes of an agent so it won’t stop the debate.

With the imperative solution, this would entail coding a whole protocol between the orchestrator and other agents.

With the state-machine implementation, this would require introducing new states dynamically (“1 agent completed”, “2 agents completed”, …, “N agents completed”). This is not currently implemented, but the basic mechanism is there (note it’s called “DEFAULT_TRANSITIONS”).

With the agent-based implementation, the policy is basically the 5 lines quoted above. Implementing it is basically changing the SKILL file, or providing extra instructions when invoking it (the “user prompt”). This of course assumes the underlying LLM follows instructions¹⁰. In short, it’s easier to implement, but we’re more at the mercy of the underlying agent to follow the instructions as intended.

Runtime Transparency

How easy it would be to understand the execution as it is running?

In the imperative implementation, the flow is mostly implied in the code itself. We would need to log everything or implement tracing to gain visibility. In short – more code.

In the state machine implementation, the flow is also expressed in code, but it’s easier to understand where it stands just by tracing/logging state transitions. Another case where better code organization benefits us. If nodes communicate by some other inter-process communication protocol, e.g. message queues, it’s also possible to track these.

In the agent-native implementation, since all communication between agent execution happens in files (status.md, progress.md, files written with proposals, critiques, etc.) it’s very easy to simply look at the file system and understand how the process is progressing, or where it fails.

Determinism and Reproducibility

How deterministic is a given execution? How easy would it be to reproduce it?

In both of the code-based implementations, the process is expressed in code. Given the exact same inputs and sequence of events, we’re almost certain to reproduce the same results. While there is some non-determinism in the potential LLM response, it would not likely affect the execution of the flow. It might affect the quality of the end result.

In the agent-native approach, a lot of the execution depends on the LLM following instructions properly. The execution here is a lot more sensitive to the agent platform running it, prompting and runtime changes.

This might be good in some cases if the LLM finds ways to overcome obstacles, but generally speaking, the behavior is less predictable, compared to code. In order to mitigate this, we’d need to invest more in verifying contracts (e.g. files created). It’s no question that this approach is weaker on this point.

Tool Integration Ergonomics

How easy is it to integrate tools into the flow and direct the LLMs to use it when necessary?

In both of the code-based implementations, the tool registration and execution is code centric. We would need to implement tool discovery¹¹ and integration into prompts as well as execution. It’s possible to integrate more well-established protocol, e.g. MCP, but still requires investment in implementation and maintenance. There are of course established agent frameworks these days that do a lot of this heavy lifting.

In the agent-native approach, this is largely solved by the underlying agent platform. It already takes care of registering tools, including custom tools; and it usually has some basic tools already built-in. For example, in Cursor, file_read and web_search are available as part of the platform. We’re only left with guiding the agents on how to use them. In this respect, it’s a done deal and the application developer only needs to focus on usage of tools. It also means that tool usage might not be immediately transferable to other platforms, unless we somehow make sure we’re using some standard tooling, e.g. the same MCP servers.

I’m not sure there’s a clear winner in this aspect. Only that existing platforms already support this out of the box.

Testing

How easy would it be to test the application behavior in each approach? How well can we use established testing tools and methodologies?

The imperative implementation is a winner in this aspect. It is best suited for traditional unit testing and other automated testing approaches.

The state-machine implementation is also code-centric and therefore easily testable with existing features. It might need a bit more testing for the nodes/events facility, but this added testing complexity isn’t a significant addition.

The agent-native is weaker in this aspect. Testing here requires relying on golden artifact testing, validating implicit contracts (file naming and content) and generally a more end-to-end approach for testing.

This is a point that’s generally true for applications relying on LLM execution, and I think merits its own separate discussion¹².

So Which is Better?

To summarize this comparison, if I had to rate these implementations on a 1 to 5 scale (1 – weak, 3 – balanced, 5 – strong), it would look something like this:

Aspect	Imperative Implementation	State-machine Implementation	Agent-native (skill-based) Implementation
Change Velocity	2	3	5
Failure Isolation	2	4	3
Runtime Transparency	2	3	5
Determinism / Reproducibility	4	4	2
Tool Integration Ergonomics	3	3	5
Testing	5	4	2

Unsurprisingly, there’s no one architecture that dominates all of these aspects. Each refactor done here improved some areas at the cost of others.

When we moved from imperative code to state machine implementation, we gained better code organization, flow modeling and failure boundaries. But we paid a “tax” in complexity (managing nodes, events, suspend/resume cycles).

When we moved to agent-native architecture we gained flexibility and easier customization as well as velocity. This allows the system to adapt to the conversation rather than following a script. But we pay in less deterministic execution and a harder to test application.

As always, the answer to what is better is ‘it depends’. There is no necessarily better architecture, only a better fit for the specific problem at hand.

If we optimize for predictability, and maybe compliance, it would probably be better to go with one of the code-based approaches.

If we optimize for rapid iterations, and protocol flexibility (including user contributions), it might be better to go with an agent-native approach.

And of course, other applications, with more complex flows¹³ might work as a more hybrid approach, where some of the process, namely the part we want to be more predictable and compliant, is implemented in code and integrated as a tool with an underlying agent.

For this specific use case, the flow remains fairly simple and predictable. My takeaway is that an agent-native architecture really fits when the path to a solution isn’t an obvious “straight line” – where flows are less rigid, or where different processes must be combined on the fly in unforeseen ways.

Consider, for example, a Tier 1 customer support bot following a well-known script. This is usually predictable and code-like (“if this is raised do this, otherwise do that”). Contrast this with a support bot that behaves more like a high-level troubleshooter, and pivots based on the complexity of the problem and its context. In that scenario an agent-native architecture will fit better.

Similarly, a supply chain software that needs to set up a delivery route. An agent, connected to online information, and absorbing different inputs about external events (e.g. extreme weather, fuel shortage), should be able to adapt better than a static route based on hard-coded heuristics.

In the end, we architect for the predictable, but we try to build for the unknown.
And it is in the “unknown” that an agent-native approach finally pays its rent.

Token economy! ↩︎
For example, the “architect” role agents and the “performance” role agents have a lot of overlap in their clarifying questions. ↩︎
For example, a tool failure or connectivity issue ↩︎
I guess there’s a reason why LangGraph is basically built around a similar model. It’s natural for a workflow ↩︎
Admittedly, working interaction into the state machine is a bit more involved, but doable. ↩︎
Then again, it’s not a requirement, so I wouldn’t run to implement it just yet. ↩︎
Usually with strong meta-programming affordances ↩︎
And, well… a robust LLM. ↩︎
Or ADR documentation, or JIRA update, or whatever ↩︎
It’s also possible for a model to try and overcome the issue in some other creative method. ↩︎
Similar to how it’s currently implemented. ↩︎
Maybe tests defined on traces? ↩︎
Like a lot of business applications focused on processes ↩︎

Agent-Driven Applications

Leave a reply

AI Agents are everywhere, and slowly (quickly?) becoming more prominent in applications. We’re seeing more and more of them appearing as integral parts of applications, not just as tools for developments, but actual technical components that implement/provide user-facing functionality. We’re also seeing a significant improvement in the length of tasks agents are able to accomplish. I’m not sure this is AGI yet, but it’s definitely significant.

So far, I have focused on the implications of AI on how software is developed. But as we move from working internally with LLMs to building applications that leverage them, I believe it’s time to look more carefully on how to build such systems. In other words, what would it look like if we want to build a system that really leverages LLMs as a core building block.

We already have concrete examples of such applications – our AI-driven IDEs and other coding agents. These are examples of applications where the introduction of AI has done more than supercharge existing application functionality. It has actually changed the way we do things. What’s more interesting is how quite a few people are using these in ways that traditional IDEs weren’t designed for. I remember a time, not so long ago, when suggesting the use of an IDE to a non-technical product manager was met with raised eyebrows¹. Now, most product managers have Cursor (or Claude Code) open and doing much of their work. This isn’t just ‘vibe coding’; it’s using the agent as a multi-tool for the boring-but-essential parts of the job. I’m seeing people use Cursor for practically everything – writing specs/design documents, documentation, diagramming, design and data archaeology and more. And this is still mostly chatting with a given agent. The potential, I believe, is much bigger.

When I let myself extrapolate from coding agents to the broader set of potential applications², I can’t help but think we’re going to see a new kind of software applications emerge – Agent-driven applications. Applications that will be mostly built around LLMs and the tools, essentially the harness of the agent. It can be multiple agents cooperating or a single agent embedded into a larger platform. I don’t assume this type of application will replace all kinds of applications, but I think it will be more prevalent, and we should start to seriously think of what it means to really leverage LLMs in applications. We should consider what the implications are for how we define, build, evolve and use these applications where AI-based agents sit at the core.

Why Should We Care?

One can argue that LLMs are nothing more than a technical component in a larger system, with limitations and quirks around how it’s used. This is technically true³. But I believe there’s a larger opportunity here in how we deploy and use these LLMs; an opportunity which presents its own challenges.

If we limit ourselves to some kind of smarter automation – guiding the LLM through a task or some workflow – this would probably work. But with LLMs we can do more. We can declare a desired goal/outcome, and let the agent decide on how to achieve it. A reasoning agent equipped with a set of capabilities (=tools) and a desired goal can work to achieve it, without us coding the concrete workflow, or even explaining it in detail. This is what we’re already seeing with AI IDEs. Assuming the capabilities are robust enough – on par with what a user would do – the agent should be able to achieve the task on its own.

This may seem insignificant at first look. But I think it changes how we would want to design these systems if we want to really leverage high-end LLMs. Given an agent with enough tools, the user can also instruct it to do all kinds of tasks that a developer/product manager/architect did not even consider when building the application. It can be a simple heuristic change, or complete new workflows – all only a prompt away. Years of enterprise software customizations and pluggable architectures prove that this is a very real need. And reasoning LLMs, with open ended flexibility supercharge this customization capabilities.
It’s more than customization of existing workflows. It’s also finding new ways to use the existing capabilities. Similar to how people string together Linux shell commands, or use electronic spreadsheets to everything from accounting to games and habit tracking – once the capabilities are there we’re only limited by the user’s imagination.

In addition, I think there’s also a chance for a pattern that might be unique to LLMs being at the heart of such an application. This is derived from the use of context. When an agent in such an application works, it consumes context (usually fed by different tools, or from its progress). But it also has the chance to affect future context, for other agents or for its future invocations. It maintains a memory of past actions and interactions. Similar to how I turn certain conversations I have in Cursor to rules or commands, or updating the AGENTS.md file. This has the potential to allow the agent to improve over time, even automatically, without a human in the loop.

But what does such an application look like?

Anatomy of an Agent-Driven Application

As I see it, the architecture of an agent-driven application centers around the Agent Loop. Unlike traditional software that relies on rigid, pre-defined workflows to execute logic, this architecture relies on an agent, or a set of collaborating agents, working autonomously to achieve a specific goal.

In this development model, we do not define concrete flows. Instead, we define the application by providing the agent with a set of tools. These tools allow the agent to perceive its environment and act upon it. The desired outcome is defined through a combination of prompts, originating from both the system developers and the end-user, and existing context. The agent then works through the task in its loop, until it completes its work. This is similar to the idea of Web World Models, only applied to business scenarios, which hopefully can make it more constrained.

The execution path is dynamic rather than static. Because the agent maintains a context that evolves, learns, and potentially forgets over time, the specific steps taken to achieve an outcome may change. The application defines what needs to be done, while the agent determines how to do it based on its current context and available tools. The potential for such an application is more than simple automation of tasks, it’s also about finding ad-hoc ways to achieve a given goal or completing unforeseen (desired) outcomes.

Examples can be agents with varying level of complexity:

A planning agent that responds to events, queries application state and decides how to allocate resources, collaborating with other agents to verify choices and other constraints eventually notifying users and downstream systems.
A troubleshooting agent that leverages various data sources to correlate and find insights in the data, iteratively exploring data until it provides several theories to the asked question.

There are 3 main components to such an application: the capabilities (tools) available to the agent, the shared context and the agent loop.

The Agent’s Capabilities (Tools): The agent interacts with the application and external systems through tools. These tools should be atomic and composable. Initially, they may be simple primitives. Over time, as we observe how the agent utilizes them, these tools can evolve into more complex capabilities. The agent selects these tools dynamically to solve problems, invokes them and acts on their result.

Shared Context: Context is the memory of the system. It is not limited to a single interaction but persists and evolves between agent sessions. This shared context allows the agent to learn from previous interactions. It ensures that the agent does not start from zero with every task but builds upon a history of user preferences and past decisions, in addition to the system state. This memory is shared between agents working, but also between the agents and the users. It’s possible for a user to interact directly with the context, correct/change it and therefore direct the agent(s), within normal data access limitations.

The Agent Loop and Completion Signals: The agent lives in a perpetual loop: it observes the state, reasons through the next step, acts using a tool, and then looks at what happened. Repeat until the job is done. This loop runs until the agent determines the task is complete. Note that since the system centers around the agent loop, identifying when it’s finished is critical and an integral part of the pattern. There could be different signals for completion (e.g. “fully completed”, “partially completed”, “completed but unknown state”, “failed”).
It’s important to distinguish between a completion signal and some execution failure. It’s quite possible that a tool execution fails, but the agent continues to reason and work around it. It’s also possible to have all tools successfully execute, with the overall outcome not achieved due to other reasons.

Design Principles

Now that we’ve established the general idea of what an agent-driven application looks like, it’s worth laying down some points which should help us design such a system effectively.

Agent Capabilities Match User Capabilities

We should aim for capability parity between the user and the agent. If a user can achieve some outcome in the system, the agent must have a corresponding tool or set of tools to achieve the same result. This does not necessarily mean the agent manipulates the UI widgets; rather, it means the agent has programmatic access to the same underlying logic and mutations that the UI exposes to the user. It might be through a different path, but if we want the agent to achieve the same outcomes as a user, it should have capabilities that are on par with the user’s capabilities to affect the system.

Application Logic Lives In Prompts

The core logic of the application shifts from code to prompts. We use prompts to define the business constraints and desired outcomes. Deterministic code is still there, for various reasons, but the more flexible we want to be, and more open to agentic reasoning, the more we need the desired logic to exist in prompts. I also expect that the definition of business flows will be less prescriptive. Instead it will focus on establishing goals and constraints. Think of it like SQL for business logic. You declare the ‘what’ (the query), and the engine figures out the ‘how’ (the execution plan)⁴. There’s of course a twist here: our “engine” is a non-deterministic LLM working with an ever-evolving vocabulary of tools. This is harder compared to optimizing over a relatively narrow domain (relational algebra).

Consequently, this changes how we debug an agent-driven application. Instead of stepping through lines of code to debug logic errors we analyze execution traces to understand the agent’s reasoning process and tool selection.

Guardrails are Explicit – In The Tools

While the agent is autonomous, it must operate within safety boundaries. We do not rely on the agent’s “judgment” for critical constraints. Concerns such as data consistency, authorization, and sensitive data access are enforced strictly by the tools themselves. The tool allows the action only if it meets the hard-coded security and business rules. Some of the safety guardrails can be in the prompts, but we should not rely on this as a security measure.

Capability Evolution

The agent’s capabilities in the system are not necessarily static. We can evolve them by observing how the agent “behaves”. Concretely, we treat the agent’s behavior as a source of requirement generation. By observing traces, we identify common patterns or sequences of actions. We then “graduate” these patterns into more elaborate, hard-coded tools. It’s technical and logical refactoring that’s driven by how we observe the system behaving.

I see a few main motivations for this kind of evolution:

Optimization: Hard-coded tools reduce cost and latency compared to multiple LLM round-trips.
Domain Language: Creating specific tools establishes a richer, higher-level vocabulary for the agent to use, making it more effective within our specific business domain.

It’s also possible that we’d want to code some tool in order to guarantee some business constraint, e.g. data consistency. However, I believe this will not be so much an evolution of a tool but rather a defined boundary condition for the definition of a tool in the first place, maybe a result of a new business requirement/feature.

It’s quite possible that a few granular tools will be combined into a more complicated one if the pattern is very common, and we can optimize the process. Still, I wouldn’t discount the more granular tools as they provide the flexibility we might like to preserve.

Tradeoffs and Practical Considerations

Naturally, when designing any kind of system, we make tradeoffs. When designing real-world systems, we often need to be practical, beyond theory. So it’s important to understand whether this kind of architecture pattern and technology carry with it any specific considerations or tradeoffs.

Model selection and configuration is an obvious point to note when building a system where LLMs sit at the heart of it. Not all tasks are created equal, and some may require a higher level of reasoning than others. The tradeoff being cost/latency with reasoning and expressive power, and inherent capabilities of the model (e.g. is it multi-modal or not). For example, a “router” agent that identifies and dispatches messages to other agents/processes may work well enough with a cheaper (weaker?) model; whereas an agent requiring deep understanding of a domain model, and how to retrieve and connect different bits of information, working for a longer time, may require a stronger model. This will probably be more evident in systems where there’s a topology of cooperating agents.

Then there’s the elephant in the room: the tradeoff between autonomy and risk. This is an obvious point when considering a somewhat stochastic element in the architecture.

On the one hand, autonomy provides the agent, and ultimately the user, more flexibility. This should immediately lead to more unexpected use cases and “emergent behavior” mentioned above. Consider, for example, an agent dealing with financial records that can be used to identify issues and fix them without pre-programming the patterns in code.

On the other hand, there’s an inherent risk with allowing too much. Restricting the agent’s capabilities increases predictability and therefore safety. It of course limits the product’s value at the same time. On the extreme end, a very limited agent is kind of a fancy workflow engine⁵.

Applications obviously exist on a spectrum here, but this is a prime consideration when designing the agent’s capabilities.

The intersection of LLM context and long running agent carries with it some points to pay attention to as well.
First of all, long running agents will probably “run out” of context windows. Trying different tools, retrying failed actions, accumulating data and observations will inevitably lead to the context window filling up. This is an expected problem in this scenario. Its impact and frequency will most likely correlate with task complexity and tool capabilities.
When building such a system, we should provide a standard, hopefully efficient, way to summarize or compact the context. Simply dropping a “memory” is usually not an option. There should be a standard way for agents to retrieve memories where applicable. This will likely be a core component of the system, and it’s still open (at least for me) whether there’s a general mechanism for managing context that will fit all kinds of tasks and/or applications and agent topologies.

Which brings me to another point about context – managing context across agents, and the intersection of agents and users. The context for an agent will evolve across sessions. And it might actually be a good thing, depending on the application, to make it accessible to the human user. For example, if we want to allow the user to fix a data issue and/or somehow change the behavior by modifying some learned memory. There is a potential here for conflicts between changes. So we should consider how conflict resolution is done when it occurs on context updates.

User interface and experience should also be considered carefully here. Since a fundamental building block is the agent loop, the state and progress of the agent should be reflected to the human user, and maybe to APIs as well. Faithfully reflecting the state of the system, specifically the behavior and reasoning of the agent(s) running in it, helps to identify issues and build trust. I expect this to be a non-negligible issue when building and adopting such an application. Completion signals are part of this standard pattern, and probably deserve a “first-class” citizen status in the application. Understanding what an agent is doing, and whether it’s done with the goal the user has presented it is important to the user. Understanding when and whether, and sometimes how, a goal was achieved should be standardized.

One last tradeoff to point out is the mechanism used for discovering tools by agents. You can have a static list of tools (capabilities), coded as available to each agent. This provides a more predictable list and therefore higher control. On the other hand, you can imagine a more dynamic “tool registry”, where tools may be added and made available to agents. Tool choice is done by the agent, but probably easier to predict with a static list of tools. An evolving, dynamic, registry may offer more flexibility but may be less predictable. I expect the agent will have a tougher time selecting the right tool in this case.

If we want true flexibility, we lean into the dynamic registry. And if the agent gets lost in the aisles?⁶ We can always fall back to a “safer” hard-coded map of tools.

From Magic Boxes to Design Blueprints

Whether we choose static control or dynamic flexibility, the goal remains the same: building a robust environment for autonomy.

We are rapidly moving past the phase where an LLM is a “magic box” bolted onto the side of a traditional app. We need to think about how we design these systems. We have to get serious about the architectural patterns that allow these agents to actually get work done without constant human hand-holding.

The transition to agent-driven applications presents a new set of interesting problems for us to solve⁷. We’re no longer just designing APIs for human coders; we’re designing vocabularies for agents. The challenges ahead – how to build tools that are legible to a model, how to share context across a multi-agent swarm without it becoming a game of “telephone”, and how to let that context evolve organically – are the new unexplored territories of software system design.

Building these systems isn’t just about writing code anymore; it’s about building a harness for reasoning. It’s messy, it’s non-deterministic, and therefore less predictable. But it’s also an exciting architectural shift.

So, let’s keep exploring and see what else we can find.

or, god forbid, write something in markdown. ↩︎
More examples here. ↩︎
Pun intended ↩︎
Yes, I realize it’s a bit more complicated than that, but you get the idea. ↩︎
And we have plenty of those, good ones, with no LLMs involved. ↩︎
That is, fails to accomplish its goal ↩︎
or at least old problems with new technology ↩︎

MAD About Software Design: When AI Debates

3 Replies

So at this point, I think we’ve established that LLMs can code (right?). They’re only getting better at it. I’ve also argued in the past that I believe LLMs can do more than just code to improve our software engineering lives. But this isn’t a simple task. There’s quite a bit of essential complexity in the process; it’s beyond simply automating day-to-day tasks¹.

Imagine my interest, then, as I stumbled upon the idea of LLM-based AI agents debating each other. The concept isn’t unique to software engineering, but it still appealed to me as a way to simulate (or at least approximate) an actual software design process and, by extension, scale or improve it.

Before we dive into my implementation, let’s step back and understand the concept, and where it fits.

A Discourse of Agents

LLMs are powerful², but they often come with a (potentially significant) catch. A single LLM, as capable as it is, can easily suffer from issues like hallucinations, inconsistent reasoning, and bias. And the more complex the task, the more likely it is to exhibit these issues. This is the “single-agent trap”: relying on one model’s perspective means you are exposed to its blind spots. This isn’t that different from trying to solve complex tasks as humans³ – the more complicated the task, we often benefit from collaborating with others.

We have ways to mitigate some of the problems to an extent – prompt and context engineering, RAG, access to tools.

So what if we didn’t have to rely on a single AI agent’s answer, the same as we humans collaborate with other people when working on complex issues?

This is where multi-agent debate (MAD) comes in. MAD provides a complementary approach. It’s an approach that uses iterative discourse to enhance reasoning and improve validity. See examples here.

You can think of it like a collaborative “society of minds”. Instead of one agent providing one answer, multiple agents propose and critique solutions to the problem. This goes on for several rounds of discussion, where agents challenge each other’s proposals, spot errors and refine their ideas. Eventually, the goal is for this process to converge on a superior final answer.

While I don’t intend to provide a full literature review here, or any kind of exhaustive description⁴, I think it’s worth understanding the main components, findings and challenges.
What follows below is a crash course on Multi-Agent Debate (MAD). But if you’re interested in more detailed evidence and nuance, I encourage you to follow the links and explore some more.

MAD – The Bird’s Eye View

So how do these debates actually work under the hood?
There are different implementations, and from what I’ve seen, they vary significantly for different reasons. But three fundamental components repeat in all cases.

First is the agent profile, which defines the roles or “personas” of the debating agents. A simple setup might define agents that are symmetrical peers. But more complicated setups assign specific roles to agents. For example, one agent may be a “critic”, another a “security expert”, etc. There are different ways to create this diversity. Everything from using different models, configuring them differently, and prompting the different agents to hold/emphasize divergent views.

Second is the communication structure – the topology. Essentially the network map that dictates who talks to whom. A common setup is a fully connected setup where all agents see each others’ messages. Other approaches may use more sparse topologies (interacting with specific neighbours) or even going through a single orchestrator/dispatcher. The choice of topology of course changes the debate dynamic.

Finally, there is the decision-making process: how the debate is concluded. After the agents have debated amongst themselves, how do you decide it’s time to conclude and compile a final answer?
The simplest method, which works well for certain types of problems, is a simple majority voting. This works best in cases where the answer to the problem is a simple deterministic value, e.g. math problems. Another approach, a bit more structured, is to use a “judge” (/”arbiter”) agent. This agent listens to arguments from all sides and selects or compiles a winning answer.

Does It Work?

Yes, to a degree.
Current research suggests that multiple agents working together achieve better results, especially when the complexity of the tasks increases. This example shows significant improvements on math problems.

Multi-Agent Debate (MAD) systems seem to improve factuality and the accuracy of results. Agents seem to be able to spot errors in each other’s reasoning, improving consistency. Some evidence can be seen here and here among others.

Tasks that are more complicated, and/or require more diversity of thought, seem to benefit from this pattern more. Specifically, it seems that iterative refinement and using different models to propose and debate each other yields better results – more consistent answers that align better with human judgement.

Does It Always Work?

Of course not. It wouldn’t be fun otherwise.

This study, for example, suggests that it’s not so much the debate that’s improving performance, but rather the “multi-agent” aspect of it. Another study suggests they are difficult to optimize (though it does conclude they have potential for out-performing other methods).

There are also distinct failure modes. This study suggests that models may flip to incorrect answers under some conditions. And they require more careful setup, specifically guidance on how to criticize answers from other agents – a structured critique guidance.

There are of course cost considerations to be had, as any engineering problem. Multiple agents making repeated calls to LLMs with potentially growing (exploding?) context mean cost can easily get out of hand.

This is an active research area, with probably more results and implementations to be shared in the near future.

So while we’re here, why not join the fun, and try to apply it?

MAD About Software Design

This pattern of debating agents can be applied to all sorts of problems, as the studies linked above show. Software system architecture should not be an exception. I could not find another implementation of this pattern that’s related to SW engineering. The closest is MAAD which seems nice, but as far as I could see it does not exactly implement a debate pattern, but rather a set of cooperating agents working towards a goal of producing a design specification.

Part of the reason this piqued my interest is that in my line of work, when considering feature and system designs, a debate⁵ is a natural dynamic. This is simply what we do – we discuss, brainstorm and often argue over different alternative solutions. AI agents debating over a design problem seems like a natural fit.

This is where Dialectic comes into play.

This is a small, simple implementation⁶ of the multi-agent debate pattern, with a focus on software engineering debate. It is a command line tool, receiving a problem description and a configuration of a debate setup, and carries out a debate between different agents. The tool facilitates the debate between the agents with the goal of eventually arriving at a reasonable, hopefully the best, solution to the presented problem, with concrete implementation notes and decisions.

When it comes to the debate setup, Dialectic allows the user to specify the number and role of participating agents. A user can choose from available roles – “Architect”, “Performance Engineer”, “Security Expert”, “Testing Expert” and “Generalist”⁷.

The current implementation has a rather rigid debate structure: for a fixed number of rounds (configurable), each agent is asked to propose a solution, then critique all of the other agents’ solutions, and refine its proposal based on the feedback from other agents. The refined proposals are fed into the next round. At the end of the last round, a Judging agent receives the final proposals and compiles a synthesized solution from all participating agents.

As a user, you can control the number of rounds, the prompts used, temperature and model per agent. See here for a more complete description of configuration options.

Why This Debate Pattern?

The chosen debate pattern and configuration options are intentional⁸, in an attempt to mitigate some of the problems mentioned above.

First, different “roles” (essentially different sets of agent system prompts) offer different perspectives. When debating, specifically criticizing each other’s work, the offered different perspectives should allow consideration of different arguments for choices. This hopefully avoids at least some of the potential groupthink.

Additionally, each agent can be configured with a different LLM model and different temperature. This offers a chance at combining models with different strengths (and costs), potentially trained and tuned on different data sets. This heterogeneous debate setup, which combines different agent profiles, allows for a rich interaction of viewpoints. This is especially true given the current fixed topology, where every agent critiques all other agents’ proposals.

The possibility for clarifications from the user allows also for additional context based on specific agents input (the agents ask the user questions). This not only allows more focused context to the debate, but also mimics a real world dynamic where the development team interacts with the product owner/manager for different clarifications that come up during a discussion (“what should we do in this case? – this is a product decision” is a common phrase heard around the office).

Dialectic also supports context summarization to try and avoid context explosion. There’s of course a trade-off here, but for practical cost reasons⁹, it should support a way to manage context size. Some models can be quite “chatty” and end up with big responses.

Apart from being a tool to be used in practice, I realize the different options and combinations possible can potentially lead to very different results and quality may vary depending on any number of reasons. This is why output options also vary: you can simply output the final synthesized solution, as well as get a structured file containing a more detailed step-by-step description of the entire debate, with all configurations and latency + tokens used figures. There’s also the option to provide a complete debate report in markdown format. This should allow users to experiment with different debate and agent configurations, and hopefully settle on a setup (maybe several) that fits their purposes the most.

What’s Next?

At this point, you can start using Dialectic, and experiment with it on different problems and debate setups. I plan to do it as well.

Initial experiments seem anecdotally promising. When used with advanced models, it’s producing reasonable results. It’s a practical tool, still evolving, that shows promise in helping to analyze and reach solutions in complex domains faster and more comprehensively. But we’ll need to evaluate results more systematically, so this is the obvious next stage.

At the same time, I believe it can still help as a brainstorming partner. Having a tool that automatically analyzes a problem from several angles and refining it is at the very least helpful in covering options and exploring ideas.

But it’s clear that some things can and should be improved/added.

To start, a lot of real-world (human) discussions implicitly involve pre-existing knowledge. This is part of the experience we have as professionals. Specifically, knowledge and context of our specific systems (the “legacy code”), patterns and domains. While it’s possible to include a lot in the given problem description and clarifying questions, I believe it should be possible for debating agents to query further information and knowledge. We will probably need to support plugging in extra knowledge retrieval, driven by the agents to allow them more focus and refined answers.

Another thing to look into is the way the debate terminates. Currently it’s a fixed number of configured rounds. All rounds run, and the judge has to synthesize an answer at the end. But this is not the only way. We can terminate the debate when it seems that there’re no new ideas or issues coming up. We can have the different agents propose a confidence vote of their proposals, and then have the debate terminate when it seems that all (most?) agents are confident beyond some set threshold.
We can also instruct the agents and judge to propose follow-ups, and use the result of a given debate as the input to another, with extra information.

The current topology is also fixed. It will be interesting to experiment with different topologies. For example, have the specialist (security, performance, testing) agents only critique the architect’s proposals. A step further would be for an orchestrator agent to dynamically set up the topology, based on some problem parameters.

Agent diversity is also interesting. There is evidence that diversity of agents improves results in some cases. Playing with the LLM models used, their temperature and specific prompts can potentially complement each other in better ways. We could, for example, create an agent that is intentionally adversarial, and pushes for alternative solutions.

The tool itself can of course be augmented with interesting features:

Automatically deriving and outputting ADRs
Adding image(s) as some initial context.
Connecting with further context available from other systems as input¹⁰, so the agents’ analysis is more evidence-based

These should be helpful in making it more useful for day-to-day work.

Of course, costs are also important. The current implementation tries to summarize so we don’t hit token limits too early. But it’s possible we can find more ways to optimize costs. Skip calls when not necessary, summarize to a smaller size every round, etc.

So Software Designers are Obsolete?

No.
I do believe there’s still a way to go before this replaces the human dynamics of discussion. One thing that I still don’t see LLMs doing well is weighing trade-offs, especially when human factors¹¹ are in play. This is more than a simple information gap that can be solved by tooling. I don’t see how agents implicitly “read the room”. I also don’t see how we mimic human intuition by agents.

I do see this as a step forward, not only because we can automate a lot of the research and debate. But also because the analysis given by such agents is almost guaranteed to be more driven by information, cold analysis, and the vast knowledge embedded within them. Agents don’t get offended (I think) when their proposal is not accepted, or when they don’t get to play with the cool new technology.

Summary

Dialectic is a simple tool that tries to implement a potentially powerful pattern of agentic systems in the realm of software engineering. If done properly, I believe it can help in reaching decisions faster and with higher quality, especially when scaling design work with a larger organization. And this is what mechanization is all about.

The combination of LLM-based agents into a debate and feedback loops should enable more complete solutions, likely with higher quality.

Off to design!

Which is of course still a welcome improvement ↩︎
And continue to improve ↩︎
Sadly, even the hallucination part is true for humans sometimes. ↩︎
A decent review can be found here. But any “Deep Research” AI will help you here. ↩︎
Between people – humans in the loop! ↩︎
Yes, it could have been implemented with something like LangChain/Graph or probably even some kind of low-code tooling. But I also like to learn by doing, so I opted for more bare-bones approach of coding from scratch. We might port it to use some other framework in the future. ↩︎
Note there’s nothing fundamentally about Software in this pattern, except these roles. It’s straightforward to apply the debate pattern to other roles. ↩︎
And still evolving ↩︎
I got too many 429 errors complaining about token limits when testing ↩︎
MCP server support? ↩︎
Business pressures, office politics ↩︎

AI Adoption Roadmap for Software Development

1 Reply

I’ve argued before that LLMs’ greatest promise in software engineering lies beyond raw code generation. While producing code remains essential, building scalable, cost-effective software involves far more: requirements, architecture, teamwork and feedback loops. The end goal is of course producing useful and correct software, economically. But the process of producing software, especially as the organization scales, is much more than that.

So how do we adopt AI across a growing software organization—efficiently and at scale?

We’ve gone through¹ paradigm shifts before – agile, microservices, DevOps are some examples. Is AI different in some more profound way, or another evolutionary step?

I believe this is a slightly different story, when compared to other technologies, at least when it comes to the practice of SW development.

First, this is an area that’s still being actively researched, with advancements in research and technology being announced all the time. New models and papers drop constantly, fueling FOMO and risk of distraction. Teams can quickly feel overwhelmed without a clear adoption path.

Second, it seems that a technology that sits at the intersection of machines and human communication (because of natural language understanding), has the potential to disrupt not only the technical tools we use, but our workflows and working patterns at the same time. AI feels less like another toolchain and more like a collision of Agile and microservices – reshaping not just code, but communication flows themselves. This may be going too far, but I sometimes imagine this is the first time Conway’s law might be challenged.

The AI ecosystem, especially in the software engineering space² is abundant with tools and technologies. The rate of current development is staggering, and it’s getting hard to keep up with announcements and tools/patterns/techniques being developed and announced and shared.

Randomly handing teams new AI toys can spark short-term wins. But to unlock AI’s transformative power, we need to be more intentional about it. We need a deliberate adoption roadmap.

Our aim: weave LLMs into daily software engineering to maximize impact. But with tools and standards still maturing, a rigid, long-range plan is unrealistic. There are few substantial case studies that show adoption at scale at this point. Similar to early days of the world wide web, some imagination and extrapolation is required³, and naturally some of it will be wrong or will need to be updated in the future to come.

It’s natural to chase faster coding as the low-hanging fruit. Yet AI’s true potential lies in higher-level workflows. Since I believe the potential is much greater, I try to follow a slightly more structured approach to navigating this challenge.

This here is my attempt at trying to think and articulate an approach for adoption of AI for a software development organization. It’s positioned as a (very) high level roadmap for adopting AI in a way that will benefit the organization and will be hopefully viable and efficient at the same time.

This will probably not fit any organization. Specifics of business, architecture, organizational structure and culture will probably require adapting this, even significantly. Still, I believe this can be used as a framework for thinking about this topic, and can serve at the very least as a rough draft for such a roadmap.

I will of course be happy to hear feedback or how others approach this challenge, if at all.

Before diving into details of such a suggested roadmap, I will need to introduce a preliminary concept which I believe to be central to the topic of AI adoption – AI Workspaces.

AI Workspaces

Most AI technology today focuses on transactional tool usage – a user asks something (prompts), and the AI model responds, potentially with some tool invocations. The utility of this flow is limited, mainly because crafting the prompt and providing the context is hard. Some AI tools provide facilities and behind-the-scenes code that injects further context, but this is still localized, and not always consistent. From the user’s point of view it’s still very transactional.

In order to realize more of the potential AI has for simplification and automation, we need to consistently apply and provide context that is updated and used whenever needed. We need to allow a combination of AI tools with the relevant up-to-date context so more complicated tasks can be achieved. Also, with more AI autonomy, the easier it will be for users to apply and use it successfully.

I’m proposing that we need to start thinking about an “AI workspace”.

An AI workspace is a combination of:

Basic AI tools, e.g. models used, MCP servers, with their configuration.
Custom prompts, usually focused on a task or set of tasks in some domain.
Persistent memory – a contextual knowledge source, potentially growing with every interaction, that is relevant to tasks the AI is meant to address.

The combination of these, using different tools and techniques, should provide a complete framework for AI agents to accomplish ever more complex tasks. The exact capabilities depend of course on the setup, but the main point is that all of these elements, in tandem, are necessary to create more elaborate automation.

A key point here is the knowledge building – the persistent memory. I expect that an AI workspace is something that’s constantly updated (automatically or by the user) so the AI can automatically adapt to changing circumstances, including other AI-based tasks. There should be a compounding effect of knowledge building over time and being used by AI to perform better and more accurately.

An AI workspace should be customized for a specific task or set of tasks. But it can be more useful if it will be customized for a complete business flow that brings together disparate systems and roles in the organization. This will arguably make the workspace more complex and harder to set up, but if used consistently over time, the overhead might be worth it.

We’re already seeing first signs of this (e.g. Claude Projects), but I expect this to go beyond the confines of a single vendor platform, potentially involving several different models, and be open to updates/reading from agents⁴.

A Roadmap – General Framing

As I’ve already noted, using AI, in my opinion, is more than simply automating some tasks. Automating is great, and provides value, but the potential here is much greater. In order to realize the greater potential we need to leverage the strengths of LLMs, and point them at the right challenges we face in our day to day work in software development.

And these strengths generally boil down to:

Understanding natural language (and other, more formal, languages)
Being able to respond and produce content in natural language (and other, more formal, languages)
Understand patterns in its input and reasoning on it; and apply patterns to its output.

And do all of this at scale.

Looking at the challenges of software development, our general bottlenecks are less in code production, and more in understanding, communicating and applying our understanding effectively. This includes understanding existing code, troubleshooting bug reports, understanding requirements, understanding system architecture, anticipating impact, translating requirements to plans etc.

Apart from actual problems we might face in all of these, there’s also a challenge of scale here. The more people are involved in the software production (larger organization), the larger the codebase and the more clients we have – the greater the challenge.

An immediate corollary of the way (non-trivial) software is built is that it’s not just a problem of software developers. There are more people involved in the software building, evolution and maintenance – devops engineers, product managers, designers, customer support etc. A lot of the challenges are affected by different roles and communication patterns and motivations presented by different roles.

So when it comes to adopting a technology that has the potential to encompass different workflows and roles, I’m looking at adoption from different angles.

Since this is a roadmap, there’s naturally a general time component to it. But I’m also looking at it using a different axis – the way different roles or workflows (tasks?) adopt AI, and at what point these workflows converge, and how exactly.

The general framing of the roadmap is therefore a progression across phases of different verticals of “types of work” or roles if you will.

Workflow Verticals

When building software⁵ we have different tasks, performed by separate cooperating professionals. I’d like to avoid the discussion on software project management methodologies, so suffice to say that different people cooperate to produce, evolve and maintain the software system , each with more or less well defined tasks⁶.

Roughly speaking these workflows are:

Design and coding of the software: anything from infrastructure to application design, prototyping, implementation and debugging.
Testing and quality: measuring and improving quality processes – generating tests, measuring coverage, simulating product flows, assessing usage.
Incident management: identifying and troubleshooting issues (bugs or otherwise), at scale. This includes also customer facing support.
Product and Project management: analyzing market trends and requirements, guiding the product roadmap, rolling out changes, synchronizing implementations across teams
Operations and monitoring: monitoring the system behavior, applying updates, identifying issues proactively, etc.

All of these tasks are part of what makes the software, and operates it on a daily basis. There’s obviously some overlap, but more importantly there are synergies between these roles. People fulfilling these roles constantly cooperate to do their job.

People doing these roles also have their own tools and processes, each in its own domain, with the potential to be greatly enhanced by AI. We’re already seeing a plethora of tools promising, with varying⁷ degrees of success, to optimize and improve productivity in all of these areas.

Just to name a few examples to this:

Software coding is obviously being disrupted by AI-driven IDEs and agents.
Product management can leverage AI for analyzing market feedback, producing and checking requirements, simulating “what-if” scenarios, researching, etc.
Incident management can easily benefit from AI analyzing logs, traces and reports, helping to provide troubleshooting teams with relevant context and analysis of issues.
Testing can be generated and maintained automatically alongside changing code.
UX design can go from drawing to prototype in no time.

And I’m sure there are more examples I’m not even aware of. The list goes on.

The point here is not to exhaustively list all the potential benefits of AI. Rather, I argue that for the software organization to effectively leverage AI, it needs to do it across these “verticals”.

And as the organization and the technologies mature, we have better potential to leverage cooperation and synergies between these verticals.

This won’t happen immediately. It probably won’t happen for a while, if at all. But for that, we need to talk about phases of adoption.

Phases of Adoption

I try to outline here several phases for the adoption of AI. These phases are not necessarily clearly distinct. Progress across these is probably not linear nor constant. The point of this description is not so much to provide a concrete timeline. This is more about describing the main driving forces and potential value we can gain at each phase. Understanding this should help us plan and articulate better more concrete steps for realizing the vision.

You can look at these phases as a sort of “AI Maturity Level”, although I’m not trying to provide any kind of formal or rigorous definition to this. It’s more of a mindset.

Phase 1: Exploration and Basic Usage

At this phase, different teams explore the possibilities and tools available for AI usage. The current rate of innovation in this field, especially around software development is extremely high. Given this, I expect employees in different roles will experiment and try various tools and techniques, trying to optimize their existing workflows in one way or another.

At this point, the organization drives for quick wins, where people in different roles leverage AI tools for common tasks, share knowledge internally and learn from the community.

Covered scenarios at this point are localized to specific workflows and focus mainly on providing context to localized (1-2 people) tasks, as well as automation or faster completion of such localized tasks.

LLM and AI usage at this point is very much triggered and controlled by humans requesting and reviewing results. This work is very much task/workflow oriented at this point, with AI tools serving specific focused tasks. The human-AI interaction at this point is very transactional and limited in scope.

The organization should expect to gain the required fundamental knowledge of deploying and using the different tools securely and in a scalable manner, including performance, cost operations etc. At this phase, a lot of experimentation and evaluation is happening. It will be good to establish an internal community driving the tooling and adoption of AI. The organization should expect several quick wins and localized productivity gains.

I expect the learning curve to be steep in this phase, so a lot of what happens here is trial and error and comparison of different tools, techniques and models.
AI workspaces at this point, if they exist, are very much focused on the localized context of individual well-defined tasks. They are also probably harder to establish and operate (integrate tools, add information).

What would be the expected value?
Phase 1 focuses on achieving quick wins and localized productivity gains. By implementing AI code assistants, automated code reviews, AI-generated tests, and anomaly detection tools, the organization can quickly demonstrate immediate developer speedups, improved code quality, faster test coverage, and early incident learning.

This goes beyond a business benefit. It’s also a psychological hurdle to overcome. Concrete wins, such as fewer bugs and faster releases, build momentum and justify further investment in AI adoption while increasing developer satisfaction.

In addition, there’s going to be considerable technical infrastructure investment done at this point, e.g. model governance, cost management, etc. This infrastructure should be leveraged in the following phases as well, and is therefore critical. This phase provides a strong foundation for leveraging AI in future stages.

Phase 2: Grounding in Domain-Specific Knowledge

At this phase, having gained basic proficiency, the organization should expect to improve performance and scope of AI-enabled tasks by starting to build and expose organization-specific knowledge and processes to LLM models.

I expect that business-specific information (internal or external) can increase performance and open up possibilities to more tasks that can be improved using AI. Examples to knowledge building include better code and design understanding, understanding of relationship between different deployed components, connecting product requirements to code and technical artifacts, etc.

This can open the road to higher level AI-driven tasks, like analyzing and understanding the impact of different features, simulating choices, detecting inconsistencies in product and technical architecture and more.

A key aspect of this phase is to facilitate a consistent evolution of the knowledge so it can be scaled and maintain its efficacy. At this point, the organization needs to have the infrastructure and efficient standards in place so information can be shared between roles, and between different AI-driven tools and processes.

In this phase AI workspaces become more robust and prevalent, encompassing a larger context, and even crossing across workflows verticals in some cases. Contrast this with workspaces we’ve seen in the first phase which are more focused in localized contexts.

This phase is also when we start thinking in “AI Systems” instead of simply using AI tools. This is where we consistently apply and use AI workspaces, with several tools (AI or non-AI) being combined with the same knowledge base, and evolve it together.

An example for this would be AI coding agents that automatically connect their implementation to JIRA tickets, product requirements, and record this knowledge. With other AI agents leveraging this knowledge to map it to design decisions and testing coverage reports (how much of the product requirements are tested) and plan roll outs.

What value can we expect to have at this point?
Phase 2 is mainly around integrating company-specific (and company-wide) knowledge with AI workspaces. At this point I expect existing workloads to be more accurate, precise and faster in doing their work, even if the task is limited in scope. The grounding provided by the specific knowledge graph should improve the accuracy of AI models.

Different workflow verticals will start to cooperate more closely at this point. First of all, by building a knowledge graph/base together. But also by leveraging this combined knowledge to implement simple agentic workflows, where AI-based agents start to reason on the data and make simple decisions.

Phase 3: Autonomous Cross Team Workflows

This is the point where previous infrastructure starts to really pay off in terms of increased productivity and quality.

At this phase of adoption, I expect we’ll see more autonomous AI-driven processes coming into fruition. And when I say “AI-driven” I’m not referring to simply automating a well known process. I’m referring to AI agents reasoning and dynamically using tools and other agents to adapt and produce results/do tasks⁸. I expect at this point AI agents can also build their own knowledge, and adapt their work to accommodate changes in the environment.

Humans are still in the loop for critical decision making, but the friction between humans and tools, and humans to humans is significantly reduced⁹. The focus at this point should be on eliminating bureaucracy and increasing the adoption of consistent and increasingly robust workflows. This generalization also means that agentic AI systems now work across roles and departments, it’s where the workflow verticals start to converge.

Examples to this would be:

Managing changes across roles and workflows. For example, a change in UX/product feature definition that is automatically reflected in plans, and rolled out to clients.
Technical design that is validated against technical dependencies (from other teams), past decisions and project plans. Potentially updating the dependencies and informing other agent, potentially changing agent decisions as a result.
Identifying cross-cutting issues from internal conversations, correlated with support tickets and other metrics, and proactively planning and suggesting resolutions.

At this phase, I expect AI workspaces to become really cross-departmental and leverage knowledge being built and added in different verticals.

Ad-hoc exploration and automation of tools should also be possible. At this point, the organization should have a strong foundation of tooling and experience with applying AI. It should be possible to allow ad-hoc building of new flows on top of the existing LLM infrastructure and the ever-evolving organizational knowledge base.

Note that this also poses a challenge: there is a fine line between standardization of tools, which drives efficiencies at scale, and democratization of capabilities. You want people to experiment and find new ways to optimize their work, but in order to efficiently grow you’ll need to apply some boundaries to what is used and how it’s used. This tradeoff isn’t unique to AI systems, but I believe it will become more emphasized when we consider new directions and applications of LLMs as the technology improves.
In terms of expected value, we should expect significant productivity gains. While humans are still in the loop, AI will further automate processes, reducing bureaucracy. The focus will be on adoption of consistent productive workflows across roles and departments. Human focus should be on innovation and decision making at this point, with accurate and reliable information being made available to humans, by the machines¹⁰.

Technical Infrastructure

In order to support this process, looking at the expected phases of adoption, we should pay attention and plan the necessary technical infrastructure investment. This is true with the adoption of any new technology, but with the current explosion of tools and techniques, it’s very easy to lose focus.

I won’t pretend to know exactly which tools should be available at what point. Nor do I expect to know a definitive list of tools and compare them at this point¹¹. But in order to plan ahead investments, and make a concerted effort on learning what will help us, I believe we can give some idea of what will be needed at each phase of adoption.

In phase 1, we naturally explore a plethora of tools. We should be able to facilitate new models for different use cases. Enabling access to different models using tools that provide a (more or less) uniform facade is useful. Examples for this are OpenWebUI, LiteLLM. We should provide access to AI-driven IDEs, like Cursor, Windsurf and similar ones.

For non-development workflows, AI-based prototyping tools should be helpful, and vendor-specific AI extensions would be helpful. The same goes for monitoring tools.

Connecting these tools with MCP servers to existing hosts of MCP clients (IDEs, chat applications, etc) would probably be useful as well. So support for installing and monitoring MCPs might be useful. At this point it should be also useful to establish some way to measure effectiveness of prompts or model tuning, and track usage of various tools.

In phase 2, building and potentially maturing the infrastructure at phase 1, we should start focusing on more robust workflows, and knowledge building. Depending on use cases, it could be useful to look at agent workflow frameworks (LangChain, et al) and agent authoring tools (e.g. n8n).
Additionally, knowledge management tools and processes will probably be useful to introduce – easily configured RAG processes (and therefore vector DBs), memory management techniques, maybe graph databases. This of course all depends on the techniques used for memory building and maintenance.

I expect MCP servers, especially ones specialized for the organization’s code and other knowledge systems, will become more central. It should be possible to also create necessary MCP servers that will allow LLMs to access and use internal tools.

In phase 3, I expect most of the technical features to be in place. This will be a phase where the focus will be more optimizing costs and improving performance. It’s possible that we should be looking at ways to use more efficient models, and match models to tasks, potentially fine tuning models, in whatever method.

Monitoring the operation and costs of agents, understanding what happens in different flows will become more critical at this point, especially when usage scales up in the organization, and AI adoption increases, across departments.

Summary

AI stands to transform software engineering far beyond code generation. Realizing that promise demands coordinated learning, infrastructure and a phased roadmap. This framework offers a starting point

I believe that due to the nature of the technology, it goes beyond simple tool adoption, or alternatively adopting a new project management practice. This has the potential to change both aspects of work.

The structure I’m proposing is to highlight the potential in each “stream” of workflow vertical, and adopt the tools in phases of maturity, as the ecosystem evolves (click to view full size):

This visualization is only an illustration of course. You’ll note it’s laid out as a “layer cake” where scenarios for using AI are roughly laid out on top of other use cases/scenarios which should probably precede them.

This is of course not an exhaustive list.

The attempt here is of course to structure the process into something that can be further refined and hopefully result in an actionable plan. At the very least, it should serve as a guideline on where to focus research, learning and implementation efforts, to bring value.

It would be nice to know what other people are thinking when trying to structure such a process; or what the AI thinks about this.

On to explore more.

Dare I say “weathered”? ↩︎
SW engineers being natural early adopters for this technology ↩︎
And we know how some attempts didn’t end well. ↩︎
To be honest, I did not yet dive into the Claude projects, so it’s possible they support this. But I can imagine something similar done with other tools as well. ↩︎
And probably in other industries as well, but I know software best. ↩︎
I realize this is kind of hand-wavy, but bear with me. Also, you probably know what I’m talking about ↩︎
Ever increasing? ↩︎
In a sense, leveraging test time compute at the agentic system level ↩︎
Although in some cases, friction is desirable – think of compliance, cost management, etc. ↩︎
I guess accurate context is also important for humans, who would’ve guessed. ↩︎
And let’s face it, at the rate things are going right now, by the time I finish writing this, there will be new tools ↩︎

From Code Monkeys to Thought Partners: LLMs and the End of Software Engineering Busywork

When it comes to AI and programming, vibe coding is all the rage these days. I’ve tried it, to an extent, and commented about it at length. While it seems a lot of people believe this to be a game changer when it comes to SW development, it seems that among experienced SW engineers, there’s a growing realization that this is not a panacea. In some cases I’ve even seen resentment or scorn at the idea that vibe coding is anything more than a passing hype.

I personally don’t think it’s just a hype. It might be more in the zeitgeist at the moment, but it won’t go away. I believe that, simply because it’s not a new trend. Vibe coding, in my opinion, is nothing more than an evolution of low/no-code platforms. We’ve seen this type of tools since MS-Access and Visual Basic back in the 90s. It definitely has its niche, a viable one, but it’s not something that will eradicate the SW development profession.

I do think that AI will most definitely change how developers work and how programming looks like. But this still will not make programmers obsolete.

This is because the actual challenges are elsewhere.

The Real bottlenecks in Software Engineering

In fact, I think we’re scratching the surface here. Partially because the technology and tooling are still evolving. But also, it’s because it seems most people¹ looking at improving software engineering are looking at the wrong problem.

Anyone who’s been at this business professionally has realized at some point that code production is not the real bottleneck when it comes to being a productive software engineer. It never was the productivity bottleneck.

The real challenges, in real world software development, especially at scale, are different. They revolve mainly around producing a coherent software by a lot of people that need to interact with one another:

Conquering complexity: understanding the business and translating it into working code. Understanding large code bases.
Communication overhead: the amount of coordination that needs to happen between different teams when trying to coordinate design choices². We often end up with knowledge silos.
Maintaining consistency: using the same tools, practices and patterns so operation and evolution will be easier. This is especially true at a large scale of organization, and over time.
Hard to analyze impacts of changes. Tracing back decisions isn’t easy.

A lot of the energy and money invested in doing day-to-day professional software development is about managing this complexity and delivering software at a consistent (increasing?) pace, with acceptable quality. It’s no surprise there’s a whole ecosystem of methodologies, techniques and tools dedicated to alleviate some of these issues. Some are successful, some not so much.

Code generation isn’t really the hard part. That’s probably the easiest part of the story. Having a tool that does it slightly faster³ is great, and it’s helpful, but this doesn’t solve the hard challenges.
We should realize that code generation, however elaborate, is not the entire story. It’s also about understanding the user’s request, constraints and existing code.

The point here isn’t about the fantastic innovations made in the technology. My point is rather that it’s applied to the least interesting problem. As great as the technology and tooling is, and they are great, simply generating code doesn’t solve a big challenge.

This leads me to thinking: is this it?
Is all the promise of AI, when it comes to my line of work, is typing the characters I tell it faster?
Don’t get me wrong, it’s nice to have someone else do the typing⁴, but this seems somewhat underwhelming. It certainly isn’t a game changer.

Intuitively, this doesn’t seem right. But for this we need to go a step back and consider LLMs again.

LLM Strengths Beyond Code Generation

Large Language Models, as the name implies, are pretty good at understanding, well – language. They’re really good at parsing and producing texts, at “understanding” it. I’m avoiding the philosophical debate on the nature of understanding⁵, but I think it’s pretty clear at this point that when it comes to natural language understanding, LLMs provide a very clear advantage.

And this is where it gets interesting. Because when we look at the real world challenges listed above, most of them boil down to communication and understanding of language and semantics.

LLMs are good at:

Natural language understanding – identifying concepts in written text.
Information synthesis – connecting disparate sources.
Pattern recognition
Summarization
Structured data generation

And when you consider mechanizing these capabilities, like LLMs do, you should be able to see the doors this opens.

These capabilities map pretty well to the problems we have in large scale software engineering. Take, for example, pattern recognition. This should help with mastering complexity, especially when complexity is expressed in human language⁶.

Another example might be in addressing communication overhead. It can be greatly reduced when the communication artifacts are generated by agents armed with LLMs. Think about drafting decisions, specifications, summarizing notes and combining them into concrete design artifacts and project plans.
It’s also easier to maintain consistency in design and code, when you have a tireless machine that does the planning and produces the code based on examples and design artifacts it sees in the system.

It should also be easier to understand the impact of changes when you have a machine that traces and connects the decisions to concrete artifacts and components. A machine that checks changes in code isn’t new (you probably know it as “a compiler” or “static code analyzer”). But one that understands high level design documents and connects it eventually to the running code, with no extra metadata, is a novelty. Think about an agent that understands your logs, and your ADRs to find bottlenecks or brainstorm potential improvements.

And this is where it starts to get interesting.

It’s interesting because this is where mechanizing processes starts to pay off – when we address the scale of the process and volume of work. And we do it with little to no loss of quality.

If we can get LLMs to do a lot of the heavy lifting when it comes to identifying correlations, understanding concepts and communicating about it, with other humans and other LLMs, then scaling it is a matter of cost⁷. And if we manage this, we should be on the road to, I believe, an order of magnitude improvement.

So where does that leave us?

Augmenting SW Engineering Teams with LLMs

You have your existing artifacts – your meeting notes, design specifications, code base, language and framework documentation, past design decisions, API descriptors , data schemas, etc.
These are mostly written in English or some other known format.

Imagine a set of LLM-based software agents that connect to these artifacts, understand the concepts and patterns, make the connections and start operating on them. This has an immediate potential to save human time by generating artifacts (not just code), but also make a lot of the communication more consistent. It also has the potential to highlight inconsistencies that would otherwise go unnoticed.

Consider, for example, an ADR assistant that takes in a set of meeting notes, some product requirements document(s) and past decisions, and identifies the new decisions taken automatically, and generates succinct and focused new ADRs based on decisions reached.

Another example would be an agent that can act as a sounding board to design thinking – you throw your ideas at it, allow it to access existing project and system context as well as industry standards and documentation. You then chat with it about where best practices are best applied, and where are the risks in given design alternatives. Design review suddenly becomes more streamlined when you can simply ask the LLM to bring up issues in the proposed design.

Imagine an agent that systematically builds a knowledge graph of your system as it grows. It does it in the background by scanning code committed and connecting it with higher level documentation and requirements (probably after another agent generated them). Understanding the impact of changes becomes easier when you can access such a semantic knowledge graph of your project. Connect it to a git tool and it can also understand code/documentation changes at a very granular level.

All these examples don’t eliminate the human in the loop. It’s actually a common pattern in agentic systems. I don’t think the human(s) can or should be eliminated from the loop. It’s about empowering human engineers to apply intuition and higher level reasoning. Let the machine do the heavy lifting of producing text and scanning it. And in this case we have a machine that can not only scan the text, but understand higher level concepts, to a degree, in it. Humans immediately benefit from this, simply because humans and machines now communicate in the same natural language, at scale.

We can also take it a step further: we don’t necessarily need a complicated or very structured API to allow these agents to communicate amongst themselves. Since LLMs understand text, a simple markdown with some simple structure (headers, blocks) is a pretty good starting point for an LLM to infer concepts. Combine this with diagram-as-code artifacts and you have another win – LLMs understand these structures as well. All with the same artifacts understandable by humans. There’s no need for extra conversions⁸.

So now we can have LLMs communicating with other LLMs, to produce more general automated workflows. Analyzing requirements, in the context of the existing system and past decisions, becomes easier. Identifying inconsistencies or missing/conflicting requirements can be done by connecting a “requirement analyzer” agent to the available knowledge graph produced and updated by another agent. What-if scenarios are easier to explore in design.

Such agents can also help with producing more viable plans for implementation, especially taking into consideration existing code bases. Leaning on (automatically updated) documentation can probably help with LLM context management – making it more accurate at a lower token cost.

Mechanizing Semantics

We should be careful here not to fall into the trap of assuming this is a simple automation, a sort of a more sophisticated robotic process automation , though that has its value as well.

I think it goes beyond that.
A lot of the work we do on a day to day basis is about bringing context and applying it to the problem or task at hand.

When I get a feature design to be reviewed, I read it, and start asking questions. I try to apply system thinking and first principle thinking. I bring in the context of the system and business I’m already aware of. I try to look at the problem from different angles, and ask a series of “what-if” questions on the design proposed. Sometimes it’s surfacing implicit, potentially harmful, assumptions. Sometimes it’s just connecting the dots with another team’s work. Sometimes it’s bringing up the time my system was hacked by a security consultant 15 years ago (true story). There’s a lot of experience that goes into that. But essentially it’s applying the same questions and thought processes to the concepts presented on paper and/or in code.

With LLMs’ ability to derive concepts, identify patterns in them and with vast embedded knowledge, I believe we can encode a lot of that experience into them. Whether it’s by fine tuning, clever prompting or context building. A lot of these thinking steps can be mechanized. It seems we have a machine that can derive semantics from natural language. We have the potential to leverage this mechanization into the day to day of software production. It’s more than simple pattern identification. It’s about bridging the gap between human expression to formal methods (be it diagrams or code). The gap seems to be becoming smaller by the day.

Let’s not forget that software development is usually a team effort. And when we have little automatic helpers that understand our language, and make connections to existing systems, patterns and vocabulary, they’re also helping us to communicate amongst ourselves. In a world where remote work is prevalent, development teams are often geographically distributed and communicating in a language that is not native to anyone in the development team – having something that summarizes your thoughts, verifying meeting notes against existing patterns and ultimately checking if your components behave nicely with the plans of other teams, all in perfect English, is a definite win.

This probably won’t be an easy thing to do, and will have a lot of nuances (e.g. legacy vs. newer code, different styles of architecture, evolving non functional requirements). But for the first time I feel this is a realistic goal, even if it’s not immediately achievable.

Are We Done?

This of course begs the question – where is the line? If we can encode our experience as developers and architects into the machine, are we really on the path to obsolescence?

My feeling is that no, we are not. At the end of the process, after all alternatives are weighed, assumptions are surfaced, trade offs are considered, a decision needs to be taken.

At the level of code writing, this decision – what code to produce – can probably be taken by an LLM. This is a case where constraints are clearer and with correct context and understanding there’s a good chance of getting it right. The expected output is more easily verifiable.

But this isn’t true for more “strategic” design choices. Things that go beyond code organization or localized algorithm performance. Choices that involve human elements like skill sets and relationships, or contractual and business pressure. Ultimately, the decision involves a degree of intuition. I can’t say whether intuition can be built into LLMs, intuitively I believe it can’t (pun intended). I highly doubt we can emulate that using LLMs, at least not in the foreseeable future.

So when all analysis is done, the decision maker is still a human (or a group of humans). A human that needs to consider the analysis, apply his experience, and decide on a course forward. If the LLM-based assistant is good enough, it can present a good summary and even recommendations, all done automatically. This analysis still needs to be understood and used by humans to reach a conclusion.

Are we there yet? No.
Are we close? Closer than ever probably, but still a way to go.

Can we think of a way to get there? Probably yes.

A Possible Roadmap

How can we realize this?

The answer seems to be, as always, to start simple, integrate and iterate; ad infinitum. In this case, however, the technology is still relatively young, and there’s a lot going on. Anything from the foundation models, relevant databases, coding tools, to prompt engineering, MCPs and beyond . These are all being actively researched and developed. So trying to predict how this will evolve is even harder.

Still, if I have to think on how this will evolve, practically, this is how I think it will go, at least one possible path.

Foundational System Understanding

First, we’ll probably start with simple knowledge building. I expect we’ll first see AI agents that can read code, produce and consume design knowledge – how current systems operate. This is already happening and I expect it will improve. It’s here mainly because the task in this case is well known and tools are here. We can verify results and fine tune the techniques.
Examples of this could be AI agents that produce detailed sequence diagrams of existing code, and then identifying components. Other AI agents can consume design documents/notes and meeting transcriptions, together with the already produced description to produce an accurate record of the changed/enhanced design. Having these agents work continuously and consistently across a large system already provides value.

Connecting Static and Dynamic Knowledge

Given that AI agents have an understanding of the system structure, I can see other AI agents working on dynamic knowledge – analyzing logs, traces and other dynamic data to provide insights into how the system and users actually behave and how the system evolves (through source control). This is more than log and metric analysis. It’s overlaying the information available over a larger knowledge graph of the system, connecting business behavior to the implementation of the system, including its evolution (i.e. git commits and Jira tickets).

Can we now examine and deduce information about better UX design?
Can we provide insights into the decomposition of the system?

Enhanced Contextual Assistant and Design Support

At this point we should have everything to actually provide more proactive design support. I can see AI agents we can chat with, and help us reason about our designs. Where we can suggest a design alternative, and ask the agent to assess it, find hidden complexities, with the context of the existing system. Combined with daily deployments and source control, we can probably expect some time estimates and detailed planning.

This is where I see the “design sounding board” agent coming into play. As well as agents preemptively telling me where expected designs might falter.

More importantly, it’s where AI agents start to make the connections to other teams’ work. Telling me where my designs or expected flow will collide with another team’s plans.
Imagine an AI agent that monitors design decisions, of all teams and domains, identifies the flows they refer to, and highlights potential mismatches between teams or suggests extra integration testing, if necessary, all before sprint planning starts. Impact analysis becomes much easier at this point, not because we can query the available data (though we could, and that’s nice as well), but because we have an AI agent looking at the available data, considering the change, and identifying on its own what the impact is.

There’s still a long way to go until this is realized. Implementing this vision requires taking into account data access issues, LLM and technology evolution, integration and costs. All the makings of a useful software project.
I also expect quite a bit can change, and new techniques/technologies might make this more achievable or completely unnecessary.

And who knows, I could also be completely hallucinating. I heard it’s fashionable these days.

Conclusion: The Real Promise of LLMs in Software Engineering

I’ve argued here that while vibe coding and code generation get most of the attention, they aren’t addressing the real bottlenecks in software development. The true potential of Large Language Models lies in their ability to understand and process natural language, connect disparate information sources, and mechanize semantic understanding at scale.

LLMs can transform software engineering by tackling the actual challenges we face daily: conquering complexity, reducing communication overhead, maintaining consistency, and analyzing the impact of changes. By creating AI agents that can understand requirements, generate documentation, connect design decisions to implementation, and serve as design thinking partners, we can achieve meaningful productivity improvements beyond simply typing code faster, as nifty as that is.

What makes this vision useful and practical is that it doesn’t eliminate humans from the loop. Rather, it augments our capabilities by handling the heavy lifting of information processing and connection-making, while leaving the intuitive, strategic decisions to experienced engineers. This partnership between human intuition and machine-powered semantic understanding represents a genuine step forward in how we build software.

Are we there yet? Not quite. But we’re closer than ever before, and the path forward is becoming clearer.

Have you experienced any of these AI-powered workflows in your own development process? Do you see other applications for LLMs that could address the real bottlenecks in software engineering?

At least most who publicly talk about it ↩︎
‘Just set up an api’ is easier said than done – agreeing on the API is the hard work ↩︎
And this is a bit debatable when you consider non-functional requirements ↩︎
I am getting older ↩︎
Also because I don’t feel qualified to argue on it ↩︎
Data mining has been around forever, but mostly works on structure data ↩︎
Admittedly, not a negligible consideration ↩︎
Though from a pure mechanistic point of view, this might not be the most efficient way ↩︎

Discussing Your Design with Scenaria

Leave a reply

Motivation
As a software architect, I spend quite a bit of my time in design discussions. That’s an integral part of the job, for a good reason. As I see it, the design conversation is a fundamental part of this job and its role in the organization.

Design discussions are hard, for various reasons. Sometimes the subject matter is complicated. Sometimes there’s a lot of uncertainty. Sometimes tradeoffs are hard to negotiate. These are all just examples, and it is all part of the job. More often than not, it’s the interesting part.

But another reason these discussions tend to be hard is because of misunderstandings, vagueness and lack of precision in how we express ourselves. Expressing your thoughts in a way that translates well into other people’s minds is not easy. This gets worse as the number of people involved increases, especially when using a language where most, if not all, people do not speak natively.

From what I observed, this is true both for face to face meetings (often conducted remotely these days), as well as in written communication. I try to be as precise as I can, but jumping from one discussion to another, under time pressure, I also often commit the sin of “winging it” when making an argument in some Slack thread or some design document comment.

I’ve argued in the past that diagrams serve a much better job of explaining designs. I think this is true, and I often try to make extensive use of diagrams. But good diagrams also take time to create. Tools that use the “diagram as code” approach, e.g. PlantUML (but there are a bunch of others, see kroki), are in my experience a good way to create and share ideas. If you know the syntax, you can be fairly fast in “drawing” your design idea.

Still, I haven’t found a tool that will allow me to conveniently express what I need to express in a design discussion. Simply creating a simple diagram is not all of the story. I often want to share an idea of the structure of the system – the cooperating components, but also of its behavior. It’s important to not just show the structure of the system, and interfaces between components, but also highlight specific flows in different scenarios.

There are of course diagram types for that as well, e.g. sequence or activity diagrams. And there are a plethora of tools for creating those as well. But the “designer experience” is lacking. It’s hard to move from one type of view to another, maintaining consistency. This is why whiteboard discussions are easier in that sense – we sit together, draw something on the board, and then point at it, waving our hands over the picture that everyone is looking at. Even if something is not precise in itself, we can compensate by pointing at specific points, emphasizing one point or another.

Emulating this interaction is not easy at this day and age of remote work. When a lot of the discussions are done remotely, and often asynchronously (for good reasons), there’s a greater need to be precise. And this is not easy to do at the “speed of thought”.

Building software tools is sort of a hobby for me, so I set out to try and address this.

Goals

What I’m missing is a tool that will allow me to:

Quickly express my thoughts on the structure and behavior of a (sub)system – the involved components and interactions.
Share this picture and relevant behavior easily with other people, allowing them to reason about it. Allowing us to conveniently discuss the ideas presented, and easily make corrections or suggest alternatives.

So essentially I’m looking to create a tool that allows me to describe a system easily (structure + behavior). A tool that efficiently creates relevant diagram and allows me to visualize the behavior on this diagram.

Constraints and Boundary Conditions

Setting out to implement this kind of tool, as a proof of concept, I outlined for myself several constraints or boundary conditions I would like to maintain, both from a “product” point of view as well as from an engineering implementation point of view.

The description should be text based, so we can easily share system description as well as version them using existing versioning tools, namely git.
The tool should be easy to ramp up to.
1. Just load and start writing
2. Easy syntax, hopefully intuitive.
Designs should be easily shareable – a simple link that can be sent, and embedded in other places.
There should not be any special requirements for software to use the tool.
1. A simple modern browser should be enough.

Scenaria

Enter Scenaria (git repo).

Scenaria is a language – a simple DSL, with an accompanying web tool. The tool includes a simple online editor, and a visualization area. You enter the description of the system in the editor, hit “Apply”, and the system is displayed in the visualization pane.

The diagram itself is heavily inspired by technical architecture modeling. The textual DSL is inspired by PlantUML. You can play with the tool here, and see a more detailed explanation of the model and syntax here.

Discussion doesn’t stop with purely static diagram. The tool also allows you to describe and visualize interactions between the different components. You can describe several flows, which you can then “play”, on the drawn diagram. You can step through a scenario or simply play from start to finish.

After this is done, you have a shareable link, as part of the application which you can send to colleagues (or keep).

As a diagramming tool, it’s pretty lacking. But remember that the purpose here is not to necessarily create beautiful diagrams (though that’s always a plus). It’s mainly about enabling a conversation, efficiently. So there’s a balance here between being expressive in the language, while not going down the route of adding a ton of visualization features which will distract from the main purpose of describing a system or a feature.

Scenaria is more intended to be a communication tool to be used easily in the discussion we have with our colleagues. It can serve as a basis for further analysis, as it provides a way to structure the description of a system – its structure and behavior. But the focus isn’t on rigorous formal description that can derive working code. It’s not intended for code generation. It’s about having something to point at when discussing design, but easily create and share it, based on some system model.

An Example

An example scenario can be viewed here. This example shows the main components of the Scenaria app, with a simple flow showing the interaction between them when the code is parsed and shown on screen.

Looking at the code of the description, we start by enumerating the different actors cooperating in the process:

user 'Designer' as u;
agent 'App Page' as p;
agent 'Main App' as app;
agent 'Editor' as e;
agent 'Parser' as prsr;
agent 'Diagram Drawing' as dd;
agent 'ELK Lib' as elk;
agent 'Diagram Painting' as dp;
agent 'Diagram Controller' as dc;

Each component is described as an agent here, with the user (a “Designer”) as a separate actor.

We then define an annotation highlighting external libraries:

@External {
  color : 'lightgreen';
};

And annotate two agents to mark them as external libraries:

elk is @External;
e is @External;

Note that up to this point we haven’t defined any interactions or channels between the components.
Now we can turn to describe a flow – specifically what happens when the user writes some Scenaria code and hits the “Apply” button:

'Model Drawing' {
    u -('enter code')-> e
    u -('apply')->p
    p -('reset')-> app

    p -('get code')-> e
    p --('code')--< e

    p-('parseAndPresent')-> app
        app -('parse')-> prsr
        app --('model')--< prsr
        app -('layoutModel') -> dd
            dd -('layout') -> elk
            dd --('graph obj')--< elk
        app --('graph obj')--< dd

        app -('draw graph')-> dd
            dd -('draw actors, channels, edges')->dp
        app --('painter (dp)')--< dd

        app -('get svg elements')->dp
        app --('svg elements')--<dp
        
        app -('create and set svg elements')->dc


    p --('model')--< app

};

We give scenario a name – “Model Drawing”, and describe the different calls between the cooperating actors. Indentation is not required, just added here for readability.

The interaction between the agents implicitly define channels between the components. So when the diagram is drawn, it is drawn with relevant channels:

At this point the application allows you to run or step through the given scenario where you will see the different messages and return values, as described in the text.

Next Steps

This is far from a complete tool, and I hope to continue working on it, as I try to embed it into my daily work and see what works and what doesn’t.

At this point, it’s basically a proof of concept, a sort of an early prototype.

Some directions and features I have in mind that I believe can help in promoting the goals I outlined above:

Better diagramming: better layout, supporting component hierarchies.
Diagram features: comments on the diagram (as part of steps?), titles, notes
Scenario playback – allow for branches, parallel step execution, self calls.
Versioning of diagrams – show an evolution of a system, milestones for development, etc.
Integration with other tools:
1. Wikis/markdown (a “design notebook”?)
2. Slack and other discussion tools
3. Tools and links to other modeling tools, showing different views of the same model.
A view only mode – allow sharing only the diagram and allow playback of scenarios.
1. Allow embedding of the SVG only into other tools, e.g. a widget in google docs.
Better application UX (admittedly, I’m not much of a user interface designer).
Team collaboration features beyond version control.

Contributions, feedback and discussions are of course always welcome.

The Understated Architect Role

So what’s in a software architect?

This is actually a question I’ve seen addressed a few times, and actually discussed during my career quite a bit. It’s a question I need to address if and when I search for new job as a software development architect. And it’s also a question that seems to have quite a few possible answers.

But eventually, there does some to be a common ground when trying to discuss the work of a software development architect. One could discuss the soft skills, the technical skills or the leadership skills required from an architect. It’s all true, at least to some degree. The exact combination of skills required to be a good software development architect varies between different organizations, operating contexts (the organization structure, company culture and size, history with the team, etc.) and your definition of “good”; so I won’t waste your time in trying to define an exact grocery list of skills.

I do believe, however, that one important role of the architect is often overlooked or not emphasized enough. And it’s probably not something you’ll learn in any course or training. It is something that I often felt during my work, but realized fully only when another (more experienced) architect has put it into words that resonated strongly with me.

Probably the most important job of an architect is to create and maintain a consistent understanding of the software system being developed. Or more precisely: creating a coherent and consistent picture of the developed system across all parties involved in the development project. To create an understanding between the development team and its stakeholders. To create an understanding within the team about the technical vision and direction for development. To create an understanding across teams on what is being developed and how systems interact.

This is probably the single most important job an architect has that is also unique to this role. Technical expertise is important, as is various design decisions and being able to weigh trade-offs properly. But the thing that truly separates the architect’s role from that of an expert software developer is the ability to convey a system structure and technical design, its capabilities, constraints and decisions taken when building it.

It’s really more than simply knowing the right words and using the right terminology. It’s about framing thoughts in a way that is consistent and understandable by the target audience. It’s about striking the balance between formalism and human comprehension. It’s about choosing the right methods to convey an idea. It’s about being precise and succinct, yet clear and understandable. It’s about being able to translate between different terminologies or “domains of thoughts”.

And herein lies the real challenge in being a software development architect, in my opinion. Technical mastery is important. But being able to convey an idea efficiently, to plant the right idea into the minds of people you’re communicating with is one thing you can’t learn on stackoverflow.com (which is a wonderful site, by the way).

And it’s a subtle act, often very delicate and hard to balance. Simply because people come from different schools of thoughts and experiences (and often with agendas). But choosing the right terminology, explaining it properly, making assumptions explicit or drawing the right diagram goes a long way towards aligning all involved parties on a single vision and a coherent mental picture of a system. Creating this mental picture in everyone’s mind is no easy task.

I had a boss who once told me that whenever there’s a disagreement on a technical direction to take, you should prefer to be the one drawing the diagrams. This simple fact already gives you an advantage. If you’re the one holding the marker pen, you already have an edge.

This is true from an organizational politics point of view, but also an important point to keep in mind when you want to reach an agreement and create this common understanding and consistency – if you wield the pen, you wield the power as well as the responsibility. And when you’re an architect, your job is to wield the pen.

When people talk about being an architect, they usually talk about what to do with the pen – what kind of diagrams to draw, what documents to write, how much code, etc. But there’s an overarching, implicit goal that is assigned to the one holding the pen – to make sure everyone is aligned and have a consistent mental picture in their heads, when approaching their individual jobs. This is especially harder in the software business, where there’s no shortage of abstractions and intangible concepts to keep in mind. Where the different levels of abstraction and complexity a developer has to keep in his head are often mind boggling. The ability to synthesize the important ideas and communicate them at the right place, at the right time, in a manner that will be adopted and accepted is quite a fit. Take someone who is good at doing that, add a decent technical and analytical abilities into the mix and you’ve got yourself a good software architect right there.

Effective System Modeling

3 Replies

This post is a rough “transcript” (with some changes and creative freedom) of a session I gave in the Citi Innovation Lab, TLV about how to effectively model a system.

A Communication Breakdown?

Building complex software systems is not an easy task, for a lot of reasons. All kinds of solutions have been invented to tackle the different issues. We have higher level programming languages, DB tools, agile project management methodologies and quite a bit more. One could argue that these problems still exist, and no complete solution has been found so far. That may be true, but in this post, I’d like to discuss a different problem in this context: communicating our designs.

One problem that seems to be overlooked or not addressed well enough, is the issue of communicating our designs and system architecture. By ourselves, experienced engineers are (usually) quite capable of coming up with often elegant solutions to complex problems. But the realities and dynamics of a software development organization, especially a geographically distributed one, often require us to communicate and reason about systems developed by others.

We – software engineers – tend to focus on solving the technical issues or designing the systems we’re building. This often leads to forgetting that software development, especially in the enterprise, is often, if not always, a team effort. Communicating our designs is therefore critical to our success, but is often viewed as a negligible activity at best, if not a complete waste of time.

The agile development movement, in all its variants, has done some good to bring the issues of cooperation and communication into the limelight. Still, I often find that communication of technical details – structure and behavior of systems, is poorly done.

Why is that?

“Doing” Architecture

A common interpretation of agile development methods I often encounter tends to spill the baby with the water. I hear about people/teams refusing to do “big up-front design”. That in itself is actually a good thing in my opinion. The problem starts when this translates to no design at all, and this immediately translates into not wanting to spend time on documenting your architecture properly, or how it’s communicated.

But as anyone who’s been in this industry for more than a day knows – there’s no replacement for thinking about your design and your system, and agile doesn’t mean we shouldn’t design our system. So I claim that the problem isn’t really with designing per-se, but rather in the motivation and methodology we use for “doing” our architecture – how we go about designing the system and conveying our thoughts. Most of us acknowledge the importance of thinking about a system, but we do not invest the time in preserving that knowledge and discussion. Communicating a design or system architecture, especially in written form, is often viewed as superfluous, given the working code and its accompanying tests. From my experience this is often the case because the actual communication and documentation of a design are done ineffectively.

This was also strengthened after hearing Simon Brown talk about a similar subject, one which resonated with me. An architecture document/artifact should have “just enough” up front design to understand the system and create a shared vision. An architecture document should augment the code, not repeat it; it should describe what the code doesn’t already describe. In other words – don’t document the code, but rather look for the added value. A good architecture/design document adds value to the project team by articulating the vision on which all team members need to align on. Of course, this is less apparent in small teams than in large ones, especially teams that need to cooperate on a larger project.

As a side note I would like to suggest that besides creating a shared understanding and vision, an architecture document also helps in preserving the knowledge and ramping-up people onto the team. I believe that anyone who has tried learning a new system just by looking at its code will empathize with this.

Since I believe the motivation to actually design the system and solve the problem is definitely there, I’m left with the feeling that people often view the task of documenting it and communicating it as unnecessary “bureaucracy”.
We therefore need a way to communicate and document our system’s architecture effectively. A way that will allow us to transfer knowledge, over time and space (geographies), but still do it efficiently – both for the writer and readers.
It needs to be a way that captures the essence of the system, without drowning the reader in details, or burden the writer with work that will prove to be a waste of time. Looking at it from a system analysis point of view, then reading the document is quite possibly the more prominent use case, compared to writing it; i.e. the document is going to be read a lot more than written/modified.

When we come to the question of modeling a system, with the purpose of the end result being readable by humans, we need to balance the amount of formalism we apply to the model. A rigorous modeling technique will probably result in a more accurate model, but not necessarily an easily understandable one. Rigorous documents tend to be complete and accurate, but exhausting to read and follow; thereby beating the purpose we’re trying to achieve. At the other end of the scale are free text documents, often in English and sometimes with some scribbled diagrams, which explain the structure or behavior of system, often inconsistently. These are hard to follow for different reasons: inaccurate language, inconsistent terminology and/or ad-hoc (=unfamiliar) modeling technique used.

Providing an easy to follow system description, and doing so efficiently, requires us to balance these two ends. We need to have a “just enough” formalism that provides a common language. It needs to be intuitive to write and read, with enough freedom to provide any details needed to get a complete picture, but without burdening the writers and readers with unnecessary details.
In this post, I try to give an overview and pointers to a method I found useful in the past (not my invention), and that I believe answers the criteria mentioned above. It is definitely not the only way and may not suit everyone’s taste (e.g. Simon Brown suggests something similar but slightly different); but regardless of the method used, creating a shared vision, and putting it to writing is something useful, when done effectively.

System != Software

Before going into the technicalities of describing a system effectively, I believe we need to make the distinction between a system and its software.

For the purposes of our discussion, we’ll define software as a computer-understandable description of a dynamic system; i.e. one way to code the structure and behavior of a system in a way that’s understandable by computers.
A (dynamic) system on the other hand is what emerges from the execution of software.

To understand the distinction, an analogy might help: consider the task of understanding the issue of global warming (the system) vs. understanding the structure of a book about global warming (the software).

Understanding the book structure does not imply understanding global warming. Similarly, understanding the software structure doesn’t imply understanding the system.
The book can be written in different languages, but it’s still describing global warming. Similarly, software can be implemented using different languages and tools/technologies, but it doesn’t (shouldn’t) change the emergent behavior of the system.
Reading the content of the book implies understanding global warming. Similarly, the system is what emerges from execution of the software.

One point we need to keep in mind, and where this analogy breaks, is that understanding a book’s structure is considerably easier than understanding the software written for a given system.
So usually, when confronted with the need to document our system, we tend to focus on documenting the software, not the system. This leads to ineffective documentation/modeling (we’re documenting the wrong thing), eventually leading to frustration and missing knowledge.
This is further compounded by the fact that existing tools and frameworks for documentation of software (e.g. UML) tend to be complex and detailed, and with the tools emphasizing code generation, and not human communication; this is especially true for UML.

Modeling a System

When we model an existing system, or design a new one, we find several methods and tools that help us. A lot of these methods define all sorts of views of the system – describing different facets of its implementation. Most practitioners have surely met one or more different “types” of system views: logical, conceptual, deployment, implementation, high level, behavior, etc. These all provide some kind of information as to how the system is built, but there’s not a lot of clarity on the differences or roles of each such view. These are essentially different abstractions or facets of the given system being modeled. While any such abstraction can be justified in itself, it is the combination of these that produces an often unreadable end result.

So, as with any other type of technical document you write, the first rule of thumb is:

Rule of thumb #1: Tailor the content to the reader(s), and be explicit about it.

In other words – set expectations. Set the expectation early on – what you’re describing and what is the expected knowledge (and usually technical competency) of the reader.
—

Generally, in my experience, 3 main facets are the most important ones: the structure of the system – how it’s built, the behavior of the system – how the different component interact on given inputs/events, and the domain model used in the system. Each of these facets can be described in more or less detail, at different abstraction levels, and using different techniques, depending on the case. But these are usually the most important facets for a reader to understand the system and approach the code design itself, or reading the code.

Technical Architecture Modeling

One method I often find useful is that of Technical Architecture Modeling (TAM), itself a derivative of Fundamental Modeling Concepts (FMC). It is a formal method, but one which focuses on human comprehension. As such, it borrows from UML and FMC, to provide a level of formalism which seems to strike a good balance between readability and modeling efficiency. TAM uses a few diagram types, where the most useful are the component/block diagram used to depict a system’s structure or composition; the activity and sequence diagrams used to model a system/component’s behavior and the class diagram used to model a domain (value) model. In addition, other diagram types are also included, e.g. state charts and deployment
diagrams; but these are less useful in my experience. In addition, TAM also has some tool support in the form of Visio stencils that make it easier to integrate this into other documentation methods.

I briefly discuss how the most important facets of a system can be modeled with TAM, but the reader is encouraged to follow the links given above (or ask me) for further information and details.

Block Diagram: System Structure

A system’s structure, or composition, is described using a simple block diagram. At its simplest form, this diagram describes the different components that make up the system.
For example, describing a simple travel agency system, with a reservation and information system can look something like this (example taken from the FMC introduction):

This in itself already tells us some of the story: there’s a travel agency system, accessed by customers and other interested parties, with two subsystems: a reservation system and an information help desk system. The information is read and written to two separate data stores holding the customer data and reservations in one store, and the travel information (e.g. flight and hotel information) in the other. This data is fed into the system by external travel-related organizations (e.g. airlines, hotel chains), and reservations are forwarded to the same external systems.

This description is usually enough to provide at least a contextual high level information of the system. But the diagram above already tells us a bit more. It provides us some information about the access points to the data; about the different kinds of data flowing in the system, and what component is interacting with what other component (who knows who). Note that there is little to no technical information at this point.

The modeling language itself is pretty straightforward and simple as well: we have two main “entities”: actors and data stores.
Actors, designated by square rectangles, are any components that do something in the system (also humans). They are they active components of the system. Actors communicate with other actors through channels (lines with small circles on them), and the read/write from/to data stores (simple lines with arrow heads). Examples include services, functions and human operators of the system.
Data store, designated by round rectangles (/circles), are passive components. These are “places” where data is stored. Examples include database systems, files, and even memory arrays (or generally any data structure).

Armed with these definitions, we can already identify some useful patterns, and how to model them:

Read only access – actor A can only read from data store S:

Write only access – actor A can only write to data store S:

Read/Write access:

Two actors communicating on a request/response channel have their own unique symbol:

In this case, actor ‘B’ requests something from actor ‘A’ (the arrow on the ‘R’ symbol points to ‘A’), and ‘A’ answers back with data. So data flow actually happens in both ways. A classical example of this is a client browser asking for a web page from a web server.

A simple communication over a shared storage:

actors ‘A’ and ‘B’ both read and write from/to data store ‘S’. Effectively communicating over it.

There’s a bit more to this formalism, which you can explore in FMC/TAM website. But not really much more than what’s shown here. These simple primitives already provide a powerful expression mechanism to convey most of the ideas we need to communicate over our system on a daily basis.

Usually, when providing such a diagram, it’s good practice to accompany it with some text that provides some explanation on the different components and their roles. This shouldn’t be more than 1-2 paragraphs, but actually depends on the level of detail and system size.

This would generally help with two things: identifying redundant components, and describing the responsibility of each component clearly. Think of this text explanation as a way to validate your modeling, as displayed in the diagram.

Rule of thumb #2: If your explanation doesn’t include all the actors/stores depicted in the
diagram – you probably have redundant components.

Behavior Modeling

The dynamic behavior of a system is of course no less important than its structure. The cooperation, interaction and data flow between components allow us to identify failure points, bottlenecks, decoupling problems etc. In this case, TAM adopts largely the UML practice of using sequence diagrams or activity diagrams, whose description is beyond the scope of this post.

One thing to keep in mind though, is that when modeling behavior in this case, you’re usually not modeling interaction between classes, but rather between components. So the formalism of “messages” sent between objects need not couple itself to code structure and class/method names. Remember: you generally don’t model the software (code), but rather system components. So you don’t need to model the exact method calls and object instances, as is generally the case with UML models.

One good way to validate the model at this point is to verify that the components mentioned in the activity diagram are mentioned in the system’s structure (in the block diagram); and that components that interact in the behavioral model actually have this interaction expressed in the structural model. A missing interaction (e.g. channel) in the structural view may mean that these two components have an interface that wasn’t expressed in the structural model, i.e. the structure diagram should be fixed; or it could mean that these two components shouldn’t interact, i.e. the behavioral model needs to be fixed.

This is the exact thought process that this modeling helps to achieve – modeling two different facets of the system and validating one with the other in iterations allows us to reason and validate our understanding of the system. The explicit diagrams are simply the visual method that helps us to visualize and capture those ideas efficiently. Of course, keep in mind that you validate the model at the appropriate level of abstraction – don’t validate a high level system structure with a sequence diagram describing implementation classes.

Rule of thumb #3: Every interaction modeled in the behavioral model (activity/sequence
diagrams) should be reflected in the structural model (block diagram), and vice versa.

Domain Modeling

Another often useful aspect of modeling a system is modeling the data processed by the system. It helps to reason about the algorithms, expected load and eventually the structure of the code. This is often the part that’s not covered by well known patterns and needs to be carefully tuned per application. It also helps in creating a shared vocabulary and terminology when discussing different aspects of the developed software.

A useful method in the case of domain modeling is UML class diagrams, which TAM also adopts. In this case as well, I often find a more scaled-down version the most useful, usually focused on the main entities, and their relationships (including cardinality). The useful notation of class diagrams can be leveraged to express these relationships quite succinctly.

Explicit modeling of the code itself is rarely useful in my opinion – the code will probably be refactored way faster than a model will be updated, and a reader who is able to read a detailed class diagram can also read the code it describes. One exception to this rule might be when your application deals with code constructs, in which case the code constructs themselves (e.g. interfaces) serve as the API to your system, and clients will need to write code that integrates with it, as a primary usage pattern of the system. An example for this is an extensible library of any sort (eclipse plugins are one prominent example, but there are more).

Another useful modeling facet in this context is to model the main concepts handled in the system. This is especially useful in very technical systems (oriented at developers), that introduce several new concepts, e.g. frameworks. In this case, a conceptual model can prove to be useful for establishing a shared understanding and terminology for anyone discussing the system.

Iterative Refinement

Of course, at the end of the day, we need to remember that modeling a system in fact reflects a thought process we have when designing the system. The end product, in the form a document (or set of documents) represents our understanding of the system – its structure and behavior. But this is never a one-way process. It is almost always an iterative process that reflects our evolving understanding of the system.

So modeling a specific facet of the system should not be seen as a one-off activity. We often follow a dynamic where we model the structure of the system, but then try to model its behavior, only to realize the structure isn’t sufficient or leads to a suboptimal flow. This back and forth is actually a good thing – it helps us to solidify our understanding and converge on a widely understood and accepted picture of how the system should look, and how it should be constructed.

Refinements also happen on the axis of abstractions. Moving from a high level to a lower level of abstraction, we can provide more details on the system. We can refine as much as we find useful, up to the level of modeling the code (which, as stated above, is rarely useful in my opinion). Also when working on the details of a given view, it’s common to find improvement points and issues in the higher level description. So iterations can happen here as well.

As an example, consider the imaginary travel agency example quoted above. One possible refinement of the structural view could be something like this (also taken from the site above):

In this case, more detail is provided on the implementation of the information help subsystem and the ‘Travel Information’ data store. Although providing some more (useful) technical details, this is still a block diagram, describing the structure of the system. This level of detail refines the high level view shown earlier, and already provides more information and insight into how the system is built. For example, how the data stores are implemented and accessed, the way data is adapted and propagated in the system. The acute reader will note that the ‘Reservation System’ subsystem now interacts with the ‘HTTP Server’ component in the ‘Information help desk’ subsystem. This makes sense from a logical point of view – the reservation system accesses the travel information through the same channels used to provide information to other actors, but this information was missing from the first diagram (no channel between the two components).
One important rule of thumb is that as you go down the levels of abstraction, keep the names of actors presented in the higher level of abstraction. This allows readers to correlate the views more easily, identify the different actors, and reason about their place in the system. It provides a context for the more fine granular details. As the example above shows, the more detailed diagram still includes the actor and store names from the higher level diagram (‘Travel Information’, ‘Information help desk’, ‘Travel Agency’).

Rule of thumb #4: Be consistent about names when moving between different levels of abstraction. Enable correlations between the different views.

Communicating w/ Humans – Visualization is Key

With all this modeling activity going on, we have to keep in mind that our main goal, besides good design, is communicating this design to other humans, not machines. This is why, reluctant as we are to admit it (engineers…) – aesthetics matter.

In the context of enterprise systems, communicating the design effectively is as important to the quality of the resulting software as designing it properly. In some cases, it might be even more important – just consider the amount of time you sometime spend on integration of system vs. how much time you spend writing the software itself. So a good looking diagram is important, and we should be mindful about how we present it to the intended audience.

Following are some tips and pointers on what to look for when considering this aspect of communicating our designs. This is by no means an exhaustive list, but more based on experience (and some common sense). More pointers can be found in the links above, specifically in the visualization guide.

First, keep in mind node and visual arrangement of nodes and edges in your diagram immediately lends itself to how clear the diagram is to readers. Try to minimize intersection of edges, and align edges on horizontal and vertical axes.
Compare these two examples:

The arrangement on the left is definitely clearer than the one on the right. Note that generally speaking, the size of a node does not imply any specific meaning; it is just a visual convenience.

Similarly, this example:

shows how the re-arrangement of nodes allows for less intersection, without losing any meaning.

—

Colors can also be very useful in this case. One can use colors to help distinguish between different levels of containment:

In this case, the usage of colors helps to distinguish an otherwise confusing structure. Keep in mind that readers might want to print the document you create on a black and white printer (and color blind) – so use high contrast colors where possible.

—

Label styles are generally not very useful to convey meaning. Try to stick to a very specific font and be consistent with it. An exception might be a label that pertains to a different aspect, e.g. configuration files or code locations, which might be more easily distinguished when using a different font style.

Visuals have Semantics

One useful way to leverage colors and layout of a diagram is to stress specific semantics you might want to convey in your diagram. One might leverage colors to distinguish a set of components from other components, e.g. highlighting team responsibilities, or highlight specific implementation details. Note that when you use this kind of technique that it is not standard, so remember to include an explanation – a legend – of what the different colors mean. Also, too many colors might cause more clutter, eventually beating the purpose of clarity.

Another useful technique is to use layout of the nodes in the graph for conveying an understanding. For example, depicting the main data flow might be hinted in the block diagram by layouting the nodes from left to right, or top to down. This is not required, nor carries any specific meaning. But it is often useful to use, and provides hints as to how the system actually works.

Summary

As we’ve seen, “doing” architecture, while often perceived as a cumbersome and unnecessary activity isn’t hard to do when done effectively. We need to keep in mind the focus of this activity: communicating our designs and reasoning about them over longer periods of time.

Easing the collaboration around design is not just an issue of knowledge sharing (though that’s important as well), but it is a necessity when trying to build software across global teams, over long periods of time. How effectively we communicate our designs directly impacts how we collaborate, the quality of produced software, how we evolve it over time, and eventually the bottom line of deliveries.

I hope this (rather long) post has served to shed some light on the subject, and provide some insight, useful tips and encouraged people to invest some efforts into learning further.

Credit: The example and images presented in this post are taken from the FMC website and examples.