Author Archives: slior

Agent-Driven Applications

AI Agents are everywhere, and slowly (quickly?) becoming more prominent in applications. We’re seeing more and more of them appearing as integral parts of applications, not just as tools for developments, but actual technical components that implement/provide user-facing functionality. We’re also seeing a significant improvement in the length of tasks agents are able to accomplish. I’m not sure this is AGI yet, but it’s definitely significant. 

So far, I have focused on the implications of AI on how software is developed. But as we move from working internally with LLMs to building applications that leverage them, I believe it’s time to look more carefully on how to build such systems. In other words, what would it look like if we want to build a system that really leverages LLMs as a core building block. 

We already have concrete examples of such applications – our AI-driven IDEs and other coding agents. These are examples of applications where the introduction of AI has done more than supercharge existing application functionality. It has actually changed the way we do things. What’s more interesting is how quite a few people are using these in ways that traditional IDEs weren’t designed for. I remember a time, not so long ago, when suggesting the use of an IDE to a non-technical product manager was met with raised eyebrows1. Now, most product managers have Cursor (or Claude Code) open and doing much of their work. This isn’t just ‘vibe coding’; it’s using the agent as a multi-tool for the boring-but-essential parts of the job. I’m seeing people use Cursor for practically everything – writing specs/design documents, documentation, diagramming, design and data archaeology and more. And this is still mostly chatting with a given agent. The potential, I believe, is much bigger.

When I let myself extrapolate from coding agents to the broader set of potential applications2, I can’t help but think we’re going to see a new kind of software applications emerge – Agent-driven applications. Applications that will be mostly built around LLMs and the tools, essentially the harness of the agent. It can be multiple agents cooperating or a single agent embedded into a larger platform. I don’t assume this type of application will replace all kinds of applications, but I think it will be more prevalent, and we should start to seriously think of what it means to really leverage LLMs in applications. We should consider what the implications are for how we define, build, evolve and use these applications where AI-based agents sit at the core.

Why Should We Care?

One can argue that LLMs are nothing more than a technical component in a larger system, with limitations and quirks around how it’s used. This is technically true3. But I believe there’s a larger opportunity here in how we deploy and use these LLMs; an opportunity which presents its own challenges.

If we limit ourselves to some kind of smarter automation – guiding the LLM through a task or some workflow – this would probably work. But with LLMs we can do more. We can declare a desired goal/outcome, and let the agent decide on how to achieve it. A reasoning agent equipped with a set of capabilities (=tools) and a desired goal can work to achieve it, without us coding the concrete workflow, or even explaining it in detail. This is what we’re already seeing with AI IDEs. Assuming the capabilities are robust enough – on par with what a user would do – the agent should be able to achieve the task on its own.

This may seem insignificant at first look. But I think it changes how we would want to design these systems if we want to really leverage high-end LLMs. Given an agent with enough tools, the user can also instruct it to do all kinds of tasks that a developer/product manager/architect did not even consider when building the application. It can be a simple heuristic change, or complete new workflows – all only a prompt away. Years of enterprise software customizations and pluggable architectures prove that this is a very real need. And reasoning LLMs, with open ended flexibility supercharge this customization capabilities.
It’s more than customization of existing workflows. It’s also finding new ways to use the existing capabilities. Similar to how people string together Linux shell commands, or use electronic spreadsheets to everything from accounting to games and habit tracking – once the capabilities are there we’re only limited by the user’s imagination.

In addition, I think there’s also a chance for a pattern that might be unique to LLMs being at the heart of such an application. This is derived from the use of context. When an agent in such an application works, it consumes context (usually fed by different tools, or from its progress). But it also has the chance to affect future context, for other agents or for its future invocations. It maintains a memory of past actions and interactions. Similar to how I turn certain conversations I have in Cursor to rules or commands, or updating the AGENTS.md file. This has the potential to allow the agent to improve over time, even automatically, without a human in the loop.

But what does such an application look like?

Anatomy of an Agent-Driven Application

As I see it, the architecture of an agent-driven application centers around the Agent Loop. Unlike traditional software that relies on rigid, pre-defined workflows to execute logic, this architecture relies on an agent, or a set of collaborating agents, working autonomously to achieve a specific goal.

In this development model, we do not define concrete flows. Instead, we define the application by providing the agent with a set of tools. These tools allow the agent to perceive its environment and act upon it. The desired outcome is defined through a combination of prompts, originating from both the system developers and the end-user, and existing context. The agent then works through the task in its loop, until it completes its work. This is similar to the idea of Web World Models, only applied to business scenarios, which hopefully can make it more constrained.

The execution path is dynamic rather than static. Because the agent maintains a context that evolves, learns, and potentially forgets over time, the specific steps taken to achieve an outcome may change. The application defines what needs to be done, while the agent determines how to do it based on its current context and available tools. The potential for such an application is more than simple automation of tasks, it’s also about finding ad-hoc ways to achieve a given goal or completing unforeseen (desired) outcomes.

Examples can be agents with varying level of complexity:

  • A planning agent that responds to events, queries application state and decides how to allocate resources, collaborating with other agents to verify choices and other constraints eventually notifying users and downstream systems.
  • A troubleshooting agent that leverages various data sources to correlate and find insights in the data, iteratively exploring data until it provides several theories to the asked question.

There are 3 main components to such an application: the capabilities (tools) available to the agent, the shared context and the agent loop.

The Agent’s Capabilities (Tools): The agent interacts with the application and external systems through tools. These tools should be atomic and composable. Initially, they may be simple primitives. Over time, as we observe how the agent utilizes them, these tools can evolve into more complex capabilities. The agent selects these tools dynamically to solve problems, invokes them and acts on their result. 

Shared Context: Context is the memory of the system. It is not limited to a single interaction but persists and evolves between agent sessions. This shared context allows the agent to learn from previous interactions. It ensures that the agent does not start from zero with every task but builds upon a history of user preferences and past decisions, in addition to the system state. This memory is shared between agents working, but also between the agents and the users. It’s possible for a user to interact directly with the context, correct/change it and therefore direct the agent(s), within normal data access limitations.

The Agent Loop and Completion Signals: The agent lives in a perpetual loop: it observes the state, reasons through the next step, acts using a tool, and then looks at what happened. Repeat until the job is done. This loop runs until the agent determines the task is complete. Note that since the system centers around the agent loop, identifying when it’s finished is critical and an integral part of the pattern. There could be different signals for completion (e.g. “fully completed”, “partially completed”, “completed but unknown state”, “failed”).
It’s important to distinguish between a completion signal and some execution failure. It’s quite possible that a tool execution fails, but the agent continues to reason and work around it. It’s also possible to have all tools successfully execute, with the overall outcome not achieved due to other reasons.

Design Principles

Now that we’ve established the general idea of what an agent-driven application looks like, it’s worth laying down some points which should help us design such a system effectively.

Agent Capabilities Match User Capabilities

We should aim for capability parity between the user and the agent. If a user can achieve some outcome in the system, the agent must have a corresponding tool or set of tools to achieve the same result. This does not necessarily mean the agent manipulates the UI widgets; rather, it means the agent has programmatic access to the same underlying logic and mutations that the UI exposes to the user. It might be through a different path, but if we want the agent to achieve the same outcomes as a user, it should have capabilities that are on par with the user’s capabilities to affect the system.

Application Logic Lives In Prompts

The core logic of the application shifts from code to prompts. We use prompts to define the business constraints and desired outcomes. Deterministic code is still there, for various reasons, but the more flexible we want to be, and more open to agentic reasoning, the more we need the desired logic to exist in prompts. I also expect that the definition of business flows will be less prescriptive.  Instead it will focus on establishing goals and constraints. Think of it like SQL for business logic. You declare the ‘what’ (the query), and the engine figures out the ‘how’ (the execution plan)4. There’s of course a twist here: our “engine” is a non-deterministic LLM working with an ever-evolving vocabulary of tools. This is harder compared to optimizing over a relatively narrow domain (relational algebra).

Consequently, this changes how we debug an agent-driven application. Instead of stepping through lines of code to debug logic errors we analyze execution traces to understand the agent’s reasoning process and tool selection.

Guardrails are Explicit – In The Tools

While the agent is autonomous, it must operate within safety boundaries. We do not rely on the agent’s “judgment” for critical constraints. Concerns such as data consistency, authorization, and sensitive data access are enforced strictly by the tools themselves. The tool allows the action only if it meets the hard-coded security and business rules. Some of the safety guardrails can be in the prompts, but we should not rely on this as a security measure.

Capability Evolution

The agent’s capabilities in the system are not necessarily static. We can evolve them by observing how the agent “behaves”. Concretely, we treat the agent’s behavior as a source of requirement generation. By observing traces, we identify common patterns or sequences of actions. We then “graduate” these patterns into more elaborate, hard-coded tools. It’s technical and logical refactoring that’s driven by how we observe the system behaving.

I see a few main motivations for this kind of evolution:

  1. Optimization: Hard-coded tools reduce cost and latency compared to multiple LLM round-trips.
  2. Domain Language: Creating specific tools establishes a richer, higher-level vocabulary for the agent to use, making it more effective within our specific business domain.

It’s also possible that we’d want to code some tool in order to guarantee some business constraint, e.g. data consistency. However, I believe this will not be so much an evolution of a tool but rather a defined boundary condition for the definition of a tool in the first place, maybe a result of a new business requirement/feature.
  

It’s quite possible that a few granular tools will be combined into a more complicated one if the pattern is very common, and we can optimize the process. Still, I wouldn’t discount the more granular tools as they provide the flexibility we might like to preserve.

Tradeoffs and Practical Considerations

Naturally, when designing any kind of system, we make tradeoffs. When designing real-world systems, we often need to be practical, beyond theory. So it’s important to understand whether this kind of architecture pattern and technology carry with it any specific considerations or tradeoffs.

Model selection and configuration is an obvious point to note when building a system where LLMs sit at the heart of it. Not all tasks are created equal, and some may require a higher level of reasoning than others. The tradeoff being cost/latency with reasoning and expressive power, and inherent capabilities of the model (e.g. is it multi-modal or not). For example, a “router” agent that identifies and dispatches messages to other agents/processes may work well enough with a cheaper (weaker?) model; whereas an agent requiring deep understanding of a domain model, and how to retrieve and connect different bits of information, working for a longer time, may require a stronger model. This will probably be more evident in systems where there’s a topology of cooperating agents.

Then there’s the elephant in the room: the tradeoff between autonomy and risk. This is an obvious point when considering a somewhat stochastic element in the architecture.

On the one hand, autonomy provides the agent, and ultimately the user, more flexibility. This should immediately lead to more unexpected use cases and “emergent behavior” mentioned above. Consider, for example, an agent dealing with financial records that can be used to identify issues and fix them without pre-programming the patterns in code.

On the other hand, there’s an inherent risk with allowing too much. Restricting the agent’s capabilities increases predictability and therefore safety. It of course limits the product’s value at the same time. On the extreme end, a very limited agent is kind of a fancy workflow engine5.

Applications obviously exist on a spectrum here, but this is a prime consideration when designing the agent’s capabilities.

The intersection of LLM context and long running agent carries with it some points to pay attention to as well.
First of all, long running agents will probably “run out” of context windows. Trying different tools, retrying failed actions, accumulating data and observations will inevitably lead to the context window filling up. This is an expected problem in this scenario. Its impact and frequency will most likely correlate with task complexity and tool capabilities.
When building such a system, we should provide a standard, hopefully efficient, way to summarize or compact the context. Simply dropping a “memory” is usually not an option. There should be a standard way for agents to retrieve memories where applicable. This will likely be a core component of the system, and it’s still open (at least for me) whether there’s a general mechanism for managing context that will fit all kinds of tasks and/or applications and agent topologies.

Which brings me to another point about context – managing context across agents, and the intersection of agents and users. The context for an agent will evolve across sessions. And it might actually be a good thing, depending on the application, to make it accessible to the human user. For example, if we want to allow the user to fix a data issue and/or somehow change the behavior by modifying some learned memory. There is a potential here for conflicts between changes. So we should consider how conflict resolution is done when it occurs on context updates.

User interface and experience should also be considered carefully here. Since a fundamental building block is the agent loop, the state and progress of the agent should be reflected to the human user, and maybe to APIs as well. Faithfully reflecting the state of the system, specifically the behavior and reasoning of the agent(s) running in it, helps to identify issues and build trust. I expect this to be a non-negligible issue when building and adopting such an application. Completion signals are part of this standard pattern, and probably deserve a “first-class” citizen status in the application. Understanding what an agent is doing, and whether it’s done with the goal the user has presented it is important to the user. Understanding when and whether, and sometimes how,  a goal was achieved should be standardized.

One last tradeoff to point out is the mechanism used for discovering tools by agents. You can have a static list of tools (capabilities), coded as available to each agent. This provides a more predictable list and therefore higher control. On the other hand, you can imagine a more dynamic “tool registry”, where tools may be added and made available to agents. Tool choice is done by the agent, but probably easier to predict with a static list of tools. An evolving, dynamic, registry may offer more flexibility but may be less predictable. I expect the agent will have a tougher time selecting the right tool in this case.

If we want true flexibility, we lean into the dynamic registry. And if the agent gets lost in the aisles?6 We can always fall back to a “safer” hard-coded map of tools.

From Magic Boxes to Design Blueprints

Whether we choose static control or dynamic flexibility, the goal remains the same: building a robust environment for autonomy. 

We are rapidly moving past the phase where an LLM is a “magic box” bolted onto the side of a traditional app. We need to think about how we design these systems. We have to get serious about the architectural patterns that allow these agents to actually get work done without constant human hand-holding.

The transition to agent-driven applications presents a new set of interesting problems for us to solve7. We’re no longer just designing APIs for human coders; we’re designing vocabularies for agents. The challenges ahead – how to build tools that are legible to a model, how to share context across a multi-agent swarm without it becoming a game of “telephone”, and how to let that context evolve organically – are the new unexplored territories of software system design.

Building these systems isn’t just about writing code anymore; it’s about building a harness for reasoning. It’s messy, it’s non-deterministic, and therefore less predictable. But it’s also an exciting architectural shift.

So, let’s keep exploring and see what else we can find.


  1. or, god forbid, write something in markdown. ↩︎
  2. More examples here. ↩︎
  3. Pun intended ↩︎
  4. Yes, I realize it’s a bit more complicated than that, but you get the idea. ↩︎
  5. And we have plenty of those, good ones, with no LLMs involved. ↩︎
  6. That is, fails to accomplish its goal ↩︎
  7. or at least old problems with new technology ↩︎

The Autonomous SDLC: When Code Becomes Substrate

In my last two posts, I dove into an implementation of AI agents in the area of software design, specifically having different LLMs debate a design problem. At the end of my last post I shared some initial thoughts on how this fits into a bigger picture of software development life cycle in the age of AI.

I’ve argued before that I believe AI has more potential than simply generating code faster than humans.  AI-based design debates are, I believe, just one component in a broader ecosystem of techniques and tools that mechanize a lot of the process of evolving and developing a software-based system. Software development involves much more than simply coding the desired behavior. Taken from this perspective, program code becomes another substrate on the path between human thought and working software. It is a substrate increasingly detached from human supervision1.

Consider the combination of:

  1. AI-based design discussions
  2. Spec-driven development
  3. AI-coding agents
  4. Continuous deployments
  5. Architectural fitness functions

Can we envision a process where a human provides the necessary definition of required functionality, with some non-functional requirement thrown in, and a machine, driven by AI, picks it up and iterates through a continuous loop to build and evolve the system?

I imagine something like this:

(arrow direction represents data flow – read and write direction)

Where a human developer2 provides requirements to implement in the system, goes through some clarification Q&A cycle, but eventually hands off the design and implementation to a set of AI agents that break down the requirements to an actionable implementation plan, and implement it up to deployment. The new code is then deployed, architecture and code updated, and they become the basis for a new feature/bug fix to be implemented. This cycle proceeds as long as new requirements (including bug reports) are fed into the system. The system evolves by a set of agents that cooperate on specific tasks, handing off the different artifacts and making changes, potentially in several iterations.

The concept of fitness functions is crucial. We know that LLMs require a way to perceive the environment, and perceive the implication of their changes so they can reason and act on them. We also know that system design, especially when evolving a system,  is often affected by the existing architecture and how the system actually behaves in runtime. This is more than code design, it is often about runtime and operational properties of the system being built or improved, e.g. performance or security properties. Some design decisions also affect how the process will be built, observed and run.
This notion of evolutionary architecture isn’t new, but making it mechanized is even more important when machines are what determines the next iteration of the system architecture.  It’s a way for LLM-driven agents to perceive the system they operate on.

Most of the puzzle pieces are already in place.
We have pretty good coding agents that can index and reason about existing code. We already see the beginning of existing technical specification agents. At this point they are interactive (human-in-the-loop). But given enough accurate context, I believe there’s a clear trajectory for agents like this being more independent. Deployment is essentially CI/CD, augmented with some quality controls (testing, code reviews, linting, documentation, etc.). And depending on the desired fitness functions, we can probably implement most with existing observability tools and APIs.

What emerges is what I call an Autonomous Software Development Lifecycle. A system where design choices, specifications and implementations are done by agents that exchange structured artifacts and feed into one another, with very little human intervention. Humans are part of the game in two main aspects: defining the expected system behavior, and tuning the system behavior through some well defined interfaces, notably the definition of the non-functional requirements.

Supervision can also be autonomous. There’s no reason AI agents can’t be connected to existing observability tools, and actively respond to them. Given enough accurate data a capable LLM should be able to derive a decent fix for problems that come up. Combined with existing methodologies of blue/green deployment, redundancy, etc. we effectively get a self healing system.

Taking it a step further, we can consider a scenario where such a system, a collection of agents, also proactively improves the system – a self improving system. An example for this would be a situation where a new technology becomes available and allows for a better implementation of existing functionality. For example, a faster DB, or a better VM that allows doing more work with less network calls.

These last two points about agents being proactive presents a shift from how we operate today with coding agents. Most of the coding agents today are reactive to our requests. Having agents that respond to a changing environment, not us, and acting on it, presents a proactive system that autonomously improves the system. We’re already seeing this at the coding level in some cases. But is there a fundamental barrier to doing this at the system level given that enough accurate context is provided?

Really?

I realize the vision I present here is utopian in a lot of ways and may in fact appear unrealistic. We’re not there yet. There are technical limitations and cost is an issue3. And at this point in time LLMs are not yet reliable enough to “own” a system in production in such a way. Generally speaking there are non-trivial challenges to solve here.
But I do believe we’re starting to see these patterns emerge, and this is a reasonable extrapolation of current advancements in LLMs and the surrounding ecosystem when applied to software development. Especially when it’s integrated with existing observability, software engineering, project management and other relevant tools of the trade.

It’s also interesting to see how this vision is not unique to software development. In a recent interview Satya Nadella was asked (among a lot of other things) about his vision of using models in Microsoft4. And it’s interesting to see how he outlines a future where MS-Office apps, Excel in his example, are used autonomously by AI agents. The gist of the tool is not a UI the model works with, it’s the underlying functionality (“logic”) the model integrates with. The focus in that conversation is on the business implications of this kind of development, which I won’t dive into here. But this idea resonated with me when thinking about coding and software engineering and operation in general. The human-centric UX of the tool becomes secondary when AI is involved, but the tool’s functionality can still be available and relevant to AI agents to use. When more and more software engineering tools become available to AI to use, the integration seems inevitable.

We’re training and focusing AI on using the tools we know and use for the tasks we need. The tools that were built for us. The emphasis we’re currently putting on code quality is mostly driven by human involvement in the code. When humans need to read and review the code, we judge and build the coding agents in a way that emphasizes code metrics and quality that are (rightfully) important for humans. But if you take the human interface – the programming language – out of the coding loop, we can probably relax some of these requirements. And let agents use tools that do coding, even if it’s not optimal for human consumption. Instead, we should give agents the tools to perceive and assess its results based on what actually matters – the runtime behavior of the system.

In all likelihood, we’ll get there in small steps, and it will be realized in stages, with different parts of the puzzle implemented by different teams at different times. Even if we assume all pieces are integrated, I still believe we’ll always have the ability to adopt it partially. We’ll probably also see a reality where different teams adopt varying levels of such capabilities. Similar to how today different software development teams adopt different languages and tools, depending on the type of software they use. Some types of software are still better off developed in Assembly or C. In other cases, teams abandon the use of ORM frameworks, even though they can make their coding life easier. It’s quite possible we’ll see teams still doing investigations or carefully feeding planning and coding agents with hand-curated context and tools. But there is a potential here to achieve an order of magnitude more efficiency if we are willing to let go of some control.

What Do We Need To Get There?

As I wrote above, I don’t think we’re there yet.
We’d obviously need to standardize the way agents communicate. Given the amount of available formats and notations for expressing practically anything in software, the challenge will be more about agreeing on the notation rather than inventing one (though that’s always an option).

The challenge may lie not in having a communication format, but in having an efficient one. We’re already seeing examples of this (here, here).

Standardizing on protocols and notation is the easier problem. Agents will need to communicate with one another using semantics of different activities in the SDLC. How projects are organized, how a plan is built, and how it translates to components of the running system. Luckily, decades of humans building software are already built-in to the LLMs through their training. Mapping some of the ideas may be sometimes a challenge, though it seems LLMs can bridge this gap5. Mapping semantics between tools seems to be the easier problem. Combining the various domains of a whole project may be a bit more challenging. Bridging the gap between project planning, product roadmap and technical planning/constraints is usually done in the minds of people, and in discussions. Adapting LLMs to do this is, I think, achievable, though not generally trivial6.

Of course, having the technical protocols, and ways to encode semantics doesn’t mean we can do it efficiently at scale. So adapting these to be used by LLMs efficiently is of course key to realizing it.

So What’s Left For Humans?

Obviously humans are still left to define requirements and priorities. Current SDD tools don’t seem to address this fully, at least not yet. But there are already attempts to show how such a process can look. It’s clear that there’s still a way to go for automatically translating software requirements to technical specifications. Still, the foundations are there, and one can imagine how this process is realized. I also don’t see LLMs weighing in on business constraints and trade-offs. Taking into consideration social circumstances and constraints is, well, human. 

Even when looking at the pure engineering side of things, I expect that humans are still needed, but not so much for their ability to express business application logic in code as much as their ability to reason about system behavior7. The future software engineer will need to understand systems engineering at a very fundamental level, and be able to translate it into specification and requirements to be worked on by AI agents. For example, understanding what causes the application to suddenly slow down when a certain event hits, or spot a sporadic race condition that happens because of the distributed nature of the system. Experienced software engineers are able to spot issues like that, especially in a system they know, from just a cursory look at the UI or logs. Systems thinking and understanding will probably become much more important and sought after skill.

In practice, it will probably mean that software engineers will be much more troubled with defining and regulating the “fitness functions” than whether a given snippet of code is readable or violates DRY principles. It could be about modifying the specifications, but also about having a “widgets and knobs” dashboard-like experience where different properties are exposed, allowing engineers to tune and configure the system according to their understanding.

Even if such a system (or ecosystem) of agents materializes, I expect brownfield projects will take time to adapt to such a methodology. There will need to be a lot of work for feeding and adapting existing artifacts (code, documents, specifications) to reverse engineer the implicit knowledge that is often assumed or communicated verbally.

Another place where I believe there’s not yet a replacement for humans is in innovation. New technologies, that could potentially change how we interact with computers or build systems, will most probably require us to train LLM models in how to program and use these. Think of new hardware, applied mathematics or new algorithms. These are all examples, where new application and technical patterns are formed, and these still need to be taught to LLMs8.
Similarly, integrating across modalities, or interactions with the physical world will probably require more guidance.

Good engineering – understanding how a system fundamentally works and how it can/should change – will not be replaceable so easily by LLMs. The more we encode it into descriptions, the more we can get done. But these are, at the end of the day, textual probabilistic models. Training on larger datasets will help, but not replace understanding.
It’s the way we build and modify these systems that will change. It is about letting some of the obvious repeating patterns sort themselves out.

Is This Necessarily a Good Thing?

Beyond the technical challenges of realizing this vision, I think it’s important also to ask ourselves whether this kind of state of affairs is a good thing – will this lead to greater success in software delivery, without compromising on quality and safety?
This is not about avoiding it9, but rather about articulating the necessary constraints or guardrails so we can avoid a “less than ideal”10 outcome. 

A reality where software is created and modified with zero friction can quickly become risky, especially with mission critical systems. Friction-inducing mechanisms, e.g. compliance and risk assessment, exist for a reason. Software production is not different, especially as software is already a critical component of our modern life. If we do reach a point where software is created autonomously by LLM-driven agents, any human intervention is essentially such a friction. Today, we usually experience this friction as a hindrance to efficient and effective delivery. But in a world where most of the human inefficiencies are taken out of the equation, this friction11 of human intervention may actually work to hedge some (all?) of the risks. 

So we need to ask ourselves where human involvement is in fact a positive thing, and where this interacts with the SDLC. I believe that as a rule of thumb, decisions that impact humans should be taken by humans. For example, defining who has access to what piece of data is essentially a human decision directly affecting humans; similarly, how long to keep transaction data has legal and social implications. Contrast this with the decision of whether to use a linked list or a simple array, or how to decompose a system into separate services – these may affect how the system performs, or how long it takes to make changes, but it does not directly affect the human experience, and can be more easily relegated to AI-agents.

This autonomous SDLC has to have some “human-friction” built into it, if only for the sake of safety if not for better results. The exact mechanisms are yet to be seen, but they should be there.

So yes, mechanizing the software development life cycle is an overall good thing to efficiency, and has the potential to alleviate a lot of the problems plaguing the software industry. As a corollary, it can also induce a wave of innovation if software is easier and cheaper to create12. But we also need to make sure we’re not giving up on human common sense, intuition and ability to innovate. We have to be conscious of what software is getting built, and how it affects us, especially when it evolves more easily. 

Let the agents rise!
(but keep an eye on what they’re doing)


  1. And we can argue whether it’s really important given how detached it becomes from humans; e.g. is readability by humans that critical? ↩︎
  2. Or however we’d like to call this role ↩︎
  3. Though I believe it could be offset by the savings in development and down times ↩︎
  4. This actually wasn’t the exact question, but this is roughly where he went with it ↩︎
  5. And more formal semantics may prove useful. ↩︎
  6. And we should consider – should we solve it in a general manner? What if we start by doing it only for web applications? or mobile applications? ↩︎
  7. Remember the point about System ≠ Software, here. ↩︎
  8. Though world models may prove to be different here, but admittedly I’m not an expert. ↩︎
  9. Not trying to be a Luddite. ↩︎
  10. And I will let the dystopian genre experts suggest what is “less than ideal”. ↩︎
  11. Should I say “efficiency-dampening factor”? ↩︎
  12. But note – not necessarily easier to operate. ↩︎
agents under scrutiny

Argument-Driven Development: Benchmarking Dialectic

In the last post, I presented the idea of multi-agent debates, and possible implementation of the idea in software design.

At the time, I had only some anecdotal and very preliminary evidence on whether it works. Since then, I’ve tested some more, added useful features and spent some time fixing bugs. Now it’s time to look at these initial results, and what we can learn from it.

Evaluation

Generally speaking, when evaluating a tool like Dialectic, evaluating it depends entirely on where, how, and how much, you intend  to use it. In its current incarnation it’s a CLI utility, and I’m still working on assessing its efficacy. My focus is on how to achieve good results from a qualitative perspective. In other words, I am assessing the quality of the analysis and solutions it provides.  Since we’re dealing with software design, which is famously hard to quantify1, then assessing the results of such analysis is the primary focus. The problem of system design, especially in software-heavy systems, is almost always a question of tradeoff. There is almost never an absolute truth. Quite often there’s not even a clearly better alternative. It’s often hard for us, as humans, to judge what is the “best” solution. Evaluating the quality of the output mechanically is not easy. Yet, scaling this kind of evaluation relying solely on human judgment is unrealistic2.

I could, and probably should, address other quality attributes, namely runtime performance and cost of usage. These are important, but not as much as quality of results, which I consider an essential success factor. For now, we are operating in controlled, limited environments, rendering these factors less critical.

Beyond a holistic quality measure of the design, I’m aiming to see how different customizable factors in Dialectic affect the quality.
Specifically:

  1. How does the number of rounds affect the debate and the final suggestions?
  2. How does the introduction of clarification questions (answers to questions raised by the agents) affect the outcome?
  3. How do different models used affect the results?
  4. What if we use different subsets of the possible roles?
  5. What if we enable/disable the summarization of conversation?

There are endless combinations and potential factors affecting performance on their own and when combined. The analysis below is a preliminary analysis of specific factors, with some suggested follow up questions. But it’s pretty obvious that there are more factors that might affect the results.

Evaluation Method

The question of how to evaluate AI performance is a hot topic these days. And there are already evolving methodologies and best practices. Given the focus I mentioned above, and admittedly, my limited time and budget, I implemented a simple “LLM-as-a-Judge” strategy, with some hand-picked human review3 to gauge quality. Quality in this case is not so much the “correctness” of the suggested solution4 as much as how it addresses the issues raised in the stated design problem, how well it identifies risks and provides reasonable reasoning.

When creating the evaluation functionality, by LLMs, I focused the evaluator’s prompts on estimating coverage of different aspects. The default evaluation prompt tries to cover functional completeness as well as non-functional requirements – performance/scalability, security/compliance, testability/maintainability – and an overall assessment. While providing reasoning for each such rating. When evaluating the different factors we provide scores (and results below) for each of these aspects separately.

One can of course configure the evaluators with other prompts and temperatures, so this is still something that can be improved. This matters. As I was testing the evaluation functionality, it was obvious that more detailed prompts, with examples and chain-of-thought patterns, produced stricter evaluations with better reasoning.

All the tests I describe below use the same evaluation configuration – prompts, model and temperature. This way, good or bad, at least the evaluating agents are consistent.

Case Studies

For testing, I chose several cases, somewhat generic problems taken from architectural katas, and an example from a Reddit discussion. These are good as preliminary and limited examples, but they lack a broader business and product context that often exists in companies looking to do software design on evolving systems. These may not be completely representative of real-world design problems, but as an evaluation benchmark I believe them to be a decent starting point.

For each of the case studies, I created the necessary configurations and script to run one of the above tests. Then another script is used to run a specific test on all case studies.

This is by no means a complete rigorous study or experimentation. I expect to continue with more cases, especially real-world ones, requiring more context and subtlety in the questions raised. There is a difference between “What is the best way to build a system with the following requirements… “ and “I need to improve this component, given the current system and constraints – what is the best way to do it…”5.

From anecdotal experience (at work), I can already identify a pattern where more specific questions require more context and a more focused phrasing of the problem. In other words, the way we present the problem (the input), unsurprisingly, affects the output.

Unless otherwise noted, the debate configuration for all these tests included one agent of each of the “Architect”, “Performance”, “Security” and simplicity (“KISS”) roles. All were configured to use a Gemini 2.5 Flash Lite LLM model with a temperature of 0.5. The default judging agent also uses the same model and temperature by default.

Evaluation configuration is for two evaluating agents with the same gemini model. The temperature is not configurable.The evaluators are prompted to evaluate not only an overall judgement of the final result, but also an estimation of how well the debate results address other factors such as performance, maintainability etc. The scores are averaged across the different case studies, for each test.

Results and Preliminary Analysis

My goal here is to try and gauge what affects the quality of returned results. What immediately stood out even in preliminary tests was that, unsurprisingly, prompts matter. The original prompts I used for defining the different agent roles were somewhat bland. The results I got were obviously too generic and lacked details. After some iterations and refinements (with AI) on the prompts, I reached what I think are reasonable default prompts that seem to focus and guide the LLMs to provide decent answers. These are customizable, even per agent, so users have the option to play around with more prompts and suggest better ones. I’ll be happy to hear about better options.

In addition, as described above, I ran several tests to understand the effect of different options and customizations.

Enough talking, let’s look at some numbers.

Does Clarification Matter?

With clarification turned on (“True” below, green color), the different agents get to ask the user 5 clarifying questions before the debate begins. The answers from all agents’ questions are available to all agents.

Looking at the scores, it seems that when clarification is available, the debate tends to produce better results, although by a close margin.

So yes, it helps. I’d venture that this hints at the importance of context in general.

Assuming this is consistent, potential follow up questions to this are whether more questions provide better results, and whether questions from specific roles affect the result differently.

Does Summarizing The Debate Make a Difference?

The summarization feature allows the agents to summarize the context in case it becomes too big. Summarization is done from the perspective of each agent on its own, and implemented using an LLM call. This way, different agents can decide on different summarization strategies and/or emphasize different aspects.
Currently the summarization is somewhat simplistic, setting a threshold (configurable) and summarizing to some maximum length of context once the context grows beyond the threshold.

This is essentially a performance feature. But does it come at the cost of quality?

Results:

The results seem somewhat comparable, with a slight advantage to the non-summarizing debates. Unsurprisingly, when we provide more context, the results tend to be better.

There are of course follow up questions here as well: does the length of the summarized context matter? What if I don’t summarize at all, but instead allow the model to retrieve the context it wants more accurately? Can I gain performance without compromising the available information – and subsequently, the result quality?

Does The Model We Use Matter?

One of the obvious questions here is whether the actual LLM makes a difference.

To answer this question fully, we would need to test on a variety of models, and there are quite a few.

Since that could take eons (and burn a hole in my wallet) I chose to focus on 3 models. From preliminary, isolated tests, I noticed that larger models (e.g. Claude Sonnet 4), don’t necessarily perform much better compared to smaller ones. 

So for this test I went with 3 smaller models: Gemini 2.5 Flash Lite, GPT-5.1-Codex-Mini and Kimi-Dev-72b (from Moonshot). 

The results:

As you can see, there’s not a lot of difference between the models, with maybe a slight advantage to the Kimi-Dev-72b, but not a consistent advantage across the different scores. We can of course continue with other models as well, and different model settings, e.g. temperature.
Also interesting to see if using a mix of models in the same debates leads to different results, or whether certain models are better at debating specific perspectives. For example, is Gemini a better model for assessing maintainability.

Do The Roles Matter?

One other possible customization is to decide which agent roles (essentially design perspectives) are participating in the debate. One can have any mix of the available roles and have them debate equally6.

For this test, I tested the following subsets of roles:

  • Architect, Performance, Security and KISS (simplicity)
  • Architect, Architect
  • Architect, Architect, KISS
  • Architect, Performance, KISS
  • Architect, KISS

Results:

Generally speaking, a combination of general “System Architect” role agents seems to provide an overall good result on most scores, compared to other combinations.

Unsurprisingly, combinations that did not have a security expert involved score lower on the “Security” and “Regulatory and Compliance” scores. When a security expert was involved in the debate, more emphasis7 was given to security issues.

The same seems to be true for the performance/scalability aspect when a performance expert is involved.

This strengthens the hypothesis that debating with different roles genuinely affects the result.

This also hints at a simple way to provide weight to different perspectives. For example, if you’d want to put more emphasis on simplicity of the solution, add another “KISS” agent to the debate, on top of the existing one. This would have 2 (or more) agents championing simplicity in their proposals and critiques. At the synthesis phase, when the judge synthesizes the final proposal, it would gather all the proposed solutions. Stacking the deck with agents who emphasize simplicity will likely push the synthesized solution in that direction.

It will of course be interesting to test with other combinations, and on different problems that raise more specific design questions.

Does The Number of Debate Rounds Matter?

Another obvious question is whether the number of debate rounds matters. Would longer debates produce better results.

Note that as it’s currently implemented the number of rounds is fixed (configurable per debate) and equal to all agents – they all run the same number of rounds.

Results:

Generally speaking, looking at the numerical results, the numeric scores seem to behave more or less the same (the ‘kata3’ example was a bit of an outlier compared to the rest; it’s pulling the average down).

Across all evaluated scores, we see a similar pattern – a few rounds provide good results, then evaluations dip only to rise again around the 5th or 6th round. Examining the specific results I could identify a pattern where suggested solutions of shorter debates seem to provide decent results. Debates with 5/6 rounds also provided decent results, but different ones – emphasizing simplicity. It’s as if after a few debate rounds the “KISS” agents somehow tilted the result towards simpler solutions8. I cannot explain it definitively yet, but it could be ‘position bias’ at play: if the synthesizing judge saw the KISS result first, it may have anchored on it.  This is of course just a hypothesis that needs to be examined further.

Still, it looks like more than 1 or 2 rounds don’t provide much value. I might just stick with 2 rounds for real-world usage.

Key Takeaways – What Can We Say So Far?

So while these tests are not exhaustive in any way, and they certainly raise a lot of follow up questions, I think some things are already clear and can be implemented or give direction to further implementation. While we’re still not at a point of a truly autonomous software design, I think we’ve laid a few bricks in the path to get there. I think we can safely say that a structured debate does help, but the details of “how” are important to make it effective.

First, providing extra context to the problem yields better results. This was evident from the usage of clarification as well as avoiding summaries. I expect this effect to even be more acute when dealing with “brownfield” projects where there’s a lot of legacy and implicit information that needs to be given. This has already motivated me9 to add the option for a context file. But we’ll need a better context mechanism to make it easier for context to be added and searched.

The clarification mechanism is already a step in that direction, as it allows the agents to deduce (“think about”) their missing information, and ask explicit and specific questions that might help it to suggest a better solution and critique other agents’ solutions.
And this point about clarification vs. simply more context is important. It’s not just about stuffing more context into the prompt; it’s about refining that context. Even when models were given ample initial information, the act of having agents ask and answer targeted questions consistently led to more robust and accurate design proposals. This suggests that the iterative loop of questioning and informing helps the agents converge on a shared, deeper understanding of the problem, rather than just processing a larger, potentially ambiguous input blob.

In contrast to this, it seems the choice of model doesn’t affect the result that much. It might be better to try models optimized for coding (as kimi-dev-72b claims to be), but the difference isn’t clear or significant. It’s also encouraging that smaller models perform fairly well, as this directly impacts the cost. This isn’t just about saving pennies; it means that the architecture of the debate itself – the structured interaction and iterative refinement – can compensate for individual model scale. This has profound implications for scaling these systems in real-world, budget-constrained environments.

The short experiment of using different role assignments also confirmed that these perspectives are far from mere window dressing; they can fundamentally shape the outcome. Like a well-rounded human team, a diverse set of AI “personalities” can explore different facets of a problem and push solutions in varied directions. For example, a team weighted towards KISS (Keep It Simple, Stupid) principles produced notably simpler designs. This highlights the potential for engineers to ‘tune’ the debate by curating specific agent roles to achieve desired design characteristics, whether that’s maintainability, robustness, security or performance.

Longer debates also don’t seem to provide much better results. Looks like, similar to human conversations, AI conversations also tend to converge pretty quickly, and shorter “meetings” are better10. Some iterative discussion is ok, similar to the design discussion “grind” we all know. But at some point, pretty early, it seems to not add too much to the conversation – we’ve all been there.

I do suspect that the number of participating agents, and their roles, will have a more significant effect on the quality of results.

Where Does That Leave Us?

With the insights I went through above, I think it’s enough to start a more productive “real world” use of the tool. Some investment still needs to be done in ergonomics of the tool, to make it convenient and practical for day-to-day use. But not a lot is missing, especially given the target audience are supposed to be people who should not be afraid to use a CLI tool, and don’t consider a “JSON configuration file” a magical incantation. It might also be useful to have a more convenient front-end (web based?) to allow for easier access, customization and review.

There are also clear11 additions and next steps: we’ll probably need to have a better way to provide context to the debate, driven by the asking agents. Some kind of tool (MCP-based or not) is probably a good answer to this, but there might be other ways.

This journey of applying AI for autonomous software design is still in its early phases. These initial measurements already demonstrate that structured multi-agent debates, guided by thoughtful methodology, is promising. We’ve seen how clarification, strategic model selection, role diversity, and iterative refinement can collectively elevate AI-generated solutions. As we continue to refine these AI applications and push the boundaries of what’s possible, the vision of a truly Autonomous Software Development Lifecycle comes into sharper focus. I believe there is a potential here for a reality where AI doesn’t assist but actually drives the software lifecycle. Looking at it from this perspective, the AI-driven design process is just another piece in a bigger puzzle – a way to create a much more mechanized, software development cycle.

I hope to explore this more soon.


  1. It’s all tradeoffs – “it depends” ↩︎
  2. Maybe I’ll add some like/dislike buttons somewhere in the future. ↩︎
  3. By me. I’m the human here. But you can also be a human here. ↩︎
  4. Because… it depends ↩︎
  5. Or, for example, “what would be the best way to optimize this db structure” ↩︎
  6. Actually, a user could provide a completely new role if he provides a custom system prompt for an agent, as the system prompts are used to define the role. ↩︎
  7. Emphasis → more words → more tokens ↩︎
  8. Looks like the KISS agent was very persuasive given enough debate time. Maybe it has better stamina? ↩︎
  9. Should I say “prompted me”? (pun intended) ↩︎
  10. It’s the “it could have been an email” of the AI agents ↩︎
  11. At least in my opinion ↩︎

MAD About Software Design: When AI Debates

So at this point, I think we’ve established that LLMs can code (right?). They’re only getting better at it. I’ve also argued in the past that I believe LLMs can do more than just code to improve our software engineering lives. But this isn’t a simple task. There’s quite a bit of essential complexity in the process; it’s beyond simply automating day-to-day tasks1.

Imagine my interest, then, as I stumbled upon the idea of LLM-based AI agents debating each other. The concept isn’t unique to software engineering, but it still appealed to me as a way to simulate (or at least approximate) an actual software design process and, by extension, scale or improve it.

Before we dive into my implementation, let’s step back and understand the concept, and where it fits.

A Discourse of Agents

LLMs are powerful2, but they often come with a (potentially significant) catch. A single LLM, as capable as it is, can easily suffer from issues like hallucinations, inconsistent reasoning, and bias. And the more complex the task, the more likely it is to exhibit these issues. This is the “single-agent trap”: relying on one model’s perspective means you are exposed to its blind spots. This isn’t that different from trying to solve complex tasks as humans3 – the more complicated the task, we often benefit from collaborating with others.

We have ways to mitigate some of the problems to an extent – prompt and context engineering, RAG, access to tools.

So what if we didn’t have to rely on a single AI agent’s answer, the same as we humans collaborate with other people when working on complex issues? 

This is where multi-agent debate (MAD) comes in. MAD provides a complementary approach. It’s an approach that uses iterative discourse to enhance reasoning and improve validity. See examples here.

You can think of it like a collaborative “society of minds”. Instead of one agent providing one answer, multiple agents propose and critique solutions to the problem. This goes on for several rounds of discussion, where agents challenge each other’s proposals, spot errors and refine their ideas. Eventually, the goal is for this process to converge on a superior final answer.

While I don’t intend to provide a full literature review here, or any kind of exhaustive description4, I think it’s worth understanding the main components, findings and challenges.
What follows below is a crash course on Multi-Agent Debate (MAD). But if you’re interested in more detailed evidence and nuance, I encourage you to follow the links and explore some more.

MAD – The Bird’s Eye View

So how do these debates actually work under the hood?
There are different implementations, and from what I’ve seen, they vary significantly for different reasons. But three fundamental components repeat in all cases.

First is the agent profile, which defines the roles or “personas” of the debating agents. A simple setup might define agents that are symmetrical peers. But more complicated setups assign specific roles to agents. For example, one agent may be a “critic”, another a “security expert”, etc. There are different ways to create this diversity. Everything from using different models, configuring them differently, and prompting the different agents to hold/emphasize divergent views.

Second is the communication structure – the topology. Essentially the network map that dictates who talks to whom. A common setup is a fully connected setup where all agents see each others’ messages. Other approaches may use more sparse topologies (interacting with specific neighbours) or even going through a single orchestrator/dispatcher. The choice of topology of course changes the debate dynamic.

Finally, there is the decision-making process: how the debate is concluded. After the agents have debated amongst themselves, how do you decide it’s time to conclude and compile a final answer?
The simplest method, which works well for certain types of problems, is a simple majority voting. This works best in cases where the answer to the problem is a simple deterministic value, e.g. math problems. Another approach, a bit more structured, is to use a “judge” (/”arbiter”) agent. This agent listens to arguments from all sides and selects or compiles a winning answer.

Does It Work?

Yes, to a degree.
Current research suggests that multiple agents working together achieve better results, especially when the complexity of the tasks increases. This example shows significant improvements on math problems.

Multi-Agent Debate (MAD) systems seem to improve factuality and the accuracy of results. Agents seem to be able to spot errors in each other’s reasoning, improving consistency. Some evidence can be seen here and here among others.

Tasks that are more complicated, and/or require more diversity of thought, seem to benefit from this pattern more. Specifically, it seems that iterative refinement and using different models to propose and debate each other yields better results – more consistent answers that align better with human judgement.

Does It Always Work?

Of course not. It wouldn’t be fun otherwise.

This study, for example, suggests that it’s not so much the debate that’s improving performance, but rather the “multi-agent” aspect of it. Another study suggests they are difficult to optimize (though it does conclude they have potential for out-performing other methods).

There are also distinct failure modes. This study suggests that models may flip to incorrect answers under some conditions. And they require more careful setup, specifically guidance on how to criticize answers from other agents – a structured critique guidance.

There are of course cost considerations to be had, as any engineering problem. Multiple agents making repeated calls to LLMs with potentially growing (exploding?) context mean cost can easily get out of hand.

This is an active research area, with probably more results and implementations to be shared in the near future.

So while we’re here, why not join the fun, and try to apply it?

MAD About Software Design

This pattern of debating agents can be applied to all sorts of problems, as the studies linked above show. Software system architecture should not be an exception. I could not find another implementation of this pattern that’s related to SW engineering. The closest is MAAD which seems nice, but as far as I could see it does not exactly implement a debate pattern, but rather a set of cooperating agents working towards a goal of producing a design specification.

Part of the reason this piqued my interest is that in my line of work, when considering feature and system designs, a debate5 is a natural dynamic. This is simply what we do – we discuss, brainstorm and often argue over different alternative solutions. AI agents debating over a design problem seems like a natural fit.

This is where Dialectic comes into play. 

This is a small, simple implementation6 of the multi-agent debate pattern, with a focus on software engineering debate. It is a command line tool, receiving a problem description and a configuration of a debate setup, and carries out a debate between different agents. The tool facilitates the debate between the agents with the goal of eventually arriving at a reasonable, hopefully the best, solution to the presented problem, with concrete implementation notes and decisions.

When it comes to the debate setup, Dialectic allows the user to specify the number and role of participating agents. A user can choose from available roles – “Architect”, “Performance Engineer”, “Security Expert”, “Testing Expert” and “Generalist”7

The current implementation has a rather rigid debate structure: for a fixed number of rounds (configurable), each agent is asked to propose a solution, then critique all of the other agents’ solutions, and refine its proposal based on the feedback from other agents. The refined proposals are fed into the next round. At the end of the last round, a Judging agent receives the final proposals and compiles a synthesized solution from all participating agents.

As a user, you can control the number of rounds, the prompts used, temperature and model per agent. See here for a more complete description of configuration options.

Why This Debate Pattern?

The chosen debate pattern and configuration options are intentional8, in an attempt to mitigate some of the problems mentioned above.

First, different “roles” (essentially different sets of agent system prompts) offer different perspectives. When debating, specifically criticizing each other’s work, the offered different perspectives should allow consideration of different arguments for choices. This hopefully avoids at least some of the potential groupthink.

Additionally, each agent can be configured with a different LLM model and different temperature. This offers a chance at combining models with different strengths (and costs), potentially trained and tuned on different data sets. This heterogeneous debate setup, which combines different agent profiles, allows for a rich interaction of viewpoints. This is especially true given the current fixed topology, where every agent critiques all other agents’ proposals.

The possibility for clarifications from the user allows also for additional context based on specific agents input (the agents ask the user questions). This not only allows more focused context to the debate, but also mimics a real world dynamic where the development team interacts with the product owner/manager for different clarifications that come up during a discussion (“what should we do in this case? – this is a product decision” is a common phrase heard around the office).

Dialectic also supports context summarization to try and avoid context explosion. There’s of course a trade-off here, but for practical cost reasons9, it should support a way to manage context size. Some models can be quite “chatty” and end up with big responses.

Apart from being a tool to be used in practice, I realize the different options and combinations possible can potentially lead to very different results and quality may vary depending on any number of reasons. This is why output options also vary: you can simply output the final synthesized solution, as well as get a structured file containing a more detailed step-by-step description of the entire debate, with all configurations and latency + tokens used figures. There’s also the option to provide a complete debate report in markdown format. This should allow users to experiment with different debate and agent configurations, and hopefully settle on a setup (maybe several) that fits their purposes the most.

What’s Next?

At this point, you can start using Dialectic, and experiment with it on different problems and debate setups.  I plan to do it as well.

Initial experiments seem anecdotally promising. When used with advanced models, it’s producing reasonable results. It’s a practical tool, still evolving, that shows promise in helping to analyze and reach solutions in complex domains faster and more comprehensively. But we’ll need to evaluate results more systematically, so this is the obvious next stage.

At the same time, I believe it can still help as a brainstorming partner. Having a tool that automatically analyzes a problem from several angles and refining it is at the very least helpful in covering options and exploring ideas.

But it’s clear that some things can and should be improved/added.

To start, a lot of real-world (human) discussions implicitly involve pre-existing knowledge. This is part of the experience we have as professionals. Specifically, knowledge and context of our specific systems (the “legacy code”), patterns and domains. While it’s possible to include a lot in the given problem description and clarifying questions, I believe it should be possible for debating agents to query further information and knowledge. We will probably need to support plugging in extra knowledge retrieval, driven by the agents to allow them more focus and refined answers.

Another thing to look into is the way the debate terminates. Currently it’s a fixed number of configured rounds. All rounds run, and the judge has to synthesize an answer at the end. But this is not the only way. We can terminate the debate when it seems that there’re no new ideas or issues coming up. We can have the different agents propose a confidence vote of their proposals, and then have the debate terminate when it seems that all (most?) agents are confident beyond some set threshold.
We can also instruct the agents and judge to propose follow-ups, and use the result of a given debate as the input to another, with extra information.

The current topology is also fixed. It will be interesting to experiment with different topologies. For example, have the specialist (security, performance, testing) agents only critique the architect’s proposals. A step further would be for an orchestrator agent to dynamically set up the topology, based on some problem parameters.

Agent diversity is also interesting. There is evidence that diversity of agents improves results in some cases. Playing with the LLM models used, their temperature and specific prompts can potentially complement each other in better ways. We could, for example, create an agent that is intentionally adversarial, and pushes for alternative solutions.

The tool itself can of course be augmented with interesting features:

  • Automatically deriving and outputting ADRs
  • Adding image(s) as some initial context.
  • Connecting with further context available from other systems as input10, so the agents’ analysis is more evidence-based

These should be helpful in making it more useful for day-to-day work.

Of course, costs are also important. The current implementation tries to summarize so we don’t hit token limits too early. But it’s possible we can find more ways to optimize costs. Skip calls when not necessary, summarize to a smaller size every round, etc.

So Software Designers are Obsolete?

No.
I do believe there’s still a way to go before this replaces the human dynamics of discussion. One thing that I still don’t see LLMs doing well is weighing trade-offs, especially when human factors11 are in play. This is more than a simple information gap that can be solved by tooling. I don’t see how agents implicitly “read the room”. I also don’t see how we mimic human intuition by agents.  

I do see this as a step forward, not only because we can automate a lot of the research and debate. But also because the analysis given by such agents is almost guaranteed to be more driven by information, cold analysis, and the vast knowledge embedded within them. Agents don’t get offended (I think) when their proposal is not accepted, or when they don’t get to play with the cool new technology.

Summary

Dialectic is a simple tool that tries to implement a potentially powerful pattern of agentic systems in the realm of software engineering. If done properly, I believe it can help in reaching decisions faster and with higher quality, especially when scaling design work with a larger organization. And this is what mechanization is all about.

The combination of LLM-based agents into a debate and feedback loops should enable more complete solutions, likely with higher quality.

Off to design!


  1. Which is of course still a welcome improvement ↩︎
  2. And continue to improve ↩︎
  3. Sadly, even the hallucination part is true for humans sometimes. ↩︎
  4. A decent review can be found here. But any “Deep Research” AI will help you here. ↩︎
  5. Between people – humans in the loop! ↩︎
  6. Yes, it could have been implemented with something like LangChain/Graph or probably even some kind of low-code tooling. But I also like to learn by doing, so I opted for more bare-bones approach of coding from scratch. We might port it to use some other framework in the future. ↩︎
  7. Note there’s nothing fundamentally about Software in this pattern, except these roles. It’s straightforward to apply the debate pattern to other roles. ↩︎
  8. And still evolving ↩︎
  9. I got too many 429 errors complaining about token limits when testing ↩︎
  10. MCP server support? ↩︎
  11. Business pressures, office politics ↩︎

AI Adoption Roadmap for Software Development

I’ve argued before that LLMs’ greatest promise in software engineering lies beyond raw code generation. While producing code remains essential, building scalable, cost-effective software involves far more: requirements, architecture, teamwork and feedback loops. The end goal is of course producing useful and correct software, economically. But the process of producing software, especially as the organization scales, is much more than that.

So how do we adopt AI across a growing software organization—efficiently and at scale?

We’ve gone through1 paradigm shifts before – agile, microservices, DevOps are some examples. Is AI different in some more profound way, or another evolutionary step?

I believe this is a slightly different story, when compared to other technologies, at least when it comes to the practice of SW development.

First, this is an area that’s still being actively researched, with advancements in research and technology being announced all the time. New models and papers drop constantly, fueling FOMO and risk of distraction. Teams can quickly feel overwhelmed without a clear adoption path.

Second, it seems that a technology that sits at the intersection of machines and human communication (because of natural language understanding), has the potential to disrupt not only the technical tools we use, but our workflows and working patterns at the same time. AI feels less like another toolchain and more like a collision of Agile and microservices – reshaping not just code, but communication flows themselves. This may be going too far, but I sometimes imagine this is the first time Conway’s law might be challenged.

The AI ecosystem, especially in the software engineering space2 is abundant with tools and technologies. The rate of current development is staggering, and it’s getting hard to keep up with announcements and tools/patterns/techniques being developed and announced and shared.

Randomly handing teams new AI toys can spark short-term wins. But to unlock AI’s transformative power, we need to be more intentional about it. We need a deliberate adoption roadmap.

Our aim: weave LLMs into daily software engineering to maximize impact. But with tools and standards still maturing, a rigid, long-range plan is unrealistic. There are few substantial case studies that show adoption at scale at this point. Similar to early days of the world wide web, some imagination and extrapolation is required3, and naturally some of it will be wrong or will need to be updated in the future to come.

It’s natural to chase faster coding as the low-hanging fruit. Yet AI’s true potential lies in higher-level workflows. Since I believe the potential is much greater, I try to follow a slightly more structured approach to navigating this challenge.

This here is my attempt at trying to think and articulate an approach for adoption of AI for a software development organization. It’s positioned as a (very) high level roadmap for adopting AI in a way that will benefit the organization and will be hopefully viable and efficient at the same time.

This will probably not fit any organization. Specifics of business, architecture, organizational structure and culture will probably require adapting this, even significantly. Still, I believe this can be used as a framework for thinking about this topic, and can serve at the very least as a rough draft for such a roadmap.

I will of course be happy to hear feedback or how others approach this challenge, if at all.


Before diving into details of such a suggested roadmap, I will need to introduce a preliminary concept which I believe to be central to the topic of AI adoption – AI Workspaces.

AI Workspaces

Most AI technology today focuses on transactional tool usage – a user asks something (prompts), and the AI model responds, potentially with some tool invocations. The utility of this flow is limited, mainly because crafting the prompt and providing the context is hard. Some AI tools provide facilities and behind-the-scenes code that injects further context, but this is still localized, and not always consistent. From the user’s point of view it’s still very transactional. 

In order to realize more of the potential AI has for simplification and automation, we need to consistently apply and provide context that is updated and used whenever needed. We need to allow a combination of AI tools with the relevant up-to-date context so more complicated tasks can be achieved. Also, with more AI autonomy, the easier it will be for users to apply and use it successfully.

I’m proposing that we need to start thinking about an “AI workspace”.

An AI workspace is a combination of:

  1. Basic AI tools, e.g. models used, MCP servers, with their configuration.
  2. Custom prompts, usually focused on a task or set of tasks in some domain.
  3. Persistent memory – a contextual knowledge source, potentially growing with every interaction, that is relevant to tasks the AI is meant to address.

The combination of these, using different tools and techniques, should provide a complete framework for AI agents to accomplish ever more complex tasks. The exact capabilities depend of course on the setup, but the main point is that all of these elements, in tandem, are necessary to create more elaborate automation. 

A key point here is the knowledge building – the persistent memory. I expect that an AI workspace is something that’s constantly updated (automatically or by the user) so the AI can automatically adapt to changing circumstances, including other AI-based tasks. There should be a compounding effect of knowledge building over time and being used by AI to perform better and more accurately.

An AI workspace should be customized for a specific task or set of tasks. But it can be more useful if it will be customized for a complete business flow that brings together disparate systems and roles in the organization. This will arguably make the workspace more complex and harder to set up, but if used consistently over time, the overhead might be worth it.

We’re already seeing first signs of this (e.g. Claude Projects), but I expect this to go beyond the confines of a single vendor platform, potentially involving several different models, and be open to updates/reading from agents4.

A Roadmap – General Framing

As I’ve already noted, using AI, in my opinion, is more than simply automating some tasks. Automating is great, and provides value, but the potential here is much greater. In order to realize the greater potential we need to leverage the strengths of LLMs, and point them at the right challenges we face in our day to day work in software development.

And these strengths generally boil down to:

  1. Understanding natural language (and other, more formal, languages)
  2. Being able to respond and produce content in natural language (and other, more formal, languages)
  3. Understand patterns in its input and reasoning on it; and apply patterns to its output.

And do all of this at scale.

Looking at the challenges of software development, our general bottlenecks are less in code production, and more in understanding, communicating and applying our understanding effectively. This includes understanding existing code, troubleshooting bug reports, understanding requirements, understanding system architecture, anticipating impact, translating requirements to plans etc.

Apart from actual problems we might face in all of these, there’s also a challenge of scale here. The more people are involved in the software production (larger organization), the larger the codebase and the more clients we have – the greater the challenge.

An immediate corollary of the way (non-trivial) software is built is that it’s not just a problem of software developers. There are more people involved in the software building, evolution and maintenance – devops engineers, product managers, designers, customer support etc. A lot of the challenges are affected by different roles and communication patterns and motivations presented by different roles.

So when it comes to adopting a technology that has the potential to encompass different workflows and roles, I’m looking at adoption from different angles.

Since this is a roadmap, there’s naturally a general time component to it. But I’m also looking at it using a different axis – the way different roles or workflows (tasks?) adopt AI, and at what point these workflows converge, and how exactly.

The general framing of the roadmap is therefore a progression across phases of different verticals of “types of work” or roles if you will.

Workflow Verticals

When building software5 we have different tasks, performed by separate cooperating professionals. I’d like to avoid the discussion on software project management methodologies, so suffice to say that different people cooperate to produce, evolve and maintain the software system , each with more or less well defined tasks6.

Roughly speaking these workflows are:

  1. Design and coding of the software: anything from infrastructure to application design, prototyping, implementation and debugging.
  2. Testing and quality: measuring and improving quality processes – generating tests, measuring coverage, simulating product flows, assessing usage.
  3. Incident management: identifying and troubleshooting issues (bugs or otherwise), at scale. This includes also customer facing support.
  4. Product and Project management: analyzing market trends and requirements, guiding the product roadmap, rolling out changes, synchronizing implementations across teams
  5. Operations and monitoring: monitoring the system behavior, applying updates, identifying issues proactively, etc.

All of these tasks are part of what makes the software, and operates it on a daily basis. There’s obviously some overlap, but more importantly there are synergies between these roles. People fulfilling these roles constantly cooperate to do their job.

People doing these roles also have their own tools and processes, each in its own domain, with the potential to be greatly enhanced by AI. We’re already seeing a plethora of tools promising, with varying7 degrees of success, to optimize and improve productivity in all of these areas.

Just to name a few examples to this:

  1. Software coding is obviously being disrupted by AI-driven IDEs and agents.
  2. Product management can leverage AI for analyzing market feedback, producing and checking requirements, simulating “what-if” scenarios, researching, etc.
  3. Incident management can easily benefit from AI analyzing logs, traces and reports, helping to provide troubleshooting teams with relevant context and analysis of issues.
  4. Testing can be generated and maintained automatically alongside changing code.
  5. UX design can go from drawing to prototype in no time.

And I’m sure there are more examples I’m not even aware of. The list goes on.

The point here is not to exhaustively list all the potential benefits of AI. Rather, I argue that for the software organization to effectively leverage AI, it needs to do it across these “verticals”. 

And as the organization and the technologies mature, we have better potential to leverage cooperation and synergies between these verticals. 

This won’t happen immediately. It probably won’t happen for a while, if at all. But for that, we need to talk about phases of adoption.

Phases of Adoption

I try to outline here several phases for the adoption of AI. These phases are not necessarily clearly distinct. Progress across these is probably not linear nor constant. The point of this description is not so much to provide a concrete timeline. This is more about describing the main driving forces and potential value we can gain at each phase. Understanding this should help us plan and articulate better more concrete steps for realizing the vision.

You can look at these phases as a sort of “AI Maturity Level”, although I’m not trying to provide any kind of formal or rigorous definition to this. It’s more of a mindset.

Phase 1: Exploration and Basic Usage

At this phase, different teams explore the possibilities and tools available for AI usage. The current rate of innovation in this field, especially around software development is extremely high. Given this, I expect employees in different roles will experiment and try various tools and techniques, trying to optimize their existing workflows in one way or another.

At this point, the organization drives for quick wins, where people in different roles leverage AI tools for common tasks, share knowledge internally and learn from the community. 

Covered scenarios at this point are localized to specific workflows and focus mainly on providing context to localized (1-2 people) tasks, as well as automation or faster completion of such localized tasks.

LLM and AI usage at this point is very much triggered and controlled by humans requesting and reviewing results. This work is very much task/workflow oriented at this point, with AI tools serving specific focused tasks. The human-AI interaction at this point is very transactional and limited in scope.

The organization should expect to gain the required fundamental knowledge of deploying and using the different tools securely and in a scalable manner, including performance, cost operations etc. At this phase, a lot of experimentation and evaluation is happening. It will be good to establish an internal community driving the tooling and adoption of AI. The organization should expect several quick wins and localized productivity gains.

I expect the learning curve to be steep in this phase, so a lot of what happens here is trial and error and comparison of different tools, techniques and models.
AI workspaces at this point, if they exist, are very much focused on the localized context of individual well-defined tasks. They are also probably harder to establish and operate (integrate tools, add information).

What would be the expected value?
Phase 1 focuses on achieving quick wins and localized productivity gains. By implementing AI code assistants, automated code reviews, AI-generated tests, and anomaly detection tools, the organization can quickly demonstrate immediate developer speedups, improved code quality, faster test coverage, and early incident learning. 

This goes beyond a business benefit. It’s also a psychological hurdle to overcome. Concrete wins, such as fewer bugs and faster releases, build momentum and justify further investment in AI adoption while increasing developer satisfaction. 

In addition, there’s going to be considerable technical infrastructure investment done at this point, e.g. model governance, cost management, etc. This infrastructure should be leveraged in the following phases as well, and is therefore critical. This phase provides a strong foundation for leveraging AI in future stages.

Phase 2: Grounding in Domain-Specific Knowledge

At this phase, having gained basic proficiency, the organization should expect to improve performance and scope of AI-enabled tasks by starting to build and expose organization-specific knowledge and processes to LLM models. 

I expect that business-specific information (internal or external) can increase performance and open up possibilities to more tasks that can be improved using AI. Examples to knowledge building include better code and design understanding, understanding of relationship between different deployed components, connecting product requirements to code and technical artifacts, etc.

This can open the road to higher level AI-driven tasks, like analyzing and understanding the impact of different features, simulating choices, detecting inconsistencies in product and technical architecture and more.

A key aspect of this phase is to facilitate a consistent evolution of the knowledge so it can be scaled and maintain its efficacy. At this point, the organization needs to have the infrastructure and efficient standards in place so information can be shared between roles, and between different AI-driven tools and processes. 

In this phase AI workspaces become more robust and prevalent, encompassing a larger context, and even crossing across workflows verticals in some cases. Contrast this with workspaces we’ve seen in the first phase which are more focused in localized contexts.

This phase is also when we start thinking in “AI Systems” instead of simply using AI tools. This is where we consistently apply and use AI workspaces, with several tools (AI or non-AI) being combined with the same knowledge base, and evolve it together.

An example for this would be AI coding agents that automatically connect their implementation to JIRA tickets, product requirements, and record this knowledge. With other AI agents leveraging this knowledge to map it to design decisions and testing coverage reports (how much of the product requirements are tested) and plan roll outs. 

What value can we expect to have at this point?
Phase 2 is mainly around integrating company-specific (and company-wide) knowledge with AI workspaces. At this point I expect existing workloads to be more accurate, precise and faster in doing their work, even if the task is limited in scope. The grounding provided by the specific knowledge graph should improve the accuracy of AI models.

Different workflow verticals will start to cooperate more closely at this point. First of all, by building a knowledge graph/base together. But also  by leveraging this combined knowledge to implement simple agentic workflows, where AI-based agents start to reason on the data and make simple decisions.

Phase 3: Autonomous Cross Team Workflows

This is the point where previous infrastructure starts to really pay off in terms of increased productivity and quality.


At this phase of adoption, I expect we’ll see more autonomous AI-driven processes coming into fruition. And when I say “AI-driven” I’m not referring to simply automating a well known process. I’m referring to AI agents reasoning and dynamically using tools and other agents to adapt and produce results/do tasks8. I expect at this point AI agents can also build their own knowledge, and adapt their work to accommodate changes in the environment.

Humans are still in the loop for critical decision making, but the friction between humans and tools, and humans to humans is significantly reduced9. The focus at this point should be on eliminating bureaucracy and increasing the adoption of consistent and increasingly robust workflows. This generalization also means that agentic AI systems now work across roles and departments, it’s where the workflow verticals start to converge.

Examples to this would be:

  • Managing changes across roles and workflows. For example, a change in UX/product feature definition that is automatically reflected in plans, and rolled out to clients.
  • Technical design that is validated against technical dependencies (from other teams), past decisions and project plans. Potentially updating the dependencies and informing other agent, potentially changing agent decisions as a result.
  • Identifying cross-cutting issues from internal conversations, correlated with support tickets and other metrics, and proactively planning and suggesting resolutions.

At this phase, I expect AI workspaces to become really cross-departmental and leverage knowledge being built and added in different verticals.

Ad-hoc exploration and automation of tools should also be possible. At this point, the organization should have a strong foundation of tooling and experience with applying AI. It should be possible to allow ad-hoc building of new flows on top of the existing LLM infrastructure and the ever-evolving organizational knowledge base.

Note that this also poses a challenge: there is a fine line between standardization of tools, which drives efficiencies at scale, and democratization of capabilities. You want people to experiment and find new ways to optimize their work, but in order to efficiently grow you’ll need to apply some boundaries to what is used and how it’s used. This tradeoff isn’t unique to AI systems, but I believe it will become more emphasized when we consider new directions and applications of LLMs as the technology improves.
In terms of expected value, we should expect significant productivity gains. While humans are still in the loop, AI will further automate processes, reducing bureaucracy. The focus will be on adoption of consistent productive workflows across roles and departments. Human focus should be on innovation and decision making at this point, with accurate and reliable information being made available to humans, by the machines10.

Technical Infrastructure

In order to support this process, looking at the expected phases of adoption, we should pay attention and plan the necessary technical infrastructure investment. This is true with the adoption of any new technology, but with the current explosion of tools and techniques, it’s very easy to lose focus.

I won’t pretend to know exactly which tools should be available at what point. Nor do I expect to know a definitive list of tools and compare them at this point11. But in order to plan ahead investments, and make a concerted effort on learning what will help us, I believe we can give some idea of what will be needed at each phase of adoption.

In phase 1, we naturally explore a plethora of tools. We should be able to facilitate new models for different use cases. Enabling access to different models using tools that provide a (more or less) uniform facade is useful. Examples for this are OpenWebUI, LiteLLM. We should provide access to AI-driven IDEs, like Cursor, Windsurf and similar ones.

For non-development workflows, AI-based prototyping tools should be helpful, and vendor-specific AI extensions would be helpful. The same goes for monitoring tools.

Connecting these tools with MCP servers to existing hosts of MCP clients (IDEs, chat applications, etc) would probably be useful as well. So support for installing and monitoring MCPs might be useful. At this point it should be also useful to establish some way to measure effectiveness of prompts or model tuning, and track usage of various tools.

In phase 2, building and potentially maturing the infrastructure at phase 1, we should start focusing on more robust workflows, and knowledge building. Depending on use cases, it could be useful to look at agent workflow frameworks (LangChain, et al) and agent authoring tools (e.g. n8n).
Additionally, knowledge management tools and processes will probably be useful to introduce – easily configured RAG processes (and therefore vector DBs), memory management techniques, maybe graph databases. This of course all depends on the techniques used for memory building and maintenance.

I expect MCP servers, especially ones specialized for the organization’s code and other knowledge systems, will become more central. It should be possible to also create necessary MCP servers that will allow LLMs to access and use internal tools.

In phase 3, I expect most of the technical features to be in place. This will be a phase where the focus will be more optimizing costs and improving performance. It’s possible that we should be looking at ways to use more efficient models, and match models to tasks, potentially fine tuning models, in whatever method.

Monitoring the operation and costs of agents, understanding what happens in different flows will become more critical at this point, especially when usage scales up in the organization, and AI adoption increases, across departments.

Summary

AI stands to transform software engineering far beyond code generation. Realizing that promise demands coordinated learning, infrastructure and a phased roadmap. This framework offers a starting point

I believe that due to the nature of the technology, it goes beyond simple tool adoption, or alternatively adopting a new project management practice. This has the potential to change both aspects of work.

The structure I’m proposing is to highlight the potential in each “stream” of workflow vertical, and adopt the tools in phases of maturity, as the ecosystem evolves (click to view full size):

AI Roadmap – “Layer Cake”

This visualization is only an illustration of course. You’ll note it’s laid out as a “layer cake” where scenarios for using AI are roughly laid out on top of other use cases/scenarios which should probably precede them. 

This is of course not an exhaustive list.

The attempt here is of course to structure the process into something that can be further refined and hopefully result in an actionable plan. At the very least, it should serve as a guideline on where to focus research, learning and implementation efforts, to bring value.

It would be nice to know what other people are thinking when trying to structure such a process; or what the AI thinks about this.

On to explore more.


  1. Dare I say “weathered”? ↩︎
  2. SW engineers being natural early adopters for this technology ↩︎
  3. And we know how some attempts didn’t end well. ↩︎
  4. To be honest, I did not yet dive into the Claude projects, so it’s possible they support this. But I can imagine something similar done with other tools as well. ↩︎
  5. And probably in other industries as well, but I know software best. ↩︎
  6. I realize this is kind of hand-wavy, but bear with me. Also, you probably know what I’m talking about ↩︎
  7. Ever increasing? ↩︎
  8. In a sense, leveraging test time compute at the agentic system level ↩︎
  9. Although in some cases, friction is desirable – think of compliance, cost management, etc. ↩︎
  10. I guess accurate context is also important for humans, who would’ve guessed. ↩︎
  11. And let’s face it, at the rate things are going right now, by the time I finish writing this, there will be new tools ↩︎

From Code Monkeys to Thought Partners: LLMs and the End of Software Engineering Busywork

When it comes to AI and programming, vibe coding is all the rage these days. I’ve tried it, to an extent, and commented about it at length. While it seems a lot of people believe this to be a game changer  when it comes to SW development, it seems that among experienced SW engineers, there’s a growing realization that this is not a panacea. In some cases I’ve even seen resentment or scorn at the idea that vibe coding is anything more than a passing hype.

I personally don’t think it’s just a hype. It might be more in the zeitgeist at the moment, but it won’t go away. I believe that, simply because it’s not a new trend. Vibe coding, in my opinion, is nothing more than an evolution of low/no-code platforms. We’ve seen this type of tools since MS-Access and Visual Basic back in the 90s. It definitely has its niche, a viable one, but it’s not something that will eradicate the SW development profession.

I do think that AI will most definitely change how developers work and how programming looks like. But this still will not make programmers obsolete.

This is because the actual challenges are elsewhere.

The Real bottlenecks in Software Engineering

In fact, I think we’re scratching the surface here. Partially because the technology and tooling are still evolving. But also, it’s because it seems most people1 looking at improving software engineering are looking at the wrong problem.

Anyone who’s been at this business professionally has realized at some point that code production is not the real bottleneck when it comes to being a productive software engineer. It never was the productivity bottleneck.

The real challenges, in real world software development, especially at scale, are different. They revolve mainly around producing a coherent software by a lot of people that need to interact with one another:

  • Conquering complexity: understanding the business and translating it into working code. Understanding large code bases.
  • Communication overhead: the amount of coordination that needs to happen between different teams when trying to coordinate design choices2. We often end up with knowledge silos.
  • Maintaining consistency: using the same tools, practices and patterns so operation and evolution will be easier. This is especially true at a large scale of organization, and over time.
  • Hard to analyze impacts of changes. Tracing back decisions isn’t easy.

A lot of the energy and money invested in doing day-to-day professional software development is about managing this complexity and delivering software at a consistent (increasing?) pace, with acceptable quality. It’s no surprise there’s a whole ecosystem of methodologies, techniques and tools dedicated to alleviate some of these issues. Some are successful, some not so much.

Code generation isn’t really the hard part. That’s probably the easiest part of the story. Having a tool that does it slightly faster3 is great, and it’s helpful, but this doesn’t solve the hard challenges.
We should realize that code generation, however elaborate, is not the entire story. It’s also about understanding the user’s request, constraints and existing code.

The point here isn’t about the fantastic innovations made in the technology. My point is rather that it’s applied to the least interesting problem. As great as the technology and tooling is, and they are great, simply generating code doesn’t solve a big challenge.

This leads me to thinking: is this it?
Is all the promise of AI, when it comes to my line of work, is typing the characters I tell it faster?
Don’t get me  wrong, it’s nice to have someone else do the typing4, but this seems somewhat underwhelming. It certainly isn’t a game changer.

Intuitively, this doesn’t seem right. But for this we need to go a step back and consider LLMs again.

LLM Strengths Beyond Code Generation

Large Language Models, as the name implies, are pretty good at understanding, well – language. They’re really good at parsing and producing texts, at “understanding” it. I’m avoiding the philosophical debate on the nature of understanding5, but I think it’s pretty clear at this point that when it comes to natural language understanding, LLMs provide a very clear advantage.

And this is where it gets interesting. Because when we look at the real world challenges listed above, most of them boil down to communication and understanding of language and semantics.

LLMs are good at:

  • Natural language understanding – identifying concepts in written text.
  • Information synthesis – connecting disparate sources.
  • Pattern recognition
  • Summarization
  • Structured data generation

And when you consider mechanizing these capabilities, like LLMs do, you should be able to see the doors this opens.

These capabilities map pretty well to the problems we have in large scale software engineering. Take, for example, pattern recognition. This should help with mastering complexity, especially when complexity is expressed in human language6.

Another example might be in addressing communication overhead. It can be greatly reduced when the communication artifacts are generated by agents armed with LLMs. Think about drafting decisions, specifications, summarizing notes and combining them into concrete design artifacts and project plans.
It’s also easier to maintain consistency in design and code, when you have a tireless machine that does the planning and produces the code based on examples and design artifacts it sees in the system.

It should also be easier to understand the impact of changes when you have a machine that traces and connects the decisions to concrete artifacts and components. A machine that checks changes in code isn’t new (you probably know it as “a compiler” or “static code analyzer”). But one that understands high level design documents and connects it eventually to the running code, with no extra metadata, is a novelty. Think about an agent that understands your logs, and your ADRs to find bottlenecks or brainstorm potential improvements.

And this is where it starts to get interesting.

It’s interesting because this is where mechanizing processes starts to pay off – when we address the scale of the process and volume of work. And we do it with little to no loss of quality.

If we can get LLMs to do a lot of the heavy lifting when it comes to identifying correlations, understanding concepts and communicating about it, with other humans and other LLMs, then scaling it is a matter of cost7. And if we manage this, we should be on the road to, I believe, an order of magnitude improvement.

So where does that leave us?

Augmenting SW Engineering Teams with LLMs

You have your existing artifacts – your meeting notes, design specifications, code base, language and framework documentation, past design decisions, API descriptors , data schemas, etc.
These are mostly written in English or some other known format.

Imagine a set of LLM-based software agents that connect to these artifacts, understand the concepts and patterns, make the connections and start operating on them. This has an immediate potential to save human time by generating artifacts (not just code), but also make a lot of the communication more consistent. It also has the potential to highlight inconsistencies that would otherwise go unnoticed.

Consider, for example, an ADR assistant that takes in a set of meeting notes, some product requirements document(s) and past decisions, and identifies the new decisions taken automatically, and generates succinct and focused new ADRs based on decisions reached.

Another example would be an agent that can act as a sounding board to design thinking – you throw your ideas at it, allow it to access existing project and system context as well as industry standards and documentation. You then chat with it about where best practices are best applied, and where are the risks in given design alternatives. Design review suddenly becomes more streamlined when you can simply ask the LLM to bring up issues in the proposed design.

Imagine an agent that systematically builds a knowledge graph of your system as it grows. It does it in the background by scanning code committed and connecting it with higher level documentation and requirements (probably after another agent generated them). Understanding the impact of changes becomes easier when you can access such a semantic knowledge graph of your project. Connect it to a git tool and it can also understand code/documentation changes at a very granular level.

All these examples don’t eliminate the human in the loop. It’s actually a common pattern in agentic systems. I don’t think the human(s) can or should be eliminated from the loop. It’s about empowering human engineers to apply intuition and higher level reasoning. Let the machine do the heavy lifting of producing text and scanning it. And in this case we have a machine that can not only scan the text, but understand higher level concepts, to a degree, in it. Humans immediately benefit from this, simply because humans and machines now communicate in the same natural language, at scale.

We can also take it a step further: we don’t necessarily need a complicated or very structured API to allow these agents to communicate amongst themselves. Since LLMs understand text, a simple markdown with some simple structure (headers, blocks) is a pretty good starting point for an LLM to infer concepts. Combine this with diagram-as-code artifacts and you have another win – LLMs understand these structures as well. All with the same artifacts understandable by humans. There’s no need for extra conversions8.

So now we can have LLMs communicating with other LLMs, to produce more general automated workflows. Analyzing requirements, in the context of the existing system and past decisions, becomes easier. Identifying inconsistencies or missing/conflicting requirements can be done by connecting a “requirement analyzer” agent to the available knowledge graph produced and updated by another agent. What-if scenarios are easier to explore in design.

Such agents can also help with producing more viable plans for implementation, especially taking into consideration existing code bases. Leaning on (automatically updated) documentation can probably help with LLM context management – making it more accurate at a lower token cost.

Mechanizing Semantics

We should be careful here not to fall into the trap of assuming this is a simple automation, a sort of a more sophisticated robotic process automation , though that has its value as well.

I think it goes beyond that.
A lot of the work we do on a day to day basis is about bringing context and applying it to the problem or task at hand.

When I get a feature design to be reviewed, I read it, and start asking questions. I try to apply system thinking and first principle thinking. I bring in the context of the system and business I’m already aware of. I try to look at the problem from different angles, and ask a series of “what-if” questions on the design proposed. Sometimes it’s surfacing implicit, potentially harmful, assumptions. Sometimes it’s just connecting the dots with another team’s work. Sometimes it’s bringing up the time my system was hacked by a security consultant 15 years ago (true story). There’s a lot of experience that goes into that. But essentially it’s applying the same questions and thought processes to the concepts presented on paper and/or in code.

With LLMs’ ability to derive concepts, identify patterns in them and with vast embedded knowledge, I believe we can encode a lot of that experience into them. Whether it’s by fine tuning, clever prompting or context building. A lot of these thinking steps can be mechanized. It seems we have a machine that can derive semantics from natural language. We have the potential to leverage this mechanization into the day to day of software production. It’s more than simple pattern identification. It’s about bridging the gap between human expression to formal methods (be it diagrams or code). The gap seems to be becoming smaller by the day.

Let’s not forget that software development is usually a team effort. And when we have little automatic helpers that understand our language, and make connections to existing systems, patterns and vocabulary, they’re also helping us to communicate amongst ourselves. In a world where remote work is prevalent, development teams are often geographically distributed and communicating in a language that is not native to anyone in the development team – having something that summarizes your thoughts, verifying meeting notes against existing patterns and ultimately checking if your components behave nicely with the plans of other teams, all in perfect English, is a definite win.

This probably won’t be an easy thing to do, and will have a lot of nuances (e.g. legacy vs. newer code, different styles of architecture, evolving non functional requirements). But for the first time I feel this is a realistic goal, even if it’s not immediately achievable.

Are We Done?

This of course begs the question – where is the line? If we can encode our experience as developers and architects into the machine, are  we really on the path to obsolescence?

My feeling is that no, we are not. At the end of the process, after all alternatives are weighed, assumptions are surfaced, trade offs are considered, a decision needs to be taken. 

At the level of code writing, this decision – what code to produce – can probably be taken by an LLM. This is a case where constraints are clearer and with correct context and understanding there’s a good chance of getting it right. The expected output is more easily verifiable.

But this isn’t true for more “strategic” design choices. Things that go beyond code organization or localized algorithm performance. Choices that involve human elements like skill sets and relationships, or contractual and business pressure. Ultimately, the decision involves a degree of intuition. I can’t say whether intuition can be built into LLMs, intuitively I believe it can’t (pun intended). I highly doubt we can emulate that using LLMs, at least not in the foreseeable future.

So when all analysis is done, the decision maker is still a human (or a group of humans). A human that needs to consider the analysis, apply his experience, and decide on a course forward. If the LLM-based assistant is good enough, it can present a good summary and even recommendations, all done automatically. This analysis still needs to be understood and used by humans to reach a conclusion.

Are we there yet? No.
Are we close? Closer than ever probably, but still a way to go.

Can we think of a way to get there? Probably yes.

A Possible Roadmap

How can we realize this?

The answer seems to be, as always, to start simple, integrate and iterate; ad infinitum. In this case, however, the technology is still relatively young, and there’s a lot going on. Anything from the foundation models, relevant databases, coding tools, to prompt engineering, MCPs and beyond . These are all being actively researched and developed. So trying to predict how this will evolve is even harder.

Still, if I have to think on how this will evolve, practically, this is how I think it will go, at least one possible path.

Foundational System Understanding

First, we’ll probably start with simple knowledge building. I expect we’ll first see AI agents that can read code, produce and consume design knowledge – how current systems operate. This is already happening and I expect it will improve. It’s here mainly because the task in this case is well known and tools are here. We can verify results and fine tune the techniques.
Examples of this could be AI agents that produce detailed sequence diagrams of existing code, and then identifying components. Other AI agents can consume design documents/notes and meeting transcriptions, together with the already produced description to produce an accurate record of the changed/enhanced design. Having these agents work continuously and consistently across a large system already provides value.

Connecting Static and Dynamic Knowledge

Given that AI agents have an understanding of the system structure, I can see other AI agents working on dynamic knowledge – analyzing logs, traces and other dynamic data to provide insights into how the system and users actually behave and how the system evolves (through source control). This is more than log and metric analysis. It’s overlaying the information available over a larger knowledge graph of the system, connecting business behavior to the implementation of the system, including its evolution (i.e. git commits and Jira tickets).


Can we now examine and deduce information about better UX design?
Can we provide insights into the decomposition of the system? 

Enhanced Contextual Assistant and Design Support

At this point we should have everything to actually provide more proactive design support. I can see AI agents we can chat with, and help us reason about our designs. Where we can suggest a design alternative, and ask the agent to assess it, find hidden complexities, with the context of the existing system. Combined with daily deployments and source control, we can probably expect some time estimates and detailed planning.

This is where I see the “design sounding board” agent coming into play. As well as agents preemptively telling me where expected designs might falter.

More importantly, it’s where AI agents start to make the connections to other teams’ work. Telling me where my designs or expected flow will collide with another team’s plans.
Imagine an AI agent that monitors design decisions, of all teams and domains, identifies the flows they refer to, and highlights potential mismatches between teams or suggests extra integration testing, if necessary, all before sprint planning starts. Impact analysis becomes much easier at this point, not because we can query the available data (though we could, and that’s nice as well), but because we have an AI agent looking at the available data, considering the change, and identifying on its own what the impact is.


There’s still a long way to go until this is realized. Implementing this vision requires taking into account data access issues, LLM and technology evolution, integration and costs. All the makings of a useful software project.
I also expect quite a bit can change, and new techniques/technologies might make this more achievable or completely unnecessary.

And who knows, I could also be completely hallucinating. I heard it’s fashionable these days.

Conclusion: The Real Promise of LLMs in Software Engineering

I’ve argued here that while vibe coding and code generation get most of the attention, they aren’t addressing the real bottlenecks in software development. The true potential of Large Language Models lies in their ability to understand and process natural language, connect disparate information sources, and mechanize semantic understanding at scale.

LLMs can transform software engineering by tackling the actual challenges we face daily: conquering complexity, reducing communication overhead, maintaining consistency, and analyzing the impact of changes. By creating AI agents that can understand requirements, generate documentation, connect design decisions to implementation, and serve as design thinking partners, we can achieve meaningful productivity improvements beyond simply typing code faster, as nifty as that is.

What makes this vision useful and practical is that it doesn’t eliminate humans from the loop. Rather, it augments our capabilities by handling the heavy lifting of information processing and connection-making, while leaving the intuitive, strategic decisions to experienced engineers. This partnership between human intuition and machine-powered semantic understanding represents a genuine step forward in how we build software.

Are we there yet? Not quite. But we’re closer than ever before, and the path forward is becoming clearer. 

Have you experienced any of these AI-powered workflows in your own development process? Do you see other applications for LLMs that could address the real bottlenecks in software engineering?


  1. At least most who publicly talk about it ↩︎
  2. ‘Just set up an api’ is easier said than done – agreeing on the API is the hard work ↩︎
  3. And this is a bit debatable when you consider non-functional requirements ↩︎
  4. I am getting older ↩︎
  5. Also because I don’t feel qualified to argue on it ↩︎
  6. Data mining has been around forever, but mostly works on structure data ↩︎
  7. Admittedly, not a negligible consideration ↩︎
  8. Though from a pure mechanistic point of view, this might not be the most efficient way ↩︎

Exploring Vibe Coding with AI: My Experiment

In my previous post I mentioned vibe coding as a current trend of coding with AI. But I haven’t actually tried it.

So I’ve decided to jump on the bandwagon and give it a try. Granted, I’m not the obvious target audience for this technique, but before passing judgment I had to see/feel it for myself.

It’s not the first time I generated code using an LLM with some prompting. But this time I was more committed to try out “the vibe”. To be clear, I did not intend to go all in with voice commands, transcription, and watching Netflix while the LLM worked. I did intend to review the code, and keep in touch with the output at every point. I wanted to test the tool’s capabilities while still being very much aware of what was going on.

Below is an account of what happened, my thoughts and conclusions so far.
A general disclaimer is of course in place: I’m still exploring these tools, and it’s quite possible there could be improvements to the process. My experience, however, is very much influenced by my experience as a developer. My choice of tools and how to use them is therefore very much biased towards usage as an experienced developer looking to increase productivity, not a non-coder looking to crank out one-off applications1.

The Setup

I set out to create a new simple tool for myself (actually to be used at work), something I actually find useful, and is not an obvious side project that’s been done a million times, and therefore less likely (I hope) to be in the LLM’s training data. It’s a project done from scratch, and I’m trying to do something that I don’t have a lot of experience with. It is also meant to be fairly limited in scope.

The project itself is a “Knowledge Graph Visualizer”, essentially an in-browser viewer of a graph representing arbitrary concepts and their relationships. I intended this to be purely browser, JS code. The main feature is a 3D rendering of the graph, allowing navigation through the concepts and their links. You can see the initial bare specification here.

To get a feel for the project, here’s a current screenshot:

KG-Viewer showing its own knowledge graph

With respect to tooling I went with Cursor (I use Cursor Pro), using primarily Claude-Sonnet 3.7 model. The initial code generation was actually done with Gemini 2.5 pro. But I quickly ran out of credits there. So the bulk of the work was done with Cursor.

I did not use any special cursor rules or MCP tools. This may have altered the experience to a degree (though I doubt it), so I will need to continue trying it as I explore these tools and techniques.

Getting Into the Vibe

It actually started fairly impressive. Given the initial spec, Gemini generated 6 files that provided the skeleton for a proof of concept. All of these files are still there. I did not look too deeply into the generated code. Instead, I initialized an empty simple project, launched Cursor, and copied the files there. With a few tweaks2, it worked. I had a working POC in about one hour of work. Without ever coding 3D renderings of graphs.

Magic!
I’ll be honest – I was impressed at first. I got a working implementation for drawing a graph with Three.js, for some JSON schema describing a graph. Given that I never laid eyes on Three.js, this was definitely faster than I would have gotten even to this simple POC.

I did peek at the code. I wasn’t overly impressed by it – there was a lot of unnecessary repetition, very long functions, and some weird design choices. For example, having a style.css holding all the style classes, but at the same time generating a new style and dynamically injecting it into the document.
But, adhering to my “viber code”, I did not touch the code, instead working only with prompts.

Then I started asking for more features.

Cursor/Claude, We Have a Problem

A POC is nice. But I actually need a working tool. So I started asking for more features.
Note, I did not just continue to spill out requests in the chat. I followed the common wisdom – using a new chat instance, laying out the feature specification and working step by step on planning and testing before implementation.

I wrote a simple file, which should allow me to trace the feature’s spec and implementation.
The general structure is simple:

- Feature Specification
- Plan
- Testing
- Implementation Log

Where I fill in only the Feature Specification, and let Cursor fill in the plan (after approval) and the “Implementation Log” as we proceed.

The plan was to have a working log of progress, to be used as both a log of the work, but also provide context to future chat sessions.

I don’t intend to re-create here the entire chat session or all my prompts, as this is not intended to be a tutorial on LLM techniques. But fair to say that the first feature (data retrieval), was implemented fairly easily, using only prompts.

Just One Small Change…

I was actually still pretty impressed at this point, so I simply asked for tiny small feature – showing node and link labels. I did it without creating an explicit “feature file”.

The code didn’t work. So I asked Cursor to fix it. And this quickly spiraled out of control. Cursor’s agent of course notified me on every request that it had definitely figured out the issue, and now it has the fix (!).
It didn’t.

I remained loyal to the “vibe coder creed”, and did not try to debug/fix the code myself. Instead deliberately going in cycles of prompting for fixes, accepting changes blindly, testing, and prompting again with new errors.

Somewhere along this cycle, the code changes made by the agent actually created regression in the application’s code resulting in the application not loading at all.

After roughly 3 hours, and a lot more grey hair, I did notice that the Cursor agent was going in circles – simply trying out the same 3 solutions, with no idea what’s wrong. But still confidently hallucinating solutions (“Now I see the issue…”3).

This was so frustrating that at this point I simply took it upon myself to actually look at the code, which was a complete mess. I looked at the problematic code, consulted git diffs to restore basic functionality, and solved the actual issue with about 10 more minutes of Google search.

To be fair, from my very rudimentary google search it seemed my request (link labels) wasn’t that easy to achieve. It’s apparently not that obvious (again, without being an expert on Three.js). I relaxed the requirement a bit, and found a simple solution.
Still, the whole cycle of back and forth of code changes, especially to unrelated code, was very much counter-productive. The vibes were all wrong. Getting back to working code took another 2-3 hours.

At this point I was thinking “oh well, you can’t win them all”. I wanted to turn to something simple. And looking at the state of the code, a simple cleanup should be easy enough, right?

Right? …

Now It’s Just Cleanup

Well … it depends.

I went back into “vibe coding” mode. This time, I defined very basic code cleanup procedures. I then asked Cursor’s agent (in a new session), to go through the source code and follow these steps to clean it up.

It actually did reasonably well for small files. The bigger files proved to be more challenging. Trying to clean them up ended up messing the files completely. For some reason, the LLM agent removed functioning code, and created functionality regressions. Trying to quickly fix them ended up in causing more issues. It was clearly guessing at this point.

Given my battle scars with the previous feature request, I avoided this hallucination death spiral. Instead, I went through git history, found a working version, and restored the working code “by hand” – actully typing in code. I wasn’t a vibe coder anymore, but the application worked, the code was cleaner, and my blood pressure remained fairly low (I think).

The experience felt like trying to mentor a junior developer to code without creating regressions. The problem is it’s a fast and confident junior developer, with short term memory loss, who is apparently so eager to please that he simply spews out code that looks remotely connected to the problem at hand, with little understanding of context; proving to be ignorant even of changes it itself made to the code.

Documentation for Man and Machine

At this point I decided to go back to basics, where LLMs truly shine – understanding and creating text. I asked for it to create documentation for specific flows in the code (init sequence, clicking on a legend item). Unsurprisingly, with a few simple prompts, the agent produced decent documentation for what I asked, including a mermaid.js diagram code.

This is important not simply because it allowed me to document project easily, which is nice. Creating a textual documentation of specific flows also allowed me to provide better context for other chat sessions. And this is an important insight – textual descriptions of the code are useful for humans as well as the LLMs.

Other Features

At this point I turned to develop more features – loading data and “node focus“. In both cases I went back to providing feature files, with specifications, and asking the agent to update the files with plans and implementation logs.

I was a bit more cautious now. I reviewed code more carefully and intervened where I felt the code wasn’t good. In some cases it was obvious the code wasn’t functionally correct, but instead of trying to “fight” with the agent, I accepted the code and went on to change it myself.

A repeating phrase in all my prompts at this point was:

Do minimal code changes. Change only what is needed and nothing more.

This, combined with being more cautious and careful, resulted in pretty good results. I managed to implement two features in a short time. Probably a bit shorter compared to what it would have taken me to run through Three.js tutorials and do it myself.

Final Thoughts

So where does this leave me?

I have a working application. And if I had to learn Three.js from scratch myself, it would have taken me considerably longer to create. It’s working, and it’s useful. This is an important bottom line.

Small Application, Good Starting Point

The initial code, generated by the LLM (Gemini or Claude) does serve as a good starting point, especially in areas or frameworks that are unfamiliar to the developer.

But this is still a far cry from replacing developers. There are tool limitations, some of them, I expect, introduced by Cursor rather than the LLM. These limitations can cause havoc if the agent is left to proceed with no oversight.
And review is harder when there’s a ton of unorganized code4.

We can probably make it better with rules, better prompts, and combination of agents. And of course advances in LLM training.

This is a good starting point. But we need to remember this is a very small application, made from scratch. In the real world, a lot of use cases are not that simple at all. The more I read and think about it, this bears a striking resemblance to no-code/low-code tools. Also in those cases, it’s easy to achieve quick results for simple uses cases, but very hard to scale development when features creep in or the application needs to scale.

It’s not that low-code tools don’t have their place. They serve a very specific (viable) niche. But as experience shows, they haven’t replaced developers.

Could this be different?
What would it take to tackle more serious challenges, with “vibe coding”?

Context is King

It’s quite obvious that in the kingdom of tokens, amidst ramparts of code and wind all of chat messages, there is only one king, and its name is Context5. As LLMs are limited in their context size, and a lot of it is taken up by wrapping tools (Cursor in this case), context for an LLM chat is an expensive real estate.

So while context windows can get big, we’ll probably never have enough when we get to more complicated tasks and bigger code bases. There’s a preservation of complexity at play.

Accuracy and precision in the context play a crucial role in effectiveness. Context passed to LLMs needs to be information-dense. We should probably start considering how efficient is the context we’re providing to LLMs. I don’t know how to measure context efficiency yet, but I believe this will be important to be more effective as tasks become more complicated.

But there’s more than just the LLM and how to operate it.

You’re Only as Good as Your Tools, Also When Vibing

It’s quite clear that mistakes done by LLMs, and humans, can be avoided/caught with the help of the right tools. Even in my small example described above, cooperation of the LLM agent with external tools (console logs, shell commands) resulted in better understanding and a more independent agent.

I suspect that having more tools, e.g. relevant MCP server for documentation, can significantly help. I expect the integration of LLMs with tools will become more prominent and more necessary to create more independent coding agents.

One often overlooked tool is the simple document explaining the context of the project, specific features and current tasks. When LLMs will work seamlessly with Architecture Decision Records and diagram as code tools, I expect to see better results. The memory bank approach seems to be a step in that direction, though it’s hard to assess how effective it is.

I have noticed in this exercise that supplying the LLM with context of how a flow works currently (e.g. loading the data), allows it to identify the necessary changes more easily.

Diagram as code play a role now not just for humans developers, but also as a way to encode context for the application. There’s a feedback loop here between the LLM generating documentation, and using it as input for further tasks.

Effective Vibing

The real question is about the effectiveness of the vibe coding approach. With what degree of agent independence can we achieve good results.

I’m not sure how to assess this. One approximation of this might be the rate of bugs to user chat messages times lines generated in a given vibe coding session. But there are obviously other parameters involved6.

It will be interesting to measure this over time, with more integrated tools, improved LLMs and possibly improved tools.

I’m not sure how this will evolve over time. I do think, however, that if LLMs with coding tools will be reduced to a glorified low-code platform it will be a miss for software engineering in general. The technology seems to be more powerful than that, since it has the potential to more easily bridge the gap between human language and rigorous computer programs; and do it in both directions.

On to explore more.



  1. Not that there’s anything wrong with that ↩︎
  2. Yep, I asked Cursor to keep track of the changes at this point ↩︎
  3. A phrase which, I guess, is close to becoming a meme onto itself ↩︎
  4. But then again, not sure it’s a problem in the long run ↩︎
  5. Always looking for opportunities to paraphrase one of my favorite book series; couldn’t resist this one ↩︎
  6. And we should be careful of Goodhart’s law. ↩︎

AI and the Nature of Programming – Some Thoughts

So, AI.
It’s all the rage these days. And apparently for good reason.

Of course, my curiosity, along with a fair amount of FOMO1 leads me to experimenting and learning the technology. There’s no shortage of tutorials, tools and models. A true Cambrian explosion of technology.
This also aligns fairly easily with my long time interests of development tools, development models, and software engineering in general. So there’s no shortage of motivation to dive into this from a developer’s perspective2.

And the debate is on, especially when it comes to development tools.

It’s no secret that tools powered by large language models (LLMs), like Github Copilot, Cursor, and Windsurf3, are becoming indispensable. Developers all over adopt them as an essential part of their daily toolset. They offer the ability to generate code snippets, debug errors and refactor code with remarkable speed. This shift has sparked a fascinating debate about the role of AI in coding. Is it merely a productivity booster? Or does it represent a fundamental change in how we think about programming itself?

At its core, coding with AI promises to make software development faster and arguably more accessible. For simple, well-defined tasks, AI can produce functional code in seconds. This reduces the cognitive load on developers and allows them to focus on higher-level problem-solving. But software development in the wild, especially for ongoing product development, becomes very complicated very quickly. As complexity grows, the limitations of AI-generated code become obvious. While LLMs can produce code quickly and easily, the quality of its output often depends on the developer’s ability to guide and refine it.

So while AI excels at speeding up simple tasks, there are still challenges with more complex tasks. And there are implications to the ability to maintain code over time. But, I cannot deny we’re apparently at the beginning of a new era. And this raises the question of whether traditional notions of “good code” still apply in an era where AI might take over the bulk of maintenance work.

And I ask myself (and you): can we imagine a future where AI no longer generates textual code? Instead, it operates on some other canonical representation of logic. Are we witnessing a shift in the very nature of programming?

Efficiency of AI in Coding

Before diving into the hard(er) questions, let’s take a step back.

One of the most compelling advantages of coding with AI is its ability to significantly speed up the development process. This is especially true for simple and focused tasks. AI-powered tools, like GitHub Copilot and ChatGPT, excel at generating boilerplate code, writing repetitive functions, and even suggesting entire algorithms based on natural language prompts. For example, a developer can describe a task like “create a function to sort a list of integers in Python,” and the AI will instantly produce a working implementation. This capability not only saves time. It also reduces the cognitive burden on developers4. Consequently, developers can focus on more complex and creative aspects of their work.

The efficiency of AI in coding is particularly evident in tasks that are well-defined and require minimal context. Writing unit tests, implementing standard algorithms, or formatting data are all areas where AI can outperform human developers in terms of speed. AI tools can also quickly adapt to different programming languages and frameworks, making them versatile assistants for developers working in diverse environments. For instance, a developer switching from Python to JavaScript can rely on AI to generate syntactically correct code in the new language, reducing the learning curve and accelerating productivity. I often use LLMs to create simple scripts quickly instead of consulting documentation on some forgotten shell scripting syntax.

AI’s effectiveness in coding often depends on the developer’s ability to simplify tasks. The developer should break down larger, more complex tasks into smaller, manageable components. AI thrives on clarity and specificity; the more focused the task, the better the results. Yes, we have thinking models now, and they are becoming better every day. Still, they require supervision and accurate context. Contexts are large, and they’re not cheap.

At this point in time, developers still need to break down complicated tasks into more manageable sub tasks to be successful. This is often compared to a senior developer/tech lead detailing a list of tasks for a junior developer. I often find myself describing a feature to an LLM, asking for a list of tasks before coding, and then iterating over it together with the LLM. This works quite well in small focused applications. It becomes significantly more complicated with larger codebases.

While AI excels at handling simple and well-defined tasks, its performance tends to diminish as the complexity of the task increases. This is not necessarily a limitation of the AI itself but rather a reflection of the inherent challenges in translating high-level, ambiguous requirements into precise, functional code. For example, asking an AI to “build a recommendation system for an e-commerce platform” is a very complex task. In contrast, requesting a specific algorithm, like “implement a collaborative filtering model”, is simpler. The former requires a deep understanding of the problem domain, user behavior, and system architecture. These are areas where AI still struggles without significant human guidance.

As it stands today, LLMs act as a force multiplier for developers, enabling them to achieve more in less time. The true potential is realized when developers approach AI as a collaborative tool rather than a fully autonomous coder.

The “hands-off” approach (aka “Vibe coding“), where developers rely heavily on AI to generate code with minimal oversight, often leads to mixed results. AI can produce code that appears correct at first glance. Yet, it can contain subtle bugs, inefficiencies, or design flaws that are not immediately obvious. This is just one case I came across, but there are a lot more of course.

It’s not just about speed

But it’s more than simple planning, prompt engineering and context building. AI can correct its own errors, autonomously.

One of the most impressive features of AI in coding is its ability to detect and fix errors. When an LLM generates code, it doesn’t always get everything right the first time. Syntax errors, compilation issues, or logical mistakes can creep in. Yet, modern AI tools are increasingly equipped to spot these problems and even suggest fixes. For instance, tools like Cursor’s “agent mode” can recognize compilation errors. These tools then automatically try to correct them. This creates a powerful feedback loop where the AI not only writes code but also improves it in real time.

But it’s important to note here that there’s a collaboration here between AI and traditional tooling. Compilers make sure that the code is syntactically correct and can run, while LLMs help refine it. Together, they form a system where errors are caught early and often, leading to more reliable code. I have also had several cases where I asked the LLM to make sure all tests pass and there are no regressions. It ran all tests and fixed the code based on broken tests.
That is, without human intervention in that loop.

So AI, along with traditional tools (compilers, tests, linters) can be autonomous, at least to a degree.

It’s not just about correct code

As we all know, producing working code is only one (important) step when working as an engineer. It’s only the beginning of the journey. This is especially true when working on ongoing product development. It is probably less so in time-scoped projects. In ongoing projects, development never really stops. It continues unless the product is discontinued. There are mountains of tools, methodologies and techniques dedicated to maintain and evolve code over time and at scale. It is often a much tougher challenge compared to the initial code generation.

One of the biggest criticisms of AI-generated code is that it often lacks maintainability. Maintainable code is code that is easy to read, understand, and change over time. Humans value this because it makes collaboration and long-term project evolution easier. Yet, AI doesn’t always prioritize these qualities. For example, it might generate long, convoluted functions or introduce unnecessary methods that make the code harder to follow.

The reality is that code produced by an LLM, while often functional, may not always align with human standards of readability and maintainability.
I stopped counting the times I’ve had some LLM produce running, and often functionally correct code, that was horrible in almost every aspect of clean and robust code. I dare say a lot of the code produced is the antithesis of clean code. And yes, we can use system prompts and rules to facilitate better code. However, it’s not there yet, at least not consistently. This issue is not necessarily a fault of AI itself. It reflects the difficulty in defining and agreeing on what constitutes “good code”.

Whether or not LLMs get to the point where they can produce more maintainable code is uncertain. I’m sure it can improve, and we haven’t seen the end of it yet. I wonder if that is a goal we should be aiming for in the first place. We want “good” code, because we are there to read it, and work with it after the AI has created it.

But what if that wasn’t the case?

A code for the machine, by the machine

LLMs are good at understanding our text, and eventually acting on it – producing the text that will answer our questions/instructions. And when it comes to code, it produces code, as text. But that text is for us – humans – to consume. So we review and judge through this lens – code that we need to understand and work with.

We do it with the power of our understanding, but also with the tools that we’ve built to help us do it – compilers, linters, etc. It’s important to note that language compilers are tools for humans to interact with the machine. It’s a tool that’s necessary when dealing with humans instructing the machine (=writing code). The software development process, even with AI, requires it because the LLM  is writing code for humans to understand. It also allows us to leverage existing investments in programming.

But when an LLM is generating code for another LLM to review, and when it iterates on the generated response, the code doesn’t need to be worked on by humans. Do we really need the code to be maintainable and clear for us?
Do we care about duplication of code? Meaningful variable and class names? Is encapsulation important?
LLMs can structure their output, and consume structured inputs. Assuming LLMs don’t hallucinate as much than I’m not sure type checking is that impactful as well.

I think we should not care so much about these things.
At the current rate of development around LLMs there’s no reason we shouldn’t get to a point where LLMs will be able to analyze an existing code base and modify/evolve it successfully without a human ever laying eyes on the code. It might require some fancy prompting or combination of multiple LLM agents, but we’re not so far.

Another force at play here, I believe, is that code can be made simpler and straightforward if it doesn’t need to abstract away much of the underlying concepts. A lot of the existing abstractions are there because of humans. Take for example UI frameworks, or different SDKs, component frameworks and application servers. Most of the focus there is about abstracting concepts and letting humans operate at a higher level of understanding. It can be leveraged by LLMs, but it doesn’t necessarily have to be. Do I need an ORM framework when the LLM simply produces the right queries whenever it needs to?
Do I need middleware and abstractions over messaging when an AI agent can simply produce the code it needs, and replicate it whenever it needs to?

My point is, a lot of the (good) concepts and tools and frameworks we created in programming are good and useful under the assumption that humans are in the loop. Once you take humans out of the loop, are they needed? I’m not so sure.

The AI “Compiler”

Let’s take it a step further.
Programming languages are in essence a way for humans to connect with the machine. It has been this way since the early days of assembly language. And with time, the level of expression and ergonomics of programming languages have evolved to accommodate humans working with computers. This is great and important, because we humans need to program these damn computers. The easier it is for us, the more we can do.

But it’s different when it’s not a human instructing the machine. It’s an AI that understands the human, but then translates it to something else. And another AI works to evolve this output further. Does the output need to be understandable by humans?
What if LLMs understand the intent given by us, but then continue to iterate over the resulting program using some internal representation that’s more efficient?

Internal representations are nothing new in the world of compilers. Compiler developers program them to enable various operations that compilers often perform. Operations like optimizations, type checking, tool support, and generating outputs.
Why can’t LLMs communicate over code using their own internal representation, resulting in much more efficient operation and lower costs?
This is not just for generating a one-time binary, but also for evolving and extending the program. As we observed above, software engineering is more than a simple generation of code.
It doesn’t have to be something fancy or with too many abstractions. It needs to allow another LLM/AI to work and continue to evolve it whenever a new feature specification or bug is found/reported.
Do we really need AI to produce a beautiful programming language, mimicking some form of structured English, when there’s no English reader who’s going to read and work on it?

Why not have AI agents produce something like “textual/binary Gibberlink” an AI-oriented “bytecode” when producing our programs:

Is the human to machine connection through a programming language necessary when we have a tool that understands our natural language well enough, and can then reason on its own outputs?

LLMs can already encode their output in structured formats (e.g. JSON) that are machine processable. Is it that big of a leap to assume they’d be able to simply continue communicating with themselves and get the job done based on our specifications, without involving us in the middle?

Vibe coding is apparently a thing now. I don’t believe it’s a sustainable trend5. But the main reason is that it focuses on a specific point in the software life cycle – the point of generating code.
What if we can take it to the extreme? What if we remove the human from the coding process throughout the software life cycle?

I can’t really predict where this is going. At this point I don’t know the technology well enough to guesstimate, and I’m no oracle. But I do see this as one possible direction with a definite upside. And it’s definitely interesting to follow.

If programming is machines talking to machines, maintainability and evolution of code becomes a different game.

“Code” becomes a different game.

Is programming dead?

What would such a future hold for the programming professionals?

Again, I’m not great at making prophecies. But the way I see it, and looking at history, I don’t belong to the pessimistic camp. So in my opinion – no, I don’t subscribe to the notion that programming is dead.

History has taught us an important lesson. Creating software with more expressive power did not decrease the amount of software created. A higher level of abstraction did not lessen software production either. Quite the contrary. More tools, and being capable of working at higher levels of abstraction meant that more software is created. Demand grew as well. It’s just the developers that needed to adapt to the new tools and concepts. And we did6.

Demand for software still exists, and it doesn’t look like it’s receding. I believe that developers who will adapt to this new reality will still be in demand as well.

I expect LLMs will improve, even significantly, in the foreseeable future. But this doesn’t mean there’s no need for software developers. I expect software development tasks to become more complex. As developers free their time and minds from the gritty details of sorting algorithms, database schemas and implementing authentication schemes, they will focus on bigger, more complicated tasks. So software development doesn’t become less complicated, we’re just capable of doing more stuff. Complexity is simply moving to other (higher level?) places7

Could it be that software architects will become the new software engineers?
Are all of us going to be “agent coders”?

I really don’t know, but I intend to stick around and find out.

Where do you think this is going?


  1. And, admittedly, fear of becoming irrelevant ↩︎
  2. And yes, AI was used when authoring this post, a bit. But no LLM was harmed, that I know of ↩︎
  3.  Originally I intended to add more examples, but realized that by the time I finish writing a list, at least 3 new tools will be announced. So… [insert your favorite AI dev tool here] ↩︎
  4. Give or take hallucinations ↩︎
  5. Remember no-code? ↩︎
  6. I’m old enough to have programmed in Java with no build tool, even before Ant. Classpath/JAR hell was a very real thing for me. ↩︎
  7. “The law of complexity preservation in software development”? ↩︎

Discussing Your Design with Scenaria

Motivation
As a software architect, I spend quite a bit of my time in design discussions. That’s an integral part of the job, for a good reason. As I see it, the design conversation is a fundamental part of this job and its role in the organization.

Design discussions are hard, for various reasons. Sometimes the subject matter is complicated. Sometimes there’s a lot of uncertainty. Sometimes tradeoffs are hard to negotiate. These are all just examples, and it is all part of the job. More often than not, it’s the interesting part.

But another reason these discussions tend to be hard is because of misunderstandings, vagueness and lack of precision in how we express ourselves. Expressing your thoughts in a way that translates well into other people’s minds is not easy. This gets worse as the number of people involved increases, especially when using a language where most, if not all, people do not speak natively.

From what I observed, this is true both for face to face meetings (often conducted remotely these days), as well as in written communication. I try to be as precise as I can, but jumping from one discussion to another, under time pressure, I also often commit the sin of “winging it” when making an argument in some Slack thread or some design document comment.

I’ve argued in the past that diagrams serve a much better job of explaining designs. I think this is true, and I often try to make extensive use of diagrams. But good diagrams also take time to create. Tools that use the “diagram as code” approach, e.g. PlantUML (but there are a bunch of others, see kroki), are in my experience a good way to create and share ideas. If you know the syntax, you can be fairly fast in “drawing” your design idea.

Still, I haven’t found a tool that will allow me to conveniently express what I need to express in a design discussion. Simply creating a simple diagram is not all of the story. I often want to share an idea of the structure of the system – the cooperating components, but also of its behavior. It’s important to not just show the structure of the system, and interfaces between components, but also highlight specific flows in different scenarios.

There are of course diagram types for that as well, e.g. sequence or activity diagrams. And there are a plethora of tools for creating those as well. But the “designer experience” is lacking. It’s hard to move from one type of view to another, maintaining consistency. This is why whiteboard discussions are easier in that sense – we sit together, draw something on the board, and then point at it, waving our hands over the picture that everyone is looking at. Even if something is not precise in itself, we can compensate by pointing at specific points, emphasizing one point or another.

Emulating this interaction is not easy at this day and age of remote work. When a lot of the discussions are done remotely, and often asynchronously (for good reasons), there’s a greater need to be precise. And this is not easy to do at the “speed of thought”.

Building software tools is sort of a hobby for me, so I set out to try and address this.

Goals

What I’m missing is a tool that will allow me to:

  1. Quickly express my thoughts on the structure and behavior of a (sub)system – the involved components and interactions.
  2. Share this picture and relevant behavior easily with other people, allowing them to reason about it. Allowing us to conveniently discuss the ideas presented, and easily make corrections or suggest alternatives.

So essentially I’m looking to create a tool that allows me to describe a system easily (structure + behavior). A tool that efficiently creates relevant diagram and allows me to visualize the behavior on this diagram.

Constraints and Boundary Conditions

Setting out to implement this kind of tool, as a proof of concept, I outlined for myself several constraints or boundary conditions I would like to maintain, both from a “product” point of view as well as from an engineering implementation point of view.

  1. The description should be text based, so we can easily share system description as well as version them using existing versioning tools, namely git.
  2. The tool should be easy to ramp up to.
    1. Just load and start writing
    2. Easy syntax, hopefully intuitive.
  3. Designs should be easily shareable – a simple link that can be sent, and embedded in other places.
  4. There should not be any special requirements for software to use the tool.
    1. A simple modern browser should be enough.

Scenaria

Enter Scenaria (git repo). 

Scenaria is a language – a simple DSL, with an accompanying web tool. The tool includes a simple online editor, and a visualization area. You enter the description of the system in the editor, hit “Apply”, and the system is displayed in the visualization pane.

Scenaria Screenshot
Scenaria Screenshot

The diagram itself is heavily inspired by technical architecture modeling. The textual DSL is inspired by PlantUML. You can play with the tool here, and see a more detailed explanation of the model and syntax here.

Discussion doesn’t stop with purely static diagram. The tool also allows you to describe and visualize interactions between the different components. You can describe several flows, which you can then “play”, on the drawn diagram. You can step through a scenario or simply play from start to finish.

After this is done, you have a shareable link, as part of the application which you can send to colleagues (or keep).

As a diagramming tool, it’s pretty lacking. But remember that the purpose here is not to necessarily create beautiful diagrams (though that’s always a plus). It’s mainly about enabling a conversation, efficiently. So there’s a balance here between being expressive in the language, while not going down the route of adding a ton of visualization features which will distract from the main purpose of describing a system or a feature.

Scenaria is more intended to be a communication tool to be used easily in the discussion we have with our colleagues. It can serve as a basis for further analysis, as it provides a way to structure the description of a system – its structure and behavior. But the focus isn’t on rigorous formal description that can derive working code. It’s not intended for code generation. It’s about having something to point at when discussing design, but easily create and share it, based on some system model.

An Example

An example scenario can be viewed here. This example shows the main components of the Scenaria app, with a simple flow showing the interaction between them when the code is parsed and shown on screen.

Looking at the code of the description, we start by enumerating the different actors cooperating in the process:

user 'Designer' as u;
agent 'App Page' as p;
agent 'Main App' as app;
agent 'Editor' as e;
agent 'Parser' as prsr;
agent 'Diagram Drawing' as dd;
agent 'ELK Lib' as elk;
agent 'Diagram Painting' as dp;
agent 'Diagram Controller' as dc;

Each component is described as an agent here, with the user (a “Designer”) as a separate actor.

We then define an annotation highlighting external libraries:

@External {
  color : 'lightgreen';
};

And annotate two agents to mark them as external libraries:

elk is @External;
e is @External;

Note that up to this point we haven’t defined any interactions or channels between the components.
Now we can turn to describe a flow – specifically what happens when the user writes some Scenaria code and hits the “Apply” button:

'Model Drawing' {
    u -('enter code')-> e
    u -('apply')->p
    p -('reset')-> app

    p -('get code')-> e
    p --('code')--< e

    p-('parseAndPresent')-> app
        app -('parse')-> prsr
        app --('model')--< prsr
        app -('layoutModel') -> dd
            dd -('layout') -> elk
            dd --('graph obj')--< elk
        app --('graph obj')--< dd

        app -('draw graph')-> dd
            dd -('draw actors, channels, edges')->dp
        app --('painter (dp)')--< dd

        app -('get svg elements')->dp
        app --('svg elements')--<dp
        
        app -('create and set svg elements')->dc


    p --('model')--< app

};

We give scenario a name – “Model Drawing”, and describe the different calls between the cooperating actors. Indentation is not required, just added here for readability.

The interaction between the agents implicitly define channels between the components. So when the diagram is drawn, it is drawn with relevant channels:

At this point the application allows you to run or step through the given scenario where you will see the different messages and return values, as described in the text.


Next Steps

This is far from a complete tool, and I hope to continue working on it, as I try to embed it into my daily work and see what works and what doesn’t.

At this point, it’s basically a proof of concept, a sort of an early prototype.

Some directions and features I have in mind that I believe can help in promoting the goals I outlined above:

  1. Better diagramming: better layout, supporting component hierarchies.
  2. Diagram features: comments on the diagram (as part of steps?), titles, notes
  3. Scenario playback – allow for branches, parallel step execution, self calls.
  4. Versioning of diagrams – show an evolution of a system, milestones for development, etc.
  5. Integration with other tools:
    1. Wikis/markdown (a “design notebook”?)
    2. Slack and other discussion tools
    3. Tools and links to other modeling tools, showing different views of the same model.
  6. A view only mode – allow sharing only the diagram and allow playback of scenarios.
    1. Allow embedding of the SVG only into other tools, e.g. a widget in google docs.
  7. Better application UX (admittedly, I’m not much of a user interface designer).
  8. Team collaboration features beyond version control.

Contributions, feedback and discussions are of course always welcome.

שיר אחד

(נכתב ב 19.3.2023)

אי שם בחצי הראשון של שנות ה 90, בעודי נער בכיתה י”א השתתפתי במסע לפולין.

בלי להתייחס לביקורת על המסעות האלה, רגע אחד זכור לי במיוחד עד היום.

רצה הגורל, ובאותו יום שהמשלחת שלנו ביקרה באושווויץ, ביקרה שם משלחת של צוערי בה”ד 1. וכך יצא, שבעודנו מסתובבים בין הבלוקים של אושוויץ 1, עטופים ב- או מניפים את דגלי ישראל, נקלעתי כצופה למחזה חצי סוריאליסטי כשדווקא שם, בין הבלוקים, צוערי בה”ד 1 מניפים את דגל צה”ל ואת דגל ישראל, ובטקס קטן וצנוע שרים את “התקווה”.

ושם עמדתי, חצי מהופנט. כש 10 מטרים ממני אחד המלווים של המשלחת שלנו, ניצול שואה שהתלווה אלינו כעד, מוחה דמעה. המראה הזה לא עזב אותי.

יותר מאוחר באותו היום שרנו אנחנו את אותו השיר מעל המשרפה המפוצצת בבירקנאו. והמלים שיעקב, חברי באותם הימים, אמר לי בלילה שטסתי לפולין הדהדו לי כל הזמן בראש – “תזכור שלכל מקום שאליו אתה נכנס, לך יש את הזכות לצאת”.

המלים האלה, עם אותן התמונות, שרטו את השריטה שלהן. באותו הערב כבר ידעתי שעד כמה שזה תלוי בי, אני אהיה חייל קרבי בצה”ל.

כמה שנים מאוחר יותר, בתחילת 1996, בערב אביבי על רחבת המסדרים בלטרון, שרתי את אותו השיר; דקות ספורות אחרי שנשבעתי “להקדיש את כל כוחותי ואף להקריב את חיי להגנת המולדת וחירות ישראל”. 4 חודשים אח”כ ענדתי צפרגול על החזה. שרתי את אותו השיר וזכרתי את אותה הבטחה מפולין.

27 שנים עברו מאז.

ובשבת האחרונה, בנס ציונה, עמדתי ליד הבן שלי כשהוא עטוף בדגל ישראל. ביחד עם אחיו הקטן ואמא שלהם, שרנו את אותו השיר, בסיום הפגנה.

ופתאום, אותן מילים בדיוק, ששרתי כבר עשרות פעמים בעבר בגאווה גדולה וחזה נפוח, מקבלות משמעות שונה מאוד.

כי החופש שלנו, למרות ששרנו עליו, היה תמיד שם. היינו צריכים את המסע לפולין כדי ללמוד ולהיזכר למה הוא חשוב. תמיד היה ברור למה אנחנו נלחמים, אבל לא היה לרגע אחד ספק שאנחנו חופשיים. ידענו שצריך לעמוד על המשמר, אבל תמיד כאנשים חופשיים. צדקת הדרך הייתה מובנת מאליה.
אולי הדרך עצמה מלאה ג’עג’ועים, ולפעמים סוטים ממנה. אבל לא היה ספק למה אנחנו שם. ועל מה אנחנו שרים את אותן המילים.

ופתאום, קצת פחות ברור.
פתאום, כשאנחנו שרים על “להיות עם חופשי בארצנו”, זה כבר לא מובן מאליו. והשיר אותו השיר, המילים אותן המילים, הדגל אותו דגל, הצבא (בגדול) אותו צבא.

רק התקווה שונה.