In my last two posts, I dove into an implementation of AI agents in the area of software design, specifically having different LLMs debate a design problem. At the end of my last post I shared some initial thoughts on how this fits into a bigger picture of software development life cycle in the age of AI.
I’ve argued before that I believe AI has more potential than simply generating code faster than humans. AI-based design debates are, I believe, just one component in a broader ecosystem of techniques and tools that mechanize a lot of the process of evolving and developing a software-based system. Software development involves much more than simply coding the desired behavior. Taken from this perspective, program code becomes another substrate on the path between human thought and working software. It is a substrate increasingly detached from human supervision1.
Consider the combination of:
- AI-based design discussions
- Spec-driven development
- AI-coding agents
- Continuous deployments
- Architectural fitness functions
Can we envision a process where a human provides the necessary definition of required functionality, with some non-functional requirement thrown in, and a machine, driven by AI, picks it up and iterates through a continuous loop to build and evolve the system?
I imagine something like this:

(arrow direction represents data flow – read and write direction)
Where a human developer2 provides requirements to implement in the system, goes through some clarification Q&A cycle, but eventually hands off the design and implementation to a set of AI agents that break down the requirements to an actionable implementation plan, and implement it up to deployment. The new code is then deployed, architecture and code updated, and they become the basis for a new feature/bug fix to be implemented. This cycle proceeds as long as new requirements (including bug reports) are fed into the system. The system evolves by a set of agents that cooperate on specific tasks, handing off the different artifacts and making changes, potentially in several iterations.
The concept of fitness functions is crucial. We know that LLMs require a way to perceive the environment, and perceive the implication of their changes so they can reason and act on them. We also know that system design, especially when evolving a system, is often affected by the existing architecture and how the system actually behaves in runtime. This is more than code design, it is often about runtime and operational properties of the system being built or improved, e.g. performance or security properties. Some design decisions also affect how the process will be built, observed and run.
This notion of evolutionary architecture isn’t new, but making it mechanized is even more important when machines are what determines the next iteration of the system architecture. It’s a way for LLM-driven agents to perceive the system they operate on.
Most of the puzzle pieces are already in place.
We have pretty good coding agents that can index and reason about existing code. We already see the beginning of existing technical specification agents. At this point they are interactive (human-in-the-loop). But given enough accurate context, I believe there’s a clear trajectory for agents like this being more independent. Deployment is essentially CI/CD, augmented with some quality controls (testing, code reviews, linting, documentation, etc.). And depending on the desired fitness functions, we can probably implement most with existing observability tools and APIs.
What emerges is what I call an Autonomous Software Development Lifecycle. A system where design choices, specifications and implementations are done by agents that exchange structured artifacts and feed into one another, with very little human intervention. Humans are part of the game in two main aspects: defining the expected system behavior, and tuning the system behavior through some well defined interfaces, notably the definition of the non-functional requirements.
Supervision can also be autonomous. There’s no reason AI agents can’t be connected to existing observability tools, and actively respond to them. Given enough accurate data a capable LLM should be able to derive a decent fix for problems that come up. Combined with existing methodologies of blue/green deployment, redundancy, etc. we effectively get a self healing system.
Taking it a step further, we can consider a scenario where such a system, a collection of agents, also proactively improves the system – a self improving system. An example for this would be a situation where a new technology becomes available and allows for a better implementation of existing functionality. For example, a faster DB, or a better VM that allows doing more work with less network calls.
These last two points about agents being proactive presents a shift from how we operate today with coding agents. Most of the coding agents today are reactive to our requests. Having agents that respond to a changing environment, not us, and acting on it, presents a proactive system that autonomously improves the system. We’re already seeing this at the coding level in some cases. But is there a fundamental barrier to doing this at the system level given that enough accurate context is provided?
Really?
I realize the vision I present here is utopian in a lot of ways and may in fact appear unrealistic. We’re not there yet. There are technical limitations and cost is an issue3. And at this point in time LLMs are not yet reliable enough to “own” a system in production in such a way. Generally speaking there are non-trivial challenges to solve here.
But I do believe we’re starting to see these patterns emerge, and this is a reasonable extrapolation of current advancements in LLMs and the surrounding ecosystem when applied to software development. Especially when it’s integrated with existing observability, software engineering, project management and other relevant tools of the trade.
It’s also interesting to see how this vision is not unique to software development. In a recent interview Satya Nadella was asked (among a lot of other things) about his vision of using models in Microsoft4. And it’s interesting to see how he outlines a future where MS-Office apps, Excel in his example, are used autonomously by AI agents. The gist of the tool is not a UI the model works with, it’s the underlying functionality (“logic”) the model integrates with. The focus in that conversation is on the business implications of this kind of development, which I won’t dive into here. But this idea resonated with me when thinking about coding and software engineering and operation in general. The human-centric UX of the tool becomes secondary when AI is involved, but the tool’s functionality can still be available and relevant to AI agents to use. When more and more software engineering tools become available to AI to use, the integration seems inevitable.
We’re training and focusing AI on using the tools we know and use for the tasks we need. The tools that were built for us. The emphasis we’re currently putting on code quality is mostly driven by human involvement in the code. When humans need to read and review the code, we judge and build the coding agents in a way that emphasizes code metrics and quality that are (rightfully) important for humans. But if you take the human interface – the programming language – out of the coding loop, we can probably relax some of these requirements. And let agents use tools that do coding, even if it’s not optimal for human consumption. Instead, we should give agents the tools to perceive and assess its results based on what actually matters – the runtime behavior of the system.
In all likelihood, we’ll get there in small steps, and it will be realized in stages, with different parts of the puzzle implemented by different teams at different times. Even if we assume all pieces are integrated, I still believe we’ll always have the ability to adopt it partially. We’ll probably also see a reality where different teams adopt varying levels of such capabilities. Similar to how today different software development teams adopt different languages and tools, depending on the type of software they use. Some types of software are still better off developed in Assembly or C. In other cases, teams abandon the use of ORM frameworks, even though they can make their coding life easier. It’s quite possible we’ll see teams still doing investigations or carefully feeding planning and coding agents with hand-curated context and tools. But there is a potential here to achieve an order of magnitude more efficiency if we are willing to let go of some control.
What Do We Need To Get There?
As I wrote above, I don’t think we’re there yet.
We’d obviously need to standardize the way agents communicate. Given the amount of available formats and notations for expressing practically anything in software, the challenge will be more about agreeing on the notation rather than inventing one (though that’s always an option).
The challenge may lie not in having a communication format, but in having an efficient one. We’re already seeing examples of this (here, here).
Standardizing on protocols and notation is the easier problem. Agents will need to communicate with one another using semantics of different activities in the SDLC. How projects are organized, how a plan is built, and how it translates to components of the running system. Luckily, decades of humans building software are already built-in to the LLMs through their training. Mapping some of the ideas may be sometimes a challenge, though it seems LLMs can bridge this gap5. Mapping semantics between tools seems to be the easier problem. Combining the various domains of a whole project may be a bit more challenging. Bridging the gap between project planning, product roadmap and technical planning/constraints is usually done in the minds of people, and in discussions. Adapting LLMs to do this is, I think, achievable, though not generally trivial6.
Of course, having the technical protocols, and ways to encode semantics doesn’t mean we can do it efficiently at scale. So adapting these to be used by LLMs efficiently is of course key to realizing it.
So What’s Left For Humans?
Obviously humans are still left to define requirements and priorities. Current SDD tools don’t seem to address this fully, at least not yet. But there are already attempts to show how such a process can look. It’s clear that there’s still a way to go for automatically translating software requirements to technical specifications. Still, the foundations are there, and one can imagine how this process is realized. I also don’t see LLMs weighing in on business constraints and trade-offs. Taking into consideration social circumstances and constraints is, well, human.
Even when looking at the pure engineering side of things, I expect that humans are still needed, but not so much for their ability to express business application logic in code as much as their ability to reason about system behavior7. The future software engineer will need to understand systems engineering at a very fundamental level, and be able to translate it into specification and requirements to be worked on by AI agents. For example, understanding what causes the application to suddenly slow down when a certain event hits, or spot a sporadic race condition that happens because of the distributed nature of the system. Experienced software engineers are able to spot issues like that, especially in a system they know, from just a cursory look at the UI or logs. Systems thinking and understanding will probably become much more important and sought after skill.
In practice, it will probably mean that software engineers will be much more troubled with defining and regulating the “fitness functions” than whether a given snippet of code is readable or violates DRY principles. It could be about modifying the specifications, but also about having a “widgets and knobs” dashboard-like experience where different properties are exposed, allowing engineers to tune and configure the system according to their understanding.
Even if such a system (or ecosystem) of agents materializes, I expect brownfield projects will take time to adapt to such a methodology. There will need to be a lot of work for feeding and adapting existing artifacts (code, documents, specifications) to reverse engineer the implicit knowledge that is often assumed or communicated verbally.
Another place where I believe there’s not yet a replacement for humans is in innovation. New technologies, that could potentially change how we interact with computers or build systems, will most probably require us to train LLM models in how to program and use these. Think of new hardware, applied mathematics or new algorithms. These are all examples, where new application and technical patterns are formed, and these still need to be taught to LLMs8.
Similarly, integrating across modalities, or interactions with the physical world will probably require more guidance.
Good engineering – understanding how a system fundamentally works and how it can/should change – will not be replaceable so easily by LLMs. The more we encode it into descriptions, the more we can get done. But these are, at the end of the day, textual probabilistic models. Training on larger datasets will help, but not replace understanding.
It’s the way we build and modify these systems that will change. It is about letting some of the obvious repeating patterns sort themselves out.
Is This Necessarily a Good Thing?
Beyond the technical challenges of realizing this vision, I think it’s important also to ask ourselves whether this kind of state of affairs is a good thing – will this lead to greater success in software delivery, without compromising on quality and safety?
This is not about avoiding it9, but rather about articulating the necessary constraints or guardrails so we can avoid a “less than ideal”10 outcome.
A reality where software is created and modified with zero friction can quickly become risky, especially with mission critical systems. Friction-inducing mechanisms, e.g. compliance and risk assessment, exist for a reason. Software production is not different, especially as software is already a critical component of our modern life. If we do reach a point where software is created autonomously by LLM-driven agents, any human intervention is essentially such a friction. Today, we usually experience this friction as a hindrance to efficient and effective delivery. But in a world where most of the human inefficiencies are taken out of the equation, this friction11 of human intervention may actually work to hedge some (all?) of the risks.
So we need to ask ourselves where human involvement is in fact a positive thing, and where this interacts with the SDLC. I believe that as a rule of thumb, decisions that impact humans should be taken by humans. For example, defining who has access to what piece of data is essentially a human decision directly affecting humans; similarly, how long to keep transaction data has legal and social implications. Contrast this with the decision of whether to use a linked list or a simple array, or how to decompose a system into separate services – these may affect how the system performs, or how long it takes to make changes, but it does not directly affect the human experience, and can be more easily relegated to AI-agents.
This autonomous SDLC has to have some “human-friction” built into it, if only for the sake of safety if not for better results. The exact mechanisms are yet to be seen, but they should be there.
So yes, mechanizing the software development life cycle is an overall good thing to efficiency, and has the potential to alleviate a lot of the problems plaguing the software industry. As a corollary, it can also induce a wave of innovation if software is easier and cheaper to create12. But we also need to make sure we’re not giving up on human common sense, intuition and ability to innovate. We have to be conscious of what software is getting built, and how it affects us, especially when it evolves more easily.
Let the agents rise!
(but keep an eye on what they’re doing)
- And we can argue whether it’s really important given how detached it becomes from humans; e.g. is readability by humans that critical? ↩︎
- Or however we’d like to call this role ↩︎
- Though I believe it could be offset by the savings in development and down times ↩︎
- This actually wasn’t the exact question, but this is roughly where he went with it ↩︎
- And more formal semantics may prove useful. ↩︎
- And we should consider – should we solve it in a general manner? What if we start by doing it only for web applications? or mobile applications? ↩︎
- Remember the point about System ≠ Software, here. ↩︎
- Though world models may prove to be different here, but admittedly I’m not an expert. ↩︎
- Not trying to be a Luddite. ↩︎
- And I will let the dystopian genre experts suggest what is “less than ideal”. ↩︎
- Should I say “efficiency-dampening factor”? ↩︎
- But note – not necessarily easier to operate. ↩︎
