Category Archives: ai

AI Adoption Roadmap for Software Development

I’ve argued before that LLMs’ greatest promise in software engineering lies beyond raw code generation. While producing code remains essential, building scalable, cost-effective software involves far more: requirements, architecture, teamwork and feedback loops. The end goal is of course producing useful and correct software, economically. But the process of producing software, especially as the organization scales, is much more than that.

So how do we adopt AI across a growing software organization—efficiently and at scale?

We’ve gone through¹ paradigm shifts before – agile, microservices, DevOps are some examples. Is AI different in some more profound way, or another evolutionary step?

I believe this is a slightly different story, when compared to other technologies, at least when it comes to the practice of SW development.

First, this is an area that’s still being actively researched, with advancements in research and technology being announced all the time. New models and papers drop constantly, fueling FOMO and risk of distraction. Teams can quickly feel overwhelmed without a clear adoption path.

Second, it seems that a technology that sits at the intersection of machines and human communication (because of natural language understanding), has the potential to disrupt not only the technical tools we use, but our workflows and working patterns at the same time. AI feels less like another toolchain and more like a collision of Agile and microservices – reshaping not just code, but communication flows themselves. This may be going too far, but I sometimes imagine this is the first time Conway’s law might be challenged.

The AI ecosystem, especially in the software engineering space² is abundant with tools and technologies. The rate of current development is staggering, and it’s getting hard to keep up with announcements and tools/patterns/techniques being developed and announced and shared.

Randomly handing teams new AI toys can spark short-term wins. But to unlock AI’s transformative power, we need to be more intentional about it. We need a deliberate adoption roadmap.

Our aim: weave LLMs into daily software engineering to maximize impact. But with tools and standards still maturing, a rigid, long-range plan is unrealistic. There are few substantial case studies that show adoption at scale at this point. Similar to early days of the world wide web, some imagination and extrapolation is required³, and naturally some of it will be wrong or will need to be updated in the future to come.

It’s natural to chase faster coding as the low-hanging fruit. Yet AI’s true potential lies in higher-level workflows. Since I believe the potential is much greater, I try to follow a slightly more structured approach to navigating this challenge.

This here is my attempt at trying to think and articulate an approach for adoption of AI for a software development organization. It’s positioned as a (very) high level roadmap for adopting AI in a way that will benefit the organization and will be hopefully viable and efficient at the same time.

This will probably not fit any organization. Specifics of business, architecture, organizational structure and culture will probably require adapting this, even significantly. Still, I believe this can be used as a framework for thinking about this topic, and can serve at the very least as a rough draft for such a roadmap.

I will of course be happy to hear feedback or how others approach this challenge, if at all.

Before diving into details of such a suggested roadmap, I will need to introduce a preliminary concept which I believe to be central to the topic of AI adoption – AI Workspaces.

AI Workspaces

Most AI technology today focuses on transactional tool usage – a user asks something (prompts), and the AI model responds, potentially with some tool invocations. The utility of this flow is limited, mainly because crafting the prompt and providing the context is hard. Some AI tools provide facilities and behind-the-scenes code that injects further context, but this is still localized, and not always consistent. From the user’s point of view it’s still very transactional.

In order to realize more of the potential AI has for simplification and automation, we need to consistently apply and provide context that is updated and used whenever needed. We need to allow a combination of AI tools with the relevant up-to-date context so more complicated tasks can be achieved. Also, with more AI autonomy, the easier it will be for users to apply and use it successfully.

I’m proposing that we need to start thinking about an “AI workspace”.

An AI workspace is a combination of:

Basic AI tools, e.g. models used, MCP servers, with their configuration.
Custom prompts, usually focused on a task or set of tasks in some domain.
Persistent memory – a contextual knowledge source, potentially growing with every interaction, that is relevant to tasks the AI is meant to address.

The combination of these, using different tools and techniques, should provide a complete framework for AI agents to accomplish ever more complex tasks. The exact capabilities depend of course on the setup, but the main point is that all of these elements, in tandem, are necessary to create more elaborate automation.

A key point here is the knowledge building – the persistent memory. I expect that an AI workspace is something that’s constantly updated (automatically or by the user) so the AI can automatically adapt to changing circumstances, including other AI-based tasks. There should be a compounding effect of knowledge building over time and being used by AI to perform better and more accurately.

An AI workspace should be customized for a specific task or set of tasks. But it can be more useful if it will be customized for a complete business flow that brings together disparate systems and roles in the organization. This will arguably make the workspace more complex and harder to set up, but if used consistently over time, the overhead might be worth it.

We’re already seeing first signs of this (e.g. Claude Projects), but I expect this to go beyond the confines of a single vendor platform, potentially involving several different models, and be open to updates/reading from agents⁴.

A Roadmap – General Framing

As I’ve already noted, using AI, in my opinion, is more than simply automating some tasks. Automating is great, and provides value, but the potential here is much greater. In order to realize the greater potential we need to leverage the strengths of LLMs, and point them at the right challenges we face in our day to day work in software development.

And these strengths generally boil down to:

Understanding natural language (and other, more formal, languages)
Being able to respond and produce content in natural language (and other, more formal, languages)
Understand patterns in its input and reasoning on it; and apply patterns to its output.

And do all of this at scale.

Looking at the challenges of software development, our general bottlenecks are less in code production, and more in understanding, communicating and applying our understanding effectively. This includes understanding existing code, troubleshooting bug reports, understanding requirements, understanding system architecture, anticipating impact, translating requirements to plans etc.

Apart from actual problems we might face in all of these, there’s also a challenge of scale here. The more people are involved in the software production (larger organization), the larger the codebase and the more clients we have – the greater the challenge.

An immediate corollary of the way (non-trivial) software is built is that it’s not just a problem of software developers. There are more people involved in the software building, evolution and maintenance – devops engineers, product managers, designers, customer support etc. A lot of the challenges are affected by different roles and communication patterns and motivations presented by different roles.

So when it comes to adopting a technology that has the potential to encompass different workflows and roles, I’m looking at adoption from different angles.

Since this is a roadmap, there’s naturally a general time component to it. But I’m also looking at it using a different axis – the way different roles or workflows (tasks?) adopt AI, and at what point these workflows converge, and how exactly.

The general framing of the roadmap is therefore a progression across phases of different verticals of “types of work” or roles if you will.

Workflow Verticals

When building software⁵ we have different tasks, performed by separate cooperating professionals. I’d like to avoid the discussion on software project management methodologies, so suffice to say that different people cooperate to produce, evolve and maintain the software system , each with more or less well defined tasks⁶.

Roughly speaking these workflows are:

Design and coding of the software: anything from infrastructure to application design, prototyping, implementation and debugging.
Testing and quality: measuring and improving quality processes – generating tests, measuring coverage, simulating product flows, assessing usage.
Incident management: identifying and troubleshooting issues (bugs or otherwise), at scale. This includes also customer facing support.
Product and Project management: analyzing market trends and requirements, guiding the product roadmap, rolling out changes, synchronizing implementations across teams
Operations and monitoring: monitoring the system behavior, applying updates, identifying issues proactively, etc.

All of these tasks are part of what makes the software, and operates it on a daily basis. There’s obviously some overlap, but more importantly there are synergies between these roles. People fulfilling these roles constantly cooperate to do their job.

People doing these roles also have their own tools and processes, each in its own domain, with the potential to be greatly enhanced by AI. We’re already seeing a plethora of tools promising, with varying⁷ degrees of success, to optimize and improve productivity in all of these areas.

Just to name a few examples to this:

Software coding is obviously being disrupted by AI-driven IDEs and agents.
Product management can leverage AI for analyzing market feedback, producing and checking requirements, simulating “what-if” scenarios, researching, etc.
Incident management can easily benefit from AI analyzing logs, traces and reports, helping to provide troubleshooting teams with relevant context and analysis of issues.
Testing can be generated and maintained automatically alongside changing code.
UX design can go from drawing to prototype in no time.

And I’m sure there are more examples I’m not even aware of. The list goes on.

The point here is not to exhaustively list all the potential benefits of AI. Rather, I argue that for the software organization to effectively leverage AI, it needs to do it across these “verticals”.

And as the organization and the technologies mature, we have better potential to leverage cooperation and synergies between these verticals.

This won’t happen immediately. It probably won’t happen for a while, if at all. But for that, we need to talk about phases of adoption.

Phases of Adoption

I try to outline here several phases for the adoption of AI. These phases are not necessarily clearly distinct. Progress across these is probably not linear nor constant. The point of this description is not so much to provide a concrete timeline. This is more about describing the main driving forces and potential value we can gain at each phase. Understanding this should help us plan and articulate better more concrete steps for realizing the vision.

You can look at these phases as a sort of “AI Maturity Level”, although I’m not trying to provide any kind of formal or rigorous definition to this. It’s more of a mindset.

Phase 1: Exploration and Basic Usage

At this phase, different teams explore the possibilities and tools available for AI usage. The current rate of innovation in this field, especially around software development is extremely high. Given this, I expect employees in different roles will experiment and try various tools and techniques, trying to optimize their existing workflows in one way or another.

At this point, the organization drives for quick wins, where people in different roles leverage AI tools for common tasks, share knowledge internally and learn from the community.

Covered scenarios at this point are localized to specific workflows and focus mainly on providing context to localized (1-2 people) tasks, as well as automation or faster completion of such localized tasks.

LLM and AI usage at this point is very much triggered and controlled by humans requesting and reviewing results. This work is very much task/workflow oriented at this point, with AI tools serving specific focused tasks. The human-AI interaction at this point is very transactional and limited in scope.

The organization should expect to gain the required fundamental knowledge of deploying and using the different tools securely and in a scalable manner, including performance, cost operations etc. At this phase, a lot of experimentation and evaluation is happening. It will be good to establish an internal community driving the tooling and adoption of AI. The organization should expect several quick wins and localized productivity gains.

I expect the learning curve to be steep in this phase, so a lot of what happens here is trial and error and comparison of different tools, techniques and models.
AI workspaces at this point, if they exist, are very much focused on the localized context of individual well-defined tasks. They are also probably harder to establish and operate (integrate tools, add information).

What would be the expected value?
Phase 1 focuses on achieving quick wins and localized productivity gains. By implementing AI code assistants, automated code reviews, AI-generated tests, and anomaly detection tools, the organization can quickly demonstrate immediate developer speedups, improved code quality, faster test coverage, and early incident learning.

This goes beyond a business benefit. It’s also a psychological hurdle to overcome. Concrete wins, such as fewer bugs and faster releases, build momentum and justify further investment in AI adoption while increasing developer satisfaction.

In addition, there’s going to be considerable technical infrastructure investment done at this point, e.g. model governance, cost management, etc. This infrastructure should be leveraged in the following phases as well, and is therefore critical. This phase provides a strong foundation for leveraging AI in future stages.

Phase 2: Grounding in Domain-Specific Knowledge

At this phase, having gained basic proficiency, the organization should expect to improve performance and scope of AI-enabled tasks by starting to build and expose organization-specific knowledge and processes to LLM models.

I expect that business-specific information (internal or external) can increase performance and open up possibilities to more tasks that can be improved using AI. Examples to knowledge building include better code and design understanding, understanding of relationship between different deployed components, connecting product requirements to code and technical artifacts, etc.

This can open the road to higher level AI-driven tasks, like analyzing and understanding the impact of different features, simulating choices, detecting inconsistencies in product and technical architecture and more.

A key aspect of this phase is to facilitate a consistent evolution of the knowledge so it can be scaled and maintain its efficacy. At this point, the organization needs to have the infrastructure and efficient standards in place so information can be shared between roles, and between different AI-driven tools and processes.

In this phase AI workspaces become more robust and prevalent, encompassing a larger context, and even crossing across workflows verticals in some cases. Contrast this with workspaces we’ve seen in the first phase which are more focused in localized contexts.

This phase is also when we start thinking in “AI Systems” instead of simply using AI tools. This is where we consistently apply and use AI workspaces, with several tools (AI or non-AI) being combined with the same knowledge base, and evolve it together.

An example for this would be AI coding agents that automatically connect their implementation to JIRA tickets, product requirements, and record this knowledge. With other AI agents leveraging this knowledge to map it to design decisions and testing coverage reports (how much of the product requirements are tested) and plan roll outs.

What value can we expect to have at this point?
Phase 2 is mainly around integrating company-specific (and company-wide) knowledge with AI workspaces. At this point I expect existing workloads to be more accurate, precise and faster in doing their work, even if the task is limited in scope. The grounding provided by the specific knowledge graph should improve the accuracy of AI models.

Different workflow verticals will start to cooperate more closely at this point. First of all, by building a knowledge graph/base together. But also by leveraging this combined knowledge to implement simple agentic workflows, where AI-based agents start to reason on the data and make simple decisions.

Phase 3: Autonomous Cross Team Workflows

This is the point where previous infrastructure starts to really pay off in terms of increased productivity and quality.

At this phase of adoption, I expect we’ll see more autonomous AI-driven processes coming into fruition. And when I say “AI-driven” I’m not referring to simply automating a well known process. I’m referring to AI agents reasoning and dynamically using tools and other agents to adapt and produce results/do tasks⁸. I expect at this point AI agents can also build their own knowledge, and adapt their work to accommodate changes in the environment.

Humans are still in the loop for critical decision making, but the friction between humans and tools, and humans to humans is significantly reduced⁹. The focus at this point should be on eliminating bureaucracy and increasing the adoption of consistent and increasingly robust workflows. This generalization also means that agentic AI systems now work across roles and departments, it’s where the workflow verticals start to converge.

Examples to this would be:

Managing changes across roles and workflows. For example, a change in UX/product feature definition that is automatically reflected in plans, and rolled out to clients.
Technical design that is validated against technical dependencies (from other teams), past decisions and project plans. Potentially updating the dependencies and informing other agent, potentially changing agent decisions as a result.
Identifying cross-cutting issues from internal conversations, correlated with support tickets and other metrics, and proactively planning and suggesting resolutions.

At this phase, I expect AI workspaces to become really cross-departmental and leverage knowledge being built and added in different verticals.

Ad-hoc exploration and automation of tools should also be possible. At this point, the organization should have a strong foundation of tooling and experience with applying AI. It should be possible to allow ad-hoc building of new flows on top of the existing LLM infrastructure and the ever-evolving organizational knowledge base.

Note that this also poses a challenge: there is a fine line between standardization of tools, which drives efficiencies at scale, and democratization of capabilities. You want people to experiment and find new ways to optimize their work, but in order to efficiently grow you’ll need to apply some boundaries to what is used and how it’s used. This tradeoff isn’t unique to AI systems, but I believe it will become more emphasized when we consider new directions and applications of LLMs as the technology improves.
In terms of expected value, we should expect significant productivity gains. While humans are still in the loop, AI will further automate processes, reducing bureaucracy. The focus will be on adoption of consistent productive workflows across roles and departments. Human focus should be on innovation and decision making at this point, with accurate and reliable information being made available to humans, by the machines¹⁰.

Technical Infrastructure

In order to support this process, looking at the expected phases of adoption, we should pay attention and plan the necessary technical infrastructure investment. This is true with the adoption of any new technology, but with the current explosion of tools and techniques, it’s very easy to lose focus.

I won’t pretend to know exactly which tools should be available at what point. Nor do I expect to know a definitive list of tools and compare them at this point¹¹. But in order to plan ahead investments, and make a concerted effort on learning what will help us, I believe we can give some idea of what will be needed at each phase of adoption.

In phase 1, we naturally explore a plethora of tools. We should be able to facilitate new models for different use cases. Enabling access to different models using tools that provide a (more or less) uniform facade is useful. Examples for this are OpenWebUI, LiteLLM. We should provide access to AI-driven IDEs, like Cursor, Windsurf and similar ones.

For non-development workflows, AI-based prototyping tools should be helpful, and vendor-specific AI extensions would be helpful. The same goes for monitoring tools.

Connecting these tools with MCP servers to existing hosts of MCP clients (IDEs, chat applications, etc) would probably be useful as well. So support for installing and monitoring MCPs might be useful. At this point it should be also useful to establish some way to measure effectiveness of prompts or model tuning, and track usage of various tools.

In phase 2, building and potentially maturing the infrastructure at phase 1, we should start focusing on more robust workflows, and knowledge building. Depending on use cases, it could be useful to look at agent workflow frameworks (LangChain, et al) and agent authoring tools (e.g. n8n).
Additionally, knowledge management tools and processes will probably be useful to introduce – easily configured RAG processes (and therefore vector DBs), memory management techniques, maybe graph databases. This of course all depends on the techniques used for memory building and maintenance.

I expect MCP servers, especially ones specialized for the organization’s code and other knowledge systems, will become more central. It should be possible to also create necessary MCP servers that will allow LLMs to access and use internal tools.

In phase 3, I expect most of the technical features to be in place. This will be a phase where the focus will be more optimizing costs and improving performance. It’s possible that we should be looking at ways to use more efficient models, and match models to tasks, potentially fine tuning models, in whatever method.

Monitoring the operation and costs of agents, understanding what happens in different flows will become more critical at this point, especially when usage scales up in the organization, and AI adoption increases, across departments.

Summary

AI stands to transform software engineering far beyond code generation. Realizing that promise demands coordinated learning, infrastructure and a phased roadmap. This framework offers a starting point

I believe that due to the nature of the technology, it goes beyond simple tool adoption, or alternatively adopting a new project management practice. This has the potential to change both aspects of work.

The structure I’m proposing is to highlight the potential in each “stream” of workflow vertical, and adopt the tools in phases of maturity, as the ecosystem evolves (click to view full size):

This visualization is only an illustration of course. You’ll note it’s laid out as a “layer cake” where scenarios for using AI are roughly laid out on top of other use cases/scenarios which should probably precede them.

This is of course not an exhaustive list.

The attempt here is of course to structure the process into something that can be further refined and hopefully result in an actionable plan. At the very least, it should serve as a guideline on where to focus research, learning and implementation efforts, to bring value.

It would be nice to know what other people are thinking when trying to structure such a process; or what the AI thinks about this.

On to explore more.

Dare I say “weathered”? ↩︎
SW engineers being natural early adopters for this technology ↩︎
And we know how some attempts didn’t end well. ↩︎
To be honest, I did not yet dive into the Claude projects, so it’s possible they support this. But I can imagine something similar done with other tools as well. ↩︎
And probably in other industries as well, but I know software best. ↩︎
I realize this is kind of hand-wavy, but bear with me. Also, you probably know what I’m talking about ↩︎
Ever increasing? ↩︎
In a sense, leveraging test time compute at the agentic system level ↩︎
Although in some cases, friction is desirable – think of compliance, cost management, etc. ↩︎
I guess accurate context is also important for humans, who would’ve guessed. ↩︎
And let’s face it, at the rate things are going right now, by the time I finish writing this, there will be new tools ↩︎

From Code Monkeys to Thought Partners: LLMs and the End of Software Engineering Busywork

When it comes to AI and programming, vibe coding is all the rage these days. I’ve tried it, to an extent, and commented about it at length. While it seems a lot of people believe this to be a game changer when it comes to SW development, it seems that among experienced SW engineers, there’s a growing realization that this is not a panacea. In some cases I’ve even seen resentment or scorn at the idea that vibe coding is anything more than a passing hype.

I personally don’t think it’s just a hype. It might be more in the zeitgeist at the moment, but it won’t go away. I believe that, simply because it’s not a new trend. Vibe coding, in my opinion, is nothing more than an evolution of low/no-code platforms. We’ve seen this type of tools since MS-Access and Visual Basic back in the 90s. It definitely has its niche, a viable one, but it’s not something that will eradicate the SW development profession.

I do think that AI will most definitely change how developers work and how programming looks like. But this still will not make programmers obsolete.

This is because the actual challenges are elsewhere.

The Real bottlenecks in Software Engineering

In fact, I think we’re scratching the surface here. Partially because the technology and tooling are still evolving. But also, it’s because it seems most people¹ looking at improving software engineering are looking at the wrong problem.

Anyone who’s been at this business professionally has realized at some point that code production is not the real bottleneck when it comes to being a productive software engineer. It never was the productivity bottleneck.

The real challenges, in real world software development, especially at scale, are different. They revolve mainly around producing a coherent software by a lot of people that need to interact with one another:

Conquering complexity: understanding the business and translating it into working code. Understanding large code bases.
Communication overhead: the amount of coordination that needs to happen between different teams when trying to coordinate design choices². We often end up with knowledge silos.
Maintaining consistency: using the same tools, practices and patterns so operation and evolution will be easier. This is especially true at a large scale of organization, and over time.
Hard to analyze impacts of changes. Tracing back decisions isn’t easy.

A lot of the energy and money invested in doing day-to-day professional software development is about managing this complexity and delivering software at a consistent (increasing?) pace, with acceptable quality. It’s no surprise there’s a whole ecosystem of methodologies, techniques and tools dedicated to alleviate some of these issues. Some are successful, some not so much.

Code generation isn’t really the hard part. That’s probably the easiest part of the story. Having a tool that does it slightly faster³ is great, and it’s helpful, but this doesn’t solve the hard challenges.
We should realize that code generation, however elaborate, is not the entire story. It’s also about understanding the user’s request, constraints and existing code.

The point here isn’t about the fantastic innovations made in the technology. My point is rather that it’s applied to the least interesting problem. As great as the technology and tooling is, and they are great, simply generating code doesn’t solve a big challenge.

This leads me to thinking: is this it?
Is all the promise of AI, when it comes to my line of work, is typing the characters I tell it faster?
Don’t get me wrong, it’s nice to have someone else do the typing⁴, but this seems somewhat underwhelming. It certainly isn’t a game changer.

Intuitively, this doesn’t seem right. But for this we need to go a step back and consider LLMs again.

LLM Strengths Beyond Code Generation

Large Language Models, as the name implies, are pretty good at understanding, well – language. They’re really good at parsing and producing texts, at “understanding” it. I’m avoiding the philosophical debate on the nature of understanding⁵, but I think it’s pretty clear at this point that when it comes to natural language understanding, LLMs provide a very clear advantage.

And this is where it gets interesting. Because when we look at the real world challenges listed above, most of them boil down to communication and understanding of language and semantics.

LLMs are good at:

Natural language understanding – identifying concepts in written text.
Information synthesis – connecting disparate sources.
Pattern recognition
Summarization
Structured data generation

And when you consider mechanizing these capabilities, like LLMs do, you should be able to see the doors this opens.

These capabilities map pretty well to the problems we have in large scale software engineering. Take, for example, pattern recognition. This should help with mastering complexity, especially when complexity is expressed in human language⁶.

Another example might be in addressing communication overhead. It can be greatly reduced when the communication artifacts are generated by agents armed with LLMs. Think about drafting decisions, specifications, summarizing notes and combining them into concrete design artifacts and project plans.
It’s also easier to maintain consistency in design and code, when you have a tireless machine that does the planning and produces the code based on examples and design artifacts it sees in the system.

It should also be easier to understand the impact of changes when you have a machine that traces and connects the decisions to concrete artifacts and components. A machine that checks changes in code isn’t new (you probably know it as “a compiler” or “static code analyzer”). But one that understands high level design documents and connects it eventually to the running code, with no extra metadata, is a novelty. Think about an agent that understands your logs, and your ADRs to find bottlenecks or brainstorm potential improvements.

And this is where it starts to get interesting.

It’s interesting because this is where mechanizing processes starts to pay off – when we address the scale of the process and volume of work. And we do it with little to no loss of quality.

If we can get LLMs to do a lot of the heavy lifting when it comes to identifying correlations, understanding concepts and communicating about it, with other humans and other LLMs, then scaling it is a matter of cost⁷. And if we manage this, we should be on the road to, I believe, an order of magnitude improvement.

So where does that leave us?

Augmenting SW Engineering Teams with LLMs

You have your existing artifacts – your meeting notes, design specifications, code base, language and framework documentation, past design decisions, API descriptors , data schemas, etc.
These are mostly written in English or some other known format.

Imagine a set of LLM-based software agents that connect to these artifacts, understand the concepts and patterns, make the connections and start operating on them. This has an immediate potential to save human time by generating artifacts (not just code), but also make a lot of the communication more consistent. It also has the potential to highlight inconsistencies that would otherwise go unnoticed.

Consider, for example, an ADR assistant that takes in a set of meeting notes, some product requirements document(s) and past decisions, and identifies the new decisions taken automatically, and generates succinct and focused new ADRs based on decisions reached.

Another example would be an agent that can act as a sounding board to design thinking – you throw your ideas at it, allow it to access existing project and system context as well as industry standards and documentation. You then chat with it about where best practices are best applied, and where are the risks in given design alternatives. Design review suddenly becomes more streamlined when you can simply ask the LLM to bring up issues in the proposed design.

Imagine an agent that systematically builds a knowledge graph of your system as it grows. It does it in the background by scanning code committed and connecting it with higher level documentation and requirements (probably after another agent generated them). Understanding the impact of changes becomes easier when you can access such a semantic knowledge graph of your project. Connect it to a git tool and it can also understand code/documentation changes at a very granular level.

All these examples don’t eliminate the human in the loop. It’s actually a common pattern in agentic systems. I don’t think the human(s) can or should be eliminated from the loop. It’s about empowering human engineers to apply intuition and higher level reasoning. Let the machine do the heavy lifting of producing text and scanning it. And in this case we have a machine that can not only scan the text, but understand higher level concepts, to a degree, in it. Humans immediately benefit from this, simply because humans and machines now communicate in the same natural language, at scale.

We can also take it a step further: we don’t necessarily need a complicated or very structured API to allow these agents to communicate amongst themselves. Since LLMs understand text, a simple markdown with some simple structure (headers, blocks) is a pretty good starting point for an LLM to infer concepts. Combine this with diagram-as-code artifacts and you have another win – LLMs understand these structures as well. All with the same artifacts understandable by humans. There’s no need for extra conversions⁸.

So now we can have LLMs communicating with other LLMs, to produce more general automated workflows. Analyzing requirements, in the context of the existing system and past decisions, becomes easier. Identifying inconsistencies or missing/conflicting requirements can be done by connecting a “requirement analyzer” agent to the available knowledge graph produced and updated by another agent. What-if scenarios are easier to explore in design.

Such agents can also help with producing more viable plans for implementation, especially taking into consideration existing code bases. Leaning on (automatically updated) documentation can probably help with LLM context management – making it more accurate at a lower token cost.

Mechanizing Semantics

We should be careful here not to fall into the trap of assuming this is a simple automation, a sort of a more sophisticated robotic process automation , though that has its value as well.

I think it goes beyond that.
A lot of the work we do on a day to day basis is about bringing context and applying it to the problem or task at hand.

When I get a feature design to be reviewed, I read it, and start asking questions. I try to apply system thinking and first principle thinking. I bring in the context of the system and business I’m already aware of. I try to look at the problem from different angles, and ask a series of “what-if” questions on the design proposed. Sometimes it’s surfacing implicit, potentially harmful, assumptions. Sometimes it’s just connecting the dots with another team’s work. Sometimes it’s bringing up the time my system was hacked by a security consultant 15 years ago (true story). There’s a lot of experience that goes into that. But essentially it’s applying the same questions and thought processes to the concepts presented on paper and/or in code.

With LLMs’ ability to derive concepts, identify patterns in them and with vast embedded knowledge, I believe we can encode a lot of that experience into them. Whether it’s by fine tuning, clever prompting or context building. A lot of these thinking steps can be mechanized. It seems we have a machine that can derive semantics from natural language. We have the potential to leverage this mechanization into the day to day of software production. It’s more than simple pattern identification. It’s about bridging the gap between human expression to formal methods (be it diagrams or code). The gap seems to be becoming smaller by the day.

Let’s not forget that software development is usually a team effort. And when we have little automatic helpers that understand our language, and make connections to existing systems, patterns and vocabulary, they’re also helping us to communicate amongst ourselves. In a world where remote work is prevalent, development teams are often geographically distributed and communicating in a language that is not native to anyone in the development team – having something that summarizes your thoughts, verifying meeting notes against existing patterns and ultimately checking if your components behave nicely with the plans of other teams, all in perfect English, is a definite win.

This probably won’t be an easy thing to do, and will have a lot of nuances (e.g. legacy vs. newer code, different styles of architecture, evolving non functional requirements). But for the first time I feel this is a realistic goal, even if it’s not immediately achievable.

Are We Done?

This of course begs the question – where is the line? If we can encode our experience as developers and architects into the machine, are we really on the path to obsolescence?

My feeling is that no, we are not. At the end of the process, after all alternatives are weighed, assumptions are surfaced, trade offs are considered, a decision needs to be taken.

At the level of code writing, this decision – what code to produce – can probably be taken by an LLM. This is a case where constraints are clearer and with correct context and understanding there’s a good chance of getting it right. The expected output is more easily verifiable.

But this isn’t true for more “strategic” design choices. Things that go beyond code organization or localized algorithm performance. Choices that involve human elements like skill sets and relationships, or contractual and business pressure. Ultimately, the decision involves a degree of intuition. I can’t say whether intuition can be built into LLMs, intuitively I believe it can’t (pun intended). I highly doubt we can emulate that using LLMs, at least not in the foreseeable future.

So when all analysis is done, the decision maker is still a human (or a group of humans). A human that needs to consider the analysis, apply his experience, and decide on a course forward. If the LLM-based assistant is good enough, it can present a good summary and even recommendations, all done automatically. This analysis still needs to be understood and used by humans to reach a conclusion.

Are we there yet? No.
Are we close? Closer than ever probably, but still a way to go.

Can we think of a way to get there? Probably yes.

A Possible Roadmap

How can we realize this?

The answer seems to be, as always, to start simple, integrate and iterate; ad infinitum. In this case, however, the technology is still relatively young, and there’s a lot going on. Anything from the foundation models, relevant databases, coding tools, to prompt engineering, MCPs and beyond . These are all being actively researched and developed. So trying to predict how this will evolve is even harder.

Still, if I have to think on how this will evolve, practically, this is how I think it will go, at least one possible path.

Foundational System Understanding

First, we’ll probably start with simple knowledge building. I expect we’ll first see AI agents that can read code, produce and consume design knowledge – how current systems operate. This is already happening and I expect it will improve. It’s here mainly because the task in this case is well known and tools are here. We can verify results and fine tune the techniques.
Examples of this could be AI agents that produce detailed sequence diagrams of existing code, and then identifying components. Other AI agents can consume design documents/notes and meeting transcriptions, together with the already produced description to produce an accurate record of the changed/enhanced design. Having these agents work continuously and consistently across a large system already provides value.

Connecting Static and Dynamic Knowledge

Given that AI agents have an understanding of the system structure, I can see other AI agents working on dynamic knowledge – analyzing logs, traces and other dynamic data to provide insights into how the system and users actually behave and how the system evolves (through source control). This is more than log and metric analysis. It’s overlaying the information available over a larger knowledge graph of the system, connecting business behavior to the implementation of the system, including its evolution (i.e. git commits and Jira tickets).

Can we now examine and deduce information about better UX design?
Can we provide insights into the decomposition of the system?

Enhanced Contextual Assistant and Design Support

At this point we should have everything to actually provide more proactive design support. I can see AI agents we can chat with, and help us reason about our designs. Where we can suggest a design alternative, and ask the agent to assess it, find hidden complexities, with the context of the existing system. Combined with daily deployments and source control, we can probably expect some time estimates and detailed planning.

This is where I see the “design sounding board” agent coming into play. As well as agents preemptively telling me where expected designs might falter.

More importantly, it’s where AI agents start to make the connections to other teams’ work. Telling me where my designs or expected flow will collide with another team’s plans.
Imagine an AI agent that monitors design decisions, of all teams and domains, identifies the flows they refer to, and highlights potential mismatches between teams or suggests extra integration testing, if necessary, all before sprint planning starts. Impact analysis becomes much easier at this point, not because we can query the available data (though we could, and that’s nice as well), but because we have an AI agent looking at the available data, considering the change, and identifying on its own what the impact is.

There’s still a long way to go until this is realized. Implementing this vision requires taking into account data access issues, LLM and technology evolution, integration and costs. All the makings of a useful software project.
I also expect quite a bit can change, and new techniques/technologies might make this more achievable or completely unnecessary.

And who knows, I could also be completely hallucinating. I heard it’s fashionable these days.

Conclusion: The Real Promise of LLMs in Software Engineering

I’ve argued here that while vibe coding and code generation get most of the attention, they aren’t addressing the real bottlenecks in software development. The true potential of Large Language Models lies in their ability to understand and process natural language, connect disparate information sources, and mechanize semantic understanding at scale.

LLMs can transform software engineering by tackling the actual challenges we face daily: conquering complexity, reducing communication overhead, maintaining consistency, and analyzing the impact of changes. By creating AI agents that can understand requirements, generate documentation, connect design decisions to implementation, and serve as design thinking partners, we can achieve meaningful productivity improvements beyond simply typing code faster, as nifty as that is.

What makes this vision useful and practical is that it doesn’t eliminate humans from the loop. Rather, it augments our capabilities by handling the heavy lifting of information processing and connection-making, while leaving the intuitive, strategic decisions to experienced engineers. This partnership between human intuition and machine-powered semantic understanding represents a genuine step forward in how we build software.

Are we there yet? Not quite. But we’re closer than ever before, and the path forward is becoming clearer.

Have you experienced any of these AI-powered workflows in your own development process? Do you see other applications for LLMs that could address the real bottlenecks in software engineering?

At least most who publicly talk about it ↩︎
‘Just set up an api’ is easier said than done – agreeing on the API is the hard work ↩︎
And this is a bit debatable when you consider non-functional requirements ↩︎
I am getting older ↩︎
Also because I don’t feel qualified to argue on it ↩︎
Data mining has been around forever, but mostly works on structure data ↩︎
Admittedly, not a negligible consideration ↩︎
Though from a pure mechanistic point of view, this might not be the most efficient way ↩︎

Exploring Vibe Coding with AI: My Experiment

In my previous post I mentioned vibe coding as a current trend of coding with AI. But I haven’t actually tried it.

So I’ve decided to jump on the bandwagon and give it a try. Granted, I’m not the obvious target audience for this technique, but before passing judgment I had to see/feel it for myself.

It’s not the first time I generated code using an LLM with some prompting. But this time I was more committed to try out “the vibe”. To be clear, I did not intend to go all in with voice commands, transcription, and watching Netflix while the LLM worked. I did intend to review the code, and keep in touch with the output at every point. I wanted to test the tool’s capabilities while still being very much aware of what was going on.

Below is an account of what happened, my thoughts and conclusions so far.
A general disclaimer is of course in place: I’m still exploring these tools, and it’s quite possible there could be improvements to the process. My experience, however, is very much influenced by my experience as a developer. My choice of tools and how to use them is therefore very much biased towards usage as an experienced developer looking to increase productivity, not a non-coder looking to crank out one-off applications¹.

The Setup

I set out to create a new simple tool for myself (actually to be used at work), something I actually find useful, and is not an obvious side project that’s been done a million times, and therefore less likely (I hope) to be in the LLM’s training data. It’s a project done from scratch, and I’m trying to do something that I don’t have a lot of experience with. It is also meant to be fairly limited in scope.

The project itself is a “Knowledge Graph Visualizer”, essentially an in-browser viewer of a graph representing arbitrary concepts and their relationships. I intended this to be purely browser, JS code. The main feature is a 3D rendering of the graph, allowing navigation through the concepts and their links. You can see the initial bare specification here.

To get a feel for the project, here’s a current screenshot:

KG-Viewer showing its own knowledge graph

With respect to tooling I went with Cursor (I use Cursor Pro), using primarily Claude-Sonnet 3.7 model. The initial code generation was actually done with Gemini 2.5 pro. But I quickly ran out of credits there. So the bulk of the work was done with Cursor.

I did not use any special cursor rules or MCP tools. This may have altered the experience to a degree (though I doubt it), so I will need to continue trying it as I explore these tools and techniques.

Getting Into the Vibe

It actually started fairly impressive. Given the initial spec, Gemini generated 6 files that provided the skeleton for a proof of concept. All of these files are still there. I did not look too deeply into the generated code. Instead, I initialized an empty simple project, launched Cursor, and copied the files there. With a few tweaks², it worked. I had a working POC in about one hour of work. Without ever coding 3D renderings of graphs.

Magic!
I’ll be honest – I was impressed at first. I got a working implementation for drawing a graph with Three.js, for some JSON schema describing a graph. Given that I never laid eyes on Three.js, this was definitely faster than I would have gotten even to this simple POC.

I did peek at the code. I wasn’t overly impressed by it – there was a lot of unnecessary repetition, very long functions, and some weird design choices. For example, having a style.css holding all the style classes, but at the same time generating a new style and dynamically injecting it into the document.
But, adhering to my “viber code”, I did not touch the code, instead working only with prompts.

Then I started asking for more features.

Cursor/Claude, We Have a Problem

A POC is nice. But I actually need a working tool. So I started asking for more features.
Note, I did not just continue to spill out requests in the chat. I followed the common wisdom – using a new chat instance, laying out the feature specification and working step by step on planning and testing before implementation.

I wrote a simple file, which should allow me to trace the feature’s spec and implementation.
The general structure is simple:

- Feature Specification - Plan - Testing - Implementation Log

Where I fill in only the Feature Specification, and let Cursor fill in the plan (after approval) and the “Implementation Log” as we proceed.

The plan was to have a working log of progress, to be used as both a log of the work, but also provide context to future chat sessions.

I don’t intend to re-create here the entire chat session or all my prompts, as this is not intended to be a tutorial on LLM techniques. But fair to say that the first feature (data retrieval), was implemented fairly easily, using only prompts.

Just One Small Change…

I was actually still pretty impressed at this point, so I simply asked for tiny small feature – showing node and link labels. I did it without creating an explicit “feature file”.

The code didn’t work. So I asked Cursor to fix it. And this quickly spiraled out of control. Cursor’s agent of course notified me on every request that it had definitely figured out the issue, and now it has the fix (!).
It didn’t.

I remained loyal to the “vibe coder creed”, and did not try to debug/fix the code myself. Instead deliberately going in cycles of prompting for fixes, accepting changes blindly, testing, and prompting again with new errors.

Somewhere along this cycle, the code changes made by the agent actually created regression in the application’s code resulting in the application not loading at all.

After roughly 3 hours, and a lot more grey hair, I did notice that the Cursor agent was going in circles – simply trying out the same 3 solutions, with no idea what’s wrong. But still confidently hallucinating solutions (“Now I see the issue…”³).

This was so frustrating that at this point I simply took it upon myself to actually look at the code, which was a complete mess. I looked at the problematic code, consulted git diffs to restore basic functionality, and solved the actual issue with about 10 more minutes of Google search.

To be fair, from my very rudimentary google search it seemed my request (link labels) wasn’t that easy to achieve. It’s apparently not that obvious (again, without being an expert on Three.js). I relaxed the requirement a bit, and found a simple solution.
Still, the whole cycle of back and forth of code changes, especially to unrelated code, was very much counter-productive. The vibes were all wrong. Getting back to working code took another 2-3 hours.

At this point I was thinking “oh well, you can’t win them all”. I wanted to turn to something simple. And looking at the state of the code, a simple cleanup should be easy enough, right?

Right? …

Now It’s Just Cleanup

Well … it depends.

I went back into “vibe coding” mode. This time, I defined very basic code cleanup procedures. I then asked Cursor’s agent (in a new session), to go through the source code and follow these steps to clean it up.

It actually did reasonably well for small files. The bigger files proved to be more challenging. Trying to clean them up ended up messing the files completely. For some reason, the LLM agent removed functioning code, and created functionality regressions. Trying to quickly fix them ended up in causing more issues. It was clearly guessing at this point.

Given my battle scars with the previous feature request, I avoided this hallucination death spiral. Instead, I went through git history, found a working version, and restored the working code “by hand” – actully typing in code. I wasn’t a vibe coder anymore, but the application worked, the code was cleaner, and my blood pressure remained fairly low (I think).

The experience felt like trying to mentor a junior developer to code without creating regressions. The problem is it’s a fast and confident junior developer, with short term memory loss, who is apparently so eager to please that he simply spews out code that looks remotely connected to the problem at hand, with little understanding of context; proving to be ignorant even of changes it itself made to the code.

Documentation for Man and Machine

At this point I decided to go back to basics, where LLMs truly shine – understanding and creating text. I asked for it to create documentation for specific flows in the code (init sequence, clicking on a legend item). Unsurprisingly, with a few simple prompts, the agent produced decent documentation for what I asked, including a mermaid.js diagram code.

This is important not simply because it allowed me to document project easily, which is nice. Creating a textual documentation of specific flows also allowed me to provide better context for other chat sessions. And this is an important insight – textual descriptions of the code are useful for humans as well as the LLMs.

Other Features

At this point I turned to develop more features – loading data and “node focus“. In both cases I went back to providing feature files, with specifications, and asking the agent to update the files with plans and implementation logs.

I was a bit more cautious now. I reviewed code more carefully and intervened where I felt the code wasn’t good. In some cases it was obvious the code wasn’t functionally correct, but instead of trying to “fight” with the agent, I accepted the code and went on to change it myself.

A repeating phrase in all my prompts at this point was:

Do minimal code changes. Change only what is needed and nothing more.

This, combined with being more cautious and careful, resulted in pretty good results. I managed to implement two features in a short time. Probably a bit shorter compared to what it would have taken me to run through Three.js tutorials and do it myself.

Final Thoughts

So where does this leave me?

I have a working application. And if I had to learn Three.js from scratch myself, it would have taken me considerably longer to create. It’s working, and it’s useful. This is an important bottom line.

Small Application, Good Starting Point

The initial code, generated by the LLM (Gemini or Claude) does serve as a good starting point, especially in areas or frameworks that are unfamiliar to the developer.

But this is still a far cry from replacing developers. There are tool limitations, some of them, I expect, introduced by Cursor rather than the LLM. These limitations can cause havoc if the agent is left to proceed with no oversight.
And review is harder when there’s a ton of unorganized code⁴.

We can probably make it better with rules, better prompts, and combination of agents. And of course advances in LLM training.

This is a good starting point. But we need to remember this is a very small application, made from scratch. In the real world, a lot of use cases are not that simple at all. The more I read and think about it, this bears a striking resemblance to no-code/low-code tools. Also in those cases, it’s easy to achieve quick results for simple uses cases, but very hard to scale development when features creep in or the application needs to scale.

It’s not that low-code tools don’t have their place. They serve a very specific (viable) niche. But as experience shows, they haven’t replaced developers.

Could this be different?
What would it take to tackle more serious challenges, with “vibe coding”?

Context is King

It’s quite obvious that in the kingdom of tokens, amidst ramparts of code and wind all of chat messages, there is only one king, and its name is Context⁵. As LLMs are limited in their context size, and a lot of it is taken up by wrapping tools (Cursor in this case), context for an LLM chat is an expensive real estate.

So while context windows can get big, we’ll probably never have enough when we get to more complicated tasks and bigger code bases. There’s a preservation of complexity at play.

Accuracy and precision in the context play a crucial role in effectiveness. Context passed to LLMs needs to be information-dense. We should probably start considering how efficient is the context we’re providing to LLMs. I don’t know how to measure context efficiency yet, but I believe this will be important to be more effective as tasks become more complicated.

But there’s more than just the LLM and how to operate it.

You’re Only as Good as Your Tools, Also When Vibing

It’s quite clear that mistakes done by LLMs, and humans, can be avoided/caught with the help of the right tools. Even in my small example described above, cooperation of the LLM agent with external tools (console logs, shell commands) resulted in better understanding and a more independent agent.

I suspect that having more tools, e.g. relevant MCP server for documentation, can significantly help. I expect the integration of LLMs with tools will become more prominent and more necessary to create more independent coding agents.

One often overlooked tool is the simple document explaining the context of the project, specific features and current tasks. When LLMs will work seamlessly with Architecture Decision Records and diagram as code tools, I expect to see better results. The memory bank approach seems to be a step in that direction, though it’s hard to assess how effective it is.

I have noticed in this exercise that supplying the LLM with context of how a flow works currently (e.g. loading the data), allows it to identify the necessary changes more easily.

Diagram as code play a role now not just for humans developers, but also as a way to encode context for the application. There’s a feedback loop here between the LLM generating documentation, and using it as input for further tasks.

Effective Vibing

The real question is about the effectiveness of the vibe coding approach. With what degree of agent independence can we achieve good results.

I’m not sure how to assess this. One approximation of this might be the rate of bugs to user chat messages times lines generated in a given vibe coding session. But there are obviously other parameters involved⁶.

It will be interesting to measure this over time, with more integrated tools, improved LLMs and possibly improved tools.

I’m not sure how this will evolve over time. I do think, however, that if LLMs with coding tools will be reduced to a glorified low-code platform it will be a miss for software engineering in general. The technology seems to be more powerful than that, since it has the potential to more easily bridge the gap between human language and rigorous computer programs; and do it in both directions.

On to explore more.

Not that there’s anything wrong with that ↩︎
Yep, I asked Cursor to keep track of the changes at this point ↩︎
A phrase which, I guess, is close to becoming a meme onto itself ↩︎
But then again, not sure it’s a problem in the long run ↩︎
Always looking for opportunities to paraphrase one of my favorite book series; couldn’t resist this one ↩︎
And we should be careful of Goodhart’s law. ↩︎

Schejter

Lior Schejter et al