AI Agents are everywhere, and slowly (quickly?) becoming more prominent in applications. We’re seeing more and more of them appearing as integral parts of applications, not just as tools for developments, but actual technical components that implement/provide user-facing functionality. We’re also seeing a significant improvement in the length of tasks agents are able to accomplish. I’m not sure this is AGI yet, but it’s definitely significant.
So far, I have focused on the implications of AI on how software is developed. But as we move from working internally with LLMs to building applications that leverage them, I believe it’s time to look more carefully on how to build such systems. In other words, what would it look like if we want to build a system that really leverages LLMs as a core building block.
We already have concrete examples of such applications – our AI-driven IDEs and other coding agents. These are examples of applications where the introduction of AI has done more than supercharge existing application functionality. It has actually changed the way we do things. What’s more interesting is how quite a few people are using these in ways that traditional IDEs weren’t designed for. I remember a time, not so long ago, when suggesting the use of an IDE to a non-technical product manager was met with raised eyebrows1. Now, most product managers have Cursor (or Claude Code) open and doing much of their work. This isn’t just ‘vibe coding’; it’s using the agent as a multi-tool for the boring-but-essential parts of the job. I’m seeing people use Cursor for practically everything – writing specs/design documents, documentation, diagramming, design and data archaeology and more. And this is still mostly chatting with a given agent. The potential, I believe, is much bigger.
When I let myself extrapolate from coding agents to the broader set of potential applications2, I can’t help but think we’re going to see a new kind of software applications emerge – Agent-driven applications. Applications that will be mostly built around LLMs and the tools, essentially the harness of the agent. It can be multiple agents cooperating or a single agent embedded into a larger platform. I don’t assume this type of application will replace all kinds of applications, but I think it will be more prevalent, and we should start to seriously think of what it means to really leverage LLMs in applications. We should consider what the implications are for how we define, build, evolve and use these applications where AI-based agents sit at the core.
Why Should We Care?
One can argue that LLMs are nothing more than a technical component in a larger system, with limitations and quirks around how it’s used. This is technically true3. But I believe there’s a larger opportunity here in how we deploy and use these LLMs; an opportunity which presents its own challenges.
If we limit ourselves to some kind of smarter automation – guiding the LLM through a task or some workflow – this would probably work. But with LLMs we can do more. We can declare a desired goal/outcome, and let the agent decide on how to achieve it. A reasoning agent equipped with a set of capabilities (=tools) and a desired goal can work to achieve it, without us coding the concrete workflow, or even explaining it in detail. This is what we’re already seeing with AI IDEs. Assuming the capabilities are robust enough – on par with what a user would do – the agent should be able to achieve the task on its own.
This may seem insignificant at first look. But I think it changes how we would want to design these systems if we want to really leverage high-end LLMs. Given an agent with enough tools, the user can also instruct it to do all kinds of tasks that a developer/product manager/architect did not even consider when building the application. It can be a simple heuristic change, or complete new workflows – all only a prompt away. Years of enterprise software customizations and pluggable architectures prove that this is a very real need. And reasoning LLMs, with open ended flexibility supercharge this customization capabilities.
It’s more than customization of existing workflows. It’s also finding new ways to use the existing capabilities. Similar to how people string together Linux shell commands, or use electronic spreadsheets to everything from accounting to games and habit tracking – once the capabilities are there we’re only limited by the user’s imagination.
In addition, I think there’s also a chance for a pattern that might be unique to LLMs being at the heart of such an application. This is derived from the use of context. When an agent in such an application works, it consumes context (usually fed by different tools, or from its progress). But it also has the chance to affect future context, for other agents or for its future invocations. It maintains a memory of past actions and interactions. Similar to how I turn certain conversations I have in Cursor to rules or commands, or updating the AGENTS.md file. This has the potential to allow the agent to improve over time, even automatically, without a human in the loop.
But what does such an application look like?
Anatomy of an Agent-Driven Application
As I see it, the architecture of an agent-driven application centers around the Agent Loop. Unlike traditional software that relies on rigid, pre-defined workflows to execute logic, this architecture relies on an agent, or a set of collaborating agents, working autonomously to achieve a specific goal.
In this development model, we do not define concrete flows. Instead, we define the application by providing the agent with a set of tools. These tools allow the agent to perceive its environment and act upon it. The desired outcome is defined through a combination of prompts, originating from both the system developers and the end-user, and existing context. The agent then works through the task in its loop, until it completes its work. This is similar to the idea of Web World Models, only applied to business scenarios, which hopefully can make it more constrained.
The execution path is dynamic rather than static. Because the agent maintains a context that evolves, learns, and potentially forgets over time, the specific steps taken to achieve an outcome may change. The application defines what needs to be done, while the agent determines how to do it based on its current context and available tools. The potential for such an application is more than simple automation of tasks, it’s also about finding ad-hoc ways to achieve a given goal or completing unforeseen (desired) outcomes.
Examples can be agents with varying level of complexity:
- A planning agent that responds to events, queries application state and decides how to allocate resources, collaborating with other agents to verify choices and other constraints eventually notifying users and downstream systems.
- A troubleshooting agent that leverages various data sources to correlate and find insights in the data, iteratively exploring data until it provides several theories to the asked question.
There are 3 main components to such an application: the capabilities (tools) available to the agent, the shared context and the agent loop.

The Agent’s Capabilities (Tools): The agent interacts with the application and external systems through tools. These tools should be atomic and composable. Initially, they may be simple primitives. Over time, as we observe how the agent utilizes them, these tools can evolve into more complex capabilities. The agent selects these tools dynamically to solve problems, invokes them and acts on their result.
Shared Context: Context is the memory of the system. It is not limited to a single interaction but persists and evolves between agent sessions. This shared context allows the agent to learn from previous interactions. It ensures that the agent does not start from zero with every task but builds upon a history of user preferences and past decisions, in addition to the system state. This memory is shared between agents working, but also between the agents and the users. It’s possible for a user to interact directly with the context, correct/change it and therefore direct the agent(s), within normal data access limitations.
The Agent Loop and Completion Signals: The agent lives in a perpetual loop: it observes the state, reasons through the next step, acts using a tool, and then looks at what happened. Repeat until the job is done. This loop runs until the agent determines the task is complete. Note that since the system centers around the agent loop, identifying when it’s finished is critical and an integral part of the pattern. There could be different signals for completion (e.g. “fully completed”, “partially completed”, “completed but unknown state”, “failed”).
It’s important to distinguish between a completion signal and some execution failure. It’s quite possible that a tool execution fails, but the agent continues to reason and work around it. It’s also possible to have all tools successfully execute, with the overall outcome not achieved due to other reasons.
Design Principles
Now that we’ve established the general idea of what an agent-driven application looks like, it’s worth laying down some points which should help us design such a system effectively.
Agent Capabilities Match User Capabilities
We should aim for capability parity between the user and the agent. If a user can achieve some outcome in the system, the agent must have a corresponding tool or set of tools to achieve the same result. This does not necessarily mean the agent manipulates the UI widgets; rather, it means the agent has programmatic access to the same underlying logic and mutations that the UI exposes to the user. It might be through a different path, but if we want the agent to achieve the same outcomes as a user, it should have capabilities that are on par with the user’s capabilities to affect the system.
Application Logic Lives In Prompts
The core logic of the application shifts from code to prompts. We use prompts to define the business constraints and desired outcomes. Deterministic code is still there, for various reasons, but the more flexible we want to be, and more open to agentic reasoning, the more we need the desired logic to exist in prompts. I also expect that the definition of business flows will be less prescriptive. Instead it will focus on establishing goals and constraints. Think of it like SQL for business logic. You declare the ‘what’ (the query), and the engine figures out the ‘how’ (the execution plan)4. There’s of course a twist here: our “engine” is a non-deterministic LLM working with an ever-evolving vocabulary of tools. This is harder compared to optimizing over a relatively narrow domain (relational algebra).
Consequently, this changes how we debug an agent-driven application. Instead of stepping through lines of code to debug logic errors we analyze execution traces to understand the agent’s reasoning process and tool selection.
Guardrails are Explicit – In The Tools
While the agent is autonomous, it must operate within safety boundaries. We do not rely on the agent’s “judgment” for critical constraints. Concerns such as data consistency, authorization, and sensitive data access are enforced strictly by the tools themselves. The tool allows the action only if it meets the hard-coded security and business rules. Some of the safety guardrails can be in the prompts, but we should not rely on this as a security measure.
Capability Evolution
The agent’s capabilities in the system are not necessarily static. We can evolve them by observing how the agent “behaves”. Concretely, we treat the agent’s behavior as a source of requirement generation. By observing traces, we identify common patterns or sequences of actions. We then “graduate” these patterns into more elaborate, hard-coded tools. It’s technical and logical refactoring that’s driven by how we observe the system behaving.
I see a few main motivations for this kind of evolution:
- Optimization: Hard-coded tools reduce cost and latency compared to multiple LLM round-trips.
- Domain Language: Creating specific tools establishes a richer, higher-level vocabulary for the agent to use, making it more effective within our specific business domain.
It’s also possible that we’d want to code some tool in order to guarantee some business constraint, e.g. data consistency. However, I believe this will not be so much an evolution of a tool but rather a defined boundary condition for the definition of a tool in the first place, maybe a result of a new business requirement/feature.
It’s quite possible that a few granular tools will be combined into a more complicated one if the pattern is very common, and we can optimize the process. Still, I wouldn’t discount the more granular tools as they provide the flexibility we might like to preserve.
Tradeoffs and Practical Considerations
Naturally, when designing any kind of system, we make tradeoffs. When designing real-world systems, we often need to be practical, beyond theory. So it’s important to understand whether this kind of architecture pattern and technology carry with it any specific considerations or tradeoffs.
Model selection and configuration is an obvious point to note when building a system where LLMs sit at the heart of it. Not all tasks are created equal, and some may require a higher level of reasoning than others. The tradeoff being cost/latency with reasoning and expressive power, and inherent capabilities of the model (e.g. is it multi-modal or not). For example, a “router” agent that identifies and dispatches messages to other agents/processes may work well enough with a cheaper (weaker?) model; whereas an agent requiring deep understanding of a domain model, and how to retrieve and connect different bits of information, working for a longer time, may require a stronger model. This will probably be more evident in systems where there’s a topology of cooperating agents.
Then there’s the elephant in the room: the tradeoff between autonomy and risk. This is an obvious point when considering a somewhat stochastic element in the architecture.
On the one hand, autonomy provides the agent, and ultimately the user, more flexibility. This should immediately lead to more unexpected use cases and “emergent behavior” mentioned above. Consider, for example, an agent dealing with financial records that can be used to identify issues and fix them without pre-programming the patterns in code.
On the other hand, there’s an inherent risk with allowing too much. Restricting the agent’s capabilities increases predictability and therefore safety. It of course limits the product’s value at the same time. On the extreme end, a very limited agent is kind of a fancy workflow engine5.
Applications obviously exist on a spectrum here, but this is a prime consideration when designing the agent’s capabilities.
The intersection of LLM context and long running agent carries with it some points to pay attention to as well.
First of all, long running agents will probably “run out” of context windows. Trying different tools, retrying failed actions, accumulating data and observations will inevitably lead to the context window filling up. This is an expected problem in this scenario. Its impact and frequency will most likely correlate with task complexity and tool capabilities.
When building such a system, we should provide a standard, hopefully efficient, way to summarize or compact the context. Simply dropping a “memory” is usually not an option. There should be a standard way for agents to retrieve memories where applicable. This will likely be a core component of the system, and it’s still open (at least for me) whether there’s a general mechanism for managing context that will fit all kinds of tasks and/or applications and agent topologies.
Which brings me to another point about context – managing context across agents, and the intersection of agents and users. The context for an agent will evolve across sessions. And it might actually be a good thing, depending on the application, to make it accessible to the human user. For example, if we want to allow the user to fix a data issue and/or somehow change the behavior by modifying some learned memory. There is a potential here for conflicts between changes. So we should consider how conflict resolution is done when it occurs on context updates.
User interface and experience should also be considered carefully here. Since a fundamental building block is the agent loop, the state and progress of the agent should be reflected to the human user, and maybe to APIs as well. Faithfully reflecting the state of the system, specifically the behavior and reasoning of the agent(s) running in it, helps to identify issues and build trust. I expect this to be a non-negligible issue when building and adopting such an application. Completion signals are part of this standard pattern, and probably deserve a “first-class” citizen status in the application. Understanding what an agent is doing, and whether it’s done with the goal the user has presented it is important to the user. Understanding when and whether, and sometimes how, a goal was achieved should be standardized.
One last tradeoff to point out is the mechanism used for discovering tools by agents. You can have a static list of tools (capabilities), coded as available to each agent. This provides a more predictable list and therefore higher control. On the other hand, you can imagine a more dynamic “tool registry”, where tools may be added and made available to agents. Tool choice is done by the agent, but probably easier to predict with a static list of tools. An evolving, dynamic, registry may offer more flexibility but may be less predictable. I expect the agent will have a tougher time selecting the right tool in this case.
If we want true flexibility, we lean into the dynamic registry. And if the agent gets lost in the aisles?6 We can always fall back to a “safer” hard-coded map of tools.
From Magic Boxes to Design Blueprints
Whether we choose static control or dynamic flexibility, the goal remains the same: building a robust environment for autonomy.
We are rapidly moving past the phase where an LLM is a “magic box” bolted onto the side of a traditional app. We need to think about how we design these systems. We have to get serious about the architectural patterns that allow these agents to actually get work done without constant human hand-holding.
The transition to agent-driven applications presents a new set of interesting problems for us to solve7. We’re no longer just designing APIs for human coders; we’re designing vocabularies for agents. The challenges ahead – how to build tools that are legible to a model, how to share context across a multi-agent swarm without it becoming a game of “telephone”, and how to let that context evolve organically – are the new unexplored territories of software system design.
Building these systems isn’t just about writing code anymore; it’s about building a harness for reasoning. It’s messy, it’s non-deterministic, and therefore less predictable. But it’s also an exciting architectural shift.
So, let’s keep exploring and see what else we can find.
- or, god forbid, write something in markdown. ↩︎
- More examples here. ↩︎
- Pun intended ↩︎
- Yes, I realize it’s a bit more complicated than that, but you get the idea. ↩︎
- And we have plenty of those, good ones, with no LLMs involved. ↩︎
- That is, fails to accomplish its goal ↩︎
- or at least old problems with new technology ↩︎









