During the development and testing of Dialectic, something kept bothering me. While the application worked largely as designed, the implementation felt a bit too… simplistic.
The debate flow is largely a linear sequence: a loop iterating over debate rounds with an optional clarification step:
This isn’t necessarily a bad thing. It’s easy to understand and troubleshoot. It’s predictable. More importantly, it’s a decent starting point, an MVP.
But the problems start to show when using it.
First, the convergence decision is decoupled from the context of the debate itself – the debate always ends after a fixed number of rounds. This would mean that a debate that is simple and converges after a round or two may run unnecessarily for extra rounds. This is obviously wasteful1, but it also risks introducing ‘hallucination drift’ into what would otherwise be a perfectly good conclusion.
Alternatively, the predefined number of rounds may not be enough. I’ve had several cases (mainly in work-related invocations) where qualitative examination of the resulting report revealed several open points and/or questions.
Second, the clarifications step was constructed in a way that all agents are exposed to the problem description and context, and ask a set of questions at once.
While this allowed agents to gather specific context – which was helpful – it still presents two limitations:
- No real interactivity: debating agents could not follow up with questions after the answers given by the user. This means that the agent got clarifications up to a point, but was often not completely satisfied with all the questions answered.
- Isolation: the debating agents don’t see each others’ questions and answers and cannot derive conclusions from them2.
A third point is more on the implementation/operational side. When a given agent failed for whatever reason3, it meant that its contribution, at least in that round, was ignored and effectively lost. The linear loop meant that a failed agent invocation would be ignored at best. Retrying meant that the whole round or phase had to be retried. In other words, agent invocations were coupled together in error handling.
All of these problems could be solved in the original code design. But when I started thinking about it, it quickly became obvious, at least to me, that the code would become unwieldy and harder to reason about.
I started thinking whether it would actually be better to write the tool differently.
Since I don’t need a lot of excuses to write code, I rewrote it to accommodate for these problems, but also to experiment more with the idea of agent-driven applications.
State Machine Orchestration
So if I want to model my code in a way that allows me to express decisions as reactions to inputs, system events and state (in addition to predetermined configuration), modeling the system around a finite state machine seems like an obvious choice4.
The transition itself is also pretty straightforward. The linear flow maps directly to states: we model each phase (propose, critique, refine) as a state of the system, as well as clarifications and synthesis. The system is naturally at one state at any given point in time. In a sense, the original linear flow is a specific case of the broader set of behaviors possible with the state machine.
We end up with a state machine that looks (at a high level) something like this:
The application is now modeled as a graph of nodes (~= “tasks”), where orchestration happens as response to events that cause edge transitions.
This model immediately lends itself to implementation of two improvements:
- Asking clarifying questions is easily modeled as a state with a clear event telling the system when we’re done (“No more questions”) ⇒ agents can ask follow-up questions5, and can easily be made aware of each other’s questions.
- Deciding when the debate is done is also modeled as an event, based on the judge’s decision ⇒ autonomous convergence is easier to implement.
Note that configurable safeguards are still in place. We can still cap the number of clarifying questions to be asked as well as the number of debate rounds. But it naturally opens the door to more efficient handling of these situations.
I won’t go into implementation details here (you can inspect the code, and documentation), but this new flexibility also allows for easier implementation of other scenarios and improvements.
- Adding a new phase/step in the flow, e.g. “review”, is essentially introducing a new node, with relevant transitions.
- If a specific agent fails in some node, it can be retried independently of other agents in the same phase.
It does require an implementation of a different kind of orchestrator, and adapting the UI options (CLI, web) to this operational model. The orchestration logic now lives in different nodes, and some intermediate technical states need to be introduced.
Interestingly, the agent LLM prompts themselves didn’t change much to accommodate for the new state machine orchestration. This hints at generally a good abstraction at that level – orchestration vs. agents behaviors.
Where does this leave us from an architecture point of view?
One can argue that the state machine is also hard coded, and fundamentally, the graph transitions are not that different from a program counter moving through instructions. If you squint, it does look similar.
On the other hand, this more naturally allows for easier extensions as noted above (interactive clarifications and autonomous convergence) as well as easier error handling at the node level. There’s also no constraint on having a static predetermined state machine. The state machine itself can be constructed in runtime based on configuration or input.
In addition, if we zoom out a bit, and think of a potential roadmap, an event-based model allows the application to be easily re-implemented as separate processes, with different nodes implemented in separate “services” responding to events. Scaling becomes easier. Doing it based on the rigid loop-based flow would’ve been harder6.
But there’s something more fundamental in how the application is built – it’s still expressed in code.
An Agentic (?) Application
The refactoring described above works in the sense that it does improve the mechanics of extending the code. It allows us to express behaviors more naturally, and potentially scale better.
Still, the core application logic is expressed in a series of typescript code files – state machine transitions are expressed in code. Even the agent prompts are delivered as part of the code.
At a basic technical level, any material change to the behavior of the application requires some code change (+rebuilding and shipping). Extensibility, even when easier, is still code-centric. This becomes more of an issue if our application requires more flexibility and customization from a user.
We have improved control-flow modeling, as well as runtime semantics. But the application behavior is not fully externalized as a protocol/data.
What does it mean for the application protocol to be externalized as data?
At the heart of it, the application’s logic is represented as artifacts that are observable and even open to manipulation by the system’s operators, not just its coders.
To the older programmers in the crowd, this would be somewhat reminiscent of Lisp/Smalltalk and other homoiconic languages, where the program representation is directly manipulable in the same semantic system as the program data (e.g. forms/objects, S-expressions)7.
But this is not exactly homoiconicity. In this case, we are able to modify the program’s behavior by manipulating files that are read during execution.
In a system running continuously, this gives us a chance to change the system’s behavior as it’s running. In that respect, it is similar. I guess it’s more “workflow as data” and not so much “code as data”.
Another analogy might be to a template in a no-code tool, where users have the option to customize the flow without coding. It is similar in the technical sense, only here we don’t have formal semantics that usually come with modeling in some no-code tool. We have the English language, with the aid of tools (again – code) to help provide a more rigid structure.
What I’m after here is a clear separation between the agent “runtime” and the application’s business logic, in a way that allows the application protocol to be defined as malleable artifacts.
Which brings me back to the idea of implementing the application with an AI (LLM-based) agent at its core.
Practically this would mean that the application workflow would be represented in a series of artifacts that are inspectable and amenable by the user or operator of the system. The “runtime” itself would be an agent platform with basic capabilities, driven by an LLM, with relevant basic tools.
What do we gain here?
We gain transparency and faster architectural iteration with rudimentary tooling8. We also get easier customization of behavior.
At the same time, we must recover guarantees we lost when the workflow was implemented in code, compiled and verified. We’re moving from imperative coding to inspectable runtime artifacts.
This is how I got to Dialectic-Agentic.
It is essentially the same Dialectic application, re-imagined as an agent-native application.
The core execution engine is any agent platform available today. This should work with Claude Code, Cursor, etc. These already implement the basic agent loop and tool abstractions (+some built-in tools) that would allow to build the application on top.
The application protocol is expressed through a series of skill files and prompts. These enforce strict file conventions that serve as the local communication mechanisms between agents.
The flow orchestration is described in the Orchestrator agent skill. This is the main agent running in the agent loop. Using the built-in “Task” tool, it executes various subagents (per role) and the judge agents.
All work and communication between the orchestrator and other agents is done through reading and writing files in a dedicated debate workspace. This also allows us to follow the progress and status of the debate (there’s a `progress.md` file).

Invocation simply happens by invoking the relevant skill in the relevant agent platform, with the problem description and context directories given, as before.
Configuration is similar to the code-centric version. Only note that here a lot of the agent and LLM configuration is irrelevant since this is implied by the running agent platform. The configuration is focused on the agent-specific instructions and guardrails.
The entire application logic is encoded in skill files (blue components above), taking advantage of the agent runtime capabilities of reading files or doing any kind of web search or any other customized tool. The LLM configuration is entirely out of scope for the application.
Application UI is essentially the built-in agent chat window or terminal, whichever the user decides to use. The intermediate files are of course also part of the UX. You can track progress and status using the information written in the debate workspace, as they are being written and updated using the agents operating. The debate workspace is also available at the end for troubleshooting or other analytics.
This is still not a full-blown agent-driven application as I have outlined before, but the core components are there: the agent loop and basic tools are already part of the underlying agent platform. The shared context is given in the debate workspace – a simple file system directory.
The workflow, at least at this point, is a rather simple one, with a clear beginning and end. There’s no sharing context with the user while the application is running, but this is mainly because the running time is finite, and usually short.
—
At this point we have 3 different implementations of the same pattern, it might be worth taking a step back and consider the tradeoffs.
Comparing Implementations
The 3 different implementations of the same application (imperative, state machine, agent-based) accomplish essentially the same task – running a system design debate and producing a result.
I have not achieved exact feature parity between the implementations, but there should not be anything that fundamentally prevents us from doing it, even if the implementation may be awkward.
It would be interesting to examine the tradeoffs of the different implementations from an architectural point of view. How do the different implementation approaches differ in different aspects?
Change Velocity
How long will it take to implement a new feature, and deliver it to users?
The general question of course depends on the feature and its complexity, but it still might be worth examining it through the lens of a specific feature (or set of features). Imagine, for example, that we need to include a new step in the process, e.g. a final review of the solution by all agents9.
The deterministic flow would require changes in several code files (the orchestrator, role-based agent interface and implementations). It would also require new prompts and potentially new state attributes to be passed.
It will probably also require specific context construction.
The state-based flow would require a new graph node implementation, with relevant wiring. It’s better organized where the flow is clearly separated from other aspects.
Both of these implementations require of course code changes + build and deployment of compiled files. This includes package publishing etc.
The agentic implementation requires basically some change in the core protocol (a new step before the synthesis phase?) and that’s it really.
Delivery of the actual skill files really depends on the platform, but it’s essentially copying the necessary markdown file.
Failure Isolation
This aspect of course depends on the type of failure mode. It’s obvious that an underlying failure in the LLM APIs or availability of API is a blocker for any kind of application where LLMs play a vital part.
Any central failure, e.g. no LLM available, will affect the entire execution.
I think it might be more interesting to address the question of how isolated a failure mode is when it does happen in a specific step/component.
Let’s consider a failure in one agent execution, in one phase. It could be because of some misconfiguration of LLM or prompt, or some tool call, causing an LLM to return an invalid response – not according to protocol.
The imperative implementation would either try to work with the given response, however lacking/broken, or stop the debate completely (e.g. in the proposal phase). Not all errors will be immediately obvious but this is more an issue with the current implementation, not so much with the pattern. A technical failure is more likely to cause the entire run to fail. Isolation would require granular error handling at the code level, e.g. smaller and specific try-catch blocks.
The state-machine implementation works largely the same for phase-scoped errors. It either aborts the flow completely (proposal, refinement phases) or continues with partial results (critique phase). The specific mechanism is different, but the result is the same from an overall application point of view.
Note that in the current implementation, there’s no validation of the quality of returned result from agents – nonsensical LLM responses may propagate.
The node/event isolation provides a slightly easier way to isolate problems when they happen. Especially if we want to execute them in a separate process (not the current implementation).
With the agent-based implementation, the policy is embedded into the skill file, e.g. here (section 4.2):
**Wait** for all N subagents to complete.
**Verify** that each expected file exists: `{WORKSPACE}/debate/round-{ROUND}/proposals/{agent.id}.md`
If any file is missing:
1. Log a warning to `progress.md`: "WARNING: {agent.name} proposal missing in round {ROUND}. Retrying."
2. Re-dispatch that agent's subagent once.
3. If still missing after retry: log "WARNING: {agent.name} skipped in round {ROUND}" and continue without this agent. Inform the judge of missing agents when it runs.
i.e. the current policy is to retry an agent execution once, and if it fails (no file found) – log a warning and continue. It does not stop the debate, but does make the problem explicit.
Note that also in this case, in case of a faulty response, or missing response (after 1 retry), the process continues. So a problematic response will also propagate to the debate and may cause downstream issues.
Failure is generally more isolated in this case simply because it happens at a subagent level, and focused on specific task execution.
Note that the actual handling of errors really depends on the executor being strict in its execution. There might also be drift occurring from the artifact changing, or instructions coming up in prompt that alter this behavior. This behavior is not absolutely guaranteed.
In all 3 implementations, we can create a more robust failure handling. Validate actual result, retry execution, isolate specific agents.
The question then becomes how easy it is to introduce a more robust failure handling mechanism.
Imagine we’d want to isolate changes of an agent so it won’t stop the debate.
With the imperative solution, this would entail coding a whole protocol between the orchestrator and other agents.
With the state-machine implementation, this would require introducing new states dynamically (“1 agent completed”, “2 agents completed”, …, “N agents completed”). This is not currently implemented, but the basic mechanism is there (note it’s called “DEFAULT_TRANSITIONS”).
With the agent-based implementation, the policy is basically the 5 lines quoted above. Implementing it is basically changing the SKILL file, or providing extra instructions when invoking it (the “user prompt”). This of course assumes the underlying LLM follows instructions10. In short, it’s easier to implement, but we’re more at the mercy of the underlying agent to follow the instructions as intended.
Runtime Transparency
How easy it would be to understand the execution as it is running?
In the imperative implementation, the flow is mostly implied in the code itself. We would need to log everything or implement tracing to gain visibility. In short – more code.
In the state machine implementation, the flow is also expressed in code, but it’s easier to understand where it stands just by tracing/logging state transitions. Another case where better code organization benefits us. If nodes communicate by some other inter-process communication protocol, e.g. message queues, it’s also possible to track these.
In the agent-native implementation, since all communication between agent execution happens in files (status.md, progress.md, files written with proposals, critiques, etc.) it’s very easy to simply look at the file system and understand how the process is progressing, or where it fails.
Determinism and Reproducibility
How deterministic is a given execution? How easy would it be to reproduce it?
In both of the code-based implementations, the process is expressed in code. Given the exact same inputs and sequence of events, we’re almost certain to reproduce the same results. While there is some non-determinism in the potential LLM response, it would not likely affect the execution of the flow. It might affect the quality of the end result.
In the agent-native approach, a lot of the execution depends on the LLM following instructions properly. The execution here is a lot more sensitive to the agent platform running it, prompting and runtime changes.
This might be good in some cases if the LLM finds ways to overcome obstacles, but generally speaking, the behavior is less predictable, compared to code. In order to mitigate this, we’d need to invest more in verifying contracts (e.g. files created). It’s no question that this approach is weaker on this point.
Tool Integration Ergonomics
How easy is it to integrate tools into the flow and direct the LLMs to use it when necessary?
In both of the code-based implementations, the tool registration and execution is code centric. We would need to implement tool discovery11 and integration into prompts as well as execution. It’s possible to integrate more well-established protocol, e.g. MCP, but still requires investment in implementation and maintenance. There are of course established agent frameworks these days that do a lot of this heavy lifting.
In the agent-native approach, this is largely solved by the underlying agent platform. It already takes care of registering tools, including custom tools; and it usually has some basic tools already built-in. For example, in Cursor, file_read and web_search are available as part of the platform. We’re only left with guiding the agents on how to use them. In this respect, it’s a done deal and the application developer only needs to focus on usage of tools. It also means that tool usage might not be immediately transferable to other platforms, unless we somehow make sure we’re using some standard tooling, e.g. the same MCP servers.
I’m not sure there’s a clear winner in this aspect. Only that existing platforms already support this out of the box.
Testing
How easy would it be to test the application behavior in each approach? How well can we use established testing tools and methodologies?
The imperative implementation is a winner in this aspect. It is best suited for traditional unit testing and other automated testing approaches.
The state-machine implementation is also code-centric and therefore easily testable with existing features. It might need a bit more testing for the nodes/events facility, but this added testing complexity isn’t a significant addition.
The agent-native is weaker in this aspect. Testing here requires relying on golden artifact testing, validating implicit contracts (file naming and content) and generally a more end-to-end approach for testing.
This is a point that’s generally true for applications relying on LLM execution, and I think merits its own separate discussion12.
So Which is Better?
To summarize this comparison, if I had to rate these implementations on a 1 to 5 scale (1 – weak, 3 – balanced, 5 – strong), it would look something like this:
| Aspect | Imperative Implementation | State-machine Implementation | Agent-native (skill-based) Implementation |
| Change Velocity | 2 | 3 | 5 |
| Failure Isolation | 2 | 4 | 3 |
| Runtime Transparency | 2 | 3 | 5 |
| Determinism / Reproducibility | 4 | 4 | 2 |
| Tool Integration Ergonomics | 3 | 3 | 5 |
| Testing | 5 | 4 | 2 |
Unsurprisingly, there’s no one architecture that dominates all of these aspects. Each refactor done here improved some areas at the cost of others.
When we moved from imperative code to state machine implementation, we gained better code organization, flow modeling and failure boundaries. But we paid a “tax” in complexity (managing nodes, events, suspend/resume cycles).
When we moved to agent-native architecture we gained flexibility and easier customization as well as velocity. This allows the system to adapt to the conversation rather than following a script. But we pay in less deterministic execution and a harder to test application.
As always, the answer to what is better is ‘it depends’. There is no necessarily better architecture, only a better fit for the specific problem at hand.
If we optimize for predictability, and maybe compliance, it would probably be better to go with one of the code-based approaches.
If we optimize for rapid iterations, and protocol flexibility (including user contributions), it might be better to go with an agent-native approach.
And of course, other applications, with more complex flows13 might work as a more hybrid approach, where some of the process, namely the part we want to be more predictable and compliant, is implemented in code and integrated as a tool with an underlying agent.
For this specific use case, the flow remains fairly simple and predictable. My takeaway is that an agent-native architecture really fits when the path to a solution isn’t an obvious “straight line” – where flows are less rigid, or where different processes must be combined on the fly in unforeseen ways.
Consider, for example, a Tier 1 customer support bot following a well-known script. This is usually predictable and code-like (“if this is raised do this, otherwise do that”). Contrast this with a support bot that behaves more like a high-level troubleshooter, and pivots based on the complexity of the problem and its context. In that scenario an agent-native architecture will fit better.
Similarly, a supply chain software that needs to set up a delivery route. An agent, connected to online information, and absorbing different inputs about external events (e.g. extreme weather, fuel shortage), should be able to adapt better than a static route based on hard-coded heuristics.
In the end, we architect for the predictable, but we try to build for the unknown.
And it is in the “unknown” that an agent-native approach finally pays its rent.
- Token economy! ↩︎
- For example, the “architect” role agents and the “performance” role agents have a lot of overlap in their clarifying questions. ↩︎
- For example, a tool failure or connectivity issue ↩︎
- I guess there’s a reason why LangGraph is basically built around a similar model. It’s natural for a workflow ↩︎
- Admittedly, working interaction into the state machine is a bit more involved, but doable. ↩︎
- Then again, it’s not a requirement, so I wouldn’t run to implement it just yet. ↩︎
- Usually with strong meta-programming affordances ↩︎
- And, well… a robust LLM. ↩︎
- Or ADR documentation, or JIRA update, or whatever ↩︎
- It’s also possible for a model to try and overcome the issue in some other creative method. ↩︎
- Similar to how it’s currently implemented. ↩︎
- Maybe tests defined on traces? ↩︎
- Like a lot of business applications focused on processes ↩︎


