agents under scrutiny

Argument-Driven Development: Benchmarking Dialectic

In the last post, I presented the idea of multi-agent debates, and possible implementation of the idea in software design.

At the time, I had only some anecdotal and very preliminary evidence on whether it works. Since then, I’ve tested some more, added useful features and spent some time fixing bugs. Now it’s time to look at these initial results, and what we can learn from it.

Evaluation

Generally speaking, when evaluating a tool like Dialectic, evaluating it depends entirely on where, how, and how much, you intend  to use it. In its current incarnation it’s a CLI utility, and I’m still working on assessing its efficacy. My focus is on how to achieve good results from a qualitative perspective. In other words, I am assessing the quality of the analysis and solutions it provides.  Since we’re dealing with software design, which is famously hard to quantify1, then assessing the results of such analysis is the primary focus. The problem of system design, especially in software-heavy systems, is almost always a question of tradeoff. There is almost never an absolute truth. Quite often there’s not even a clearly better alternative. It’s often hard for us, as humans, to judge what is the “best” solution. Evaluating the quality of the output mechanically is not easy. Yet, scaling this kind of evaluation relying solely on human judgment is unrealistic2.

I could, and probably should, address other quality attributes, namely runtime performance and cost of usage. These are important, but not as much as quality of results, which I consider an essential success factor. For now, we are operating in controlled, limited environments, rendering these factors less critical.

Beyond a holistic quality measure of the design, I’m aiming to see how different customizable factors in Dialectic affect the quality.
Specifically:

  1. How does the number of rounds affect the debate and the final suggestions?
  2. How does the introduction of clarification questions (answers to questions raised by the agents) affect the outcome?
  3. How do different models used affect the results?
  4. What if we use different subsets of the possible roles?
  5. What if we enable/disable the summarization of conversation?

There are endless combinations and potential factors affecting performance on their own and when combined. The analysis below is a preliminary analysis of specific factors, with some suggested follow up questions. But it’s pretty obvious that there are more factors that might affect the results.

Evaluation Method

The question of how to evaluate AI performance is a hot topic these days. And there are already evolving methodologies and best practices. Given the focus I mentioned above, and admittedly, my limited time and budget, I implemented a simple “LLM-as-a-Judge” strategy, with some hand-picked human review3 to gauge quality. Quality in this case is not so much the “correctness” of the suggested solution4 as much as how it addresses the issues raised in the stated design problem, how well it identifies risks and provides reasonable reasoning.

When creating the evaluation functionality, by LLMs, I focused the evaluator’s prompts on estimating coverage of different aspects. The default evaluation prompt tries to cover functional completeness as well as non-functional requirements – performance/scalability, security/compliance, testability/maintainability – and an overall assessment. While providing reasoning for each such rating. When evaluating the different factors we provide scores (and results below) for each of these aspects separately.

One can of course configure the evaluators with other prompts and temperatures, so this is still something that can be improved. This matters. As I was testing the evaluation functionality, it was obvious that more detailed prompts, with examples and chain-of-thought patterns, produced stricter evaluations with better reasoning.

All the tests I describe below use the same evaluation configuration – prompts, model and temperature. This way, good or bad, at least the evaluating agents are consistent.

Case Studies

For testing, I chose several cases, somewhat generic problems taken from architectural katas, and an example from a Reddit discussion. These are good as preliminary and limited examples, but they lack a broader business and product context that often exists in companies looking to do software design on evolving systems. These may not be completely representative of real-world design problems, but as an evaluation benchmark I believe them to be a decent starting point.

For each of the case studies, I created the necessary configurations and script to run one of the above tests. Then another script is used to run a specific test on all case studies.

This is by no means a complete rigorous study or experimentation. I expect to continue with more cases, especially real-world ones, requiring more context and subtlety in the questions raised. There is a difference between “What is the best way to build a system with the following requirements… “ and “I need to improve this component, given the current system and constraints – what is the best way to do it…”5.

From anecdotal experience (at work), I can already identify a pattern where more specific questions require more context and a more focused phrasing of the problem. In other words, the way we present the problem (the input), unsurprisingly, affects the output.

Unless otherwise noted, the debate configuration for all these tests included one agent of each of the “Architect”, “Performance”, “Security” and simplicity (“KISS”) roles. All were configured to use a Gemini 2.5 Flash Lite LLM model with a temperature of 0.5. The default judging agent also uses the same model and temperature by default.

Evaluation configuration is for two evaluating agents with the same gemini model. The temperature is not configurable.The evaluators are prompted to evaluate not only an overall judgement of the final result, but also an estimation of how well the debate results address other factors such as performance, maintainability etc. The scores are averaged across the different case studies, for each test.

Results and Preliminary Analysis

My goal here is to try and gauge what affects the quality of returned results. What immediately stood out even in preliminary tests was that, unsurprisingly, prompts matter. The original prompts I used for defining the different agent roles were somewhat bland. The results I got were obviously too generic and lacked details. After some iterations and refinements (with AI) on the prompts, I reached what I think are reasonable default prompts that seem to focus and guide the LLMs to provide decent answers. These are customizable, even per agent, so users have the option to play around with more prompts and suggest better ones. I’ll be happy to hear about better options.

In addition, as described above, I ran several tests to understand the effect of different options and customizations.

Enough talking, let’s look at some numbers.

Does Clarification Matter?

With clarification turned on (“True” below, green color), the different agents get to ask the user 5 clarifying questions before the debate begins. The answers from all agents’ questions are available to all agents.

Looking at the scores, it seems that when clarification is available, the debate tends to produce better results, although by a close margin.

So yes, it helps. I’d venture that this hints at the importance of context in general.

Assuming this is consistent, potential follow up questions to this are whether more questions provide better results, and whether questions from specific roles affect the result differently.

Does Summarizing The Debate Make a Difference?

The summarization feature allows the agents to summarize the context in case it becomes too big. Summarization is done from the perspective of each agent on its own, and implemented using an LLM call. This way, different agents can decide on different summarization strategies and/or emphasize different aspects.
Currently the summarization is somewhat simplistic, setting a threshold (configurable) and summarizing to some maximum length of context once the context grows beyond the threshold.

This is essentially a performance feature. But does it come at the cost of quality?

Results:

The results seem somewhat comparable, with a slight advantage to the non-summarizing debates. Unsurprisingly, when we provide more context, the results tend to be better.

There are of course follow up questions here as well: does the length of the summarized context matter? What if I don’t summarize at all, but instead allow the model to retrieve the context it wants more accurately? Can I gain performance without compromising the available information – and subsequently, the result quality?

Does The Model We Use Matter?

One of the obvious questions here is whether the actual LLM makes a difference.

To answer this question fully, we would need to test on a variety of models, and there are quite a few.

Since that could take eons (and burn a hole in my wallet) I chose to focus on 3 models. From preliminary, isolated tests, I noticed that larger models (e.g. Claude Sonnet 4), don’t necessarily perform much better compared to smaller ones. 

So for this test I went with 3 smaller models: Gemini 2.5 Flash Lite, GPT-5.1-Codex-Mini and Kimi-Dev-72b (from Moonshot). 

The results:

As you can see, there’s not a lot of difference between the models, with maybe a slight advantage to the Kimi-Dev-72b, but not a consistent advantage across the different scores. We can of course continue with other models as well, and different model settings, e.g. temperature.
Also interesting to see if using a mix of models in the same debates leads to different results, or whether certain models are better at debating specific perspectives. For example, is Gemini a better model for assessing maintainability.

Do The Roles Matter?

One other possible customization is to decide which agent roles (essentially design perspectives) are participating in the debate. One can have any mix of the available roles and have them debate equally6.

For this test, I tested the following subsets of roles:

  • Architect, Performance, Security and KISS (simplicity)
  • Architect, Architect
  • Architect, Architect, KISS
  • Architect, Performance, KISS
  • Architect, KISS

Results:

Generally speaking, a combination of general “System Architect” role agents seems to provide an overall good result on most scores, compared to other combinations.

Unsurprisingly, combinations that did not have a security expert involved score lower on the “Security” and “Regulatory and Compliance” scores. When a security expert was involved in the debate, more emphasis7 was given to security issues.

The same seems to be true for the performance/scalability aspect when a performance expert is involved.

This strengthens the hypothesis that debating with different roles genuinely affects the result.

This also hints at a simple way to provide weight to different perspectives. For example, if you’d want to put more emphasis on simplicity of the solution, add another “KISS” agent to the debate, on top of the existing one. This would have 2 (or more) agents championing simplicity in their proposals and critiques. At the synthesis phase, when the judge synthesizes the final proposal, it would gather all the proposed solutions. Stacking the deck with agents who emphasize simplicity will likely push the synthesized solution in that direction.

It will of course be interesting to test with other combinations, and on different problems that raise more specific design questions.

Does The Number of Debate Rounds Matter?

Another obvious question is whether the number of debate rounds matters. Would longer debates produce better results.

Note that as it’s currently implemented the number of rounds is fixed (configurable per debate) and equal to all agents – they all run the same number of rounds.

Results:

Generally speaking, looking at the numerical results, the numeric scores seem to behave more or less the same (the ‘kata3’ example was a bit of an outlier compared to the rest; it’s pulling the average down).

Across all evaluated scores, we see a similar pattern – a few rounds provide good results, then evaluations dip only to rise again around the 5th or 6th round. Examining the specific results I could identify a pattern where suggested solutions of shorter debates seem to provide decent results. Debates with 5/6 rounds also provided decent results, but different ones – emphasizing simplicity. It’s as if after a few debate rounds the “KISS” agents somehow tilted the result towards simpler solutions8. I cannot explain it definitively yet, but it could be ‘position bias’ at play: if the synthesizing judge saw the KISS result first, it may have anchored on it.  This is of course just a hypothesis that needs to be examined further.

Still, it looks like more than 1 or 2 rounds don’t provide much value. I might just stick with 2 rounds for real-world usage.

Key Takeaways – What Can We Say So Far?

So while these tests are not exhaustive in any way, and they certainly raise a lot of follow up questions, I think some things are already clear and can be implemented or give direction to further implementation. While we’re still not at a point of a truly autonomous software design, I think we’ve laid a few bricks in the path to get there. I think we can safely say that a structured debate does help, but the details of “how” are important to make it effective.

First, providing extra context to the problem yields better results. This was evident from the usage of clarification as well as avoiding summaries. I expect this effect to even be more acute when dealing with “brownfield” projects where there’s a lot of legacy and implicit information that needs to be given. This has already motivated me9 to add the option for a context file. But we’ll need a better context mechanism to make it easier for context to be added and searched.

The clarification mechanism is already a step in that direction, as it allows the agents to deduce (“think about”) their missing information, and ask explicit and specific questions that might help it to suggest a better solution and critique other agents’ solutions.
And this point about clarification vs. simply more context is important. It’s not just about stuffing more context into the prompt; it’s about refining that context. Even when models were given ample initial information, the act of having agents ask and answer targeted questions consistently led to more robust and accurate design proposals. This suggests that the iterative loop of questioning and informing helps the agents converge on a shared, deeper understanding of the problem, rather than just processing a larger, potentially ambiguous input blob.

In contrast to this, it seems the choice of model doesn’t affect the result that much. It might be better to try models optimized for coding (as kimi-dev-72b claims to be), but the difference isn’t clear or significant. It’s also encouraging that smaller models perform fairly well, as this directly impacts the cost. This isn’t just about saving pennies; it means that the architecture of the debate itself – the structured interaction and iterative refinement – can compensate for individual model scale. This has profound implications for scaling these systems in real-world, budget-constrained environments.

The short experiment of using different role assignments also confirmed that these perspectives are far from mere window dressing; they can fundamentally shape the outcome. Like a well-rounded human team, a diverse set of AI “personalities” can explore different facets of a problem and push solutions in varied directions. For example, a team weighted towards KISS (Keep It Simple, Stupid) principles produced notably simpler designs. This highlights the potential for engineers to ‘tune’ the debate by curating specific agent roles to achieve desired design characteristics, whether that’s maintainability, robustness, security or performance.

Longer debates also don’t seem to provide much better results. Looks like, similar to human conversations, AI conversations also tend to converge pretty quickly, and shorter “meetings” are better10. Some iterative discussion is ok, similar to the design discussion “grind” we all know. But at some point, pretty early, it seems to not add too much to the conversation – we’ve all been there.

I do suspect that the number of participating agents, and their roles, will have a more significant effect on the quality of results.

Where Does That Leave Us?

With the insights I went through above, I think it’s enough to start a more productive “real world” use of the tool. Some investment still needs to be done in ergonomics of the tool, to make it convenient and practical for day-to-day use. But not a lot is missing, especially given the target audience are supposed to be people who should not be afraid to use a CLI tool, and don’t consider a “JSON configuration file” a magical incantation. It might also be useful to have a more convenient front-end (web based?) to allow for easier access, customization and review.

There are also clear11 additions and next steps: we’ll probably need to have a better way to provide context to the debate, driven by the asking agents. Some kind of tool (MCP-based or not) is probably a good answer to this, but there might be other ways.

This journey of applying AI for autonomous software design is still in its early phases. These initial measurements already demonstrate that structured multi-agent debates, guided by thoughtful methodology, is promising. We’ve seen how clarification, strategic model selection, role diversity, and iterative refinement can collectively elevate AI-generated solutions. As we continue to refine these AI applications and push the boundaries of what’s possible, the vision of a truly Autonomous Software Development Lifecycle comes into sharper focus. I believe there is a potential here for a reality where AI doesn’t assist but actually drives the software lifecycle. Looking at it from this perspective, the AI-driven design process is just another piece in a bigger puzzle – a way to create a much more mechanized, software development cycle.

I hope to explore this more soon.


  1. It’s all tradeoffs – “it depends” ↩︎
  2. Maybe I’ll add some like/dislike buttons somewhere in the future. ↩︎
  3. By me. I’m the human here. But you can also be a human here. ↩︎
  4. Because… it depends ↩︎
  5. Or, for example, “what would be the best way to optimize this db structure” ↩︎
  6. Actually, a user could provide a completely new role if he provides a custom system prompt for an agent, as the system prompts are used to define the role. ↩︎
  7. Emphasis → more words → more tokens ↩︎
  8. Looks like the KISS agent was very persuasive given enough debate time. Maybe it has better stamina? ↩︎
  9. Should I say “prompted me”? (pun intended) ↩︎
  10. It’s the “it could have been an email” of the AI agents ↩︎
  11. At least in my opinion ↩︎

One thought on “Argument-Driven Development: Benchmarking Dialectic

  1. Pingback: The Autonomous SDLC: When Code Becomes Substrate | Schejter

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.