MAD About Software Design: When AI Debates

So at this point, I think we’ve established that LLMs can code (right?). They’re only getting better at it. I’ve also argued in the past that I believe LLMs can do more than just code to improve our software engineering lives. But this isn’t a simple task. There’s quite a bit of essential complexity in the process; it’s beyond simply automating day-to-day tasks1.

Imagine my interest, then, as I stumbled upon the idea of LLM-based AI agents debating each other. The concept isn’t unique to software engineering, but it still appealed to me as a way to simulate (or at least approximate) an actual software design process and, by extension, scale or improve it.

Before we dive into my implementation, let’s step back and understand the concept, and where it fits.

A Discourse of Agents

LLMs are powerful2, but they often come with a (potentially significant) catch. A single LLM, as capable as it is, can easily suffer from issues like hallucinations, inconsistent reasoning, and bias. And the more complex the task, the more likely it is to exhibit these issues. This is the “single-agent trap”: relying on one model’s perspective means you are exposed to its blind spots. This isn’t that different from trying to solve complex tasks as humans3 – the more complicated the task, we often benefit from collaborating with others.

We have ways to mitigate some of the problems to an extent – prompt and context engineering, RAG, access to tools.

So what if we didn’t have to rely on a single AI agent’s answer, the same as we humans collaborate with other people when working on complex issues? 

This is where multi-agent debate (MAD) comes in. MAD provides a complementary approach. It’s an approach that uses iterative discourse to enhance reasoning and improve validity. See examples here.

You can think of it like a collaborative “society of minds”. Instead of one agent providing one answer, multiple agents propose and critique solutions to the problem. This goes on for several rounds of discussion, where agents challenge each other’s proposals, spot errors and refine their ideas. Eventually, the goal is for this process to converge on a superior final answer.

While I don’t intend to provide a full literature review here, or any kind of exhaustive description4, I think it’s worth understanding the main components, findings and challenges.
What follows below is a crash course on Multi-Agent Debate (MAD). But if you’re interested in more detailed evidence and nuance, I encourage you to follow the links and explore some more.

MAD – The Bird’s Eye View

So how do these debates actually work under the hood?
There are different implementations, and from what I’ve seen, they vary significantly for different reasons. But three fundamental components repeat in all cases.

First is the agent profile, which defines the roles or “personas” of the debating agents. A simple setup might define agents that are symmetrical peers. But more complicated setups assign specific roles to agents. For example, one agent may be a “critic”, another a “security expert”, etc. There are different ways to create this diversity. Everything from using different models, configuring them differently, and prompting the different agents to hold/emphasize divergent views.

Second is the communication structure – the topology. Essentially the network map that dictates who talks to whom. A common setup is a fully connected setup where all agents see each others’ messages. Other approaches may use more sparse topologies (interacting with specific neighbours) or even going through a single orchestrator/dispatcher. The choice of topology of course changes the debate dynamic.

Finally, there is the decision-making process: how the debate is concluded. After the agents have debated amongst themselves, how do you decide it’s time to conclude and compile a final answer?
The simplest method, which works well for certain types of problems, is a simple majority voting. This works best in cases where the answer to the problem is a simple deterministic value, e.g. math problems. Another approach, a bit more structured, is to use a “judge” (/”arbiter”) agent. This agent listens to arguments from all sides and selects or compiles a winning answer.

Does It Work?

Yes, to a degree.
Current research suggests that multiple agents working together achieve better results, especially when the complexity of the tasks increases. This example shows significant improvements on math problems.

Multi-Agent Debate (MAD) systems seem to improve factuality and the accuracy of results. Agents seem to be able to spot errors in each other’s reasoning, improving consistency. Some evidence can be seen here and here among others.

Tasks that are more complicated, and/or require more diversity of thought, seem to benefit from this pattern more. Specifically, it seems that iterative refinement and using different models to propose and debate each other yields better results – more consistent answers that align better with human judgement.

Does It Always Work?

Of course not. It wouldn’t be fun otherwise.

This study, for example, suggests that it’s not so much the debate that’s improving performance, but rather the “multi-agent” aspect of it. Another study suggests they are difficult to optimize (though it does conclude they have potential for out-performing other methods).

There are also distinct failure modes. This study suggests that models may flip to incorrect answers under some conditions. And they require more careful setup, specifically guidance on how to criticize answers from other agents – a structured critique guidance.

There are of course cost considerations to be had, as any engineering problem. Multiple agents making repeated calls to LLMs with potentially growing (exploding?) context mean cost can easily get out of hand.

This is an active research area, with probably more results and implementations to be shared in the near future.

So while we’re here, why not join the fun, and try to apply it?

MAD About Software Design

This pattern of debating agents can be applied to all sorts of problems, as the studies linked above show. Software system architecture should not be an exception. I could not find another implementation of this pattern that’s related to SW engineering. The closest is MAAD which seems nice, but as far as I could see it does not exactly implement a debate pattern, but rather a set of cooperating agents working towards a goal of producing a design specification.

Part of the reason this piqued my interest is that in my line of work, when considering feature and system designs, a debate5 is a natural dynamic. This is simply what we do – we discuss, brainstorm and often argue over different alternative solutions. AI agents debating over a design problem seems like a natural fit.

This is where Dialectic comes into play. 

This is a small, simple implementation6 of the multi-agent debate pattern, with a focus on software engineering debate. It is a command line tool, receiving a problem description and a configuration of a debate setup, and carries out a debate between different agents. The tool facilitates the debate between the agents with the goal of eventually arriving at a reasonable, hopefully the best, solution to the presented problem, with concrete implementation notes and decisions.

When it comes to the debate setup, Dialectic allows the user to specify the number and role of participating agents. A user can choose from available roles – “Architect”, “Performance Engineer”, “Security Expert”, “Testing Expert” and “Generalist”7

The current implementation has a rather rigid debate structure: for a fixed number of rounds (configurable), each agent is asked to propose a solution, then critique all of the other agents’ solutions, and refine its proposal based on the feedback from other agents. The refined proposals are fed into the next round. At the end of the last round, a Judging agent receives the final proposals and compiles a synthesized solution from all participating agents.

As a user, you can control the number of rounds, the prompts used, temperature and model per agent. See here for a more complete description of configuration options.

Why This Debate Pattern?

The chosen debate pattern and configuration options are intentional8, in an attempt to mitigate some of the problems mentioned above.

First, different “roles” (essentially different sets of agent system prompts) offer different perspectives. When debating, specifically criticizing each other’s work, the offered different perspectives should allow consideration of different arguments for choices. This hopefully avoids at least some of the potential groupthink.

Additionally, each agent can be configured with a different LLM model and different temperature. This offers a chance at combining models with different strengths (and costs), potentially trained and tuned on different data sets. This heterogeneous debate setup, which combines different agent profiles, allows for a rich interaction of viewpoints. This is especially true given the current fixed topology, where every agent critiques all other agents’ proposals.

The possibility for clarifications from the user allows also for additional context based on specific agents input (the agents ask the user questions). This not only allows more focused context to the debate, but also mimics a real world dynamic where the development team interacts with the product owner/manager for different clarifications that come up during a discussion (“what should we do in this case? – this is a product decision” is a common phrase heard around the office).

Dialectic also supports context summarization to try and avoid context explosion. There’s of course a trade-off here, but for practical cost reasons9, it should support a way to manage context size. Some models can be quite “chatty” and end up with big responses.

Apart from being a tool to be used in practice, I realize the different options and combinations possible can potentially lead to very different results and quality may vary depending on any number of reasons. This is why output options also vary: you can simply output the final synthesized solution, as well as get a structured file containing a more detailed step-by-step description of the entire debate, with all configurations and latency + tokens used figures. There’s also the option to provide a complete debate report in markdown format. This should allow users to experiment with different debate and agent configurations, and hopefully settle on a setup (maybe several) that fits their purposes the most.

What’s Next?

At this point, you can start using Dialectic, and experiment with it on different problems and debate setups.  I plan to do it as well.

Initial experiments seem anecdotally promising. When used with advanced models, it’s producing reasonable results. It’s a practical tool, still evolving, that shows promise in helping to analyze and reach solutions in complex domains faster and more comprehensively. But we’ll need to evaluate results more systematically, so this is the obvious next stage.

At the same time, I believe it can still help as a brainstorming partner. Having a tool that automatically analyzes a problem from several angles and refining it is at the very least helpful in covering options and exploring ideas.

But it’s clear that some things can and should be improved/added.

To start, a lot of real-world (human) discussions implicitly involve pre-existing knowledge. This is part of the experience we have as professionals. Specifically, knowledge and context of our specific systems (the “legacy code”), patterns and domains. While it’s possible to include a lot in the given problem description and clarifying questions, I believe it should be possible for debating agents to query further information and knowledge. We will probably need to support plugging in extra knowledge retrieval, driven by the agents to allow them more focus and refined answers.

Another thing to look into is the way the debate terminates. Currently it’s a fixed number of configured rounds. All rounds run, and the judge has to synthesize an answer at the end. But this is not the only way. We can terminate the debate when it seems that there’re no new ideas or issues coming up. We can have the different agents propose a confidence vote of their proposals, and then have the debate terminate when it seems that all (most?) agents are confident beyond some set threshold.
We can also instruct the agents and judge to propose follow-ups, and use the result of a given debate as the input to another, with extra information.

The current topology is also fixed. It will be interesting to experiment with different topologies. For example, have the specialist (security, performance, testing) agents only critique the architect’s proposals. A step further would be for an orchestrator agent to dynamically set up the topology, based on some problem parameters.

Agent diversity is also interesting. There is evidence that diversity of agents improves results in some cases. Playing with the LLM models used, their temperature and specific prompts can potentially complement each other in better ways. We could, for example, create an agent that is intentionally adversarial, and pushes for alternative solutions.

The tool itself can of course be augmented with interesting features:

  • Automatically deriving and outputting ADRs
  • Adding image(s) as some initial context.
  • Connecting with further context available from other systems as input10, so the agents’ analysis is more evidence-based

These should be helpful in making it more useful for day-to-day work.

Of course, costs are also important. The current implementation tries to summarize so we don’t hit token limits too early. But it’s possible we can find more ways to optimize costs. Skip calls when not necessary, summarize to a smaller size every round, etc.

So Software Designers are Obsolete?

No.
I do believe there’s still a way to go before this replaces the human dynamics of discussion. One thing that I still don’t see LLMs doing well is weighing trade-offs, especially when human factors11 are in play. This is more than a simple information gap that can be solved by tooling. I don’t see how agents implicitly “read the room”. I also don’t see how we mimic human intuition by agents.  

I do see this as a step forward, not only because we can automate a lot of the research and debate. But also because the analysis given by such agents is almost guaranteed to be more driven by information, cold analysis, and the vast knowledge embedded within them. Agents don’t get offended (I think) when their proposal is not accepted, or when they don’t get to play with the cool new technology.

Summary

Dialectic is a simple tool that tries to implement a potentially powerful pattern of agentic systems in the realm of software engineering. If done properly, I believe it can help in reaching decisions faster and with higher quality, especially when scaling design work with a larger organization. And this is what mechanization is all about.

The combination of LLM-based agents into a debate and feedback loops should enable more complete solutions, likely with higher quality.

Off to design!


  1. Which is of course still a welcome improvement ↩︎
  2. And continue to improve ↩︎
  3. Sadly, even the hallucination part is true for humans sometimes. ↩︎
  4. A decent review can be found here. But any “Deep Research” AI will help you here. ↩︎
  5. Between people – humans in the loop! ↩︎
  6. Yes, it could have been implemented with something like LangChain/Graph or probably even some kind of low-code tooling. But I also like to learn by doing, so I opted for more bare-bones approach of coding from scratch. We might port it to use some other framework in the future. ↩︎
  7. Note there’s nothing fundamentally about Software in this pattern, except these roles. It’s straightforward to apply the debate pattern to other roles. ↩︎
  8. And still evolving ↩︎
  9. I got too many 429 errors complaining about token limits when testing ↩︎
  10. MCP server support? ↩︎
  11. Business pressures, office politics ↩︎

2 thoughts on “MAD About Software Design: When AI Debates

  1. Pingback: Argument-Driven Development: Benchmarking Dialectic | Schejter

  2. Pingback: The Autonomous SDLC: When Code Becomes Substrate | Bits and Bytes

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.