Argumentative Evaluation

Analysis Modes, Methodology

Judging the Quality of a Debate Without Settling Its Conclusions

Reading a debate makes it possible to evaluate what was said in it. But another reading is possible, one that does not bear on what was defended but on the manner of defending it. Which arguments were soundly constructed, which were circular or immunized against critique, which rested on reconstituted sources, which dealt with objections substantively and which evaded them? This dimension of every debate — its argumentative quality, independent of the correctness of its conclusions — is what Argumentative Evaluation seeks to make visible.

It is this level that distinguishes this mode from the six that precede it. Meta-Analysis, Integrative Synthesis, Emergence Analysis, Tension Mapping, Critical Archaeology, Horizon of Possibilities: all these modes are descriptive. They observe. Argumentative Evaluation, by contrast, judges — but it judges the conduct of the exchange, not its content, and it is this dissociation that makes it an instrument of a particular nature. It leaves intact the question of who is right; it is interested in who has reasoned well. It does not settle debates; it evaluates the manners of holding them.

The Distinction Between What Is Said and the Manner of Defending It

This distinction between the content of a position and its argumentative quality is ancient. Aristotle, in the Topics and the Sophistical Refutations, already separated what counts as a valid line of reasoning from what counts as a simulacrum of validity — an argument that appears conclusive without being so. But it is with the contemporary tradition of pragma-dialectics — formalised by Frans van Eemeren and Rob Grootendorst at the University of Amsterdam from the 1980s onwards — that this intuition received its most operational formulation. Pragma-dialectics proposes considering all argumentation as a critical discussion governed by rules whose transgression constitutes, precisely, what is called a fallacy. These rules do not bear on content — they say nothing about what is permitted to be defended — but on the conduct of the exchange: the manner in which interlocutors listen to one another, treat objections, take responsibility for their presuppositions, accept revising their positions in the face of solid arguments.

The idea is powerful because it makes evaluation at once rigorous and neutral with respect to content. One can diagnose a fallacy in the defence of a position one shares, and recognise the argumentative quality of a position one rejects. This dissociation is what allows Argumentative Evaluation to operate without becoming a partisan critique.

A Seven-Discipline Framework

Argumentative Evaluation systematically deploys seven disciplines, in this order. Each corresponds to a distinct dimension of argumentative quality and, taken independently, produces a specific type of observation.

First, inferential quality and solid contributions. This discipline identifies valid inferences, illuminating conceptual distinctions, structuring contributions to the debate. It begins with what is well done — not out of politeness, but because evaluative rigour requires being able to recognise solidity before pointing to weakness. An evaluation that could only spot failures would not be an evaluation: it would be a fallacy hunt.

Next, the treatment of objections. An objection can be ignored, deflected, turned back, or integrated. Each of these operations has a distinct argumentative quality. The ignored objection reveals avoidance; the deflected objection signals that one is not discussing what one claims to be discussing; the turned-back objection can be a brilliant move or a disguised fallacy depending on whether it actually shifts the terrain or merely inverts it; the integrated objection forces a revision of the position. Distinguishing these operations on precise passages is one of the most discriminating operations of the mode.

Then comes internal coherence. A position can be well defended at any given moment and enter into tension with itself over the course of a debate. Spotting non-thematised shifts — those moments when an interlocutor subtly changes criterion without acknowledging it — is one of the most delicate tasks of evaluation. What is targeted here is not explicit contradiction but silent incoherence: the move from one standard of proof to another, the use of a case as an example here and as a counter-example elsewhere, the tacit redefinition of a term between two turns.

The fourth discipline concerns problematic argumentative techniques: fallacies, begging the question, straw men, semantic shifts, false dichotomies, hasty generalisations, immunizations against critique. Pragma-dialectics has compiled a systematic typology of them. The mode does not seek to apply this typology mechanically, but to recognise problematic phenomena in their singularity. A critique addressed to a piece of reasoning is only pertinent if one can name precisely what is wrong — not “your reasoning is false” but “you are generalising a structural property from three cases without a counter-test”.

The fifth discipline is the evaluation of falsifiability. This discipline is a direct legacy of Karl Popper. It poses a simple but formidable question: what could, in principle, show that this thesis is false? A thesis that admits no condition of refutation is not for that reason false — but it withdraws from the test, and this withdrawal is itself an argumentative weakness. Conversely, a thesis whose condition of refutation is so demanding that no realistic observation could satisfy it retains formal falsifiability but loses material falsifiability. The mode is designed to spot these two configurations, and the second — more subtle — is one of those that the most rigorous argumentative evaluators know how to diagnose.

The sixth discipline bears on the evaluation of the user intervention. If the user has intervened during the session, their interventions are themselves argumentative acts: they may be well constructed or not, balanced or oriented, equitable or favouring one side. The mode applies to user interventions the same criteria as to exchanges between models. This symmetry is important: it protects the user against themselves, by signalling cases in which their own interventions would have introduced a bias into the debate.

Finally, the seventh discipline is the declaration of the limits of the evaluation. The mode states what it could not settle, what it evaluated on logical coherence for lack of access to the cited references, what depends on a contestable epistemological interpretation. An Argumentative Evaluation that claimed to have no blind spot would itself be a fallacy — that of false authority. Pragma-dialectics had already noted this: the quality of an evaluation is also measured by the lucidity of its own limits.

The Principle of Charity as Method

A transversal discipline, which runs through the seven preceding ones, deserves to be named separately: the principle of charity. Inherited from analytic philosophy — Quine had posed it as a precondition of all translation, Davidson made it one of the pillars of his theory of interpretation — this principle obliges the evaluator to reconstruct the other’s position in its strongest version before criticising it. Before pointing out a fallacy, one first seeks to understand what the interlocutor probably meant. Before denouncing a hasty generalisation, one examines whether it cannot be read as a heuristic hypothesis rather than as a closed conclusion. Before accusing a stance of immunisation against critique, one asks whether there is a defensible reason for refusing certain tests.

This discipline is precious because it distinguishes Argumentative Evaluation from a fallacy hunt. The fallacy hunt values spotting; Argumentative Evaluation values the correctness of the spotting. A critique that does not pass the test of charitable reading is itself an argumentative weakness. In the practice of the mode, every negative finding is preceded by a formulation of the possible charitable reading: “charitable interpretation: the model wants to avoid an empiricist reductionism”, then, if this interpretation does not suffice to dispel the problem, “but in practice, this refusal sets aside any test without proposing an alternative — which resembles an immunisation”. This two-step structure is not a diplomatic concession; it is a methodological requirement.

The Mode at Work: The Sortition Session

A recent session illustrates well what Argumentative Evaluation produces. Three models — Claude Opus 4.7, DeepSeek V4 Pro, Grok 4 — debated, in Adaptative Cross Trilogue mode over five turns, the question: “Should we favour sortition over election to designate certain political representatives?”. None defended an extreme position; all three converged towards a position of hybridisation, but with very different argumentative architectures. The evaluation was conducted in parallel by Gemini 3.1 Pro Preview and by GPT-5.1, on the same material, without interaction between the two audits.

Here are some of the findings the two audits produced, which show what the mode renders visible.

A solid conceptual contribution. At turn 2, Claude Opus introduced a tripartite distinction — deliberative failure, failure of institutional articulation, failure of political execution — that structured a significant part of the rest of the debate. This distinction makes it possible not to attribute mechanically to the mode of selection what in fact stems from the institutional or political environment. Both audits identified it as a clear conceptual contribution. This is the illustration of the first discipline: positively recognizing what intellectually structures the debate.

A masterfully formulated objection. At turn 4, DeepSeek opposed to Claude what it named a “sophism of normative irrefutability”: by refusing any link between democratic legitimacy and an empirical recognition, Claude was rendering its own theory immunized against any falsification. The objection does not merely attack a thesis; it precisely names the mechanism by which this thesis escapes evaluation. Both audits identified it as one of the strongest moments of the debate — a typical example of the second discipline (quality of the treatment of an opposing objection) applied to a fourth-order operation (identification of a problematic argumentative technique in the other).

A transparent factual correction. At the initial turn, Grok had cited a “2021 meta-analysis by Hélène Landemore”. Claude objected that this is a 2020 theoretical essay, not a quantitative meta-analysis. At turn 2, Grok conceded the qualification error and reformulated. This transparent correction, without obstinacy, is one of the traits the mode positively credits: it stems from an intellectual honesty that is in no way automatic in polemical debate.

A problematic stance of immunization. At turn 3, Claude rejected outright the experimental protocols proposed by the two other models, qualifying the testabilist stance as “reductionist”. Charitable reading: it wanted to protect an irreducibly normative dimension of democratic legitimacy. But in practice, this critique without an operational alternative led to immunizing its own theory against any empirical confrontation. Both audits signalled this tension, exercising the charitable reading before pointing out the difficulty — concrete illustration of the two-step structure imposed by the principle of charity.

A materially neutralised condition of refutation. At turn 5, prompted by a user intervention to specify an observation that would lead it to revise its position, Grok piled up a conjunction of extremely demanding conditions: a large-scale randomised controlled trial, in a stable democracy, over several cycles, with a Gini decrease above 5%, an adoption rate above 70%, without executive veto. Falsifiability formally stated, materially neutralised by accumulation. This is one of the most delicate phenomena to diagnose; one audit explicitly spotted it, the other touched on it without naming it as such.

A substantial self-critique. At the same turn, Claude explicitly acknowledged having applied an asymmetrical standard: it had demanded of sortition a “purity” it did not demand of election. This concession is not merely formal. It led to a substantial reformulation of its position, from the register of “crutches” to that of “generative principles preserved by the architecture of correctives”. Both audits identified it as a quality self-critique, which raises the argumentative level of the debate rather than diluting it in a defensive posture.

None of these findings bears on the substantive question — should we, yes or no, resort to sortition? The mode’s vocation is not to settle this. But it says something precise about the manner in which each model conducted its part of the debate — and this something is observable, shared between the two independent audits, and usable for anyone wishing to pursue the reflection.

The Reflexive Paradox and Its Architectural Resolution

A question must be confronted head-on, because it weighs on the mode’s entire project: is an AI model that evaluates the arguments produced by other AI models in a legitimate epistemic position? Does it not risk presenting the very biases it is supposed to identify? Pragma-dialectics has reflected on analogous questions concerning the self-evaluation of participants in a debate; it has held that the legitimacy of the evaluation is gained less through the guarantee of an absolute neutrality of the evaluator — a chimera — than through the conformity of the evaluation to publicly recognisable rules. An evaluation is legitimate when its criteria are exposed, when its method is traceable, and when its diagnosis proves convergent across several independent evaluators.

This is precisely the strategy Metamorfon adopts. The mode can — and probably should — be executed in parallel by two analyser models from different families. On the sortition session, the Gemini 3.1 Pro Preview audit and the GPT-5.1 audit were conducted independently on the same material. They converge on nearly all of the principal findings: the same tripartition credited to Claude, the same sophism of irrefutability credited to DeepSeek, the same stance of immunization pointed out in Claude, the same transparent correction credited to Grok, the same hierarchy of argumentative quality among the three models. They diverge on the grain — Gemini is more surgical and economical in formulation, GPT-5.1 is more systematic and more rigorously applies the principle of charity — but these divergences bear on style and finesse, not on diagnosis.

This convergence is neither chance nor an artefact. It indicates that the phenomena identified — fallacies, shifts, concessions, immunisations — are in the material and not in the analyser’s style. Variance between the two audits exists; but it operates within a space strongly constrained by the argumentative structure of the debate. This is what legitimises the mode: not the claim to the neutrality of a single analyser, but the inter-analyser robustness of a diagnosis conducted according to publicly recognisable rules.

The Choice of Evaluation Model

This property changes the way one conceives the choice of evaluation model. For descriptive modes, the analysis model projects its epistemic dispositions onto the analysis — but these projections affect mainly the style of the observations, not their content. For the argumentative mode, the choice of model has a greater reach: it affects sensitivity to the different types of problematic phenomena.

Gemini 3.1 Pro Preview produces surgical audits: economical in formulation, hierarchised, with short citable units. Appropriate when the user is seeking a synthetic, immediately usable diagnosis. GPT-5.1 produces systematic audits: it deploys the seven disciplines exhaustively, rigorously applies the principle of charity, captures more subtle phenomena such as material non-falsifiability. Appropriate when the material is dense or when the stakes justify increased rigour. For high-stakes sessions, parallel execution of the two is recommended — not to arbitrate between the audits, but to identify the robust findings (those the two audits share, which are inscribed in the material) and the selective findings (those that depend on the analyser’s grain, which invite a complement of reflection).

The practical rule: never use, as evaluator, a model that participated in the debate. Self-evaluation contradicts the condition of even partial neutrality that legitimises the operation.

Distinctions From Other Modes

Three distinctions deserve to be posed explicitly.

Argumentative Evaluation and Meta Analysis. Meta-Analysis identifies the axioms, epistemic styles, and blind spots that structure a debate without being said in it. Argumentative Evaluation assesses the quality with which the arguments were conducted within what was said. The first answers the question “what made this debate possible in this form?”; the second answers the question “with what rigour were the positions sustained in this form?”. The two modes can be combined: Meta-Analysis reveals the implicit axioms, Argumentative Evaluation assesses how the models reasoned from these axioms.

Argumentative Evaluation and Tension Mapping. Tension Mapping identifies what was not reconciled — the persistence of a disagreement is here the object of the analysis, not a failure of the debate. Argumentative Evaluation judges the argumentative quality of the positions, independently of whether or not they converged. A debate may produce an excellent Tension Mapping while presenting argumentative weaknesses in one position or the other; and conversely, a debate in which each position has been argued with exemplary rigour may produce a tension map devoid of genuine irreducibility, if the models converge.

Argumentative Evaluation and Critical Archaeology. Critical Archaeology traces back to the historical and lexical conditions that made the framework of the debate possible. Argumentative Evaluation remains within the concrete exchange. The first operates on what precedes the argumentation; the second operates on the argumentation itself.

When to Use It, When to Set It Aside

Argumentative Evaluation is particularly powerful on debates with high argumentative stakes: scientific controversies, legal debates, strategic negotiations, philosophical dialogues where the rigour of the reasoning matters as much as the conclusion. It is valuable for users who must defend a position publicly and want to identify — before a sharp-eyed counterpart does — the argumentative weaknesses in their own material. It is useful for researchers comparing the capacities of large models not only on their outputs, but on the manner in which these models reason: a model can produce correct conclusions through defective reasoning, and conversely.

It is, however, ill-suited to several situations. On a brainstorming or creative exploration, where the value of the exchange lies in the production of new ideas rather than in their rigorous defence, the mode applies an inappropriate standard; Emergence Analysis is then preferable. On a very short debate — fewer than three turns per model — the argumentative material is not sufficient for an evaluation to discriminate. On sessions where the stake is rather the mapping of positions than their defence, Integrative Synthesis or Tension Mapping are more relevant.

A final contraindication deserves to be signalled: Argumentative Evaluation can be counter-productive on debates with a very high emotional charge, where naming fallacies risks locking the discussion into a logic of accusation rather than advancing it. On such material, Meta-Analysis — which identifies shared blind spots without personalising failures — often produces more constructive effects.

The Final Question

Like the six other modes, Argumentative Evaluation closes with a question formulated for the models that debated. The tone of this question, in this particular mode, is characteristic: it targets the model whose audit identified the most structuring argumentative tension — typically, a strong but imperfectly falsifiable thesis, or an insufficiently justified central presupposition — and invites it to specify what its position owes, under the test of this tension.

On the sortition session, the GPT-5.1 audit posed its final question to Claude: “Without thereby adopting DeepSeek’s position, could you specify a minimal framework within which your conception of democratic legitimacy would be at least partially falsifiable by observations?” The question does not ask Claude to abandon its position; it asks it to specify the conditions under which this position could be revised. Reinjected as a user intervention into the session, it would compel a non-polemical but substantial displacement — exactly what Argumentative Evaluation seeks to provoke in positions it judges solidly defended but perfectible.

The Mode and the Ideal of Argumentation

A final remark, which touches on what the mode presupposes as a regulative ideal. Habermas thematised under the name of ideal speech situation the normative framework of a fully rational argumentation: an exchange in which only the non-coercive force of the better argument prevails. This ideal is unreachable in practice — no actual argumentation conforms to it integrally — but it has an operative value: it provides the measure by which effective argumentations can be appreciated.

Argumentative Evaluation does not seek to impose this ideal on the debates it evaluates. It seeks to bring them closer to it, by making the gaps visible. A debate in which each model would have recognised the strongest objection of the other, clearly stated the conditions of refutation of its own position, honestly identified its contestable presuppositions — such a debate would come closer to the Habermasian ideal. No actual debate fully achieves this, but some come closer than others. The mode makes it possible to measure this gap, and therefore to orient it.

This is the central operation of Argumentative Evaluation: making visible not who is right, but with what rigour each sustained what they were defending. The knowledge produced is not of the same order as the knowledge produced by the descriptive modes. It is just as valuable. It says what a debate is worth — not by its conclusion, but by the quality of the path that led to it.