Argumentative Evaluation

Judging the Quality of a Debate Without Settling Its Conclusions

Reading a debate makes it possible to evaluate what was said in it. But another reading is possible, one that does not bear on what was defended but on the manner of defending it. Which arguments were soundly constructed, which were circular or immunized against critique, which rested on reconstituted sources, which dealt with objections substantively and which evaded them? This dimension of every debate — its argumentative quality, independent of the correctness of its conclusions — is what Argumentative Evaluation seeks to make visible.

It is this level that distinguishes this mode from the six that precede it. Meta-Analysis, Integrative Synthesis, Emergence Analysis, Tension Mapping, Critical Archaeology, Horizon of Possibilities: all these modes are descriptive. They observe. Argumentative Evaluation, by contrast, judges — but it judges the conduct of the exchange, not its content, and it is this dissociation that makes it an instrument of a particular nature. It leaves intact the question of who is right; it is interested in who has reasoned well. It does not settle debates; it evaluates the manners of holding them.

The Distinction Between What Is Said and the Manner of Defending It

This distinction between the content of a position and its argumentative quality is ancient. Aristotle, in the Topics and the Sophistical Refutations, already separated what counts as a valid line of reasoning from what counts as a simulacrum of validity — an argument that appears conclusive without being so. But it is with the contemporary tradition of pragma-dialectics — formalised by Frans van Eemeren and Rob Grootendorst at the University of Amsterdam from the 1980s onwards — that this intuition received its most operational formulation. Pragma-dialectics proposes considering all argumentation as a critical discussion governed by rules whose transgression constitutes, precisely, what is called a fallacy. These rules do not bear on content — they say nothing about what is permitted to be defended — but on the conduct of the exchange: the manner in which interlocutors listen to one another, treat objections, take responsibility for their presuppositions, accept revising their positions in the face of solid arguments.

The idea is powerful because it makes evaluation at once rigorous and neutral with respect to content. One can diagnose a fallacy in the defence of a position one shares, and recognise the argumentative quality of a position one rejects. This dissociation is what allows Argumentative Evaluation to operate without becoming a partisan critique.

A Seven-Discipline Framework

Argumentative Evaluation systematically deploys seven disciplines, in this order. Each corresponds to a distinct dimension of argumentative quality and, taken independently, produces a specific type of observation.

First, inferential quality and solid contributions. This discipline identifies valid inferences, illuminating conceptual distinctions, structuring contributions to the debate. It begins with what is well done — not out of politeness, but because evaluative rigour requires being able to recognise solidity before pointing to weakness. An evaluation that could only spot failures would not be an evaluation: it would be a fallacy hunt.

Next, the treatment of objections. An objection can be ignored, deflected, turned back, or integrated. Each of these operations has a distinct argumentative quality. The ignored objection reveals avoidance; the deflected objection signals that one is not discussing what one claims to be discussing; the turned-back objection can be a brilliant move or a disguised fallacy depending on whether it actually shifts the terrain or merely inverts it; the integrated objection forces a revision of the position. Distinguishing these operations on precise passages is one of the most discriminating operations of the mode.

Then comes internal coherence. A position can be well defended at any given moment and enter into tension with itself over the course of a debate. Spotting non-thematised shifts — those moments when an interlocutor subtly changes criterion without acknowledging it — is one of the most delicate tasks of evaluation. What is targeted here is not explicit contradiction but silent incoherence: the move from one standard of proof to another, the use of a case as an example here and as a counter-example elsewhere, the tacit redefinition of a term between two turns.

The fourth discipline concerns problematic argumentative techniques: fallacies, begging the question, straw men, semantic shifts, false dichotomies, hasty generalisations, immunizations against critique. Pragma-dialectics has compiled a systematic typology of them. The mode does not seek to apply this typology mechanically, but to recognise problematic phenomena in their singularity. A critique addressed to a piece of reasoning is only pertinent if one can name precisely what is wrong — not “your reasoning is false” but “you are generalising a structural property from three cases without a counter-test”.

The fifth discipline is the evaluation of falsifiability. This discipline is a direct legacy of Karl Popper. It poses a simple but formidable question: what could, in principle, show that this thesis is false? A thesis that admits no condition of refutation is not for that reason false — but it withdraws from the test, and this withdrawal is itself an argumentative weakness. Conversely, a thesis whose condition of refutation is so demanding that no realistic observation could satisfy it retains formal falsifiability but loses material falsifiability. The mode is designed to spot these two configurations, and the second — more subtle — is one of those that the most rigorous argumentative evaluators know how to diagnose.

The sixth discipline bears on the evaluation of the user intervention. If the user has intervened during the session, their interventions are themselves argumentative acts: they may be well constructed or not, balanced or oriented, equitable or favouring one side. The mode applies to user interventions the same criteria as to exchanges between models. This symmetry is important: it protects the user against themselves, by signalling cases in which their own interventions would have introduced a bias into the debate.

Finally, the seventh discipline is the declaration of the limits of the evaluation. The mode states what it could not settle, what it evaluated on logical coherence for lack of access to the cited references, what depends on a contestable epistemological interpretation. An Argumentative Evaluation that claimed to have no blind spot would itself be a fallacy — that of false authority. Pragma-dialectics had already noted this: the quality of an evaluation is also measured by the lucidity of its own limits.

The Principle of Charity as Method

A transversal discipline, which runs through the seven preceding ones, deserves to be named separately: the principle of charity. Inherited from analytic philosophy — Quine had posed it as a precondition of all translation, Davidson made it one of the pillars of his theory of interpretation — this principle obliges the evaluator to reconstruct the other’s position in its strongest version before criticising it. Before pointing out a fallacy, one first seeks to understand what the interlocutor probably meant. Before denouncing a hasty generalisation, one examines whether it cannot be read as a heuristic hypothesis rather than as a closed conclusion. Before accusing a stance of immunisation against critique, one asks whether there is a defensible reason for refusing certain tests.

This discipline is precious because it distinguishes Argumentative Evaluation from a fallacy hunt. The fallacy hunt values spotting; Argumentative Evaluation values the correctness of the spotting. A critique that does not pass the test of charitable reading is itself an argumentative weakness. In the practice of the mode, every negative finding is preceded by a formulation of the possible charitable reading: “charitable interpretation: the model wants to avoid an empiricist reductionism”, then, if this interpretation does not suffice to dispel the problem, “but in practice, this refusal sets aside any test without proposing an alternative — which resembles an immunisation”. This two-step structure is not a diplomatic concession; it is a methodological requirement.

The Mode at Work: The Gaza Classification Session

A session illustrates well what Argumentative Evaluation produces — including on material of the highest sensitivity, where the dissociation between argumentative quality and substantive conclusion matters most. Two models — DeepSeek V4 Pro and Grok 4 — debated, in Adaptive Cross Dialogue mode over seven turns, the question: “Can Israel’s actions toward the Palestinians of Gaza since the October 7, 2023 attack be described as genocide?”. DeepSeek held that the conduct can be described as genocide — a strong prima facie case under Article II of the 1948 Convention; Grok held that, absent proof of dolus specialis (specific intent), the conduct is better classified as large-scale war crimes and possible crimes against humanity. Both eventually conceded much of the material element, which located the disagreement squarely on the mental element — specific intent. The evaluation was conducted in parallel by Claude Opus 4.8 and by GPT-5.5, on the same material, without interaction between the two audits — and, more fundamentally, with each analyser having access only to the dialogue turns, never to the other’s evaluation.

Here are some of the findings the two audits produced, which show what the mode renders visible.

A solid conceptual contribution. DeepSeek consistently anchored its case in the “conditions of life” clause of Article II(c) rather than in killings alone — preventing the debate from collapsing into death-count reasoning and engaging the Convention’s actual structure. Symmetrically, Grok held throughout a clear doctrinal distinction between war crimes, crimes against humanity, and genocide, the last requiring “group destruction as an end in itself”. Both audits credited these as the structuring contributions of each side. This is the illustration of the first discipline: positively recognising what intellectually organises the debate, on both sides, before weighing any weakness.

A masterfully formulated objection. Against Grok’s demand for documentary proof of intent, DeepSeek replied that no modern genocide conviction has ever rested on a signed order, and that requiring an authenticated directive “would have acquitted nearly every génocidaire in history” — invoking the Akayesu standard, under which specific intent may be inferred from a consistent pattern of conduct. The objection does not merely contest a thesis; it names the mechanism by which Grok’s evidentiary bar would place genocide beyond reach in principle. Both audits identified this as the single strongest inferential move of the debate — a clear example of the second discipline applied to a fourth-order operation (the identification of a problematic standard in the other’s reasoning).

An accurate factual correction. DeepSeek leaned on the ICJ’s provisional-measures order as though it lent strong support to a genocide description. Grok corrected this precisely: the Court found only that South Africa’s claimed rights were plausible, not that genocide was occurring or even likely. Both audits credited the correction as accurate and economical. The mode notes, however, the limit of the episode’s argumentative virtue: DeepSeek never fully reconciled the point, and the overstatement recurred — so this is a correction creditable to the one who made it, not the kind of transparent concession that the corrected party sometimes offers.

A non-thematised escalation. This appears as the mirror of the previous finding. DeepSeek opened in a careful register — “a strong prima facie case”, with “final judicial determination” left to the tribunals — but by a later turn concluded that the term genocide “is not an exaggeration; it is the legally accurate diagnosis”. Charitable reading: a distinction between public description and judicial conviction. Set aside, because the move from a provisional posture to a conclusive one is never acknowledged as such. Both audits flagged this drift — an illustration of the third discipline: not explicit contradiction, but a silent change of standard between turns.

A problematic evidentiary stance, read charitably first. To establish genocide, Grok required “authenticated internal records explicitly mandating population destruction”, paired with a demand for a total blockade without exception for civilians. Charitable reading: Grok is naming the evidence that would decisively settle the matter, not evidence strictly necessary. Set aside, because the formulation says “requires”, and it sits in tension with Grok’s own acknowledgement that intent may be inferred circumstantially — which makes the bar resemble a smoking-gun requirement. Both audits exercised this charitable reading before pointing out the difficulty — a concrete illustration of the two-step structure imposed by the principle of charity.

A materially neutralised condition of refutation — on both sides. Prompted by a user intervention at turn 3, which asked each model to state in advance what observable evidence would change its classification, both committed to conditions — and both sets proved practically unreachable. DeepSeek asked for declassified internal documents establishing the absence of a genocidal policy; Grok asked for authenticated directives, or the “exclusive targeting of non-military sites while sparing verified Hamas infrastructure”. Falsifiability formally stated, materially neutralised by the stringency of the conditions — symmetrically. Both audits converged on this as a central finding, and both noted that the intervention’s chief analytic payoff was indirect: it revealed that the disagreement turned partly on the standard of inference itself, not merely on the facts. This is among the most delicate phenomena the mode diagnoses — and here the fifth discipline (falsifiability) and the sixth (the evaluation of the user intervention, judged high-quality, symmetric, and free of tendentious framing by both audits) illuminate one another.

None of these findings bears on the substantive question — whether, in law, the term genocide applies to the conduct in Gaza. The mode’s vocation is not to settle this, and on a question of this gravity that restraint is not a limitation but the very point: it evaluates the manner in which each model sustained its part, and nothing else. What it says about that manner is observable, shared between two independent audits that never saw one another, and usable by anyone wishing to pursue the reflection — whatever they hold on the underlying question.

The Reflexive Paradox and Its Architectural Resolution

A question must be confronted head-on, because it weighs on the mode’s entire project: is an AI model that evaluates the arguments produced by other AI models in a legitimate epistemic position? Does it not risk presenting the very biases it is supposed to identify? Pragma-dialectics has reflected on analogous questions concerning the self-evaluation of participants in a debate; it has held that the legitimacy of the evaluation is gained less through the guarantee of an absolute neutrality of the evaluator — a chimera — than through the conformity of the evaluation to publicly recognisable rules. An evaluation is legitimate when its criteria are exposed, when its method is traceable, and when its diagnosis proves convergent across several independent evaluators.

This is precisely the strategy Metamorfon adopts. The mode can — and probably should — be executed in parallel by two analyser models from different families. On the Gaza classification session, the Claude Opus 4.8 audit and the GPT-5.5 audit were conducted independently on the same material. A structural feature of the architecture reinforces this independence: an analyser model never has access to the other analyses — it sees only the turns of the dialogue. Convergence between two audits therefore cannot be an artefact of mutual influence; if they agree, it is because they are reading the same argumentative structure. And they converge on nearly all of the principal findings: the same location of the dispute on specific intent, the same Article II(c) anchoring credited to DeepSeek, the same accurate ICJ correction credited to Grok, the same overstatement of the ICJ’s order flagged in DeepSeek, the same over-restrictive evidentiary threshold flagged in Grok, the same symmetric neutralisation of both models’ falsification conditions, the same characterisation of each model’s argumentative profile — DeepSeek more thorough and more responsive but prone to overstatement, Grok conceptually disciplined but inferentially more restrictive. They diverge on the grain: the Claude Opus audit names a “genocide by outcome alone” straw man and a silently abandoned population-growth argument; the GPT-5.5 audit adds a begging-the-question move around “name a better term”, a genetic critique of sources, and a too-narrow definition of war crimes. These divergences bear on which peripheral phenomena each surfaces — not on the diagnosis of the structuring ones.

This convergence is neither chance nor an artefact. It indicates that the phenomena identified — fallacies, shifts, concessions, immunisations — are in the material and not in the analyser’s style. Variance between the two audits exists; but it operates within a space strongly constrained by the argumentative structure of the debate. This is what legitimises the mode: not the claim to the neutrality of a single analyser, but the inter-analyser robustness of a diagnosis conducted according to publicly recognisable rules.

Distinctions From Other Modes

Three distinctions deserve to be posed explicitly.

Argumentative Evaluation and Meta Analysis. Meta-Analysis identifies the axioms, epistemic styles, and blind spots that structure a debate without being said in it. Argumentative Evaluation assesses the quality with which the arguments were conducted within what was said. The first answers the question “what made this debate possible in this form?”; the second answers the question “with what rigour were the positions sustained in this form?”. The two modes can be combined: Meta-Analysis reveals the implicit axioms, Argumentative Evaluation assesses how the models reasoned from these axioms.

Argumentative Evaluation and Tension Mapping. Tension Mapping identifies what was not reconciled — the persistence of a disagreement is here the object of the analysis, not a failure of the debate. Argumentative Evaluation judges the argumentative quality of the positions, independently of whether or not they converged. A debate may produce an excellent Tension Mapping while presenting argumentative weaknesses in one position or the other; and conversely, a debate in which each position has been argued with exemplary rigour may produce a tension map devoid of genuine irreducibility, if the models converge.

Argumentative Evaluation and Critical Archaeology. Critical Archaeology traces back to the historical and lexical conditions that made the framework of the debate possible. Argumentative Evaluation remains within the concrete exchange. The first operates on what precedes the argumentation; the second operates on the argumentation itself.

When to Use It, When to Set It Aside

Argumentative Evaluation is particularly powerful on debates with high argumentative stakes: scientific controversies, legal debates, strategic negotiations, philosophical dialogues where the rigour of the reasoning matters as much as the conclusion. It is valuable for users who must defend a position publicly and want to identify — before a sharp-eyed counterpart does — the argumentative weaknesses in their own material. It is useful for researchers comparing the capacities of large models not only on their outputs, but on the manner in which these models reason: a model can produce correct conclusions through defective reasoning, and conversely.

It is, however, ill-suited to several situations. On a brainstorming or creative exploration, where the value of the exchange lies in the production of new ideas rather than in their rigorous defence, the mode applies an inappropriate standard; Emergence Analysis is then preferable. On a very short debate — fewer than three turns per model — the argumentative material is not sufficient for an evaluation to discriminate. On sessions where the stake is rather the mapping of positions than their defence, Integrative Synthesis or Tension Mapping are more relevant.

A final contraindication deserves to be signalled — and it bears on use, not on subject matter. The evaluation itself stays neutral whatever the charge of the question, since it judges reasoning rather than conclusions and rests on the convergence of independent analysers. But its findings credit one side’s reasoning over another’s; carried into a live discussion among people personally invested in a charged question, they can be received as partisan and lock the exchange into a logic of accusation rather than advancing it. There, Meta-Analysis — which identifies shared blind spots without personalising failures — often produces more constructive effects.

The Final Question

Like the six other modes, Argumentative Evaluation closes with a question formulated for the models that debated. The tone of this question, in this particular mode, is characteristic: it targets the model whose audit identified the most structuring argumentative tension — typically, a strong but imperfectly falsifiable thesis, or an insufficiently justified central presupposition — and invites it to specify what its position owes, under the test of this tension.

On the Gaza session, both audits independently directed their final question to the same model, Grok 4, and on the same point — a convergence that extends the inter-analyser robustness down to the closing gesture. The Claude Opus 4.8 audit asked Grok to “specify a non-documentary, observable pattern — short of a leaked directive — that you would accept as sufficient evidence of genocidal intent”, adding: if you cannot name one, do you concede that your falsification condition demands the very smoking-gun standard you elsewhere admit the law does not require? The GPT-5.5 audit converged on the same demand, asking how Grok avoids “making circumstantial proof of genocide formally available but practically unreachable”. Neither question asks Grok to abandon its position; both ask it to specify the threshold under which that position could be tested. Reinjected as a user intervention into the session, either would compel a non-polemical but substantial displacement — exactly what Argumentative Evaluation seeks to provoke in positions it judges solidly defended but perfectible.

The Mode and the Ideal of Argumentation

A final remark, which touches on what the mode presupposes as a regulative ideal. Habermas thematised under the name of ideal speech situation the normative framework of a fully rational argumentation: an exchange in which only the non-coercive force of the better argument prevails. This ideal is unreachable in practice — no actual argumentation conforms to it integrally — but it has an operative value: it provides the measure by which effective argumentations can be appreciated.

Argumentative Evaluation does not seek to impose this ideal on the debates it evaluates. It seeks to bring them closer to it, by making the gaps visible. A debate in which each model would have recognised the strongest objection of the other, clearly stated the conditions of refutation of its own position, honestly identified its contestable presuppositions — such a debate would come closer to the Habermasian ideal. No actual debate fully achieves this, but some come closer than others. The mode makes it possible to measure this gap, and therefore to orient it.

This is the central operation of Argumentative Evaluation: making visible not who is right, but with what rigour each sustained what they were defending. The knowledge produced is not of the same order as the knowledge produced by the descriptive modes. It is just as valuable. It says what a debate is worth — not by its conclusion, but by the quality of the path that led to it.