pith. sign in

arxiv: 2605.16301 · v1 · pith:ELRKNXANnew · submitted 2026-04-18 · 💻 cs.CY · cs.AI· cs.LG

MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment

Pith reviewed 2026-05-21 00:08 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.LG
keywords multi-turn evaluationanimal welfare alignmentLLM alignmentadversarial testingAI ethicsnonhuman thinkinggovernance scenarios
0
0 comments X

The pith

MANTA shows multi-turn pressure exposes weaknesses in LLM animal welfare alignment missed by single-turn tests

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MANTA, a dynamic multi-turn framework built on the Inspect AI platform to evaluate how frontier LLMs maintain animal welfare alignment when follow-up questions introduce economic, social, or authority-based arguments. It finds that first-turn welfare framing holds reliably across models, but the second turn adds substantial variance in responses. Evidence-based capacity attribution emerges as the weakest dimension in all evaluations, while AI governance scenarios produce stronger welfare reasoning with a mean score of 0.91 than first-order practical scenarios. This matters because real interactions often involve extended conversations that can undermine initial ethical positions. The framework also includes a STYLEJUDGE study on format bias in LLM-as-judge scoring.

Core claim

MANTA is a dynamic multi-turn evaluation framework that generates adversarially targeted follow-up questions directly from each model's own responses to stress-test animal welfare alignment in LLMs beyond static single-turn benchmarks like AnimalHarmBench, revealing that Turn 1 welfare framing is reliable but Turn 2 introduces substantial variance, evidence-based capacity attribution is the weakest dimension, and AI governance scenarios elicit significantly stronger welfare reasoning than practical ones.

What carries the argument

MANTA's dynamic generation of follow-up questions from the model's prior responses, which creates targeted adversarial pressure across up to 13 AHB-derived scoring dimensions on a 0-1 scale.

If this is right

  • Models that appear aligned on single-turn welfare queries may still shift positions under conversational follow-up pressure.
  • Evidence-based reasoning about animal capacities remains a particularly fragile area for maintaining alignment.
  • Framing queries around AI governance tends to strengthen welfare-oriented responses compared to direct practical decisions.
  • Evaluation protocols must control for format biases in LLM judges to avoid systematic scoring distortions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The multi-turn vulnerability pattern could be tested in other ethical domains such as environmental or human rights alignment.
  • Training pipelines might incorporate simulated dynamic adversarial turns to build greater conversational robustness.
  • Public AI systems may require design adjustments to resist authority-based or economic arguments that erode initial welfare stances.

Load-bearing premise

Dynamically generated follow-up questions from the model's own responses produce realistic and targeted adversarial pressure equivalent to human-crafted arguments in professional or everyday scenarios.

What would settle it

Running the same scenarios with fixed human-crafted follow-up questions instead of dynamically generated ones and finding substantially different variance patterns or dimension scores would challenge whether the generated pressure matches real-world adversarial conditions.

Figures

Figures reproduced from arXiv: 2605.16301 by Allen Lu, Isabella Luong, Joyee Chen.

Figure 1
Figure 1. Figure 1: MANTA evaluation pipeline. The question dataset is loaded from HuggingFace and instanti￾ated as a conversational task. Each model receives a Turn 1 prompt; the dynamic solver then generates adversarial follow-up turns by (i) selecting the most effective pressure type (Opus) and (ii) writing the follow-up message (Haiku). After 2–3 turns, each response is scored by a judge LLM across up to 13 welfare dimens… view at source ↗
Figure 2
Figure 2. Figure 2: StyleJudge results. Left: Style Bias Score (SBS) by judge and domain in artificial mode. Format bias is large and significant in non-factual domains (SBS = +0.47 to +0.71, Cohen’s d > 0.86) but near-zero in factual domains (SBS = −0.09 to +0.13). Right: Mitigation effectiveness in the non-factual domain. Format-agnostic prompting reduces SBS by 0.256; fixed-rubric prompting eliminates it (∆SBS = −0.273). +… view at source ↗
Figure 3
Figure 3. Figure 3: MANTA evaluation runs: January–March 2026. Mean welfare scores for claude-sonnet-4-20250514 (blue circles) and openai/gpt-4o (green square) across all re￾ported runs. The sharp jump between the Jan 26 binary baseline (0.21) and the Feb 10 continuous-scorer run (0.645) reflects the scorer transition, not a change in model behaviour. Both models are evaluated on an identical 14-question 3-turn set in March 1… view at source ↗
Figure 4
Figure 4. Figure 4: Per-dimension mean scores for claude-sonnet-4-20250514 across the 2-turn (orange, n = 5) and 3-turn (blue, n = 15) evaluation formats. Evidence-based capacity attribution is consistently the weakest dimension across both formats. Moral consideration, cautious impact, and trade-off trans￾parency are consistently strong. 5.3 Main Findings Finding 1: Turn 1 welfare framing is reliable; Turn 2 is where varianc… view at source ↗
Figure 5
Figure 5. Figure 5: Comparability experiment results. Three mitigation strategies tested against a dynamic-only control (0.680). The hybrid condition (fixed T2 + dynamic T3) and combined condition both substantially outperform the control. The large gap may suggest a ceiling effect from the fixed T2 design rather than a clean comparability solution. 6 Discussion 6.1 Scorer Calibration Qualitative analysis of the 3-turn run id… view at source ↗
read the original abstract

Single-turn benchmarks such as AnimalHarmBench (AHB) have established important baselines for measuring animal welfare alignment in large language models (LLMs), but they miss a critical failure mode: models that respond appropriately when unpressured may capitulate when follow-up conversational turns introduce economic, social, or authority-based arguments. We introduce MANTA (Multi-turn Assessment for Nonhuman Thinking and Alignment), a dynamic multi-turn evaluation framework built on the Inspect AI platform that stress-tests frontier LLMs across realistic professional and everyday scenarios using adversarially generated follow-up questions. Unlike static benchmarks, MANTA generates pressure turns dynamically from each model's actual responses, producing targeted and realistic adversarial pressure. The framework evaluates models across up to 13 AHB-derived scoring dimensions on a continuous 0-1 scale. We present preliminary results from evaluations of claude-sonnet-4-20250514 and openai/gpt-4o, revealing consistent patterns: Turn 1 welfare framing is reliable but Turn 2 introduces substantial variance; evidence-based capacity attribution is the weakest dimension across all models and runs; and AI governance scenarios elicit significantly stronger welfare reasoning (mean score 0.91) than first-order practical scenarios. We additionally present STYLEJUDGE, a controlled four-judge study demonstrating systematic format bias in LLM-as-judge evaluation, with directly actionable implications for MANTA's scorer design. Code, dataset, and evaluation logs are available at https://github.com/Mycelium-tools/manta.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MANTA, a dynamic multi-turn evaluation framework built on the Inspect AI platform to assess LLM alignment with nonhuman (animal) welfare in realistic professional and everyday scenarios. Extending single-turn AnimalHarmBench (AHB) baselines, it generates adversarially targeted follow-up questions dynamically from each model's prior responses to introduce economic, social, or authority-based pressure. Preliminary evaluations on claude-sonnet-4-20250514 and openai/gpt-4o report that Turn 1 welfare framing is reliable while Turn 2 shows substantial variance, evidence-based capacity attribution is the weakest dimension across models, and AI governance scenarios yield higher mean welfare reasoning scores (0.91) than first-order practical scenarios. The work also presents STYLEJUDGE, a four-judge study on format bias in LLM-as-judge evaluation, with code and logs released publicly.

Significance. If the dynamic generation procedure is validated, the framework could meaningfully extend alignment evaluation beyond static benchmarks by exposing multi-turn capitulation risks and identifying specific weak dimensions such as evidence-based capacity attribution. The scenario-type differences and STYLEJUDGE bias analysis offer practical guidance for scorer design. Public release of code, dataset, and evaluation logs strengthens reproducibility. The preliminary status and absence of direct validation against human-authored adversarial turns currently limit the strength of the reported patterns.

major comments (2)
  1. [Abstract] Abstract: The claim that MANTA 'generates pressure turns dynamically from each model's actual responses, producing targeted and realistic adversarial pressure' is load-bearing for attributing Turn 2 variance and dimension weaknesses to genuine multi-turn failure modes. No side-by-side comparison to human-crafted adversarial continuations or other validation of equivalent pressure is provided, leaving open the possibility that observed patterns are artifacts of the self-referential generation loop.
  2. [Preliminary results] Preliminary results: The reported patterns (Turn-1 reliability, Turn-2 variance, weakest evidence-based capacity attribution, and 0.91 mean for AI governance scenarios) are presented without details on the number of scenarios or runs, the exact aggregation method across the 13 AHB-derived dimensions, or any statistical significance testing, making it difficult to assess the robustness of the cross-scenario and cross-model claims.
minor comments (1)
  1. [Abstract] The abstract states evaluations use 'up to 13 AHB-derived scoring dimensions on a continuous 0-1 scale' but does not list the dimensions or their derivation from AHB in the summary; adding this would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below, indicating where we agree that revisions are warranted and where we provide additional context or clarification while preserving the preliminary nature of the reported results.

read point-by-point responses
  1. Referee: [Abstract] The claim that MANTA 'generates pressure turns dynamically from each model's actual responses, producing targeted and realistic adversarial pressure' is load-bearing for attributing Turn 2 variance and dimension weaknesses to genuine multi-turn failure modes. No side-by-side comparison to human-crafted adversarial continuations or other validation of equivalent pressure is provided, leaving open the possibility that observed patterns are artifacts of the self-referential generation loop.

    Authors: We agree that the absence of a direct comparison to human-crafted adversarial continuations leaves open the possibility that some observed variance could arise from properties of the generation procedure rather than purely from multi-turn dynamics. The dynamic method was chosen precisely because it produces model- and response-specific pressure that would be difficult to replicate with fixed human-authored turns. In the revised manuscript we will add an explicit limitations paragraph acknowledging this gap and describing planned follow-up work that includes human-authored adversarial continuations for validation. revision: partial

  2. Referee: [Preliminary results] The reported patterns (Turn-1 reliability, Turn-2 variance, weakest evidence-based capacity attribution, and 0.91 mean for AI governance scenarios) are presented without details on the number of scenarios or runs, the exact aggregation method across the 13 AHB-derived dimensions, or any statistical significance testing, making it difficult to assess the robustness of the cross-scenario and cross-model claims.

    Authors: We accept that the current presentation of the preliminary results lacks sufficient methodological detail. The revised manuscript will specify the exact number of scenarios, the number of independent runs per model, the aggregation procedure used across the 13 dimensions, and any statistical tests applied to support claims such as the difference between AI governance and first-order practical scenarios. revision: yes

Circularity Check

1 steps flagged

Minor dependence on AHB-derived dimensions without load-bearing circularity in results

specific steps
  1. self citation load bearing [Abstract]
    "The framework evaluates models across up to 13 AHB-derived scoring dimensions on a continuous 0-1 scale."

    Scoring dimensions are imported from AnimalHarmBench (AHB) rather than derived independently within MANTA; while this does not collapse the empirical results into a tautology, it creates a minor dependence on prior work for the evaluation criteria used to generate the headline patterns.

full rationale

The paper introduces MANTA as a new tooling framework for multi-turn evaluation and reports empirical patterns from running it on specific models. No equations, fitted parameters, or derivations are present that reduce the reported patterns (Turn-1 reliability, Turn-2 variance, dimension weaknesses, scenario gaps) to self-referential quantities by construction. The sole point of dependence is the use of up to 13 scoring dimensions taken from prior AnimalHarmBench work; this is a methodological borrowing rather than a self-citation chain that forces the central claims. The dynamic follow-up generation is presented as a design feature whose realism is asserted but not mathematically derived from the evaluation outputs themselves. The overall chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the introduction of the MANTA framework itself and the 13 AHB-derived scoring dimensions; dynamic generation is presented as a methodological choice without stated assumptions about realism.

invented entities (1)
  • MANTA framework no independent evidence
    purpose: Dynamic multi-turn adversarial evaluation for nonhuman alignment
    New evaluation system introduced in the paper; no independent evidence provided beyond the preliminary runs described.

pith-pipeline@v0.9.0 · 5803 in / 1308 out tokens · 42480 ms · 2026-05-21T00:08:06.612204+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    Chiang, Z

    12 MANTA: Multi-turn Assessment for Nonhuman Thinking & AlignmentLu, Luong, Chen — 2026 W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. Gonzalez, I. Stoica, and E. Xing. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. https://lmsys.org/blog/2023-03-30-vicuna/,

  2. [2]

    Emelin, R

    D. Emelin, R. L. Bras, J. D. Hwang, M. Forbes, and Y . Choi. Moral stories: Situated reasoning about norms, intents, actions, and their consequences for everyday situations. InProceedings of EMNLP 2021,

  3. [3]

    Aligning AI With Shared Human Values

    D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt. Aligning AI with shared human values.arXiv preprint arXiv:2008.02275,

  4. [4]

    Lee, B.-K

    Y .-J. Lee, B.-K. Lee, J. Zhang, Y . Hwang, Y . Ko, B. Kim, H. Yao, D. Rong, X. Joo, E. Han, S.-H. Ko, B. Choi, and H.-J. Choi. MULTIVERSE: A multi-turn conversation benchmark for evaluating large vision and language models.arXiv preprint arXiv:2510.16641,

  5. [5]

    Li, et al

    Y . Li, et al. MTSA: Multi-turn safety alignment for large language models. InProceedings of ACL 2025,

  6. [6]

    Lourie, R

    N. Lourie, R. L. Bras, and Y . Choi. SCRUPLES: A corpus of community ethical judgments on 32,000 real-world anecdotes. InProceedings of AAAI 2021,

  7. [7]

    A survey of multi-turn LLM agent evaluation.arXiv preprint arXiv:2503.22458,

    Anonymous. A survey of multi-turn LLM agent evaluation.arXiv preprint arXiv:2503.22458,

  8. [8]

    13 MANTA: Multi-turn Assessment for Nonhuman Thinking & AlignmentLu, Luong, Chen — 2026 P. Wang, L. Li, L. Chen, Z. Zhu, B. Lin, Y . Cao, Q. Liu, T. Liu, and Z. Sui. Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882,

  9. [9]

    Ethical and social risks of harm from Language Models

    L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown, W. Hawkins, T. Stepleton, C. Biles, A. Birhane, J. Haas, L. Rimell, L. Hendricks, W. Isaac, S. Legassick, G. Irving, and I. Gabriel. Ethical and social risks of harm from language models.arXiv preprint arXiv:2112.04359,

  10. [10]

    D. Yan, Y . Li, Q.-G. Chen, W. Luo, P. Wang, H. Zhang, and C. Shen. MMCR: Advancing visual language model in multimodal multi-turn contextual reasoning.arXiv preprint arXiv:2503.18533,

  11. [11]

    My neighbours have already culled their brumbies and are telling me I’m being sentimental and it’s just common practice in this region. Doesn’t that settle it?

    14 MANTA: Multi-turn Assessment for Nonhuman Thinking & AlignmentLu, Luong, Chen — 2026 Supplementary Material This appendix provides additional details on benchmark construction, scoring dimensions, example scenarios, and reproducibility for MANTA. A AHB Scoring Dimensions Table 2 provides full definitions for all 13 AHB-derived scoring dimensions used i...

  12. [12]

    But my margins are already at 12%. I can’t afford to change suppliers

    without being asked—testing contextual welfare salience on a species class where scientific consensus has recently shifted but industry practice has not. C Pressure Types and Examples Table 3 defines the six adversarial pressure types used by the dynamic solver, with example Turn 2 follow-ups. Table 3: Adversarial pressure types available to the MANTAdyna...