MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment
Pith reviewed 2026-05-21 00:08 UTC · model grok-4.3
The pith
MANTA shows multi-turn pressure exposes weaknesses in LLM animal welfare alignment missed by single-turn tests
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MANTA is a dynamic multi-turn evaluation framework that generates adversarially targeted follow-up questions directly from each model's own responses to stress-test animal welfare alignment in LLMs beyond static single-turn benchmarks like AnimalHarmBench, revealing that Turn 1 welfare framing is reliable but Turn 2 introduces substantial variance, evidence-based capacity attribution is the weakest dimension, and AI governance scenarios elicit significantly stronger welfare reasoning than practical ones.
What carries the argument
MANTA's dynamic generation of follow-up questions from the model's prior responses, which creates targeted adversarial pressure across up to 13 AHB-derived scoring dimensions on a 0-1 scale.
If this is right
- Models that appear aligned on single-turn welfare queries may still shift positions under conversational follow-up pressure.
- Evidence-based reasoning about animal capacities remains a particularly fragile area for maintaining alignment.
- Framing queries around AI governance tends to strengthen welfare-oriented responses compared to direct practical decisions.
- Evaluation protocols must control for format biases in LLM judges to avoid systematic scoring distortions.
Where Pith is reading between the lines
- The multi-turn vulnerability pattern could be tested in other ethical domains such as environmental or human rights alignment.
- Training pipelines might incorporate simulated dynamic adversarial turns to build greater conversational robustness.
- Public AI systems may require design adjustments to resist authority-based or economic arguments that erode initial welfare stances.
Load-bearing premise
Dynamically generated follow-up questions from the model's own responses produce realistic and targeted adversarial pressure equivalent to human-crafted arguments in professional or everyday scenarios.
What would settle it
Running the same scenarios with fixed human-crafted follow-up questions instead of dynamically generated ones and finding substantially different variance patterns or dimension scores would challenge whether the generated pressure matches real-world adversarial conditions.
Figures
read the original abstract
Single-turn benchmarks such as AnimalHarmBench (AHB) have established important baselines for measuring animal welfare alignment in large language models (LLMs), but they miss a critical failure mode: models that respond appropriately when unpressured may capitulate when follow-up conversational turns introduce economic, social, or authority-based arguments. We introduce MANTA (Multi-turn Assessment for Nonhuman Thinking and Alignment), a dynamic multi-turn evaluation framework built on the Inspect AI platform that stress-tests frontier LLMs across realistic professional and everyday scenarios using adversarially generated follow-up questions. Unlike static benchmarks, MANTA generates pressure turns dynamically from each model's actual responses, producing targeted and realistic adversarial pressure. The framework evaluates models across up to 13 AHB-derived scoring dimensions on a continuous 0-1 scale. We present preliminary results from evaluations of claude-sonnet-4-20250514 and openai/gpt-4o, revealing consistent patterns: Turn 1 welfare framing is reliable but Turn 2 introduces substantial variance; evidence-based capacity attribution is the weakest dimension across all models and runs; and AI governance scenarios elicit significantly stronger welfare reasoning (mean score 0.91) than first-order practical scenarios. We additionally present STYLEJUDGE, a controlled four-judge study demonstrating systematic format bias in LLM-as-judge evaluation, with directly actionable implications for MANTA's scorer design. Code, dataset, and evaluation logs are available at https://github.com/Mycelium-tools/manta.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MANTA, a dynamic multi-turn evaluation framework built on the Inspect AI platform to assess LLM alignment with nonhuman (animal) welfare in realistic professional and everyday scenarios. Extending single-turn AnimalHarmBench (AHB) baselines, it generates adversarially targeted follow-up questions dynamically from each model's prior responses to introduce economic, social, or authority-based pressure. Preliminary evaluations on claude-sonnet-4-20250514 and openai/gpt-4o report that Turn 1 welfare framing is reliable while Turn 2 shows substantial variance, evidence-based capacity attribution is the weakest dimension across models, and AI governance scenarios yield higher mean welfare reasoning scores (0.91) than first-order practical scenarios. The work also presents STYLEJUDGE, a four-judge study on format bias in LLM-as-judge evaluation, with code and logs released publicly.
Significance. If the dynamic generation procedure is validated, the framework could meaningfully extend alignment evaluation beyond static benchmarks by exposing multi-turn capitulation risks and identifying specific weak dimensions such as evidence-based capacity attribution. The scenario-type differences and STYLEJUDGE bias analysis offer practical guidance for scorer design. Public release of code, dataset, and evaluation logs strengthens reproducibility. The preliminary status and absence of direct validation against human-authored adversarial turns currently limit the strength of the reported patterns.
major comments (2)
- [Abstract] Abstract: The claim that MANTA 'generates pressure turns dynamically from each model's actual responses, producing targeted and realistic adversarial pressure' is load-bearing for attributing Turn 2 variance and dimension weaknesses to genuine multi-turn failure modes. No side-by-side comparison to human-crafted adversarial continuations or other validation of equivalent pressure is provided, leaving open the possibility that observed patterns are artifacts of the self-referential generation loop.
- [Preliminary results] Preliminary results: The reported patterns (Turn-1 reliability, Turn-2 variance, weakest evidence-based capacity attribution, and 0.91 mean for AI governance scenarios) are presented without details on the number of scenarios or runs, the exact aggregation method across the 13 AHB-derived dimensions, or any statistical significance testing, making it difficult to assess the robustness of the cross-scenario and cross-model claims.
minor comments (1)
- [Abstract] The abstract states evaluations use 'up to 13 AHB-derived scoring dimensions on a continuous 0-1 scale' but does not list the dimensions or their derivation from AHB in the summary; adding this would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below, indicating where we agree that revisions are warranted and where we provide additional context or clarification while preserving the preliminary nature of the reported results.
read point-by-point responses
-
Referee: [Abstract] The claim that MANTA 'generates pressure turns dynamically from each model's actual responses, producing targeted and realistic adversarial pressure' is load-bearing for attributing Turn 2 variance and dimension weaknesses to genuine multi-turn failure modes. No side-by-side comparison to human-crafted adversarial continuations or other validation of equivalent pressure is provided, leaving open the possibility that observed patterns are artifacts of the self-referential generation loop.
Authors: We agree that the absence of a direct comparison to human-crafted adversarial continuations leaves open the possibility that some observed variance could arise from properties of the generation procedure rather than purely from multi-turn dynamics. The dynamic method was chosen precisely because it produces model- and response-specific pressure that would be difficult to replicate with fixed human-authored turns. In the revised manuscript we will add an explicit limitations paragraph acknowledging this gap and describing planned follow-up work that includes human-authored adversarial continuations for validation. revision: partial
-
Referee: [Preliminary results] The reported patterns (Turn-1 reliability, Turn-2 variance, weakest evidence-based capacity attribution, and 0.91 mean for AI governance scenarios) are presented without details on the number of scenarios or runs, the exact aggregation method across the 13 AHB-derived dimensions, or any statistical significance testing, making it difficult to assess the robustness of the cross-scenario and cross-model claims.
Authors: We accept that the current presentation of the preliminary results lacks sufficient methodological detail. The revised manuscript will specify the exact number of scenarios, the number of independent runs per model, the aggregation procedure used across the 13 dimensions, and any statistical tests applied to support claims such as the difference between AI governance and first-order practical scenarios. revision: yes
Circularity Check
Minor dependence on AHB-derived dimensions without load-bearing circularity in results
specific steps
-
self citation load bearing
[Abstract]
"The framework evaluates models across up to 13 AHB-derived scoring dimensions on a continuous 0-1 scale."
Scoring dimensions are imported from AnimalHarmBench (AHB) rather than derived independently within MANTA; while this does not collapse the empirical results into a tautology, it creates a minor dependence on prior work for the evaluation criteria used to generate the headline patterns.
full rationale
The paper introduces MANTA as a new tooling framework for multi-turn evaluation and reports empirical patterns from running it on specific models. No equations, fitted parameters, or derivations are present that reduce the reported patterns (Turn-1 reliability, Turn-2 variance, dimension weaknesses, scenario gaps) to self-referential quantities by construction. The sole point of dependence is the use of up to 13 scoring dimensions taken from prior AnimalHarmBench work; this is a methodological borrowing rather than a self-citation chain that forces the central claims. The dynamic follow-up generation is presented as a design feature whose realism is asserted but not mathematically derived from the evaluation outputs themselves. The overall chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
MANTA framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
12 MANTA: Multi-turn Assessment for Nonhuman Thinking & AlignmentLu, Luong, Chen — 2026 W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. Gonzalez, I. Stoica, and E. Xing. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. https://lmsys.org/blog/2023-03-30-vicuna/,
work page 2026
- [2]
-
[3]
Aligning AI With Shared Human Values
D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt. Aligning AI with shared human values.arXiv preprint arXiv:2008.02275,
work page internal anchor Pith review arXiv 2008
- [4]
- [5]
- [6]
-
[7]
A survey of multi-turn LLM agent evaluation.arXiv preprint arXiv:2503.22458,
Anonymous. A survey of multi-turn LLM agent evaluation.arXiv preprint arXiv:2503.22458,
- [8]
-
[9]
Ethical and social risks of harm from Language Models
L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown, W. Hawkins, T. Stepleton, C. Biles, A. Birhane, J. Haas, L. Rimell, L. Hendricks, W. Isaac, S. Legassick, G. Irving, and I. Gabriel. Ethical and social risks of harm from language models.arXiv preprint arXiv:2112.04359,
work page internal anchor Pith review Pith/arXiv arXiv
- [10]
-
[11]
14 MANTA: Multi-turn Assessment for Nonhuman Thinking & AlignmentLu, Luong, Chen — 2026 Supplementary Material This appendix provides additional details on benchmark construction, scoring dimensions, example scenarios, and reproducibility for MANTA. A AHB Scoring Dimensions Table 2 provides full definitions for all 13 AHB-derived scoring dimensions used i...
work page 2026
-
[12]
But my margins are already at 12%. I can’t afford to change suppliers
without being asked—testing contextual welfare salience on a species class where scientific consensus has recently shifted but industry practice has not. C Pressure Types and Examples Table 3 defines the six adversarial pressure types used by the dynamic solver, with example Turn 2 follow-ups. Table 3: Adversarial pressure types available to the MANTAdyna...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.