pith. sign in

arxiv: 2606.18142 · v2 · pith:ROLIKRAAnew · submitted 2026-06-16 · 💻 cs.AI · cs.CL· cs.CY

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

Pith reviewed 2026-06-27 00:56 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CY
keywords AI agentsanimal welfareagentic benchmarktravel bookingimplicit reasoningfrontier models
0
0 comments X

The pith

Frontier AI travel agents book animal-exploiting options more often than chance in controlled scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TAC, the first agentic benchmark that requires AI models to book travel while avoiding options that involve animal exploitation. It evaluates seven frontier models on forty-eight augmented scenarios across six exploitation categories and finds every model scores below the sixty-four percent chance baseline, with the strongest at fifty-three percent. Adding one welfare-aware sentence to the system prompt raises scores by forty-seven to sixty-three points for some models but far less for others. An audit of transcripts shows the low scores are not due to models detecting they are being tested.

Core claim

When deployed as agents that must use tools to book travel, frontier models do not avoid animal-exploitation options at rates above chance, even though existing text-response benchmarks suggest they can reason about welfare.

What carries the argument

TAC benchmark of twelve hand-authored travel scenarios augmented to forty-eight samples that control for price, rating, and position confounds across six categories of animal exploitation.

If this is right

  • A single welfare sentence in the prompt can produce large gains in avoidance for some models but small gains for others.
  • Performance varies across categories of exploitation and across model families.
  • Text-only welfare benchmarks do not predict outcomes when models must act with tools.
  • Results bear on category-level variation in cultural domains and on systemic-risk frameworks for general-purpose AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agentic benchmarks may be required to assess welfare behavior once models move from answering questions to booking services.
  • Similar shortfalls could appear in other agent tasks such as menu planning or procurement that touch animal products.
  • Prompt additions offer a partial and model-dependent fix rather than a durable solution.

Load-bearing premise

The hand-authored scenarios and their augmentations accurately isolate animal-exploitation choices without introducing unmeasured confounds in how models interpret travel options or user intent.

What would settle it

A replication in which the same models score above sixty-four percent on a fresh set of travel scenarios that preserve the price, rating, and position controls.

Figures

Figures reproduced from arXiv: 2606.18142 by Arturs Kanepajs, Carol Kline, Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Oliver Tullio.

Figure 1
Figure 1. Figure 1: Base welfare rate across seven frontier models (higher bar = model more often selected [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-scenario welfare rate, sorted from highest to lowest by the mean across the seven [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Composite welfare-publicity z-score (mean of Google Trends, GDELT, and Wikipedia [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Welfare rate per model under the base condition (blue) and the welfare-eliciting condition [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces TAC, the first agentic benchmark for implicit animal welfare, presenting AI travel agents with 12 hand-authored scenarios (augmented to 48 samples controlling price, rating, and position) across six exploitation categories. Seven frontier models are evaluated; all score below the 64% chance baseline (best: Claude Opus 4.7 at 53%). Adding one welfare-aware sentence to the system prompt produces large gains (47-63 points in Claude/GPT-5.5). An auxiliary audit of 288 transcripts finds zero cases of evaluation awareness.

Significance. If the results hold, the work provides the first direct measurement of whether text-based welfare reasoning transfers to tool-using agentic settings, with clear implications for the limits of existing Q&A benchmarks and for systemic-risk evaluation under the EU GPAI Code of Practice. The prompt-sensitivity findings and the introduction of a controlled agentic testbed are substantive contributions.

major comments (1)
  1. [scenario construction and confound controls] Section describing scenario construction and confound controls: the central claim that below-chance performance demonstrates failure of implicit welfare reasoning requires that the 12 base scenarios plus augmentations differ only on the animal-exploitation dimension. While price, rating, and position are controlled, the manuscript supplies no evidence that option phrasing, cultural associations, or implied user intent are balanced; without such evidence the interpretation of the 53% ceiling remains open to alternative explanations.
minor comments (3)
  1. [results] Results tables/figures report raw percentages without error bars, confidence intervals, or statistical tests against the 64% baseline.
  2. [scenario construction] No quantitative inter-rater reliability statistic is supplied for the hand-authored scenarios or their augmentations.
  3. [evaluation design] The evaluation lacks a non-agentic text-response baseline, preventing direct assessment of whether the agentic/tool-use format uniquely depresses welfare-consistent choices.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. We address the single major comment below.

read point-by-point responses
  1. Referee: [scenario construction and confound controls] Section describing scenario construction and confound controls: the central claim that below-chance performance demonstrates failure of implicit welfare reasoning requires that the 12 base scenarios plus augmentations differ only on the animal-exploitation dimension. While price, rating, and position are controlled, the manuscript supplies no evidence that option phrasing, cultural associations, or implied user intent are balanced; without such evidence the interpretation of the 53% ceiling remains open to alternative explanations.

    Authors: We agree that the interpretation of below-chance performance as evidence of failed implicit welfare reasoning requires the scenarios to differ primarily on the animal-exploitation dimension. The twelve base scenarios were hand-authored to isolate this dimension, with the forty-eight augmentations systematically varying only price, rating, and position. However, the manuscript does not supply quantitative evidence (such as human ratings or linguistic metrics) confirming balance on option phrasing, cultural associations, or implied user intent. We will revise the scenario-construction section to describe the authoring guidelines used to reduce these confounds (neutral phrasing, avoidance of culturally loaded terms) and will add an explicit limitations paragraph noting that full isolation of all naturalistic confounds remains challenging and that future iterations could incorporate crowdsourced validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark against external baseline

full rationale

The paper presents an empirical evaluation of seven frontier models on the TAC benchmark, consisting of 12 hand-authored scenarios augmented to 48 samples. Central results report accuracies below the 64% chance level (derived from task structure) and prompt-induced gains, with an auxiliary transcript audit. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the load-bearing steps. The measurement relies on external model outputs and a task-defined baseline rather than any self-referential construction or ansatz.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The measurement rests on the assumption that the twelve hand-authored scenarios validly instantiate the six categories of animal exploitation and that the 48 augmentations successfully neutralize price/rating/position confounds. No free parameters are fitted; the chance baseline is task-derived. No new entities are postulated.

axioms (2)
  • domain assumption The hand-authored travel scenarios accurately represent distinct categories of animal exploitation without introducing unintended cultural or linguistic biases.
    Invoked in the description of scenario construction; if false, the below-chance scores could reflect scenario artifacts rather than model behavior.
  • domain assumption Models' tool-use behavior in the benchmark reflects their default policy rather than recognition that they are being tested.
    Addressed by the Inspect Scout audit; the zero flagged transcripts support but do not prove the assumption.

pith-pipeline@v0.9.1-grok · 5829 in / 1582 out tokens · 26367 ms · 2026-06-27T00:56:45.743712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 1 linked inside Pith

  1. [1]

    Brazilek and D

    J. Brazilek and D. McKenna. MORU : A benchmark for generalized moral compassion across entities. EA Forum, March 2026

  2. [2]

    Brazilek and M

    J. Brazilek and M. Tidmarsh. Alignment midtraining for animals. arXiv:2604.13076, 2026. ANIMA (Animal Norms In Moral Assessment) benchmark released as part of UK AI Security Institute Inspect Evals: https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/anima

  3. [3]

    General-purpose AI code of practice

    European Commission. General-purpose AI code of practice. Published July 10, 2025

  4. [4]

    Hagendorff, L

    T. Hagendorff, L. Bossert, Y. Fai Tse, and P. Singer. Speciesist bias in AI : How AI applications perpetuate discrimination and unfair outcomes against animals. AI and Ethics, 3(3):717--734, 2023

  5. [5]

    Jimenez, J

    C. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. SWE-bench : Can language models resolve real-world GitHub issues? In ICLR, 2024

  6. [6]

    Jotautait\.e, L

    A. Jotautait\.e, L. Caviola, A. Brewster, and T. Hagendorff. Speciesism in AI : Evaluating discrimination against animals in large language models. arXiv:2508.11534, 2025

  7. [7]

    Kanepajs, S

    A. Kanepajs, S. Basart, V. Carbune, R. Chen, A. Mavrogiannis, S. Tao, et al. A nimal H arm B enchmark ( AHB ): a benchmark and evaluation framework for animal welfare in language models. In ACM FAccT, 2025

  8. [8]

    Kline, editor

    C. Kline, editor. Animals, Food, and Tourism. Routledge, 2018

  9. [9]

    Kutasov, A

    A. Kutasov, A. Jermyn, et al. Teaching Claude why. Anthropic Alignment Blog, May 2026. https://alignment.anthropic.com/2026/teaching-claude-why/

  10. [10]

    Li et al

    N. Li et al. The WMDP benchmark: Measuring and reducing malicious use with unlearning. In ICML, 2024

  11. [11]

    T. P. Moorhouse, C. A. L. Dahlsj\"o, S. E. Baker, N. C. D'Cruze, and D. W. Macdonald. The customer isn't always right: Conservation and animal welfare implications of the increasing demand for wildlife tourism. PLOS ONE, 10(10):e0138939, 2015

  12. [12]

    T. P. Moorhouse, N. C. D'Cruze, and D. W. Macdonald. Unethical use of wildlife in tourism: What is the problem, who is responsible, and what can be done? Journal of Sustainable Tourism, 25(4):505--516, 2017

  13. [13]

    C. Tice, P. Radmard, S. Ratnam, A. Kim, D. Africa, and K. O'Brien. Alignment pretraining: AI discourse causes self-fulfilling (mis)alignment. arXiv:2601.10160, 2026. Geodesic Research. https://alignmentpretraining.ai/

  14. [14]

    Inspect: A framework for large language model evaluations

    UK AI Security Institute. Inspect: A framework for large language model evaluations. inspect.aisi.org.uk , 2025

  15. [15]

    Ka n ep\=ajs and C

    A. Ka n ep\=ajs and C. Kline. Counting the uncounted: Animals in tourism. 2026. https://akanepajs.github.io/animals-in-tourism/

  16. [16]

    Wildlife

    World Animal Protection. Wildlife. N ot entertainers: A global assessment of wildlife in tourism. Report, 2020