pith. sign in

arxiv: 2606.24890 · v2 · pith:4PCZUMQEnew · submitted 2026-04-30 · 💻 cs.CL · cs.AI· cs.CY

Small edits, large models: How Wikipedia advocacy shapes LLM values

Pith reviewed 2026-07-01 09:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY
keywords Wikipedia editingLLM training datadata attributionanimal welfareinfluence estimationadvocacylanguage modelsfine-tuning
0
0 comments X

The pith

A small group of Wikipedia editors can measurably shape how language models discuss animal welfare topics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that 125 edits by the Pro-Animal Wikipedians across 115 Wikipedia pages measurably influence large language model behavior on animal welfare queries. Wikipedia text receives heavy weight in training datasets, and the authors apply retrieval and gradient-based attribution to isolate the contribution of those specific edits. For animal welfare queries the PAW sections dominate the highest-attributed documents at 68 percent, while the same sections show no elevated influence on unrelated queries about the same entities. Fine-tuning experiments confirm the pattern: models trained on the PAW content reduce perplexity only on animal welfare text, and models trained on control content improve only on control text.

Core claim

Using TrackStar retrieval attribution on Llama 3.1 8B, PAW-edited sections made up 68 percent of the highest-attributed documents for animal welfare queries but only 52 percent for unrelated queries about the same companies. MAGIC counterfactual influence estimation on Llama-3.2-1B, repeated across five random training-order seeds, found that the top-10 most influential documents on animal welfare queries were all PAW edits in every seed, while the same top-10 sat at chance levels on general queries. Mean PAW influence exceeded mean control influence on animal welfare queries with p less than 0.0001 in every seed, an effect 6 to 30 times larger than on general queries. Leave-subset-out valid

What carries the argument

Gradient-based data attribution (TrackStar retrieval and MAGIC counterfactual influence estimation) that traces the contribution of specific Wikipedia page sections to model outputs on targeted versus control queries.

If this is right

  • The influence of the edits is topic-specific rather than a general association with the mentioned companies or pages.
  • Counterfactual estimation shows that the top influential documents shift completely to PAW edits when the query concerns animal welfare.
  • Fine-tuning on the edited content improves model performance only on the matching category of text.
  • The effect size difference between targeted and control queries holds across multiple random training seeds and validation runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attribution pipeline could be applied to other coordinated editing campaigns on Wikipedia to check for comparable topic-specific effects.
  • Public knowledge platforms function as high-leverage insertion points for shaping downstream model behavior when training data is drawn from them.
  • Auditing training corpora with influence methods might reveal other narrow editing efforts that have already propagated into deployed models.
  • The observed pattern suggests that small changes in high-weight data sources can produce outsized, topic-local shifts in model outputs without altering the overall training distribution.

Load-bearing premise

The attribution methods correctly isolate the causal effect of the specific PAW edits rather than capturing correlated features of the pages or query selection biases.

What would settle it

If the proportion of PAW-edited sections among the top-attributed documents remained statistically indistinguishable between animal welfare queries and unrelated queries about the same entities, the claim of topic-specific influence would be falsified.

Figures

Figures reproduced from arXiv: 2606.24890 by Alexa Gnauck, Jasmine Brazilek, Maria Navas.

Figure 1
Figure 1. Figure 1: Domain-specific influence of PAW Wikipedia editions. Comparative analysis of training [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MAGIC counterfactual influence per training document, averaged across five random [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fine-tuning ablation. Each model performs better on the type of text it was trained on. The [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Can a small group of volunteers shape how AI systems discuss animal welfare, just by editing Wikipedia? We show that they can. Wikipedia appears in nearly every major language model training dataset and is weighted more heavily than web-crawled text. The Pro-Animal Wikipedians (PAW), a group of advocates who add sourced animal welfare content to relevant articles, have made 125 edits across 115 pages. Using gradient-based data attribution (Bergson; MAGIC), we traced how these edits influence language model behavior. TrackStar retrieval attribution on Llama 3.1 8B found that PAW-edited sections made up 68 percent of the highest-attributed documents for animal welfare queries (p < 0.0001) but only 52 percent for unrelated queries about the same companies (p = 0.53): the model links PAW content specifically to animal welfare topics, not to the entities in general. MAGIC counterfactual influence estimation on Llama-3.2-1B, run across five random training-order seeds, gave the same picture even more sharply: in every seed, the top-10 most influential documents on animal welfare queries were all PAW edits (10 of 10, 5 of 5 seeds), while on general queries the same top-10 sat at chance (4 to 6 of 10). Mean PAW influence exceeded mean control influence on animal welfare queries with p < 0.0001 in every seed, an effect 6 to 30 times larger than on general queries. Leave-subset-out validation gave Spearman rho = 1.00 for all 10 runs. When we fine-tuned separate models on PAW content versus control content, each model performed better specifically on the type of text it was trained on: the PAW-trained model cut perplexity on animal welfare text from 12.4 to 8.4, while the control-trained model cut perplexity on control text from 16.1 to 11.4. A small, coordinated Wikipedia editing campaign therefore measurably shapes how language models handle the topics those edits address.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that a small coordinated set of 125 Wikipedia edits by the Pro-Animal Wikipedians (PAW) across 115 pages measurably shapes LLM behavior on animal-welfare topics. Using TrackStar retrieval attribution on Llama 3.1 8B, PAW-edited sections comprise 68% of top-attributed documents for animal-welfare queries (p < 0.0001) versus 52% for unrelated control queries on the same entities (p = 0.53). MAGIC gradient-based influence on Llama-3.2-1B across five training-order seeds shows the top-10 documents for animal-welfare queries are all PAW edits (10/10 in every seed) while controls are at chance (4–6/10); mean PAW influence exceeds control with p < 0.0001 and is 6–30 times larger than on general queries. Leave-subset-out validation yields Spearman rho = 1.00. Separate fine-tuning runs confirm topic-specific effects: PAW-trained models reduce perplexity on animal-welfare text from 12.4 to 8.4 while control-trained models reduce it on control text from 16.1 to 11.4.

Significance. If the attribution methods validly isolate causal influence from the specific PAW edits, the result would demonstrate that modest, targeted Wikipedia editing can produce measurable, topic-specific shifts in LLM behavior. The internal consistency across five seeds, perfect leave-subset-out rank correlation, and the fine-tuning perplexity contrast (topic-specific rather than global) are genuine strengths that support reproducibility of the chosen metrics.

major comments (2)
  1. [Abstract] Abstract: The interpretation that the 68% vs 52% TrackStar differential and the 10/10 MAGIC top-document result demonstrate specific causal influence from the 125 PAW edits assumes that both attribution procedures recover ground-truth influence rather than topical content similarity or query-selection effects; no validation experiment (e.g., synthetic injection of known edits or comparison against a ground-truth influence oracle) is described to test this premise, which is load-bearing for the central claim.
  2. [Abstract] Abstract: Query construction, sampling of documents for attribution, and the precise definition of the control query set are not specified, preventing assessment of whether the reported separation between animal-welfare and control conditions could be driven by semantic overlap or selection bias rather than the PAW edits themselves.
minor comments (1)
  1. [Abstract] Abstract: The fine-tuning perplexity reductions would be easier to interpret if the number of evaluation documents or tokens per condition were reported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and for noting the strengths in internal consistency, cross-seed reproducibility, and the fine-tuning perplexity contrasts. Below we respond point-by-point to the two major comments. We agree that greater methodological transparency and explicit discussion of attribution assumptions are warranted.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The interpretation that the 68% vs 52% TrackStar differential and the 10/10 MAGIC top-document result demonstrate specific causal influence from the 125 PAW edits assumes that both attribution procedures recover ground-truth influence rather than topical content similarity or query-selection effects; no validation experiment (e.g., synthetic injection of known edits or comparison against a ground-truth influence oracle) is described to test this premise, which is load-bearing for the central claim.

    Authors: We acknowledge that the manuscript does not include a synthetic-injection or oracle-based validation of the attribution methods, and that such an experiment would strengthen the causal interpretation. The control conditions (unrelated queries on identical entities, plus general queries) were intended to isolate topic-specific effects from entity or semantic overlap, and the leave-subset-out Spearman rho = 1.00 plus perfect top-10 consistency across five seeds provide internal checks. Nevertheless, we will add a dedicated Limitations subsection that explicitly states the lack of synthetic validation, discusses the assumptions of TrackStar and MAGIC, and notes that the reported differentials are conditional on those assumptions. If the available data and compute permit, we will also attempt a small-scale synthetic edit injection as an appendix experiment. revision: partial

  2. Referee: [Abstract] Abstract: Query construction, sampling of documents for attribution, and the precise definition of the control query set are not specified, preventing assessment of whether the reported separation between animal-welfare and control conditions could be driven by semantic overlap or selection bias rather than the PAW edits themselves.

    Authors: The full Methods section contains the query templates, retrieval sampling procedure (top-k documents from the indexed corpus), and control-query definitions, but these details are not summarized in the abstract or introduction. We will revise the abstract and add a concise Experimental Setup paragraph that specifies: (i) the exact query templates used for animal-welfare versus control conditions, (ii) the document sampling rule for attribution runs, and (iii) the control set construction (queries about the same 115 entities but on non-welfare topics such as financial metrics or historical facts). This will allow readers to evaluate potential semantic overlap directly. revision: yes

Circularity Check

0 steps flagged

No circularity: central results rely on external attribution tools and independent fine-tuning experiments

full rationale

The paper reports empirical measurements of influence using TrackStar retrieval and MAGIC gradient-based attribution (external methods) plus separate fine-tuning runs on PAW vs. control content. No equations appear in the provided text, and no step reduces a reported influence score or perplexity differential to a parameter fitted inside the paper or to a self-citation chain. Leave-subset-out Spearman rho = 1.00 is an internal consistency check on the chosen metrics rather than a re-derivation of the influence itself. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on the validity of two cited attribution methods and on the assumption that Wikipedia content appears in the training data with the stated weighting; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5933 in / 1125 out tokens · 36346 ms · 2026-07-01T09:00:35.471677+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 13 canonical work pages · 3 internal anchors

  1. [1]

    Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

  2. [2]

    arXiv preprint arXiv:2302.10149 , year=

    Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practical.arXiv preprint arXiv:2302.10149, 2023

  3. [3]

    Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

    Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney. Scalable influence and fact tracing for large language model pretraining.arXiv preprint arXiv:2410.17413,

  4. [4]

    doi: 10.48550/arXiv.2410.17413

  5. [5]

    Marcia W. DiStaso. Perceptions of wikipedia by public relations professionals: A comparison of 2012 and 2013 surveys. Institute for Public Relations, 2013. URL https://instituteforpr.org/ perceptions-wikipedia-public-relations-professionals-comparison-2012-2013-surveys/

  6. [6]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

  7. [7]

    doi: 10.48550/arXiv.2101.00027

  8. [8]

    MAGIC: Near-optimal data attribution for deep learning

    Andrew Ilyas and Logan Engstrom. MAGIC: Near-optimal data attribution for deep learning. arXiv preprint arXiv:2504.16430, 2025. URLhttps://arxiv.org/abs/2504.16430

  9. [9]

    Datamodels: Predicting predictions from training data.arXiv preprint arXiv:2202.00622, 2022

    Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Aleksander Madry. Datamodels: Predicting predictions from training data.arXiv preprint arXiv:2202.00622, 2022

  10. [10]

    Zarine Kharazian, Kate Starbird, and Benjamin Mako Hill. Governance capture in a self- governing community: A qualitative comparison of the Croatian, Serbian, Bosnian, and Serbo- Croatian wikipedias.Proceedings of the ACM on Human-Computer Interaction, 8(CSCW1): 1–26, 2024. doi: 10.1145/3637338

  11. [11]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational Conference on Machine Learning, pages 1885–1894, 2017

  12. [12]

    EleutherAI/bergson: Mapping out the “memory” of neural nets with data attribution

    Quintin Lucia and Nora Belrose. EleutherAI/bergson: Mapping out the “memory” of neural nets with data attribution. GitHub, 2026. URLhttps://github.com/EleutherAI/bergson

  13. [13]

    Trak: Attributing model behavior at scale.arXiv preprint arXiv:2303.14186, 2023

    Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. TRAK: Attributing model behavior at scale.arXiv preprint arXiv:2303.14186, 2023

  14. [14]

    Dolma: An open corpus of three trillion tokens for language model pretraining research.arXiv preprint arXiv:2402.00159, 2024

    Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. Dolma: An open corpus of three trillion tokens for language model pretraining research.arXiv preprint arXiv:2402.00159, 2024. doi: 10.48550/arXiv.2402.00159

  15. [15]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  16. [16]

    Concealed data poisoning attacks on NLP models

    Eric Wallace, Tony Z Zhao, Shi Feng, and Sameer Singh. Concealed data poisoning attacks on NLP models. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 139–150, 2021

  17. [17]

    Redpajama: An open dataset for training large language models.arXiv preprint arXiv:2411.12372, 2024

    Maurice Weber, Daniel Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, et al. Redpajama: An open dataset for training large language models.arXiv preprint arXiv:2411.12372, 2024. doi: 10.48550/arXiv.2411.12372

  18. [18]

    What is Restaurant C chicken welfare policy?

    Taha Yasseri, Robert Sumi, András Rung, András Kornai, and János Kertész. Dynamics of conflicts in wikipedia.PLoS ONE, 2012. doi: 10.1371/journal.pone.0038869. 9 A Query Examples and Data Availability All datasets, queries, and fine-tuned models are available from the authors on request. Below we list example queries from each set. A.1 Animal Welfare Quer...