What Is The Political Content in LLMs' Pre- and Post-Training Data?
Pith reviewed 2026-05-18 12:33 UTC · model grok-4.3
The pith
Political biases in LLMs arise from left-leaning skews already present in their pre-training and post-training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors sample text from pre-training and post-training datasets of several open-source LLMs and classify the political leaning and policy stances within that text. They report that the data is systematically skewed toward left-leaning content, that pre-training corpora contain substantially more politically engaged material than post-training data, and that pre-training datasets show similar political distributions despite differing curation methods. They further find a strong correlation between the stances detected in the training data and the stances expressed by the trained models, with political biases already visible in base models and persisting through post-training stages.
What carries the argument
Large-scale sampling of pre- and post-training datasets followed by automated political-leaning classification and stance detection on the sampled text.
If this is right
- Political biases are already fixed in base models before any post-training alignment occurs.
- Pre-training data sets the dominant political distribution even when later stages use different curation.
- Post-training data contains less political material yet does not remove the left-leaning tilt inherited from pre-training.
- Model behavior on policy issues tracks the measured stances in the data used to train it.
- Greater transparency about dataset political composition is required to understand or adjust model outputs.
Where Pith is reading between the lines
- If the correlation holds, then deliberately balancing political content in pre-training corpora could reduce downstream model bias without altering alignment techniques.
- The same sampling-and-classification approach could be applied to other bias dimensions such as gender or cultural framing to test whether data skew drives those patterns as well.
- Developers might need routine political audits of training data as a standard step before release, comparable to current safety evaluations.
Load-bearing premise
Automated political-leaning classifiers and stance detectors accurately measure the underlying political content of the sampled text without systematic error or bias of their own.
What would settle it
A direct test that retrains an identical base model on a dataset whose political distribution has been deliberately shifted rightward or balanced and then measures whether model outputs on policy questions shift accordingly would falsify the data-to-behavior link if no corresponding change appears.
Figures
read the original abstract
Large language models (LLMs) are known to generate politically biased text. Yet, it remains unclear how such biases arise, making it difficult to design effective mitigation strategies. We hypothesize that these biases are rooted in the composition of training data. Taking a data-centric perspective, we formulate research questions on (1) political leaning present in data, (2) data imbalance, (3) cross-dataset similarity, and (4) data-model alignment. We then examine how exposure to political content relates to models' stances on policy issues. We analyze the political content of pre- and post-training datasets of open-source LLMs, combining large-scale sampling, political-leaning classification, and stance detection. We find that training data is systematically skewed toward left-leaning content, with pre-training corpora containing substantially more politically engaged material than post-training data. We further observe a strong correlation between political stances in training data and model behavior, and show that pre-training datasets exhibit similar political distributions despite different curation strategies. In addition, we find that political biases are already present in base models and persist across post-training stages. These findings highlight the central role of data composition in shaping model behavior and motivate the need for greater data transparency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the political content and biases in pre- and post-training datasets of open-source LLMs via large-scale sampling combined with automated political-leaning classification and stance detection. It reports that training data is systematically skewed toward left-leaning content, pre-training corpora contain substantially more politically engaged material than post-training data, pre-training datasets show similar political distributions despite differing curation, and there is a strong correlation between political stances in the data and model behavior on policy issues. Biases are present in base models and persist through post-training stages.
Significance. If the automated measurements prove reliable, the work offers a valuable data-centric empirical contribution to understanding the origins of political biases in LLMs. The scale of the analysis across multiple open-source models and the distinction between pre- and post-training stages could inform data auditing practices and mitigation strategies, while the observed data-model alignment and cross-dataset similarities provide testable observations for the field.
major comments (1)
- [Abstract and classification/stsampling pipeline (methods description)] The central claims of systematic left-leaning skew, greater political engagement in pre- vs. post-training data, and data-model stance correlations rest entirely on the outputs of external political-leaning classifiers and stance detectors applied to sampled text. No validation is reported (e.g., precision/recall on held-out human-annotated sets, inter-annotator agreement, comparison to established political text benchmarks, or ablation across classifier choices). This is load-bearing: any systematic over-labeling of left-leaning or politically engaged content would directly produce the reported imbalances and alignments as measurement artifacts rather than properties of the underlying corpora.
Simulated Author's Rebuttal
We thank the referee for their careful reading and for highlighting the importance of validating the automated classification pipeline. We agree that this is a critical aspect for ensuring the reliability of our data-centric findings on political biases in LLM training data. We address the major comment below and outline specific revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and classification/stsampling pipeline (methods description)] The central claims of systematic left-leaning skew, greater political engagement in pre- vs. post-training data, and data-model stance correlations rest entirely on the outputs of external political-leaning classifiers and stance detectors applied to sampled text. No validation is reported (e.g., precision/recall on held-out human-annotated sets, inter-annotator agreement, comparison to established political text benchmarks, or ablation across classifier choices). This is load-bearing: any systematic over-labeling of left-leaning or politically engaged content would directly produce the reported imbalances and alignments as measurement artifacts rather than properties of the underlying corpora.
Authors: We acknowledge the validity of this concern and agree that the absence of explicit validation metrics represents a gap in the current manuscript. Although the classifiers and stance detectors were selected based on their prior use and reported performance in the political text analysis literature, we did not include human validation or robustness checks in the original submission. In the revised version, we will add a dedicated validation subsection that reports: (1) precision, recall, and F1 scores on a newly annotated held-out set of 1,000 samples with two independent human annotators (including inter-annotator agreement via Cohen's kappa); (2) comparison against established political text benchmarks such as those from the Manifesto Project or Media Bias Chart where applicable; and (3) an ablation across at least two alternative classifier implementations to test sensitivity to model choice. These additions will directly address the risk of measurement artifacts and provide quantitative evidence that the observed left-leaning skew and data-model alignments reflect properties of the corpora rather than classifier biases. revision: yes
Circularity Check
No circularity: empirical application of external classifiers to standard corpora yields independent measurements.
full rationale
The paper conducts large-scale sampling and applies political-leaning classifiers plus stance detectors to pre- and post-training datasets drawn from public corpora. No derivation reduces measured skew, imbalance, or data-model correlations to fitted parameters internal to the paper, nor does any self-citation chain substitute for the measurement step. The central results are direct outputs of the chosen external tools on the sampled text; they remain falsifiable by human annotation or alternative classifiers and do not exhibit self-definitional, fitted-input, or uniqueness-imported patterns. This is a standard empirical pipeline whose validity hinges on classifier accuracy rather than any internal circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Political leaning and stance can be reliably measured by automated text classifiers on sampled training data.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate large LLMs ... reaching a macro-F1 score of 71%. ... stance detection ... Pearson r=0.90
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
left-leaning documents consistently outnumber right-leaning ones by a factor of 3 to 12
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Towards tracing knowledge in language models back to the training data
Ekin Akyurek, Tolga Bolukbasi, Frederick Liu, Binbin Xiong, Ian Tenney, Jacob Andreas, and Kelvin Guu. Towards tracing knowledge in language models back to the training data. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.\ 2429--2446, Abu Dhabi, United Arab Emirates, De...
-
[3]
The claude 3 model family: Opus, sonnet, haiku
Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2025. URL https://api.semanticscholar.org/CorpusID:268232499
work page 2025
-
[4]
Voelkel, Shane Muldowney, Johannes C
Hui Bai, Jan G. Voelkel, Shane Muldowney, Johannes C. Eichstaedt, and Robb Willer. Llm-generated messages can persuade humans on policy issues. Nature Communications, 16 0 (1): 0 6037, 2025. ISSN 2041-1723. doi:10.1038/s41467-025-61345-5. URL https://doi.org/10.1038/s41467-025-61345-5
-
[5]
Digital speech and democratic culture: A theory of freedom of expression for the information society
Jack M Balkin. Digital speech and democratic culture: A theory of freedom of expression for the information society. In Law and Society approaches to cyberspace, pp.\ 325--382. Routledge, 2017
work page 2017
-
[6]
Elisa Bassignana, Amanda Cercas Curry, and Dirk Hovy. The AI gap: How socioeconomic status affects language technology interactions. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 18647--18664, Vienn...
-
[7]
Tanise Ceron, Neele Falk, Ana Bari \'c , Dmitry Nikolaev, and Sebastian Pad \'o . Beyond prompt brittleness: Evaluating the reliability and consistency of political worldviews in LLM s. Transactions of the Association for Computational Linguistics, 12: 0 1378--1400, 2024. doi:10.1162/tacl_a_00710. URL https://aclanthology.org/2024.tacl-1.76/
-
[8]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Yanai Elazar, Akshita Bhagia, Ian Helgi Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hannaneh Hajishirzi, Noah Smith, and Jesse Dodge. What’s in my big data? In Proceedings of the 12th International Conference on Learning Representations (ICLR 2024), 2024
work page 2024
-
[10]
Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: ...
-
[11]
On the relationship between truth and political bias in language models
Suyash Fulay, William Brannon, Shrestha Mohanty, Cassandra Overney, Elinor Poole-Dayan, Deb Roy, and Jad Kabbara. On the relationship between truth and political bias in language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 9004--9018, Miam...
-
[12]
Word embeddings quantify 100 years of gender and ethnic stereotypes
Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115 0 (16): 0 E3635--E3644, 2018
work page 2018
-
[13]
Niels D. Goet. Measuring polarization with text analysis: Evidence from the uk house of commons, 1811--2015. Political Analysis, 27 0 (4): 0 518--539, 2019. doi:10.1017/pan.2019.11. URL https://www.jstor.org/stable/26843223. Accessed: 2025-09-21
-
[14]
BERTopic: Neural topic modeling with a class-based TF-IDF procedure
Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Kobi Hackenburg and Helen Margetts. Evaluating the persuasive influence of political microtargeting with large language models. Proceedings of the National Academy of Sciences, 121 0 (24): 0 e2403116121, 2024. doi:10.1073/pnas.2403116121. URL https://www.pnas.org/doi/abs/10.1073/pnas.2403116121
-
[16]
On the inevitability of left-leaning political bias in aligned language models
Thilo Hagendorff. On the inevitability of left-leaning political bias in aligned language models. arXiv preprint arXiv:2507.15328, 2025
-
[17]
Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. The Political Ideology of Conversational AI: Converging Evidence on ChatGPT's Pro-environmental, Left-libertarian Orientation , 2023
work page 2023
-
[18]
Ben Hutchinson, Andrew Smart, Alex Hanna, Remi Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, pp.\ 560–575, New ...
-
[19]
C ommunity LM : Probing partisan worldviews from language models
Hang Jiang, Doug Beeferman, Brandon Roy, and Deb Roy. C ommunity LM : Probing partisan worldviews from language models. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, ...
work page 2022
-
[20]
Challenges and applications of large language models
Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023
-
[21]
Datacomp-lm: In search of the next generation of training sets for language models
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models. Advances in Neural Information Processing Systems, 37: 0 14200--14282, 2024
work page 2024
-
[22]
Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, and Daphne Ippolito. A pretrainer ' s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Confe...
-
[23]
Conversational AI increases political knowledge as effectively as self-directed internet search
Lennart Luettgau, Hannah Rose Kirk, Kobi Hackenburg, Jessica Bergs, Henry Davidson, Henry Ogden, Divya Siddarth, Saffron Huang, and Christopher Summerfield. Conversational ai increases political knowledge as effectively as self-directed internet search. arXiv preprint arXiv:2509.05219, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz
R. Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz. How much do language models copy from their training data? evaluating linguistic novelty in text generation using RAVEN . Transactions of the Association for Computational Linguistics, 11: 0 652--670, 2023. doi:10.1162/tacl_a_00567. URL https://aclanthology.org/2023.tacl-1.38/
-
[25]
Model Cards for Model Reporting,
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* '19, pp.\ 220–229, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 978...
-
[26]
More human than human: Measuring ChatGPT political bias
Fabio Motoki, Valdemar Pinho Neto, and Victor Rodrigues. More human than human: Measuring ChatGPT political bias. Public Choice, 198: 0 3--23, 2024. URL https://doi.org/10.1007/s11127-023-01097-2
-
[27]
Dmitry Nikolaev, Tanise Ceron, and Sebastian Pad \'o . Multilingual estimation of political-party positioning: From label aggregation to long-input transformers. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 9497--9511, Singapore, December 2023. Association...
-
[28]
Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs
Kyle O'Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, and Stella Biderman. Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight llms, 2025. URL https://arxiv.org/abs/2508.06601
-
[29]
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
The shifted and the overlooked: A task-oriented investigation of user- GPT interactions
Siru Ouyang, Shuohang Wang, Yang Liu, Ming Zhong, Yizhu Jiao, Dan Iter, Reid Pryzant, Chenguang Zhu, Heng Ji, and Jiawei Han. The shifted and the overlooked: A task-oriented investigation of user- GPT interactions. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, p...
-
[31]
The ROOTS search tool: Data transparency for LLM s
Aleksandra Piktus, Christopher Akiki, Paulo Villegas, Hugo Lauren c on, G \'e rard Dupont, Sasha Luccioni, Yacine Jernite, and Anna Rogers. The ROOTS search tool: Data transparency for LLM s. In Danushka Bollegala, Ruihong Huang, and Alan Ritter (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: Syst...
-
[32]
GAIA search: Hugging face and pyserini interoperability for NLP training data exploration
Aleksandra Piktus, Odunayo Ogundepo, Christopher Akiki, Akintunde Oladipo, Xinyu Zhang, Hailey Schoelkopf, Stella Biderman, Martin Potthast, and Jimmy Lin. GAIA search: Hugging face and pyserini interoperability for NLP training data exploration. In Danushka Bollegala, Ruihong Huang, and Alan Ritter (eds.), Proceedings of the 61st Annual Meeting of the As...
-
[33]
doi: 10.18653/v1/2024.emnlp-main.244
Yujin Potter, Shiyang Lai, Junsol Kim, James Evans, and Dawn Song. Hidden persuaders: LLM s' political leaning and their influence on voters. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 4244--4275, Miami, Florida, USA, November 2024. Association f...
-
[34]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020
work page 2020
-
[35]
Paul R \"o ttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Kirk, Hinrich Schuetze, and Dirk Hovy. Political compass or spinning arrow? towards more meaningful evaluations for values and opinions in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for...
-
[36]
Issuebench: Millions of realistic prompts for measuring issue bias in llm writing assistance
Paul R \"o ttger, Musashi Hinck, Valentin Hofmann, Kobi Hackenburg, Valentina Pyatkin, Faeze Brahman, and Dirk Hovy. Issuebench: Millions of realistic prompts for measuring issue bias in llm writing assistance. arXiv preprint arXiv:2502.08395, 2025
-
[37]
The Self-Perception and Political Biases of ChatGPT
Jérôme Rutinowski, Sven Franke, Jan Endendyk, Ina Dormuth, Moritz Roidl, and Markus Pauly. The Self-Perception and Political Biases of ChatGPT . Human Behavior and Emerging Technologies, 2024 0 (1): 0 7115633, 2024. doi:https://doi.org/10.1155/2024/7115633. URL https://onlinelibrary.wiley.com/doi/abs/10.1155/2024/7115633
-
[38]
On the conversational persuasiveness of gpt-4
Francesco Salvi, Manoel Horta Ribeiro, Riccardo Gallotti, and Robert West. On the conversational persuasiveness of gpt-4. Nature Human Behaviour, pp.\ 1--9, 2025
work page 2025
-
[39]
Gender bias in machine translation
Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Matteo Negri, and Marco Turchi. Gender bias in machine translation. Transactions of the Association for Computational Linguistics, 9: 0 845--874, 2021
work page 2021
-
[40]
Analysis of linguistic features in right-wing extremist discourse
Nicole Seow. Analysis of linguistic features in right-wing extremist discourse. Master's thesis, University of Manchester, 2025
work page 2025
-
[41]
Nikhil Sharma, Q. Vera Liao, and Ziang Xiao. Generative echo chamber? effect of llm-powered search systems on diverse information seeking. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI '24, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400703300. doi:10.1145/3613904.3642459. URL https://doi.o...
-
[42]
Aligning large language models with diverse political viewpoints
Dominik Stammbach, Philine Widmer, Eunjung Cho, Caglar Gulcehre, and Elliott Ash. Aligning large language models with diverse political viewpoints. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 7257--7267, Miami, Florida, USA, November 2024. Associa...
-
[43]
Mirjam Stieger, Christoph Flückiger, Dominik Rüegger, Tobias Kowatsch, Brent W. Roberts, and Mathias Allemand. Changing personality traits with the help of a digital personality change intervention. Proceedings of the National Academy of Sciences, 118 0 (8): 0 e2017548118, 2021. doi:10.1073/pnas.2017548118. URL https://www.pnas.org/doi/abs/10.1073/pnas.2017548118
-
[44]
Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page 2023
-
[45]
Jeffrey S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11 0 (1): 0 37–57, March 1985. ISSN 0098-3500. doi:10.1145/3147.3165. URL https://doi.org/10.1145/3147.3165
-
[46]
Benjamin Warner, Antoine Chaffin, Benjamin Clavi \'e , Orion Weller, Oskar Hallstr \"o m, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Griffin Thomas Adams, Jeremy Howard, and Iacopo Poli. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. In...
-
[47]
Franziska Weeber, Tanise Ceron, and Sebastian Padó. Do political opinions transfer between western languages? an analysis of unaligned and aligned multilingual llms, 2025. URL https://arxiv.org/abs/2508.05553
-
[48]
Better aligned with survey respondents or training data? unveiling political leanings of LLM s on U
Shanshan Xu, Santosh T.y.s.s, Yanai Elazar, Quirin Vogel, Barbara Plank, and Matthias Grabmair. Better aligned with survey respondents or training data? unveiling political leanings of LLM s on U . S . S upreme C ourt cases. In Robin Jia, Eric Wallace, Yangsibo Huang, Tiago Pimentel, Pratyush Maini, Verna Dankers, Johnny Wei, and Pietro Lesci (eds.), Proc...
-
[49]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[51]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[52]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[53]
,# (7),01444 '9=82<.342C 2! !22222222222222222222222222222222222222222222222222
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.