arxiv: 2604.11152 · v1 · submitted 2026-04-13 · 💻 cs.CL

Recognition: unknown

SHARE: Social-Humanities AI for Research and Education

Jo\~ao Gon\c{c}alves , Sonia de Jager , Petr Knoth , David Pride , Nick Jelicic

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:32 UTC · model grok-4.3

classification 💻 cs.CL

keywords social sciences and humanitiescausal language modelsdomain-specific pretrainingSSH Cloze benchmarknon-generative AI interfaceacademic integrityMIRROR user interfacePhi-4 comparison

0 comments

The pith

The SHARE models are the first causal language models pretrained exclusively on social sciences and humanities texts and approach general models on domain tasks with far less data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the SHARE family of base models as causal language models trained only on texts from the social sciences and humanities. It demonstrates that these models reach performance levels close to much larger general-purpose models on a custom SSH Cloze benchmark, despite using roughly one percent of the training tokens. The work also presents the MIRROR interface, a non-generative tool that reviews SSH text inputs to support critical engagement without producing new text. This matters for researchers seeking AI tools that align with the norms and principles of their disciplines rather than relying on broad models trained on mixed data.

Core claim

The SHARE models represent the first causal language models fully pretrained by and for the social sciences and humanities. Their performance in modeling SSH texts is close to that of general purpose models such as Phi-4, which use approximately 100 times more tokens, as measured by a custom SSH Cloze benchmark. The MIRROR user interface prototypes a generative AI system that reviews text inputs from SSH disciplines while generating no text of its own, allowing the capabilities of the SHARE models to be used without compromising SSH principles and norms.

What carries the argument

The SHARE causal language models pretrained exclusively on SSH data, together with the MIRROR non-generative review interface that preserves critical engagement.

If this is right

Domain-specific pretraining on SSH corpora can yield competitive modeling performance without the scale of general-purpose training runs.
Non-generative interfaces can apply language model strengths to SSH work while avoiding risks to originality and critical norms.
Specialized benchmarks focused on SSH texts offer a more relevant way to assess models intended for those fields than general metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models like SHARE could lower barriers for SSH researchers to use AI assistance tailored to their data sources and ethical standards.
The non-generative design of MIRROR might extend to other domains where full text generation conflicts with disciplinary expectations around authorship.
Further testing on diverse SSH subfields could clarify whether the efficiency gains hold beyond the initial benchmark.

Load-bearing premise

The custom SSH Cloze benchmark provides a valid and representative measure of modeling quality for actual SSH research tasks, and the pretraining used only SSH data without any general-purpose corpora.

What would settle it

A direct comparison on real SSH research tasks such as close reading of historical texts or sociological analysis where the SHARE models show markedly lower accuracy than Phi-4, or documentation that the pretraining corpus included substantial non-SSH material.

Figures

Figures reproduced from arXiv: 2604.11152 by David Pride, Jo\~ao Gon\c{c}alves, Nick Jelicic, Petr Knoth, Sonia de Jager.

**Figure 1.** Figure 1: Two pipelines for LLM use in scholarly work. The standard pipeline generates expected text from a general-purpose aligned model; the SHARE + MIRROR pipeline uses an SSH-specialised base model to signal unexpected tokens for the reader to reflect. 2 Related work LLMs are Natural Language Processing (NLP) algorithms designed to tackle challenges related to the complexity of modelling language structures thro… view at source ↗

**Figure 2.** Figure 2: The figure shows the average difference in log perplexity between the SHARE and Phi-4 models per scientific domain following the FoS classifier. Lower values mean that SHARE fits better to that scientific domain in relation to Phi-4. Error bars indicate 95% confidence intervals [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: The figure shows the average difference in log perplexity between the SHARE and Phi-4 models based on the faculty that the author is affiliated to at Erasmus University Rotterdam. Lower values mean that SHARE fits better to research outputs of that faculty in relation to Phi-4. Error bars indicate 95% confidence intervals. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Prior corrected accuracy on the SSH Cloze benchmark (vertical axis) based on estimated flops (horizontal axis, logarithmic scale). Data points in the top left quadrant of the chart indicate more efficient models, achieving better performance with lower training compute. SHARE-14B, trained on 96 billion tokens, outperforms (.80) comparable general models such as the fully trained Pythia-12B (.62) and Olmo-2… view at source ↗

**Figure 5.** Figure 5: Example use of SHARE-MIRROR for typo and style detection, where tokens highlighted in red show deviations from model expectations. The top output refers to the SHARE-4B model while the bottom output refers to SHARE-14B. The second example shifts from testing grammar and style to the factual knowledge of the models. We therefore construct a factually wrong statement that identifies the authors of agenda set… view at source ↗

**Figure 6.** Figure 6: Three tests of reactions to factual inaccuracies in text by the SHARE models. The top and middle examples show, respectively, unexpectedness signalled by the SHARE-4B and the SHARE-14B models, with token predictions following "proposed by" to the left. The bottom example modifies the text by switching the word "concept" by "fact" and shows uncertainty for the SHARE-14B model. The third example ( [PITH_FUL… view at source ↗

**Figure 7.** Figure 7: The figure illustrates how SHARE-4B (top) and SHARE-14B (bottom) can be leveraged to identify unexpected tokens in a discussion section. Unlike examples in previous figures, these deviations from expectation, highlighted in red, likely signal positive deviations from expectations rather than factual or grammar mistakes. The final example ( [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Analysis of unexpectedness of the first paragraph of Gregory Gondwe’s response article to the presidential address of the International Communication Association 2025 Conference. The tokens in red show how research framed from African locations is signalled as unexpected by a model trained on English open access SSH scholarship. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: The figure above signals in green which tokens of a text (Sonn & Hsu, 2022) used in model training were correctly predicted (memorized) by the model while red tokens show instances where model generation would deviate from the original text. The high prevalence of red tokens shows that the model is not memorizing copyrighted works. Finally, early experiments with supervised instruction-tuned versions of SH… view at source ↗

read the original abstract

This intermediate technical report introduces the SHARE family of base models and the MIRROR user interface. The SHARE models are the first causal language models fully pretrained by and for the social sciences and humanities (SSH). Their performance in modelling SSH texts is close to that of general purpose models (Phi-4) which use 100 times more tokens, as shown by our custom SSH Cloze benchmark. The MIRROR user interface is designed for reviewing text inputs from the SSH disciplines while preserving critical engagement. By prototyping a generative AI interface that does not generate any text, we propose a way to harness the capabilities of the SHARE models without compromising the integrity of SSH principles and norms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SHARE reports early domain-adapted causal LMs for SSH plus a non-generative review interface, but the key performance claim hangs on an undescribed custom benchmark.

read the letter

The paper introduces the SHARE family of causal language models pretrained on social sciences and humanities material and pairs them with the MIRROR interface, which is built to review text without producing any generated output. That combination is the main new element: a deliberate effort to make smaller, domain-focused models and to constrain the interface so it does not undercut the interpretive norms of the fields it targets. The non-generative design is a straightforward way to address one common objection to generative tools in humanities work, and the claim that the models reach near parity with Phi-4 on SSH text despite far less training data is worth checking if the numbers hold. The report is still at the intermediate stage, so it does not yet include model sizes, exact pretraining corpora, or the full benchmark construction. The SSH Cloze task is presented as the central evidence, but without item selection criteria, disciplinary balance, or any correlation to actual research tasks such as source criticism or long-form argument evaluation, it is difficult to know how much weight to give the result. The same gap applies to the assertion that pretraining was conducted fully and exclusively on SSH data. These omissions are not fatal for an early report, but they leave the central efficiency claim untestable from what is shown. The work is aimed at people already building or evaluating AI tools inside digital humanities and SSH research units. Readers who need concrete baselines or reproducible training recipes will find little to use yet, while those looking for interface ideas that try to respect disciplinary caution may see something worth adapting. The paper deserves a serious referee once the methods and benchmark details are added, because the framing is coherent and the interface choice is a genuine attempt to solve a real constraint rather than an afterthought.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the SHARE family of causal language models, presented as the first fully pretrained exclusively on social sciences and humanities (SSH) data, along with the MIRROR user interface. It claims that SHARE models achieve performance close to general-purpose models such as Phi-4 on a custom SSH Cloze benchmark despite using 100 times fewer tokens, and describes MIRROR as a non-generative interface for reviewing SSH texts to preserve critical engagement and disciplinary norms.

Significance. If the performance claims and benchmark validity hold, this could represent a meaningful contribution to domain-adapted language modeling for interpretive fields, demonstrating efficient pretraining on specialized corpora and an interface design that prioritizes review over generation. Such work might help address mismatches between general LLMs and SSH research practices, but the current lack of disclosed quantitative results, corpus details, or validation limits its immediate assessability.

major comments (2)

[Benchmark description and results] The central claim of near-parity with Phi-4 on SSH text modeling (despite 100x fewer tokens) depends entirely on the custom SSH Cloze benchmark. No details are provided on its construction, item selection criteria, disciplinary coverage, expert validation, or correlation to real SSH tasks such as long-range coherence or theoretical nuance detection (see the section describing the SSH Cloze benchmark and associated results). This makes it impossible to determine whether the benchmark measures genuine SSH modeling capability or merely superficial domain adaptation.
[Model pretraining and data] The assertion that the SHARE models are 'fully pretrained by and for the SSH' and 'exclusively on SSH data' is load-bearing for the efficiency claim but lacks supporting evidence on corpus composition, total token count, source filtering, or safeguards against general-purpose data leakage (see the model pretraining and data sections). Without these, the 100x token comparison cannot be evaluated.

minor comments (2)

[Abstract] The abstract refers to 'Phi-4' without a citation or specification of the exact model, training tokens, or reference paper; add this for reproducibility.
[Conclusion or discussion] As an 'intermediate technical report,' the manuscript would benefit from an explicit limitations section or statement on the preliminary status of the benchmark and models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our intermediate technical report. We address each major comment below and will revise the manuscript accordingly to provide the requested details and clarifications.

read point-by-point responses

Referee: [Benchmark description and results] The central claim of near-parity with Phi-4 on SSH text modeling (despite 100x fewer tokens) depends entirely on the custom SSH Cloze benchmark. No details are provided on its construction, item selection criteria, disciplinary coverage, expert validation, or correlation to real SSH tasks such as long-range coherence or theoretical nuance detection (see the section describing the SSH Cloze benchmark and associated results). This makes it impossible to determine whether the benchmark measures genuine SSH modeling capability or merely superficial domain adaptation.

Authors: We acknowledge that the current manuscript provides only a high-level overview of the SSH Cloze benchmark and lacks the granular details needed for full evaluation. In the revised version, we will expand the benchmark section to describe its construction process, item selection criteria, disciplinary coverage across SSH fields, expert validation steps, and any available analyses or planned validations correlating benchmark scores with real SSH tasks such as long-range coherence and theoretical nuance detection. This will strengthen the substantiation of the performance claims. revision: yes
Referee: [Model pretraining and data] The assertion that the SHARE models are 'fully pretrained by and for the SSH' and 'exclusively on SSH data' is load-bearing for the efficiency claim but lacks supporting evidence on corpus composition, total token count, source filtering, or safeguards against general-purpose data leakage (see the model pretraining and data sections). Without these, the 100x token comparison cannot be evaluated.

Authors: We agree that additional evidence on the pretraining corpus is required to support the efficiency claims and allow proper evaluation of the 100x token comparison. The revised manuscript will include expanded details in the model pretraining and data sections on corpus composition, total token count, source filtering methods, and safeguards against general-purpose data leakage. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive report with no derivations or self-referential reductions

full rationale

The manuscript introduces the SHARE models and MIRROR interface as a descriptive technical report. No equations, derivations, fitted parameters, or load-bearing self-citations appear in the provided text. The central performance claim is tied to a custom SSH Cloze benchmark, but this is presented as external evidence rather than a quantity derived by construction from the pretraining inputs or prior self-citations. The 'first' status and performance comparison do not reduce to definitional equivalence or ansatz smuggling. The derivation chain is self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new theoretical entities are introduced; the work is an empirical model and interface introduction.

pith-pipeline@v0.9.0 · 5420 in / 1126 out tokens · 68949 ms · 2026-05-10T15:32:27.690436+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

others (2024)

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., . . . others (2024). Phi-4 technical report.arXiv preprint arXiv:2412.08905. Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., . . . others (2024). Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedin...

work page doi:10.18653/v1/p19-3019 2024
[2]

Knoth, P., & Zdrahal, Z. (2012). Core: three access levels to underpin open access.D-Lib Magazine, 18(11/12), 1–13. Kuditipudi, R., Huang, J., Zhu, S., Yang, D., Potts, C., & Liang, P. (2025). Blackbox model provenance via palimpsestic membership inference.arXiv preprint arXiv:2510.19796. Kyriakidis, K. (2025). Focus on stem at the expense of humanities: ...

work page arXiv 2012
[3]

Mixed Precision Training

Lo, K., Wang, L. L., Neumann, M., Kinney, R., & Weld, D. S. (2020). S2orc: The semantic scholar open research corpus. InProceedings of the 58th annual meeting of the association for computational linguistics(pp. 4969–4983). Masterman, M. (2005).Language, cohesion and form (edited by yorick wilks). Cambridge University Press. McCombs, M. E., & Shaw, D. L. ...

work page internal anchor Pith review arXiv 2020