Agent-Driven Corpus Linguistics: A Framework for Autonomous Linguistic Discovery

Fukun Xing; Jia Yu; Pengfei Xiao; Weiwei Yu

arxiv: 2604.07189 · v1 · submitted 2026-04-08 · 💻 cs.CL

Agent-Driven Corpus Linguistics: A Framework for Autonomous Linguistic Discovery

Jia Yu , Weiwei Yu , Pengfei Xiao , Fukun Xing This is my paper

Pith reviewed 2026-05-10 18:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords corpus linguisticslarge language modelsautonomous agentssemantic changediachronic analysisintensifierstool useempirical grounding

0 comments

The pith

An LLM agent linked to a corpus query engine can autonomously generate hypotheses, execute queries, interpret results, and refine analyses while keeping every step anchored to verifiable data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that lets a language model take over the full cycle of corpus research instead of requiring humans to handle every query and interpretation step. This matters because corpus work has long demanded specialized skills in search syntax and statistics, limiting who can participate. The agent receives only a high-level direction, then proposes ideas, searches the text collection, evaluates patterns, and iterates, with the human researcher providing oversight only at the end. All outputs remain tied to actual corpus counts rather than the model's internal knowledge. Demonstrations include tracing historical shifts among English intensifiers and matching quantitative results from prior human-led studies on a larger collection.

Core claim

By connecting a large language model to a corpus query engine through a structured tool-use interface, the system enables the model to conduct complete investigative cycles: it formulates hypotheses from initial observations, translates them into precise queries, extracts and interprets quantitative patterns, identifies semantic pathways and distributions, and refines its account across successive rounds, all without relying on unverified model generation.

What carries the argument

The structured tool-use interface that lets the language model issue controlled queries to the corpus engine and receive back verifiable counts and contexts for each step of hypothesis generation and refinement.

If this is right

High-level directions from researchers can yield detailed, quantified linguistic findings without manual query construction.
Corpus-grounded results gain falsifiability that standalone model output lacks.
Established studies can be replicated with close numerical agreement on the same corpus.
The method treats agency as a separate dimension from the corpus-based versus corpus-driven distinction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar agent setups could extend to other large text collections where manual analysis is time-intensive.
The approach may surface distributional patterns across registers that require scanning millions of tokens.
It opens the possibility of running parallel investigations on multiple related queries simultaneously.

Load-bearing premise

The structured interface and direct corpus access will reliably stop the model from generating ungrounded or incorrect interpretations across multiple autonomous rounds of hypothesis formation, querying, and result evaluation.

What would settle it

Compare the agent's output on a fixed corpus and query against independent human analysis of the same data and measure the rate of claims that lack matching corpus evidence or quantitative support.

Figures

Figures reproduced from arXiv: 2604.07189 by Fukun Xing, Jia Yu, Pengfei Xiao, Weiwei Yu.

**Figure 2.** Figure 2: Five-stage iterative workflow. Blue stages involve human participation; orange stages [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Semantic change trajectories of English intensifiers, arranged by collocational free [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

Corpus linguistics has traditionally relied on human researchers to formulate hypotheses, construct queries, and interpret results - a process demanding specialized technical skills and considerable time. We propose Agent-Driven Corpus Linguistics, an approach in which a large language model (LLM), connected to a corpus query engine via a structured tool-use interface, takes over the investigative cycle: generating hypotheses, querying the corpus, interpreting results, and refining analysis across multiple rounds. The human researcher sets direction and evaluates final output. Unlike unconstrained LLM generation, every finding is anchored in verifiable corpus evidence. We treat this not as a replacement for the corpus-based/corpus-driven distinction but as a complementary dimension: it concerns who conducts the inquiry, not the epistemological relationship between theory and data. We demonstrate the framework by linking an LLM agent to a CQP-indexed Gutenberg corpus (5 million tokens) via the Model Context Protocol (MCP). Given only "investigate English intensifiers," the agent identified a diachronic relay chain (so+ADJ > very > really), three pathways of semantic change (delexicalization, polarity fixation, metaphorical constraint), and register-sensitive distributions. A controlled baseline experiment shows that corpus grounding contributes quantification and falsifiability that the model cannot produce from training data alone. To test external validity, the agent replicated two published studies on the CLMET corpus (40 million tokens) - Claridge (2025) and De Smet (2013) - with close quantitative agreement. Agent-driven corpus research can thus produce empirically grounded findings at machine speed, lowering the technical barrier for a broader range of researchers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows an LLM agent running autonomous multi-round corpus queries on intensifiers that can replicate two published studies, but the new interpretive claims rest on unshown traces of how the model turned raw counts into named semantic pathways.

read the letter

The main thing to know is that this framework lets an LLM agent handle the full cycle of hypothesis generation, CQP queries, result interpretation, and refinement on its own, with the human only setting the initial direction. On the Gutenberg corpus it produced a diachronic relay chain for intensifiers plus three semantic change pathways and register patterns; on the larger CLMET corpus it matched the numbers from Claridge and De Smet closely enough to count as replication. That replication step is the clearest evidence the setup can stay tied to data rather than just recalling training material. The baseline experiment also shows that forcing corpus queries adds measurable quantification the model cannot generate from parameters alone. Those two pieces are what actually moves the work forward from a pure idea. The softer spot is the main discovery itself. The pathways (delexicalization, polarity fixation, metaphorical constraint) and the register distributions require interpretive leaps beyond frequency tables, yet the paper gives no full log of every tool call or the exact mapping from returned results to each named claim. Without those traces it is hard to rule out that training priors shaped the story during the autonomous rounds, even with the MCP interface. The 5-million-token corpus is also small for diachronic work, which limits how far the relay-chain finding can be pushed. This is for corpus linguists who want to test whether agent tooling can scale exploratory analysis without losing empirical grounding, and for people building research agents more generally. A reader who already works with CQP or similar query engines will get the most out of the concrete demonstration and the replication check. It deserves peer review because the replication results and the baseline provide a real anchor, and the gaps in traces and implementation detail are the sort that referees can ask the authors to close with additional material rather than a fundamental flaw in the approach.

Referee Report

3 major / 2 minor

Summary. The paper proposes Agent-Driven Corpus Linguistics, a framework in which an LLM agent connected to a CQP-indexed corpus via the Model Context Protocol (MCP) autonomously executes the full research cycle: hypothesis generation, query formulation, result interpretation, and iterative refinement. The human researcher supplies only a high-level directive (e.g., 'investigate English intensifiers'). The demonstration on a 5-million-token Gutenberg corpus produces a diachronic relay chain (so+ADJ > very > really), three semantic-change pathways (delexicalization, polarity fixation, metaphorical constraint), and register-sensitive distributions. External validity is tested by replicating Claridge (2025) and De Smet (2013) on the 40-million-token CLMET corpus with close quantitative agreement, while a controlled baseline shows that corpus grounding supplies quantification and falsifiability absent from training-data recall alone.

Significance. If the grounding mechanism proves reliable, the framework could meaningfully lower the technical barrier to corpus research and accelerate multi-round, empirically anchored discovery. The replications of published studies supply a concrete external check, and the explicit contrast with ungrounded model output is a useful design choice. The work's primary contribution is methodological rather than a set of new linguistic facts; its long-term value therefore hinges on whether the autonomous cycle can be shown to remain strictly corpus-constrained across interpretive steps.

major comments (3)

[Demonstration] Demonstration section: the reported intensifier findings (diachronic relay chain and the three named pathways) are presented as direct outputs of the agent, yet the manuscript supplies neither the exact CQP query results returned at each round nor the explicit mapping from those tables to the interpretive claims. Without these traces it is impossible to confirm that the semantic-change pathways are corpus-derived rather than model-inferred.
[Baseline experiment] Baseline experiment: the control for training-data leakage is described only at a high level. To substantiate the claim that corpus grounding adds unique value, the paper must report the precise prompts used in the ungrounded condition, the model's raw responses, and quantitative metrics (e.g., precision/recall against the grounded outputs or error-rate differences).
[Replications] Replications: while 'close quantitative agreement' with Claridge (2025) and De Smet (2013) is asserted, the manuscript does not specify the agreement metrics (correlation, percentage match on key statistics, or discrepancy thresholds) nor list any observed deviations. This information is load-bearing for the external-validity argument.

minor comments (2)

[Abstract] The abstract states that the agent 'identified' the relay chain and pathways but does not clarify whether these are presented as novel discoveries or as reproductions of known patterns; a brief statement of the relationship to prior literature would improve clarity.
[Framework description] The description of the MCP interface would benefit from a short schematic or pseudocode showing the exact tool-call format and how corpus results are returned to the agent.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments identify important opportunities to increase transparency and strengthen the evidential basis of the framework. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Demonstration] Demonstration section: the reported intensifier findings (diachronic relay chain and the three named pathways) are presented as direct outputs of the agent, yet the manuscript supplies neither the exact CQP query results returned at each round nor the explicit mapping from those tables to the interpretive claims. Without these traces it is impossible to confirm that the semantic-change pathways are corpus-derived rather than model-inferred.

Authors: We agree that explicit traces are necessary for full verifiability. In the revised manuscript we will add a dedicated appendix that reproduces the exact CQP queries issued by the agent across iterations, the raw frequency tables returned by the corpus engine, and the agent's intermediate reasoning steps that map those tables onto the diachronic relay chain and the three semantic-change pathways. This addition will make the corpus-derived nature of the claims directly inspectable. revision: yes
Referee: [Baseline experiment] Baseline experiment: the control for training-data leakage is described only at a high level. To substantiate the claim that corpus grounding adds unique value, the paper must report the precise prompts used in the ungrounded condition, the model's raw responses, and quantitative metrics (e.g., precision/recall against the grounded outputs or error-rate differences).

Authors: We accept that the baseline section currently lacks the granularity required to quantify the contribution of grounding. The revised version will include the full prompt templates employed in the ungrounded condition, representative excerpts of the model's raw outputs, and explicit quantitative comparisons (precision, recall, and error-rate differentials) between grounded and ungrounded runs. These additions will allow readers to evaluate the incremental value of corpus access. revision: yes
Referee: [Replications] Replications: while 'close quantitative agreement' with Claridge (2025) and De Smet (2013) is asserted, the manuscript does not specify the agreement metrics (correlation, percentage match on key statistics, or discrepancy thresholds) nor list any observed deviations. This information is load-bearing for the external-validity argument.

Authors: We acknowledge that the replication results require more precise reporting. We will expand the relevant section to define the agreement metrics used (Pearson correlation on normalized frequency distributions and percentage match on the principal statistics reported in the source studies), supply the numerical values obtained, and include a table of any deviations exceeding a pre-specified threshold. This will make the external-validity claim fully auditable. revision: yes

Circularity Check

0 steps flagged

No circularity: framework demonstration anchored by external replications

full rationale

The paper's derivation chain consists of defining a tool-use interface (MCP to CQP corpus), running an autonomous agent cycle on a prompt, and reporting outputs plus replications. The load-bearing validation steps are the quantitative matches to two independent prior studies (Claridge 2025 and De Smet 2013) on the CLMET corpus and the controlled baseline contrasting grounded versus ungrounded model output. No equations, parameter fits, self-citations, or uniqueness theorems are invoked to derive the intensifier findings; the interpretive claims are presented as agent-generated but explicitly tied to returned corpus tables. Because the replications are external and the baseline is described as adding falsifiability the model cannot produce internally, the chain does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on assumptions about LLM reliability when tool-grounded and on the effectiveness of the new framework itself; no quantitative free parameters are introduced because the work is methodological rather than model-fitting.

axioms (2)

domain assumption An LLM connected via structured tool-use can generate, refine, and interpret corpus queries without introducing ungrounded content across multiple rounds
Invoked throughout the description of the autonomous investigative cycle.
ad hoc to paper Corpus query results provide sufficient falsifiability to anchor all agent outputs
Central to the claim that findings remain empirically grounded unlike unconstrained LLM generation.

invented entities (1)

Agent-Driven Corpus Linguistics framework no independent evidence
purpose: To shift the investigative role from human researcher to LLM agent while preserving corpus grounding
The core proposed contribution; independent evidence is limited to the single demonstration described.

pith-pipeline@v0.9.0 · 5588 in / 1639 out tokens · 65808 ms · 2026-05-10T18:10:45.971268+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Anthony, L. (2025). Integrating AI technology into corpus-based language learning through ChatAI.Computer Assisted Language Learning. 26 Bauer, L. and Bauer, W. (2002). Adjective boosters in the English of young New Zealanders. Journal of English Linguistics, 30(3):244–257. Bolinger, D. (1972).Degree Words. Mouton, The Hague. Cheung, L. and Crosthwaite, P...

work page arXiv 2025
[2]

really"%c] [deprel=

Traugott, E. C. and Dasher, R. B. (2002).Regularity in Semantic Change. Cambridge Univer- sity Press, Cambridge. Uchida, S. (2024). Using early llms for corpus linguistics: Examining chatgpt’s potential and limitations.Applied Corpus Linguistics, 4(1):100089. Wang, L., Ma, C., Feng, X., et al. (2024). A survey on large language model based autonomous agen...

work page 2002

[1] [1]

Anthony, L. (2025). Integrating AI technology into corpus-based language learning through ChatAI.Computer Assisted Language Learning. 26 Bauer, L. and Bauer, W. (2002). Adjective boosters in the English of young New Zealanders. Journal of English Linguistics, 30(3):244–257. Bolinger, D. (1972).Degree Words. Mouton, The Hague. Cheung, L. and Crosthwaite, P...

work page arXiv 2025

[2] [2]

really"%c] [deprel=

Traugott, E. C. and Dasher, R. B. (2002).Regularity in Semantic Change. Cambridge Univer- sity Press, Cambridge. Uchida, S. (2024). Using early llms for corpus linguistics: Examining chatgpt’s potential and limitations.Applied Corpus Linguistics, 4(1):100089. Wang, L., Ma, C., Feng, X., et al. (2024). A survey on large language model based autonomous agen...

work page 2002