arxiv: 2603.22608 · v2 · submitted 2026-03-23 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length

Jingxuan Chen , Mohammad Taher Pilehvar , Jose Camacho-Collados

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords large language modelsmulti-instance processingperformance degradationcontext lengthinstance countbatch analysis

0 comments

The pith

LLMs hold performance on individual tasks but degrade sharply once the number of instances exceeds roughly 100, with instance count mattering more than context length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Users often ask LLMs to process many documents or instances at once, such as aggregating sentiment across dozens of reviews. The paper evaluates models on multi-instance versions of tasks where they already perform well with single instances. It identifies a repeated pattern: modest drops between about 20 and 100 instances, then a steep collapse at higher counts. Although longer contexts are linked to worse results, the raw number of instances drives the decline more than context size. The finding indicates that batch size deserves priority when designing systems that feed many items to an LLM at the same time.

Core claim

All tested LLMs exhibit mild performance degradation for instance counts between approximately 20 and 100, followed by a sharp collapse at larger counts. While context length correlates with the degradation, the number of instances exerts a stronger influence on final results. This pattern appears across tasks where the models excel individually, implying that optimization efforts for multi-instance processing should address both factors but focus especially on instance count.

What carries the argument

The separation of instance count from total context length as distinct factors that control performance loss during multi-instance processing.

Load-bearing premise

The observed drops are caused by the multi-instance structure of the input rather than prompt formatting, model-specific quirks, or unmeasured variables.

What would settle it

An experiment that holds context length fixed while varying instance count (or vice versa) and shows no corresponding change in the collapse point would falsify the claim that instance count is the dominant driver.

Figures

Figures reproduced from arXiv: 2603.22608 by Jingxuan Chen, Jose Camacho-Collados, Mohammad Taher Pilehvar.

**Figure 2.** Figure 2: Model success rates (averaged across all tasks) as a function of the number of instances. Error bars indicate [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Task success rates (averaged across all LLMs) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Breakdown of failure types. Key mistakes, aggregation mistakes, individual mistakes, and combined [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Success rate (lines) and total prompt token [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Success rates as a function of the number of instances for different injected-noise positions constructed from the same instance sets. s = 1. In addition to the tail setting used above, we also insert the noise at the head, in the middle, and at random positions within each instance. Figure 7 shows that the performance remains broadly similar across these variants, indicating that the position of inject… view at source ↗

**Figure 8.** Figure 8: Success rate of models [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: Success rate of models. 2 5 10 20 50 100 200 500 1000 2000 Number of Instances 0% 20% 40% 60% 80% 100% Success Rate Qwen3-Thinking Arithmetic Category Language NER Parity Sentiment Word WSD 2 5 10 20 50 100 200 500 1000 2000 Number of Instances 0% 20% 40% 60% 80% 100% Success Rate Claude Sonnet 4.6 Arithmetic Category Language NER Parity Sentiment Word WSD 2 5 10 20 50 100 200 500 1000 2000 Number of Insta… view at source ↗

**Figure 11.** Figure 11: Success rate of models. 2 5 10 20 50 100 200 500 1000 2000 Number of Instances 0% 20% 40% 60% 80% 100% Success Rate Arithmetic DeepSeek R1 DeepSeek V3 gpt-oss-120b gpt-oss-20b Llama 3.3 Llama 4 Maverick MiniMax M2.5 Qwen3-Instruct Qwen3-Thinking Claude Sonnet 4.6 Gemini 2.5 Flash Gemini 3.1 Pro GPT-5 GPT-5 Nano Grok 4 Grok 4 Fast 2 5 10 20 50 100 200 Number of Instances 0% 20% 40% 60% 80% 100% Success Rat… view at source ↗

**Figure 13.** Figure 13: Model success rate for tasks. 2 5 10 20 50 100 200 500 1000 2000 Number of Instances 0% 20% 40% 60% 80% 100% Failure Rate 2.5% 2.5% 2.5% 5.0% 2.5% 5.0% 5.0% 7.5% 2.5% 15.0% 17.5% 15.0% 2.5% 5.0% 2.5% 27.5% 10.0% 17.5% 2.5% 5.7% 2.9% 51.4% 14.3% 22.9% 2.9% 22.9% 8.6% 45.7% 11.4% 11.4% 20.0% 22.9% 48.6% 8.6% DeepSeek R1 Key Mistake Aggregation Mistake Individual Mistake Agg. + Indi. Mistake Parsing Error Ov… view at source ↗

**Figure 15.** Figure 15: Failure breakdown for models. 2 5 10 20 50 100 200 500 1000 2000 Number of Instances 0% 20% 40% 60% 80% 100% Failure Rate 2.5% 7.5% 2.5% 5.0% 12.5% 2.5% 2.5% 7.5% 12.5% 10.0% 7.5% 15.0% 10.0% 20.0% 11.4% 37.1% 8.6% 20.0% 5.7% 28.6% 45.7% 2.9% 20.0% 31.4% 8.6% 54.3% 2.9% 2.9% Qwen3-Thinking Key Mistake Aggregation Mistake Individual Mistake Agg. + Indi. Mistake Parsing Error Overlong Input Error 2 5 10 20 … view at source ↗

**Figure 18.** Figure 18: Failure breakdown for tasks [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

**Figure 19.** Figure 19: Success rate (lines) and the number of instances (bars) in the artificial length setting as total prompt token length increases. Error bars indicate standard deviation across five random seeds. 2 5 10 20 50 100 200 500 1000 Number of Instances 0% 20% 40% 60% 80% 100% Success Rate DeepSeek R1 Default Augmented 0 20K 40K 60K 80K 100K Context Length 2 5 10 20 50 100 200 500 1000 Number of Instances 0% 20% … view at source ↗

**Figure 20.** Figure 20: Success rate of models [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗

**Figure 21.** Figure 21: Success rate of models. 2 5 10 20 50 100 200 500 1000 Number of Instances 0% 20% 40% 60% 80% 100% Success Rate Qwen3-Thinking Default Augmented 0 50K 100K 150K 200K Context Length 2 5 10 20 50 100 200 500 1000 Number of Instances 0% 20% 40% 60% 80% 100% Success Rate Gemini 2.5 Flash Default Augmented 0 50K 100K 150K 200K 250K Context Length 2 5 10 20 50 100 200 500 1000 Number of Instances 0% 20% 40% 60% … view at source ↗

**Figure 24.** Figure 24: Success rate for tasks [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗

read the original abstract

Users often rely on Large Language Models (LLMs) for processing multiple documents or performing analysis over a number of instances. For example, analysing the overall sentiment of a number of movie reviews requires an LLM to process the sentiment of each review individually in order to provide a final aggregated answer. While LLM performance on such individual tasks is generally high, there has been little research on how LLMs perform when dealing with multi-instance inputs. In this paper, we perform a comprehensive evaluation of the multi-instance processing (MIP) ability of LLMs for tasks in which they excel individually. The results show that all LLMs follow a pattern of slight performance degradation for small numbers of instances (approximately 20-100), followed by a performance collapse on larger instance counts. Crucially, our analysis shows that while context length is associated with this degradation, the number of instances has a stronger effect on the final results. This finding suggests that when optimising LLM performance for MIP, attention should be paid to both context length and, in particular, instance count.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs degrade on multi-instance tasks with instance count hurting more than raw length, but the separation of those effects looks incomplete.

read the letter

The main thing to know is that this work documents a consistent pattern: LLMs hold up on small batches of instances but then collapse once the count gets large, and the authors argue that the number of instances drives the drop more than the total tokens in the prompt. They test this on tasks where single-instance accuracy is already high, which keeps the focus on the multi-instance scaling issue rather than on hard problems to begin with. That practical framing is the useful part. It points to a limit that shows up in real applications like batch sentiment analysis or multi-document summarization, and the claim that count matters more than length is at least a step past the usual context-length papers. The experiments appear to cover multiple models and report the mild-then-sharp degradation curve, which matches what people see when they try to batch prompts. The soft spot is the variable separation. Adding instances directly increases total length unless each instance is shortened, so any claim that count has the stronger causal effect needs explicit ablations that hold token budget fixed while varying count, or the reverse. The abstract does not spell out those controls or the exact regression setup, so the stronger-effect result could still be tracking the joint scaling. If the full paper only varies count and then compares coefficients afterward, that attribution stays vulnerable. This is the sort of paper that would interest people building production systems that process multiple documents or queries together. It is straightforward empirical work that engages honestly with a limitation, even if the causal part needs tightening. I would send it to referees rather than desk-reject it, with the expectation that the authors add clearer ablations in revision.

Referee Report

2 major / 2 minor

Summary. The paper evaluates LLMs on multi-instance processing (MIP) tasks where models must handle multiple independent instances (e.g., sentiment analysis over many reviews) to produce an aggregated output. It reports that all tested LLMs show mild degradation for small instance counts (~20-100) followed by sharp performance collapse at higher counts, and concludes via regression analysis that instance count exerts a stronger effect on degradation than context length, even though the two are correlated.

Significance. If the central empirical pattern is robust, the work identifies a practically important scaling limitation for LLMs in common multi-document and batch-analysis workflows. The finding that instance count dominates over raw context length could inform prompting strategies, model architecture choices, and evaluation benchmarks focused on multi-instance regimes.

major comments (2)

[§4 and §5] §4 (Experimental Setup) and §5 (Results): the regression analysis claims instance count has a stronger effect than context length, yet the design varies instance count while allowing total token count to increase proportionally. No ablation is described that holds total context length fixed (e.g., by padding shorter instances or subsampling tokens) while varying instance number, or vice versa. This leaves the attribution vulnerable to confounding.
[Table 3 / Figure 4] Table 3 / Figure 4 (performance curves): the reported collapse occurs at instance counts where total tokens also exceed typical context windows for the models tested. Without a control condition that reaches the same token budget via fewer but longer instances, it is unclear whether the observed threshold is driven by instance cardinality or by context overflow.

minor comments (2)

[Abstract and §3] The abstract and §3 omit the exact list of models, datasets, and statistical tests used; these details should be stated explicitly in the main text or a dedicated appendix for reproducibility.
[Throughout] Notation for “instance count” versus “total tokens” is used interchangeably in several places; a consistent variable naming convention would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses

Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): the regression analysis claims instance count has a stronger effect than context length, yet the design varies instance count while allowing total token count to increase proportionally. No ablation is described that holds total context length fixed (e.g., by padding shorter instances or subsampling tokens) while varying instance number, or vice versa. This leaves the attribution vulnerable to confounding.

Authors: We acknowledge the potential for confounding between instance count and total context length in our experimental design. Our regression analysis was intended to disentangle these effects by including both as predictors, and the results indicated a stronger coefficient for instance count. However, we agree that an explicit ablation holding total token count fixed would provide more robust evidence. In the revised version, we will add such an ablation study by using padding or token subsampling to maintain constant context length while varying the number of instances. revision: yes
Referee: [Table 3 / Figure 4] Table 3 / Figure 4 (performance curves): the reported collapse occurs at instance counts where total tokens also exceed typical context windows for the models tested. Without a control condition that reaches the same token budget via fewer but longer instances, it is unclear whether the observed threshold is driven by instance cardinality or by context overflow.

Authors: This is a valid concern regarding the interpretation of the performance collapse. Our experiments reflect practical MIP scenarios with independent instances of typical lengths. To address this, we will include an additional control experiment in the revision where we fix the total token budget and vary the number of instances by adjusting the length of each instance (e.g., through truncation or concatenation). This will help clarify whether the degradation is primarily due to instance count or total context length. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical observational study

full rationale

The paper conducts a direct experimental evaluation of LLM behavior under multi-instance inputs, reporting observed performance patterns across varying instance counts and context lengths. No equations, parameter fits, derivations, or self-referential claims appear in the provided text; the central claim that instance count exerts a stronger effect rests on comparative experimental results rather than any reduction to its own inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The study is self-contained against its own benchmarks and does not rename known results or smuggle assumptions via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation study with no free parameters, axioms, or invented entities; relies on standard LLM benchmarking assumptions.

pith-pipeline@v0.9.0 · 5492 in / 957 out tokens · 49226 ms · 2026-05-15T00:08:45.904680+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Correlation analysis … number of instances shows a notably stronger relationship, with Spearman correlations of −0.61, compared to that of −0.37 for the context length.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gradual success rate degradation followed by collapse as the number of instances increases.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li

Why does the effective context length of llms fall short? InThe Thirteenth International Confer- ence on Learning Representations. Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A bilingual, multi- task benchmark for long conte...

work page arXiv 2024
[2]

CoRRabs/2405.04674(2024)

Data interpreter: An LLM agent for data sci- ence. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 19796–19821, Vienna, Austria. Association for Computational Lin- guistics. Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shan- tanu Acharya, Dima Rekesh, Fei Jia, and Boris Gins- burg. 2024. Ruler: What’s the real context size o...

work page arXiv 2025
[3]

Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen

Longproc: Benchmarking long-context lan- guage models on long procedural generation.arXiv preprint arXiv:2501.05414. Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. 2025. Helmet: How to evaluate long-context models effectively and thoroughly. In The Thirteenth International Conference on Learni...

work page arXiv 2025
[4]

What is the difference between -2 and 251860?

work page
[5]

What is 1,141.09 less than 1? SIP Ground Truth Output:

work page
[6]

tech” or “busi- ness

-1,140.09 MIP Question:Solve all the provided arithmetic questions and calculate the sum of all answers. MIP Ground Truth Output:-9,008,709.09 A.2 Category The task is about newscategoryclassification, where a news article can belong to “tech” or “busi- ness” or “entertainment” or “politics” or “sport”. The dataset is BBC News (Greene and Cunning- ham, 20...

work page 2006
[7]

german business confidence slides german business confidence fell in february knock- ing hopes of a speedy recovery in europe s largest economy

work page
[8]

bbc poll indicates economic gloom citizens in a majority of nations surveyed in a bbc world service poll believe the world economy is worsening

work page
[9]

SIP Ground Truth Output:

lifestyle governs mobile choice faster better or funkier hardware alone is not going to help phone firms sell more handsets research sug- gests... SIP Ground Truth Output:

work page
[10]

tech” category. MIP Ground Truth Output:1 A.3 Language The task is aboutlanguageidentification, where a paragraph can belong to “English

tech MIP Question:Count how many of the provided news articles belong to the “tech” category. MIP Ground Truth Output:1 A.3 Language The task is aboutlanguageidentification, where a paragraph can belong to “English” or “Chinese” or “Persian” or “Spanish”. The dataset is WiLI- 2018 (Thoma, 2018). The average word count for this task is 55.8. Example Input:

work page 2018
[11]

Nordahl Road is a station served by North County Transit District’s SPRINTER light rail line

work page
[12]

En Navidad de 1974, poco después de que interpretó la canción en francés película Papil- lon (Toi qui Regarde la Mer)

work page 1974
[13]

SIP Ground Truth Output:

A talk by Takis Fotopoulos about the Interna- tionalization of the Capitalist Market Econ- omy and the project of Inclusive Democracy... SIP Ground Truth Output:

work page
[14]

MIP Ground Truth Output:2 A.4 NER The task is aboutnamedentityrecognition, which uses data from WikiANN (Rahimi et al., 2019)

English MIP Question:Count how many paragraphs are in English. MIP Ground Truth Output:2 A.4 NER The task is aboutnamedentityrecognition, which uses data from WikiANN (Rahimi et al., 2019). The average word count for this task is 16.0. Example Input:

work page 2019
[15]

we love everything about the fence

work page
[16]

i want to hook up with that girl paige in the brown leather jacket

work page
[17]

official

in addition , there is a reduction of 22,101 mmbtu which is the difference between the scada values ( best available ) that anita showed on the february 29th storage sheet and the " official " february 29th values that gary wilson received from mips . SIP Ground Truth Output:

work page
[18]

odd” or “even

2 MIP Question:Count occurrences of the entity ’PERSON’ in all sentences. MIP Ground Truth Output:3 A.5 Parity The task is aboutparityclassification (i.e., identify if a number is “odd” or “even”), where we use synthetic data generated by ourselves. The average word count for this task is 1 since only a single number is provided. Example Input:

work page
[19]

89449 SIP Ground Truth Output:

work page
[20]

positive

odd MIP Question:Count how many of the provided numbers are odd. MIP Ground Truth Output:1 A.6 Sentiment The task is aboutsentimentanalysis, where a movie review can belong to “positive” or “nega- tive”. The dataset is Sentiment Treebank (Socher et al., 2013), where we only use the “most” posi- tive and negative reviews to avoid ambiguity. The average wor...

work page 2013
[21]

High Crimes is a cinematic misdemeanor , a routine crime thriller remarkable only for its lack of logic and misuse of two fine actors , Morgan Freeman and Ashley Judd

work page
[22]

One of the worst movies of the year

work page
[23]

SIP Ground Truth Output:

A mix of gritty realism , crisp storytelling and radiant compassion that effortlessly draws you in . SIP Ground Truth Output:

work page
[24]

MIP Ground Truth Output:1 A.7 Word The task is about tweetswordoccurrence (i.e., count a target word’s occurrences in given tweets)

positive MIP Question:Count how many of the provided movie reviews are positive. MIP Ground Truth Output:1 A.7 Word The task is about tweetswordoccurrence (i.e., count a target word’s occurrences in given tweets). The dataset is TweetEval (Barbieri et al., 2020), where we use its stance detection subset. The aver- age word count for this task is 17.3. Exa...

work page 2020
[25]

I want a worldwide matriarchal dictatorship with all men enslaved to women

IF FEMINISTS WERE HONEST “I want a worldwide matriarchal dictatorship with all men enslaved to women” #GamerGate #SemST

work page
[26]

#womensrights #Feminist #SemST

What the fuck do women even do? I mean seriously they’re just useless other than sex. #womensrights #Feminist #SemST

work page
[27]

#SemST SIP Ground Truth Output:

DEAR FEMINISTS Start asking for account- ability from man-haters instead of shielding them for convenient concealment. #SemST SIP Ground Truth Output:

work page
[28]

women” in all tweets. MIP Ground Truth Output:3 A.8 WSD The task is aboutwordsensedisambiguation, where a target word “apple

0 MIP Question:Count occurrences of the word “women” in all tweets. MIP Ground Truth Output:3 A.8 WSD The task is aboutwordsensedisambiguation, where a target word “apple” is required to be dis- tinguished as meaning either “company” or “fruit” based on its context. The dataset is CoarseWSD- 20 (Loureiro et al., 2021), where we use its “apple” subset. The...

work page 2021
[29]

both seasons are available for download from apple ’s itunes store

work page
[30]

in klayman ii , the plaintiffs sued the same government defendants and in addition , face- book , yahoo! , google , microsoft , youtube , aol , paltalk , skype , sprint , at&t , apple again alleging the bulk metadata collection violates the first , fourth and fifth amendment and constitutes divulgence of communication records in violation of section 2702 ...

work page
[31]

SIP Ground Truth Output:

description alongside dried pears the filling also contains raisin , walnut and other dried fruit such as apple or figs . SIP Ground Truth Output:

work page
[32]

plural” or “singular

fruit MIP Question:Count how many paragraphs contain the word "apple" referring to the company (Apple Inc.), not the fruit. MIP Ground Truth Output:2 A.9 Excluded Tasks Beyond the tasks mentioned above, we have three additional tasks that have been filtered out due to unsatisfactory SIP performance: •BigramShiftdetection: from SentEval (Con- neau and Kiel...

work page 2018