pith. sign in

arxiv: 2605.21338 · v1 · pith:ZMQCVYGCnew · submitted 2026-05-20 · 💻 cs.CL

Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media

Pith reviewed 2026-05-21 04:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM evaluationtext analyticssocial media analysisperformance scalingnumerical reasoningTwitter datasetsbenchmark frameworkquantitative analysis
0
0 comments X

The pith

LLMs show sharp performance drops on numerical analysis of social media data beyond 500 instances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a question-based evaluation framework of 470 manually curated questions to measure how LLMs handle semantic understanding and reasoning over collections of social media posts. It applies this benchmark to Twitter datasets on tasks including sentiment analysis, hate speech detection, and emotion recognition. Performance holds for small inputs but declines as input scale grows and as tasks shift from basic identification to comparison, counting, and calculation. A consistent pattern emerges where open-weights models in particular lose accuracy on numerical operations once the collection exceeds 500 posts.

Core claim

As the input size grows beyond 500 instances, LLMs exhibit a common limitation where performance degrades substantially, especially on numerical tasks, revealing critical architectural bottlenecks for rigorous quantitative analysis over large text collections from social media.

What carries the argument

A benchmark of 470 manually curated questions that test semantic understanding and reasoning over aggregated social media text applied across multiple Twitter datasets.

If this is right

  • Performance declines noticeably in multi-label and target-dependent scenarios compared with simpler single-label tasks.
  • Accuracy falls progressively as operations advance from basic semantic existence checks to demanding steps like comparison, counting, and calculation.
  • Open-weights models suffer more pronounced degradation than closed models when input size exceeds 500 instances.
  • Current LLM architectures face bottlenecks that limit reliable quantitative analysis over large unstructured text collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed scaling limit suggests LLMs may require hybrid systems that combine them with external counting or aggregation tools for real-world social media analytics.
  • The same degradation pattern could appear in other long-document domains such as news archives or legal corpora if tested with similar question sets.
  • Future model designs might benefit from explicit mechanisms for maintaining numerical fidelity across many input documents rather than relying on implicit pattern matching.

Load-bearing premise

The 470 manually curated questions sufficiently capture LLMs' semantic understanding and reasoning abilities for text analytics on aggregated social media data.

What would settle it

Running the same LLMs on collections larger than 500 posts and observing no substantial drop in accuracy on numerical questions such as counting or calculation would falsify the main finding.

Figures

Figures reproduced from arXiv: 2605.21338 by Jose Camacho-Collados, Nedjma Ousidhoum, Yuefeng Shi.

Figure 1
Figure 1. Figure 1: An illustrative example of our data analysis evaluation framework. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance across data sizes by metric and data category. Results are averaged across types of model. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs' semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces a question-based evaluation framework with 470 manually curated questions to assess LLMs' semantic understanding and reasoning over aggregated social media text from Twitter datasets. It evaluates performance on tasks including sentiment analysis, hate speech detection, and emotion recognition, reporting that performance declines with increasing input scale beyond 500 instances (especially on numerical tasks for open-weights models), with task complexity, and in multi-label scenarios. The work highlights architectural bottlenecks in current LLMs for quantitative analysis over large text collections.

Significance. If the empirical patterns hold after addressing the noted gaps, this study would provide valuable insights into the limitations of LLMs in practical data analysis scenarios involving long sequences of unstructured documents. The proposed framework could become a useful tool for benchmarking LLMs in social media analytics, and the findings on scaling and complexity could motivate research into better long-context and quantitative reasoning capabilities.

major comments (3)
  1. The central claim that performance degrades substantially as input size grows beyond 500 instances due to architectural bottlenecks (particularly for open-weights models on numerical tasks) is load-bearing but insufficiently supported. The manuscript provides no details on how aggregated posts are serialized into prompts at large scales (e.g., full concatenation, summarization, or retrieval), nor whether token counts were monitored to avoid context window limits. This risks conflating context overflow artifacts with the claimed deeper limitations.
  2. The description of the 470 manually curated questions lacks specifics on the question design process, validation methods, or how they target semantic understanding and reasoning abilities across complexity levels. This weakens the foundation for interpreting the performance patterns.
  3. There is no mention of statistical testing for the reported performance differences, baseline comparisons with non-LLM methods, or detailed error analysis to substantiate the declines with scale and complexity.
minor comments (1)
  1. Clarify terminology consistency for model types (e.g., 'open-weights' vs. 'open-weight') throughout the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, clarifying our approach where possible and outlining revisions to strengthen the empirical support and methodological transparency.

read point-by-point responses
  1. Referee: The central claim that performance degrades substantially as input size grows beyond 500 instances due to architectural bottlenecks (particularly for open-weights models on numerical tasks) is load-bearing but insufficiently supported. The manuscript provides no details on how aggregated posts are serialized into prompts at large scales (e.g., full concatenation, summarization, or retrieval), nor whether token counts were monitored to avoid context window limits. This risks conflating context overflow artifacts with the claimed deeper limitations.

    Authors: We acknowledge that the original manuscript did not provide sufficient detail on prompt construction and token management, which is necessary to isolate architectural effects from context-window artifacts. In the revised version, we will add a new subsection (3.3) explicitly describing the serialization method: posts are concatenated in their original chronological order with minimal separators, and we report average and maximum token counts for each scale (100, 500, 1000, 2000 instances) across all models. All experiments were conducted with inputs kept within the published context windows of the evaluated models (e.g., 4k–32k tokens); no truncation was applied beyond the natural aggregation limit. To further address the concern, we will include an ablation comparing full concatenation versus a simple summarization baseline at the 1000-instance scale. While we maintain that the observed degradation on numerical tasks persists even when context limits are respected, these additions will allow readers to evaluate the claim more rigorously. revision: yes

  2. Referee: The description of the 470 manually curated questions lacks specifics on the question design process, validation methods, or how they target semantic understanding and reasoning abilities across complexity levels. This weakens the foundation for interpreting the performance patterns.

    Authors: We agree that greater transparency on question construction is required. The 470 questions were created in three stages: (1) initial drafting by two domain experts drawing from standard social-media analytics queries, (2) categorization into three complexity tiers (basic semantic existence, comparison/counting, and arithmetic operations), and (3) validation through a pilot study with 30 independent annotators yielding Cohen’s κ = 0.79. We will expand Section 3.2 to include the full design protocol, representative examples from each complexity tier, and explicit mapping of question types to the targeted reasoning abilities (e.g., existence detection vs. multi-step numerical reasoning). This revision will strengthen the interpretability of the reported performance trends. revision: yes

  3. Referee: There is no mention of statistical testing for the reported performance differences, baseline comparisons with non-LLM methods, or detailed error analysis to substantiate the declines with scale and complexity.

    Authors: We accept these points as valid gaps in the current submission. In the revised manuscript we will: (i) add paired t-tests (or Wilcoxon signed-rank tests for non-normal distributions) with reported p-values for all scale- and complexity-based comparisons; (ii) include two non-LLM baselines—TF-IDF + logistic regression and a rule-based keyword matcher—for the sentiment, hate-speech, and emotion tasks; and (iii) introduce a dedicated error-analysis subsection that categorizes failure modes (numerical hallucination, label confusion, context dilution) with frequency tables and illustrative examples at each input scale. These additions will provide quantitative substantiation for the claimed performance declines. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct observations

full rationale

The paper introduces a manually curated set of 470 questions and reports LLM performance on Twitter datasets for tasks such as sentiment analysis and hate speech detection. All claims, including the degradation beyond 500 instances on numerical tasks, are presented as direct empirical results from model outputs rather than derivations, fitted parameters, or predictions that reduce to inputs by construction. No equations, self-definitional steps, or load-bearing self-citations appear in the abstract or framework description. The study is self-contained against external benchmarks with no mathematical chain that could be circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework rests on the domain assumption that question-answering performance on curated items validly proxies semantic understanding and reasoning over text collections; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Question answering on manually designed items can measure LLMs' semantic understanding and reasoning over aggregated unstructured text
    This premise underpins the entire benchmark construction and result interpretation in the abstract.

pith-pipeline@v0.9.0 · 5749 in / 1237 out tokens · 48319 ms · 2026-05-21T04:49:41.824908+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 10 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [11]

    Science China Information Sciences , volume=

    The rise and potential of large language model based agents: A survey , author=. Science China Information Sciences , volume=. 2025 , publisher=

  9. [12]

    Advances in Neural Information Processing Systems , volume=

    Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face , author=. Advances in Neural Information Processing Systems , volume=

  10. [13]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework , author=

  11. [14]

    Big Data, Mining, and Analytics , pages=

    Transforming unstructured data into useful information , author=. Big Data, Mining, and Analytics , pages=. 2014 , publisher=

  12. [15]

    International Conference on Machine Learning , pages=

    InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  13. [16]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Text2analysis: A benchmark of table question answering with advanced data analysis and unclear queries , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  14. [17]

    Conference on Empirical Methods in Natural Language Processing , year=

    NumeroLogic: Number Encoding for Enhanced LLMs’ Numerical Reasoning , author=. Conference on Empirical Methods in Natural Language Processing , year=

  15. [19]

    Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

    TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

  16. [20]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for Social Media NLP Research , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  17. [25]

    Natural Language Processing Journal , volume=

    Evaluating LLMs on document-based QA: Exact answer selection and numerical extraction using CogTale dataset , author=. Natural Language Processing Journal , volume=. 2024 , publisher=

  18. [26]

    International Conference on Machine Learning , pages=

    Pal: Program-aided language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  19. [27]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=

  20. [28]

    Data Interpreter: An LLM Agent for Data Science

    Hong, Sirui and Lin, Yizhang and Liu, Bang and Liu, Bangbang and Wu, Binhao and Zhang, Ceyao and Li, Danyang and Chen, Jiaqi and Zhang, Jiayi and Wang, Jinlin and Zhang, Li and Zhang, Lingyao and Yang, Min and Zhuge, Mingchen and Guo, Taicheng and Zhou, Tuo and Tao, Wei and Tang, Robert and Lu, Xiangtao and Zheng, Xiawu and Liang, Xinbing and Fei, Yaying ...

  21. [29]

    arXiv preprint arXiv:2505.14163 , year=

    DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation , author=. arXiv preprint arXiv:2505.14163 , year=

  22. [30]

    2025 , month=

    Introducing GPT-4.1 in the API , author=. 2025 , month=

  23. [31]

    2025 , month=

    Grok 4.1 Blog , author=. 2025 , month=

  24. [32]

    2026 , month=

    Gemini-3-1-flash-lite Official Blog , author=. 2026 , month=

  25. [33]

    2026 , month=

    Introducing GPT‑5.4 mini and nano Official Blog , author=. 2026 , month=

  26. [34]

    2026 , month=

    Introducing Qwen 3.5 Official Blog , author=. 2026 , month=

  27. [35]

    2026 , month=

    Introducing Gemma 4 Official Blog , author=. 2026 , month=

  28. [36]

    2023 , month=

    GPT-3.5 Turbo-Legacy GPT model for cheaper chat and non-chat tasks , author=. 2023 , month=

  29. [39]

    Qwen Technical Report

    Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

  30. [40]

    International Conference on Learning Representations (ICLR) , year=

    React: Synergizing reasoning and acting in language models , author=. International Conference on Learning Representations (ICLR) , year=

  31. [41]

    2024 , month=

    o4-mini Faster, more affordable reasoning model , author=. 2024 , month=

  32. [42]

    An introduction to information retrieval , author=

  33. [43]

    2023 , month=

    text-embedding-3-small Small embedding model , author=. 2023 , month=

  34. [44]

    2024 , eprint=

    The Faiss library , author=. 2024 , eprint=

  35. [45]

    , author =

    `smolagents`: a smol library to build great agentic systems. , author =

  36. [46]

    Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) , pages=

    SemEval-2017 task 4: Sentiment analysis in Twitter , author=. Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) , pages=

  37. [48]

    Proceedings of the 12th international workshop on semantic evaluation , pages=

    Semeval-2018 task 1: Affect in tweets , author=. Proceedings of the 12th international workshop on semantic evaluation , pages=

  38. [49]

    Proceedings of the 13th International Workshop on Semantic Evaluation , pages=

    SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , author=. Proceedings of the 13th International Workshop on Semantic Evaluation , pages=

  39. [50]

    Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) , pages=

    Semeval-2016 task 6: Detecting stance in tweets , author=. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) , pages=

  40. [52]

    The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism

    Sachdeva, Pratik and Barreto, Renata and Bacon, Geoff and Sahn, Alexander and von Vacano, Claudia and Kennedy, Chris. The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism. Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022. 2022

  41. [54]

    T witter Topic Classification

    Antypas, Dimosthenis and Ushio, Asahi and Camacho-Collados, Jose and Silva, Vitor and Neves, Leonardo and Barbieri, Francesco. T witter Topic Classification. Proceedings of the 29th International Conference on Computational Linguistics. 2022

  42. [55]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

    ELI5: Long Form Question Answering , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

  43. [56]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

  44. [57]

    Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=

    Tnt-llm: Text mining at scale with large language models , author=. Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=

  45. [59]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  46. [60]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

  47. [61]

    Proceedings of the national academy of sciences , volume=

    Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=

  48. [62]

    Proceedings of the 2023 International Joint Conference on Robotics and Artificial Intelligence , pages=

    A study on semantic understanding of large language models from the perspective of ambiguity resolution , author=. Proceedings of the 2023 International Joint Conference on Robotics and Artificial Intelligence , pages=

  49. [63]

    Understanding the Human-LLM Dynamic: A Literature Survey of LLM Use in Programming Tasks

    Understanding the human-llm dynamic: A literature survey of llm use in programming tasks , author=. arXiv preprint arXiv:2410.01026 , year=

  50. [64]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

    Scrolls: Standardized comparison over long language sequences , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

  51. [65]

    Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

    Squad: 100,000+ questions for machine comprehension of text , author=. Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

  52. [67]

    Understanding

    Chen, Jingxuan and Pilehvar, Mohammad Taher and Camacho-Collados, Jose , journal=. Understanding

  53. [68]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  54. [69]

    Alibaba. 2026. https://qwen.ai/blog?id=qwen3.5 Introducing qwen 3.5 official blog

  55. [70]

    Dimosthenis Antypas, Asahi Ushio, Francesco Barbieri, Leonardo Neves, Kiamehr Rezaee, Luis Espinosa Anke, Jiaxin Pei, and Jose Camacho-Collados. 2023. Supertweeteval: A challenging, unified and heterogeneous benchmark for social media nlp research. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12590--12607

  56. [71]

    Dimosthenis Antypas, Asahi Ushio, Jose Camacho-Collados, Vitor Silva, Leonardo Neves, and Francesco Barbieri. 2022. https://aclanthology.org/2022.coling-1.299 T witter topic classification . In Proceedings of the 29th International Conference on Computational Linguistics, pages 3386--3400, Gyeongju, Republic of Korea. International Committee on Computatio...

  57. [72]

    Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. 2020. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644--1650

  58. [73]

    Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. 2019. https://doi.org/10.18653/v1/S19-2007 S em E val-2019 task 5: Multilingual detection of hate speech against immigrants and women in T witter . In Proceedings of the 13th International Workshop on Semant...

  59. [74]

    Amanda Bertsch, Adithya Pratapa, Teruko Mitamura, Graham Neubig, and Matthew R. Gormley. 2025. Oolong: Evaluating long context reasoning and aggregation capabilities. arXiv preprint arXiv:2511.02817

  60. [75]

    Jingxuan Chen, Mohammad Taher Pilehvar, and Jose Camacho-Collados. 2026. Understanding LLM performance degradation in multi-instance processing: The roles of instance count and context length. arXiv preprint arXiv:2603.22608

  61. [76]

    Qiyan Deng, Jianhui Li, Chengliang Chai, Jinqi Liu, Junzhi She, Kaisen Jin, Zhaoze Sun, Yuhao Deng, Jia Yuan, Ye Yuan, and 1 others. 2025. Unstructured data analysis using llms: A comprehensive benchmark. arXiv preprint arXiv:2510.27119

  62. [77]

    Alex Egg, Martin Iglesias Goyanes, Friso Kingma, Andreu Mora, Leandro von Werra, and Thomas Wolf. 2025. Dabstep: Data agent benchmark for multi-step reasoning. arXiv preprint arXiv:2506.23719

  63. [78]

    Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558--3567

  64. [79]

    Google. 2026. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/ Gemini-3-1-flash-lite official blog

  65. [80]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  66. [81]

    Xinyi He, Mengyu Zhou, Xinrun Xu, Xiaojun Ma, Rui Ding, Lun Du, Yan Gao, Ran Jia, Xu Chen, Shi Han, and 1 others. 2024. Text2analysis: A benchmark of table question answering with advanced data analysis and unclear queries. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18206--18215

  67. [82]

    Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, and 1 others. 2024. Infiagent-dabench: Evaluating agents on data analysis tasks. In International Conference on Machine Learning, pages 19544--19572. PMLR

  68. [83]

    Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Scott Yih, Daniel Fried, Si yi Wang, and Tao Yu. 2022. Ds-1000: A natural and reliable benchmark for data science code generation. arxiv abs/2211.11501 (2022)

  69. [84]

    Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, and Yue Zhang. 2025. Logical reasoning in large language models: A survey. arXiv preprint arXiv:2502.09100

  70. [85]

    Saif Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. https://doi.org/10.18653/v1/S18-1001 S em E val-2018 task 1: Affect in tweets . In Proceedings of the 12th International Workshop on Semantic Evaluation, pages 1--17, New Orleans, Louisiana. Association for Computational Linguistics

  71. [86]

    Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. 2016. Semeval-2016 task 6: Detecting stance in tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 31--41

  72. [87]

    OpenAI. 2026. https://openai.com/index/introducing-gpt-5-4-mini-and-nano/ Introducing gpt‑5.4 mini and nano official blog

  73. [88]

    Zafaryab Rasool, Stefanus Kurniawan, Sherwin Balugo, Scott Barnett, Rajesh Vasa, Courtney Chesser, Benjamin M Hampstead, Sylvie Belleville, Kon Mouzakis, and Alex Bahar-Fuchs. 2024. Evaluating llms on document-based qa: Exact answer selection and numerical extraction using cogtale dataset. Natural Language Processing Journal, 8:100083

  74. [89]

    Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017 a . Semeval-2017 task 4: Sentiment analysis in twitter. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), pages 502--518

  75. [90]

    Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017 b . https://doi.org/10.18653/v1/S17-2088 S em E val-2017 task 4: Sentiment analysis in T witter . In Proceedings of the 11th International Workshop on Semantic Evaluation ( S em E val-2017) , pages 502--518, Vancouver, Canada. Association for Computational Linguistics

  76. [91]

    Pratik Sachdeva, Renata Barreto, Geoff Bacon, Alexander Sahn, Claudia von Vacano, and Chris Kennedy. 2022. https://aclanthology.org/2022.nlperspectives-1.11 The measuring hate speech corpus: Leveraging rasch measurement theory for data perspectivism . In Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022, pages 83--94, Marseille,...

  77. [92]

    Eliyahu Schwartz, Leshem Choshen, Joseph Shtok, Sivan Doveh, Leonid Karlinsky, and Assaf Arbelle. 2024. Numerologic: Number encoding for enhanced llms’ numerical reasoning. In Conference on Empirical Methods in Natural Language Processing

  78. [93]

    Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and 1 others. 2022. Scrolls: Standardized comparison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12007--12021

  79. [94]

    Xinyi Song, Lina Lee, Kexin Xie, Xueying Liu, Xinwei Deng, and Yili Hong. 2025. Statllm: A dataset for evaluating the performance of large language models in statistical analysis. arXiv preprint arXiv:2502.17657

  80. [95]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, and 1 others. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805

Showing first 80 references.