MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (60%).
citation-role summary
citation-polarity summary
representative citing papers
GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.
LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.
A comprehensive survey of code-switched NLP research with LLMs across modalities, covering 327 studies, 15+ tasks, 30+ datasets, and 80+ languages while outlining challenges and a future roadmap.
Scideator enables facet-based scientific ideation through LLM-driven extraction, human-guided recombination, analogous retrieval, and facet-grounded novelty verification, showing significantly higher creativity support than a baseline LLM in a user study with CS researchers.
DeepSurvey introduces an agentic system for automated survey generation that improves depth through full-text keynotes, cross-paper clustering, and code analysis, while boosting citation reliability via graph expansion, hybrid filtering, and evidence-constrained assignment, with reported gains over
Deep bibliography expansion in literature search achieves high recall while human citations are found to have only 51% moderate relevance compared to 86-88% for AI methods.
Audits of 43 LLMs show that varying persona prompts (language, location, role-and-task) and context affects technical quality and social representativeness of scholar recommendations, with location impacting diversity and factuality.
ChartFI-Bench supplies 896 chart-description pairs from visually complex charts and defines four metrics (Faithfulness, Coverage, Informativeness, Acuity) aligned to four quality dimensions to evaluate MLLM-generated descriptions.
The authors propose a retrieval-augmented framework that grounds AI exposure labels for 18,796 O*NET occupation-task pairs in retrieved news and academic abstracts, outperforming zero-shot prompting in 72% of disagreements and aligning better with observed real-world usage.
Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
Only 39% of LLM safety benchmark repositories run without modification, 6% include ethical warnings, and adoption tracks author prominence and runnability rather than code quality metrics.
Attribution gradients consolidate citation evidence and enable incremental unfolding of secondary sources, leading to deeper engagement in a lab study of critical reading tasks for AI answers.
A 69.8M-edge cross-corpus graph shows science influencing software mainly via reproducibility and sequence tools while software influences science via ML infrastructure, but direct links are sparse and reuse-citation correlations flip sign depending on the pairing method.
Prompt chaining with off-the-shelf LLMs outperforms in-context learning and BERT for 1st- and 2nd-level classification on the ORKG taxonomy using the FORC dataset, but struggles at the 3rd level.
Lit2Vec delivers a documented, reproducible pipeline that extracts and annotates a large licensed chemistry paper corpus from S2ORC with paragraph embeddings and subfield labels.
Omakase monitors project documents to infer timely queries and distills research reports into actionable suggestions that users rated significantly more useful than raw reports.
MIRAI predicts 5-year PageRank and citation impact from paper title/abstract/date with Spearman's ρ 0.47/0.62, and generates ideas judged 4:3 more impactful by LLM.
A literature survey on abstract concept recognition in videos that catalogs prior tasks and datasets while advocating for foundation models and reuse of decades of community experience.
citing papers explorer
-
MemGym: a Long-Horizon Memory Environment for LLM Agents
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
-
Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation
GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.
-
The Shrinking Lifespan of LLMs in Science
LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.
-
Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation
Scideator enables facet-based scientific ideation through LLM-driven extraction, human-guided recombination, analogous retrieval, and facet-grounded novelty verification, showing significantly higher creativity support than a baseline LLM in a user study with CS researchers.
-
DeepSurvey: Enhancing Analytical Depth and Citation Reliability in Automated Survey Generation
DeepSurvey introduces an agentic system for automated survey generation that improves depth through full-text keynotes, cross-paper clustering, and code analysis, while boosting citation reliability via graph expansion, hybrid filtering, and evidence-constrained assignment, with reported gains over
-
Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth
Deep bibliography expansion in literature search achieves high recall while human citations are found to have only 51% moderate relevance compared to 86-88% for AI methods.
-
Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation
Audits of 43 LLMs show that varying persona prompts (language, location, role-and-task) and context affects technical quality and social representativeness of scholar recommendations, with location impacting diversity and factuality.
-
ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models
ChartFI-Bench supplies 896 chart-description pairs from visually complex charts and defines four metrics (Faithfulness, Coverage, Informativeness, Acuity) aligned to four quality dimensions to evaluate MLLM-generated descriptions.
-
How Adversarial Environments Mislead Agentic AI?
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
-
Attribution Gradients: Incrementally Unfolding Citations for Critical Examination of Attributed AI Answers
Attribution gradients consolidate citation evidence and enable incremental unfolding of secondary sources, leading to deeper engagement in a lab study of critical reading tasks for AI answers.
-
The Reciprocal Impact of Science and Software: A Cross-Corpus Analysis of How Research Shapes Software and Software Enables Research
A 69.8M-edge cross-corpus graph shows science influencing software mainly via reproducibility and sequence tools while software influences science via ML infrastructure, but direct links are sparse and reuse-citation correlations flip sign depending on the pairing method.
-
Automating Categorization of Scientific Texts with In-Context Learning and Prompt-Chaining in Large Language Models
Prompt chaining with off-the-shelf LLMs outperforms in-context learning and BERT for 1st- and 2nd-level classification on the ORKG taxonomy using the FORC dataset, but struggles at the 3rd level.
-
Lit2Vec: A Reproducible Workflow for Building a Legally Screened Chemistry Corpus from S2ORC for Downstream Retrieval and Text Mining
Lit2Vec delivers a documented, reproducible pipeline that extracts and annotates a large licensed chemistry paper corpus from S2ORC with paragraph embeddings and subfield labels.
-
Omakase: proactive assistance with actionable suggestions for evolving scientific research projects
Omakase monitors project documents to infer timely queries and distills research reports into actionable suggestions that users rated significantly more useful than raw reports.
-
MIRAI: Prediction and Generation of High-Impact Academic Research
MIRAI predicts 5-year PageRank and citation impact from paper title/abstract/date with Spearman's ρ 0.47/0.62, and generates ideas judged 4:3 more impactful by LLM.
-
Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding
A literature survey on abstract concept recognition in videos that catalogs prior tasks and datasets while advocating for foundation models and reuse of decades of community experience.