mega hub Canonical reference

LLaMA: Open and Efficient Foundation Language Models

· 2023 · cs.CL · arXiv 2302.13971

Canonical reference. 82% of citing Pith papers cite this work as background.

1212 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 1212 citing papers arXiv PDF

abstract

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 206 method 19 baseline 8 other 6 dataset 1 extension 1

citation-polarity summary

background 198 use method 20 unclear 13 baseline 7 extend 1 support 1 use dataset 1

claims ledger

abstract We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Tight Sample Complexity of Transformers

cs.LG · 2026-06-08 · unverdicted · novelty 8.0

Depth-L transformers with W parameters have VC dimension Theta(L W log(T W)), yielding matching O(L W log((T+T')W)) upper and Omega(L W log((T+T')W/L)) lower bounds on sample complexity for chain-of-thought learning.

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

eess.AS · 2026-05-31 · unverdicted · novelty 8.0

SVHalluc benchmark shows open-source audio-visual LLMs achieve near-random accuracy on semantic and temporal speech-vision alignment tasks while Gemini 2.5 Pro performs substantially better.

Privacy Auditing with Zero (0) Training Run

cs.CR · 2026-05-14 · unverdicted · novelty 8.0

Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.

Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

cs.LG · 2026-05-12 · accept · novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

Backdoor Attacks on Decentralised Post-Training

cs.CR · 2026-03-31 · conditional · novelty 8.0 · 2 refs

An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequent safety training.

Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

cs.SE · 2025-06-16 · conditional · novelty 8.0

First study of 1,899 MCP servers finds eight distinct vulnerabilities (only three traditional), 7.2% with general issues, 5.5% with tool poisoning, and 66% with code smells, urging MCP-specific security practices.

BEAVER: An Enterprise Benchmark for Text-to-SQL

cs.CL · 2024-09-03 · unverdicted · novelty 8.0

BEAVER is the first text-to-SQL benchmark from private enterprise data warehouses, revealing SOTA agentic frameworks achieve only 10.8% accuracy on complex real-world queries.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

cs.IR · 2024-03-06 · unverdicted · novelty 8.0

BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

cs.CL · 2023-11-27 · unverdicted · novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

cs.CL · 2023-05-17 · accept · novelty 8.0

Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

cs.CL · 2023-04-14 · conditional · novelty 8.0

API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.

Probing Memorization of Tabular In-Context Learning

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

A new probing framework detects moderate parametric memorization signals in tabular in-context learning models under single-task fine-tuning, strongest on low-cardinality tasks, but signals largely disappear under realistic training.

A Sensitivity-Aware Test Collection for Search Among Personal Information

cs.IR · 2026-06-25 · accept · novelty 7.0

A new sensitivity-labeled test collection is released from Enron emails with crowdsourced queries, relevance judgments, and LLM extensions for evaluating sensitivity-aware search.

Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

LaViD distills LLM conceptual knowledge to vision models via LLM-generated MCQ soft labels, outperforming vision-language distillation baselines on fine-grained benchmarks while improving robustness on spurious correlation datasets.

citing papers explorer

Showing 50 of 117 citing papers after filters.

FBS: Modeling Native Parallel Reading inside a Transformer cs.AI · 2026-01-29 · unverdicted · none · ref 4 · internal anchor
FBS introduces a causal trainable loop via PAW, CH, and SG modules to model native parallel reading in Transformers, yielding better quality-efficiency on benchmarks with complementary ablations.
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model cs.AI · 2025-10-20 · unverdicted · none · ref 21 · internal anchor
Saber improves both speed and accuracy of diffusion language models on code generation by dynamically adjusting unmasking steps and reverting low-confidence tokens via backtracking.
Forget What's Sensitive, Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning cs.AI · 2025-09-16 · unverdicted · none · ref 35 · internal anchor
PeCL applies token-level dynamic differential privacy and privacy-guided memory sculpting to achieve superior privacy-utility balance in continual learning.
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models cs.AI · 2025-07-30 · unverdicted · none · ref 39 · internal anchor
League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis cs.AI · 2025-07-28 · unverdicted · none · ref 114 · internal anchor
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 195 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments cs.AI · 2025-06-03 · unverdicted · none · ref 71 · internal anchor
VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasoning and decision-making with the best model at 46.6% prediction accuracy and 31.4% of
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners cs.AI · 2025-04-19 · unverdicted · none · ref 51 · internal anchor
InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.
Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies cs.AI · 2024-12-03 · unverdicted · none · ref 47 · internal anchor
PGT optimizes latent goal embeddings for frozen policies via trajectory-level preference objectives, reporting 72-81.6% relative gains on 17 Minecraft tasks and 13.4% better OOD performance than fine-tuning.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model cs.AI · 2024-08-20 · unverdicted · none · ref 22 · internal anchor
A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 209 · internal anchor
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
Capabilities of Gemini Models in Medicine cs.AI · 2024-04-29 · unverdicted · none · ref 193 · internal anchor
Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
A Roadmap to Pluralistic Alignment cs.AI · 2024-02-07 · unverdicted · none · ref 177 · internal anchor
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
A Survey on Large Language Model based Autonomous Agents cs.AI · 2023-08-22 · accept · none · ref 9 · internal anchor
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society cs.AI · 2023-03-31 · conditional · none · ref 117 · internal anchor
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces cs.AI · 2026-05-04 · unverdicted · none · ref 43
JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B (80.9% avg) at 80% retained parameters.
When AI reviews science: Can we trust the referee? cs.AI · 2026-04-26 · unverdicted · none · ref 117
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
FADE: Mitigating Hallucinations by Reducing Language-Prior Dominance in Large Vision-Language Models cs.AI · 2026-06-28 · unverdicted · none · ref 46 · 2 links · internal anchor
FADE attenuates FFN outputs at critical layers in LVLMs to curb language-prior dominance and cut hallucinations, shown effective on POPE, CHAIR, and MME across three models.
POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation cs.AI · 2026-06-22 · unverdicted · none · ref 32 · internal anchor
POTracker fine-tunes an LLM with POTrackerLoss combining textual and structural similarity, achieving up to 86.47% structural accuracy on 1,000 power outage reports and outperforming baselines by up to 51%.
Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation cs.AI · 2026-06-08 · unverdicted · none · ref 17 · internal anchor
DPVR-LF routes saturated vision tokens into a one-layer side branch after layer 4, runs text-only processing through layers 5-17, and performs late fusion at the final layer to reduce visual computation while preserving multimodal performance.
Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models cs.AI · 2026-06-03 · unverdicted · none · ref 2 · internal anchor
SAGE-PTQ is a graph-guided ultra-low-bit PTQ framework that achieves 1.03 average weight bits and 0.004 scaling bits per matrix on LLMs while reporting lower perplexity and memory use than BiLLM and PB-LLM.
Characterizing initial human-AI proof formalization workflows cs.AI · 2026-06-02 · unverdicted · none · ref 143 · internal anchor
A controlled user study and qualitative survey find that AI assistance raises formalization accuracy for math proofs, with users flexibly combining multiple tools while retaining oversight.
eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion cs.AI · 2026-06-01 · unverdicted · none · ref 3 · internal anchor
eMoT treats reasoning trajectories as dynamic memories with corrosion, symbolic Python anchoring, and consistency refinement, raising accuracy on Game of 24 to 100% and improving math benchmarks over CoT baselines with a lightweight model.
S-SPPO: Semantic-Calibrated Self-Play Preference Optimization cs.AI · 2026-06-01 · unverdicted · none · ref 19 · internal anchor
S-SPPO stabilizes SPPO via semantic calibration in supervision and representation spaces, reporting 52.19% win rate on AlpacaEval 2.0 with Llama-3-8B.
BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices cs.AI · 2026-05-28 · unverdicted · none · ref 45 · internal anchor
BitTP applies weight-only 1.58-bit quantization to LLM trajectory predictors, claiming improved ADE/FDE over BF16 baseline with reduced resource demands on edge devices.
VikingMem: A Memory Base Management System for Stateful LLM-based Applications cs.AI · 2026-05-28 · unverdicted · none · ref 61 · internal anchor
VikingMem implements the Memory Base paradigm via event-centric extraction and entity updates on VikingDB with temporal compression, claiming up to 30% better retrieval effectiveness on long-term memory benchmarks.
DenseSteer: Steering Small Language Models towards Dense Math Reasoning cs.AI · 2026-05-28 · unverdicted · none · ref 14 · internal anchor
DenseSteer is an inference-time steering framework that improves small LLMs' accuracy on math reasoning by modulating representations toward dense reasoning patterns with fewer but higher-density steps.
Inference Time Context Sparsity: Illusion or Opportunity? cs.AI · 2026-05-22 · unverdicted · none · ref 45 · internal anchor
Current LLMs remain robust to high levels of inference-time context sparsity across diverse tasks, enabling up to 10x acceleration via sparse kernels.
Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management cs.AI · 2026-05-19 · unverdicted · none · ref 48 · 2 links · internal anchor
Existing proofs of autoregressive Transformer Turing-completeness apply to scaling families of models rather than fixed systems with context management, so they do not establish Turing-completeness for real-world LLMs.
Look Before You Leap: Autonomous Exploration for LLM Agents cs.AI · 2026-05-15 · unverdicted · none · ref 41 · internal anchor
LLM agents improve adaptability by first using an interaction budget for systematic exploration measured via Exploration Checkpoint Coverage before executing tasks.
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding cs.AI · 2026-05-15 · unverdicted · none · ref 33 · internal anchor
DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without any training.
UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence cs.AI · 2026-05-09 · unverdicted · none · ref 47 · 3 links · internal anchor
UxSID models ultra-long user sequences with semantic-group shared interest memory using Semantic IDs and dual-level attention, achieving state-of-the-art performance and a 0.337% revenue lift in advertising A/B tests.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning cs.AI · 2026-05-07 · unverdicted · none · ref 5 · 3 links · internal anchor
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning cs.AI · 2026-05-04 · unverdicted · none · ref 113 · internal anchor
U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions cs.AI · 2026-05-04 · unverdicted · none · ref 23 · internal anchor
BerLU constructs a C1-differentiable activation with Lipschitz constant 1 via Bernstein polynomial approximation, showing better performance and efficiency than baselines on image classification with ViTs and CNNs.
Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate cs.AI · 2026-04-26 · unverdicted · none · ref 2 · internal anchor
DxChain uses panoramic patient profiling, Med-ToT planning, and adversarial angel-devil debates to reduce LLM hallucinations in clinical diagnosis, achieving SOTA accuracy and consistency on two MIMIC-IV benchmarks.
Measuring the Machine: Evaluating Generative AI as Pluralist Sociotechical Systems cs.AI · 2026-04-22 · unverdicted · none · ref 19 · internal anchor
Generative AI must be evaluated as recursive pluralist sociotechnical systems via MaSH Loops and distributional World Values Benchmarks instead of static functionalist or prescriptive tests.
Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing cs.AI · 2026-04-09 · unverdicted · none · ref 45 · internal anchor
SAVeR adds self-auditing of internal beliefs in LLM agents via persona-based candidates and constraint-guided repairs, improving faithfulness on six benchmarks without hurting task performance.
An Analysis of Artificial Intelligence Adoption in NIH-Funded Research cs.AI · 2026-04-08 · unverdicted · none · ref 18 · internal anchor
AI makes up 15.9% of NIH-funded biomedical projects in 2025 with a 13.4% funding premium, yet 79% stay in research stages, only 14.7% reach clinical deployment, and health disparities work is just 5.7% of AI projects.
Joint Knowledge Base Completion and Question Answering by Combining Large Language Models and Small Language Models cs.AI · 2026-04-07 · unverdicted · none · ref 3 · internal anchor
JCQL uses an SLM-trained KBC model as an action in an LLM agent for KBQA to reduce hallucinations, then fine-tunes the KBC model with KBQA reasoning paths, outperforming baselines on two benchmarks.
Instruction-Tuned LLMs for Parsing and Mining Unstructured Logs on Leadership HPC Systems cs.AI · 2026-04-06 · unverdicted · none · ref 22 · internal anchor
An instruction-tuned 8B LLaMA model parses HPC logs with accuracy matching larger models and processes 600 million Frontier supercomputer logs to reveal temporal patterns and anomalies.
Solar-VLM: Multimodal Vision-Language Models for Augmented Solar Power Forecasting cs.AI · 2026-04-05 · unverdicted · none · ref 21 · internal anchor
Solar-VLM fuses time-series, satellite imagery, and text encoders with graph attention across sites to improve PV power forecasting on real data from eight Chinese stations.
Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity cs.AI · 2026-04-03 · conditional · none · ref 25 · internal anchor
A role clarity matrix from softmax-normalized behavior-role similarities is employed as a regularizer to enhance role consistency in multi-agent LLM collaborations.
MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization cs.AI · 2026-04-02 · unverdicted · none · ref 60 · 2 links · internal anchor
MolClaw deploys a hierarchical skill architecture to reach state-of-the-art results on a new benchmark of multi-step drug discovery tasks.
MAR: Efficient Large Language Models via Module-aware Architecture Refinement cs.AI · 2026-01-29 · unverdicted · none · ref 5 · internal anchor
MAR integrates SSMs and sparsification with new ATMN neurons and SBDS distillation to produce efficient LLMs that match dense-model performance at substantially lower inference energy.
Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving cs.AI · 2026-01-29 · unverdicted · none · ref 29 · internal anchor
An MLLM interpreter generates concise CDL descriptions from diagrams, enabling an off-the-shelf LLM to solve plane geometry problems competitively after training on only 5.5k examples.
AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture cs.AI · 2025-11-28 · unverdicted · none · ref 41 · internal anchor
AgroCoT is a new Chain-of-Thought VQA benchmark with 4759 samples to evaluate reasoning capabilities of vision-language models in agriculture.
OntoLogX: Ontology-Guided Knowledge Graph Extraction from Cybersecurity Logs with Large Language Models cs.AI · 2025-10-01 · unverdicted · none · ref 31 · internal anchor
OntoLogX is a system that applies LLMs with ontology guidance, RAG, and iterative fixes to build valid knowledge graphs from cybersecurity logs and predict ATT&CK tactics from aggregated sessions.
Advancing AI Research Assistants with Expert-Involved Learning cs.AI · 2025-05-03 · unverdicted · none · ref 38 · internal anchor
ARIEL evaluates LLMs and LMMs on full-length biomedical summarization and figure interpretation with blinded expert review, identifies limitations, and demonstrates gains from prompt engineering, fine-tuning, and an integrated agent for hypothesis generation.
ChatSR: Multimodal Large Language Models for Scientific Formula Discovery cs.AI · 2024-06-08 · unverdicted · none · ref 17 · internal anchor
ChatSR aligns scientific data encoders with LLMs to produce formulas that fit data and satisfy explicit priors, reporting SOTA results on 13 symbolic regression benchmarks plus zero-shot handling of unseen prior types.

LLaMA: Open and Efficient Foundation Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

mega hub controls

Recognition alignment

counterfactual ablation

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer