{"total":23,"items":[{"citing_arxiv_id":"2605.22811","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"GS-QA: A Benchmark for Geospatial Question Answering","primary_cat":"cs.DB","submitted_at":"2026-05-21T17:57:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18648","ref_index":48,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration","primary_cat":"cs.LG","submitted_at":"2026-05-18T16:55:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Controlled experiments on MNIST show human soft-labels act as a regularizer that improves calibration on hard samples and aligns model uncertainty with humans, beyond accuracy gains from correcting mislabels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12395","ref_index":22,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles","primary_cat":"cs.CL","submitted_at":"2026-05-12T16:57:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Re-evaluating controlled text generation systems under standardized conditions reveals that many published performance claims do not hold, highlighting the need for consistent evaluation practices.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"In our LPF approach, we combine standardisation of metric implementation, execution and application with diversity of selected metrics. The former is to ensure that arbitrary differences in what should be exactly the XXXX, Vol. 1, Article . Publication date: May 2025. 6•Michela Lorandi and Anya Belz CTG Technique Model Task C Attributes Evaluation DatasetsS T K Complete Training CTRL [25] Transf FT Si ✓ - Model Fine-Tuning C BART [22] BART FT Si ✓ One-Billion-Word dataset [8], Yelp dataset [61] Multi CTG [19] BERT FT M ✓ ✓ PPLM Prompts [13] Prior CTG [20] GPT-2 M FT M ✓ ✓ PPLM Prompts Modification of Token Distribution CAT-PAW [18] GPT-2 M FT Si ✓ ✓ PPLM Prompts PPLM [13] GPT-2 M FT M ✓ ✓ PPLM Prompts Prompting Falcon 40B Instruct [1] Falcon FT M ✓ ✓ ✓ - LLaMa2 70B chat [55] LLaMa2 FT M ✓ ✓ ✓ -"},{"citing_arxiv_id":"2605.10415","ref_index":23,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis","primary_cat":"cs.CL","submitted_at":"2026-05-11T11:52:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DPUA is a two-phase framework that aligns LLM uncertainty expressions with human disagreement distributions in subjectivity analysis while preserving task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09893","ref_index":14,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions","primary_cat":"cs.CL","submitted_at":"2026-05-11T02:32:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"6 social dialogue datasets [8, 9, 10, 11, 12, 13] and controlled generation. A human annotator first filters scenarios according to predefined criteria. To increase value diversity of selected scenarios, we generate paraphrases using GPT-4o (T= 0.7 , max_tokens=500), producing multiple formulations of the same dilemma with varied surface realizations [14]. Crucially, all scenarios are rewritten to exclude explicit mentions of values, requiring models to implicitly infer and balance competing values during generation. This design enables a controlled evaluation of value-action gaps under naturalistic conditions, where value expression must emerge from behavior rather than prompting. Additional details regarding DAISY's construction and content are provided in Appendix D."},{"citing_arxiv_id":"2605.01165","ref_index":36,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-01T23:47:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CEZSAR uses contrastive learning to align video and sentence embeddings with automatic negative sampling, claiming state-of-the-art zero-shot action recognition on UCF-101 and Kinetics-400.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"from overfitting noisy labels [37]. This property is critical for us because we deal with natural language descriptions that are intrinsically noisy due to ambiguities and annotators' perceptions of what should be described. In addition, language- image pre-trained models such as CLIP [29] have attracted increasing attention from the research community [22,35,36]. These models have shown impressive results in zero-shot experiments, but they rely on extensive training infrastruc- ture (e.g., clusters with up to596Tesla V100 GPUs used for18[29]). Moreover, the dataset containing400million image-text pairs is not available for down- load, leading us to the following question: what would the results be for ER [5]"},{"citing_arxiv_id":"2604.27998","ref_index":16,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning","primary_cat":"cs.LG","submitted_at":"2026-04-30T15:23:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Latent-GRPO stabilizes reinforcement learning in latent space, delivering 7.86 Pass@1 gains on low-difficulty tasks over latent baselines and 4.27 points over explicit GRPO on high-difficulty tasks with 3-4x shorter reasoning chains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24542","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-04-27T14:38:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LCF detects multiple LLM runtime threats by computing aggregated diagonal Mahalanobis distances on layer-wise hidden-state differences, calibrated on clean examples, achieving high detection rates with low overhead across several model architectures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21106","ref_index":49,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-22T21:51:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080- 2094, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. naacl-main.168. URLhttps://aclanthology.org/2021.naacl-main.168/. [49] Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing English math word problem solvers. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975-984, Online, July 2020. Association for Computational"},{"citing_arxiv_id":"2604.20932","ref_index":19,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks","primary_cat":"cs.CR","submitted_at":"2026-04-22T11:17:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A context-aware Sentinel-Strategist system for RAG selectively applies defenses to block membership inference and data poisoning while recovering most retrieval utility compared to always-on defense stacks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2 Related Work 2.1 Retrieval-Augmented Generation (RAG) Retrieval-Augmented Generation (RAG) is designed to increase the capabilities of Large Language Models (LLMs) by combining them with external knowledge sources [4]. This approach effectively handles the issues of hallucinations and knowledge cutoff limitations found in standalone parametric models [19, 20]. In contrast to traditional LLMs that depend solely on fixed internal parameters, RAG systems actively fetch semantically relevant context from external databases. As illustrated in Fig. 1, the pipeline consists of three phases: ingestion (document chunking and embedding), retrieval (top-k similarity search), and augmentation (prompt construction for the generator) [4]."},{"citing_arxiv_id":"2604.17761","ref_index":34,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks","primary_cat":"cs.AI","submitted_at":"2026-04-20T03:24:11+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021. EMNLP-MAIN.373. URLhttps://doi.org/10.18653/v1/2021.emnlp-main.373. 18 [33] Kramár, J., Lieberum, T., Shah, R., and Nanda, N. Atp*: An efficient and scalable method for localizing LLM behaviour to components.CoRR, abs/2403.00745, 2024. doi: 10.48550/ARXIV. 2403.00745. URLhttps://doi.org/10.48550/arXiv.2403.00745. [34] Li, J., Cheng, X., Zhao, X., Nie, J.-Y., and Wen, J.-R. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6449-6464, Singapore, December 2023. Association for Computational Linguistics."},{"citing_arxiv_id":"2604.16493","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions","primary_cat":"cs.DB","submitted_at":"2026-04-13T18:00:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tends the CoT approach by decomposing a problem into smaller, manageable components [68]. The decomposer agent introduced by MAC-SQL [60] breaks the query into a series of intermediate steps, such as sub-questions, and generates corresponding sub-queries for each step before generating the final SQL. Intermediate Representation.To bridge the gap between NL and SQL queries, intermediate representations (IRs) [ 13, 17, 25], such as Pandas-like or SQL-like codes, have been introduced to fa- cilitate the generation of SQL queries. TA-SQL [53] employs Pandas- like code as an IR, DIN-SQL adopts the IR from NatSQL [13], while OpenSearch-SQL invents an SQL-like language to encourage LLMs to focus more on logic before generating final SQL queries. Multiple Candidate Generation."},{"citing_arxiv_id":"2604.04902","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Are Latent Reasoning Models Easily Interpretable?","primary_cat":"cs.LG","submitted_at":"2026-04-06T17:50:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Latent reasoning models often ignore their latent tokens for predictions and their correct outputs can be decoded into natural language reasoning traces more reliably than incorrect outputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16353","ref_index":16,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AgriIR: A Scalable Framework for Domain-Specific Knowledge Retrieval","primary_cat":"cs.IR","submitted_at":"2026-03-17T05:14:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"AgriIR is a configurable RAG framework using modular stages and 1B-parameter models to deliver grounded, citable answers for Indian agricultural information access.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03244","ref_index":10,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AI Evaluation Should Require Standardized Item-Level Data Releases","primary_cat":"cs.AI","submitted_at":"2026-02-27T04:31:30+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AI benchmark evaluations require standardized item-level data releases as core infrastructure to support validity assessment, demonstrated via the OpenEval archive of 10M responses across 155k items.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.04003","ref_index":73,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making","primary_cat":"cs.AI","submitted_at":"2026-02-03T20:42:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09550","ref_index":63,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"HyEm: Query-Adaptive Hyperbolic Retrieval for Biomedical Ontologies via Euclidean Vector Indexing","primary_cat":"cs.IR","submitted_at":"2026-01-26T07:04:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HyEm maps radius-controlled hyperbolic ontology embeddings to Euclidean space for ANN indexing and applies query-adaptive hyperbolic reranking to improve hierarchy-aware retrieval while preserving most Euclidean performance on flat queries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22151","ref_index":7,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-09-26T10:10:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MultiMat shows multimodal large models plus constrained search produce higher-quality procedural material graphs than text-only baselines on a new production dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.02949","ref_index":42,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly","primary_cat":"cs.CL","submitted_at":"2025-09-03T02:26:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ProMQA-Assembly is a new multimodal procedural QA dataset with 646 pairs on assembly activities, built via LLM-generated candidates verified by humans plus 81 task graphs, and used to benchmark multimodal models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.03038","ref_index":16,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Automatic Combination of Sample Selection Strategies for Few-Shot Learning","primary_cat":"cs.LG","submitted_at":"2024-02-05T14:23:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ACSESS automatically combines 23 sample selection strategies to outperform individual strategies in few-shot learning on text and image datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2306.16388","ref_index":43,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Towards Measuring the Representation of Subjective Global Opinions in Language Models","primary_cat":"cs.CL","submitted_at":"2023-06-28T17:31:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.07922","ref_index":15,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CodeT5+: Open Code Large Language Models for Code Understanding and Generation","primary_cat":"cs.CL","submitted_at":"2023-05-13T14:23:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CodeT5+ is a flexible encoder-decoder LLM family for code pretrained with diverse objectives on multilingual corpora and initialized from existing LLMs, achieving state-of-the-art results on code generation, completion, math programming, and retrieval tasks including new SoTA on HumanEval with the 1","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2209.14375","ref_index":8,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Improving alignment of dialogue agents via targeted human judgements","primary_cat":"cs.LG","submitted_at":"2022-09-28T19:04:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparrow uses targeted rule-based human feedback and evidence provision to outperform baselines in preference while violating rules only 8% of the time under adversarial probing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}