{"total":138,"items":[{"citing_arxiv_id":"2606.16009","ref_index":107,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design","primary_cat":"cs.CL","submitted_at":"2026-06-14T20:41:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Machine interpreting should shift from fidelity metrics to three design priorities—agency, grounding, and experience—drawn from interpreting studies to close the usability gap with human-mediated communication.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01053","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian Surprise","primary_cat":"cs.AI","submitted_at":"2026-05-31T06:48:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AnyEdit++ proposes Bayes-Chunk, an adaptive segmentation method based on Bayesian Surprise, with theoretical claims of structural independence and causal locality, reporting superior results over baselines on math, code, and narrative tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00540","ref_index":259,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trustworthy Recommendation in the Era of Large Language Models: Opportunities and Challenges","primary_cat":"cs.IR","submitted_at":"2026-05-30T05:14:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A systematic review of over 200 studies concludes that LLMs in recommender systems act as a double-edged sword, creating both opportunities and new risks for trustworthiness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30844","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fine-Tuning Improves Information Conveyance in Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-29T05:05:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fine-tuning reorganizes uncertainty in LLMs into more efficient information conveyance, as shown by stronger length-entropy correlations and a tripling of entropy-semantic diversity links after controls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30589","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law","primary_cat":"cs.CL","submitted_at":"2026-05-28T21:36:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A new source-grounded QA dataset for U.S. immigration law is built from official documents and used to fine-tune a 3B model, yielding a 27% mean score improvement over the base model on a held-out sample.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29795","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains","primary_cat":"cs.AI","submitted_at":"2026-05-28T11:44:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MEMENTO framework uses adaptive web exploration via AET and dual-channel memory to acquire domain expertise from interaction trajectories, yielding +25.6% and +36.5% gains over ReAct baselines in sales automation and legal research.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28481","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Co-creation of AI technology, empowering curators of cultural heritage information and guarding research commons","primary_cat":"cs.DL","submitted_at":"2026-05-27T13:40:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Describes an engineering sequence implementing local RAG chatbots for cultural heritage collections to empower curators while using Dataverse for archiving.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23171","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning","primary_cat":"cs.LG","submitted_at":"2026-05-22T02:43:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SymNoise applies symmetric noise to embeddings during instruction fine-tuning and reports 6.7% higher AlpacaEval scores than NEFTune on LLaMA-2-7B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22502","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost","primary_cat":"cs.AI","submitted_at":"2026-05-21T13:54:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Compiling agentic workflows into LLM weights creates subterranean agents with near-frontier quality at two orders of magnitude less cost, validated empirically on travel booking, Zoom support, and insurance claims tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21858","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hypergraph as Language","primary_cat":"cs.CL","submitted_at":"2026-05-21T01:09:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hyper-Align is a hypergraph-native framework that serializes high-order relations into LLM-compatible tokens via HIDT-O templates and a HIP projector, outperforming graph-centric methods on HyperAlign-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19568","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder","primary_cat":"cs.CL","submitted_at":"2026-05-19T09:13:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17967","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective","primary_cat":"cs.AI","submitted_at":"2026-05-18T07:22:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SFT on LLMs removes noise-like token interactions in a brief early phase before introducing overfitted ones, explaining inconsistent effectiveness across model scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17152","ref_index":127,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages","primary_cat":"cs.CL","submitted_at":"2026-05-16T20:56:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16865","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MixSD: Mixed Contextual Self-Distillation for Knowledge Injection","primary_cat":"cs.CL","submitted_at":"2026-05-16T07:57:09+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14890","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study","primary_cat":"cs.CL","submitted_at":"2026-05-14T14:35:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Tokenizer fertility varies 1.6x across models on Ukrainian legal text, Qwen uses 60% more tokens than Llama-family models, zero-shot outperforms few-shot by up to 26 points, and pre-war classifiers lose 27.9 points on invasion-era decisions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14055","ref_index":71,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts","primary_cat":"cs.CL","submitted_at":"2026-05-13T19:25:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13280","ref_index":100,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code","primary_cat":"cs.SE","submitted_at":"2026-05-13T09:58:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10933","ref_index":2,"ref_count":6,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices","primary_cat":"cs.LG","submitted_at":"2026-05-11T17:58:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.","context_count":2,"top_context_role":"other","top_context_polarity":"unclear","context_text":"vents internal expert activations from vanishing, but also promotes a steady activation ratio at the router level. A theoretical justification for its rationality is presented in Appendix E. Given the expert intermediate dimension de, the structure of a DECO expert is formally defined as: xup = SparseLinear(x,W up), x′ up = NormSiLU(x,W up,x up), y= SparseLinear(x ′ up,W down), (2) where Wup ∈R Ne×de×dh and Wdown ∈ RNe×dh×de are the up-projection and down- projection weights, respectively. SparseLinear op- erator facilitates sparse linear operations by involv- ing only active experts at inference time. 3.3 Adaptive Sparsity Regularization To effectively control the sparsity level, we adopt an adaptive sparsity regularization, based on the"},{"citing_arxiv_id":"2605.10765","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2026-05-11T15:59:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 663-677, 2024. [41] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. InEuropean Conference on Computer Vision, pages 631-648. Springer, 2022. [42] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021. [43] Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey."},{"citing_arxiv_id":"2605.08472","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-08T20:46:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2506.01939, 2025. [53] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484-13508, 2023. [54] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021. [55] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist re- inforcement learning.Mach. Learn., 8(3-4):229-256, May 1992."},{"citing_arxiv_id":"2605.10973","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rotation-Preserving Supervised Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-08T20:20:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07731","ref_index":10,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:36:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"darker indicating more tokens). We see the set of dense models (where active parameters equals the total parameter count) along a diagonal on the top left of the graph, and a cluster of Mixture of Expert (MoE) models in the centre of the graph. Moreover, the development of instruction tuning techniques to improve the model's ability to follow instructions [10], chain of thought prompting to improve the model's ability to generate and leverage intermediate results [11], and reinforcement learning for improving the reasoning capabilities of the models [12], have all resulted in performance improvements across difficult benchmarks [13]. Finally, specialized training has enabled the models to be used for tool calling [14] and skill acquisition [15]."},{"citing_arxiv_id":"2605.06987","ref_index":169,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Response Time Enhances Alignment with Heterogeneous Preferences","primary_cat":"cs.LG","submitted_at":"2026-05-07T22:05:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06145","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization","primary_cat":"cs.LG","submitted_at":"2026-05-07T12:40:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"GCRL and MISL are unified as control maximization, with three inequivalent GCRL formulations each matched to a MISL objective via bounds on goal-sensitivity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04972","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Why Expert Alignment Is Hard: Evidence from Subjective Evaluation","primary_cat":"cs.CL","submitted_at":"2026-05-06T14:28:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Expert alignment in subjective LLM evaluations is difficult because expert judgments are heterogeneous, partly tacit, dimension-dependent, and temporally unstable.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02860","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection","primary_cat":"cs.AI","submitted_at":"2026-05-04T17:37:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Finetuned Language Models Are Zero-Shot Learners.arXiv preprint arXiv:2109.01652(2021). [50] Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, et al . 2025. Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought.arXiv preprint arXiv:2501.04682(2025). [51] Mohammad A Yahya and Dae-Kyoo Kim. 2022. Cross-language source code clone detection using deep learning with infercode.arXiv preprint arXiv:2205.04913(2022). [52] Cheng Yang, Chufan Shi, Siheng Li, Bo Shui, Yujiu Yang, and Wai Lam. 2025. Llm2: Let large language models harness system 2 reasoning. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for"},{"citing_arxiv_id":"2605.02378","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-04T09:18:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"luckily guess the correct answer without reliably inducing the underlying rule. We thus identify this inductive gap as a core bottleneck of multimodal ICL. To overcome this fundamental inductive gap, we proposeMMInduction, a framework that instructs multimodal ICL as a structured inductive-deductive process. At its core lies an Inductive Chain- of-Thought (CoT) [42], a reasoning template that guides the model to first analyze the provided cases, extract a generalizable rule, and then apply that rule to deduce the answer for the query. This stands in stark contrast to the naive jump-to-answer pattern prevalent in existing multimodal ICL. To equip the model with robust inductive-deductive reasoning capabilities, we adopt a combined"},{"citing_arxiv_id":"2605.00195","ref_index":35,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Diversity in Large Language Models under Supervised Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-04-30T20:20:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00061","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniBCI: Towards a Unified Pretrained Model for Invasive Brain-Computer Interfaces","primary_cat":"cs.NE","submitted_at":"2026-04-30T06:30:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniBCI is a unified pretrained model for invasive neural spike data that uses CST tokenization, IAA attention, and self-supervised masked reconstruction to achieve SOTA downstream performance with better generalization and efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25496","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Improving Zero-Shot Offline RL via Behavioral Task Sampling","primary_cat":"cs.AI","submitted_at":"2026-04-28T10:56:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Improving Zero-Shot Offline RL via Behavioral Task Sampling A. Proof of Proposition 4.1 Proof. For a fixed task z, the variance of the returns is determined by the projection of the task vector onto the directions of variance in the behavioral space: Varπ(J(π, z)) = Varπ (ψπ)⊤z \u0001 (18) =E π h (ψπ − ¯ψ)⊤z \u00012i (19) =E π \u0002 z⊤(ψπ − ¯ψ)(ψ π − ¯ψ)⊤z \u0003 (20) =z ⊤ Eπ \u0002 (ψπ − ¯ψ)(ψ π − ¯ψ)⊤\u0003\u0001 z(21) =z ⊤ΣΨz.(22) Now consider the expectation over the task distributionz. UsingTr(ABC) = Tr(BCA)and linearity of expectation: Ez \u0002 z⊤ΣΨz \u0003 =E z \u0002 Tr(z⊤ΣΨz) \u0003 (23) =E z \u0002 Tr(ΣΨzz ⊤) \u0003 (24) = Tr ΣΨ Ez[zz ⊤] \u0001 .(25) For a uniform random vector on the unit hypersphereS d−1, Ez[zz ⊤] = 1 d Id. Substituting yields Ez[Varπ(J(π, z))] = 1 dTr(ΣΨ) = 1 d dX i=1"},{"citing_arxiv_id":"2604.22117","ref_index":151,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training","primary_cat":"cs.LG","submitted_at":"2026-04-23T23:32:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20329","ref_index":26,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Image Generators are Generalist Vision Learners","primary_cat":"cs.CV","submitted_at":"2026-04-22T08:23:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19902","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings","primary_cat":"cs.CV","submitted_at":"2026-04-21T18:25:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"[39] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. [40] Zeyu Wang, Zilong Chen, Cihang Xie, et al. Lightfusion: A light-weighted, double fusion framework for unified multimodal understanding and generation.arXiv preprint arXiv:2510.22946, 2025. [41] Jason Wei, Maarten Bosma, Vincent Zhao, et al. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2022. [42] Yichen Wei, Wei Shen, Yang Liu, Yahui Zhou, et al. Skywork r1v2: Multimodal hybrid reinforcement learning for reasoning. arXiv preprint arXiv:2504.16656, 2025. [43] Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yuxuan Ma, Xingchao Liu, Zizheng Pan, Wenbo Chang, Zhenda Xie,"},{"citing_arxiv_id":"2604.19299","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms","primary_cat":"cs.CL","submitted_at":"2026-04-21T10:05:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18539","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Transition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations","primary_cat":"cs.CL","submitted_at":"2026-04-20T17:33:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18389","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Understanding the Prompt Sensitivity","primary_cat":"cs.CL","submitted_at":"2026-04-20T15:13:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs disperse meaning-preserving prompts internally instead of clustering them, which produces an excessively high upper bound on output log-probability differences via Taylor expansion and Cauchy-Schwarz.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"will vanish faster than ∥∆h∥2 as h1 →h 0. More- over, in this paper, h0 and h1 are two meaning- preserving prompt words that reside in a close se- mantic space. Based on this condition, we rewrite Eq. (9) in the following form: ∆ logπ(y t |h)≈ ∇ h logπ(y t |h 0)⊤∆h.(10) Then, we obtain the following inequality by calcu- lating the L2 norm: |∆ logπ(y t |h)| ≤ ∥∇ h logπ(y t|h0)∥ · ∥∆h∥, (11) where ∥ · ∥ is the L2 norm. This inequality tells us that |∆ logπ(y t |h)| has an upper bound ∥∇h logπ(y t|h0)∥ · ∥∆h∥. If the upper bound is significantly low, |∆ logπ(y t |h)| can be approxi- mated as 0, meaning the two meaning-preserving prompts receive equal log probabilities of the model's next token. Calculate the gradient.We represent the gradi-"},{"citing_arxiv_id":"2604.17800","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning","primary_cat":"cs.RO","submitted_at":"2026-04-20T04:46:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17243","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation","primary_cat":"cs.CV","submitted_at":"2026-04-19T04:04:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2508.18265, 2025. 6 [55] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484-13508, 2023. 2 [56] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learn- ers. arXiv preprint arXiv:2109.01652, 2021. 2 [57] Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, and Jun Wang. Vision-language reasoning for geolocaliza- tion: A reinforcement learning approach."},{"citing_arxiv_id":"2604.16943","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation","primary_cat":"cs.CL","submitted_at":"2026-04-18T09:54:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other methods on image translation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16917","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"x1: Learning to Think Adaptively Across Languages and Cultures","primary_cat":"cs.CL","submitted_at":"2026-04-18T08:50:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"x1 models adaptively select an advantageous language for reasoning per instance, yielding gains on multilingual math and cultural tasks while showing that scaling does not erase culture-language advantages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16896","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design","primary_cat":"q-bio.QM","submitted_at":"2026-04-18T08:09:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ProtoCycle improves text-guided protein design by coupling an LLM planner with tool feedback and reflection to achieve better language alignment and foldability than direct generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16299","ref_index":14,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization","primary_cat":"cs.SE","submitted_at":"2026-04-17T07:20:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ACE introduces a solver-adversary loop where an LLM generates both candidate programs and adversarial tests, using execution outcomes for preference optimization to achieve 3-7% pass@1 gains on code benchmarks without ground-truth code.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13694","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Weight Patching: Toward Source-Level Mechanistic Localization in LLMs","primary_cat":"cs.AI","submitted_at":"2026-04-15T10:21:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"\"Causal scrubbing: A method for rigorously testing interpretabil- ity hypotheses,\" vol. 2, p. 19, 2022. [53] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., \"Training language models to follow instructions with human feedback,\"Advances in neural information processing systems, vol. 35, pp. 27 730-27 744, 2022. [54] J. Wei, M. Bosma, V . Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, \"Finetuned language models are zero-shot learners,\"arXiv preprint arXiv:2109.01652, 2021. [55] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V . Le, B. Zoph, J. Weiet al., \"The flan collection: Designing data and methods for effective instruction tuning,\""},{"citing_arxiv_id":"2604.11103","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing","primary_cat":"cs.SD","submitted_at":"2026-04-13T07:20:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09034","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge","primary_cat":"cs.LG","submitted_at":"2026-04-10T06:52:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A competition entry achieved efficient fine-tuning of LLaMa2 70B on one GPU in 24 hours with competitive QA benchmark performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08477","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions","primary_cat":"cs.AI","submitted_at":"2026-04-09T17:16:07+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07825","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Filling the Gaps: Selective Knowledge Augmentation for LLM Recommenders","primary_cat":"cs.IR","submitted_at":"2026-04-09T05:27:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KnowSA_CKP uses comparative knowledge probing to selectively augment LLM prompts for items with knowledge gaps, improving recommendation accuracy and context efficiency.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"[51] Zijun Yao, Weijian Qi, Liangming Pan, Shulin Cao, Linmei Hu, Liu Weichuan, Lei Hou, and Juanzi Li. 2025. SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, 27022-27043. doi:10.18653/v1/ 2025.acl-long.1312 [52] Qinan Yu, Jack Merullo, and Ellie Pavlick. 2023. Characterizing mechanisms for factual recall in language models.arXiv preprint arXiv:2310.15910(2023). [53] Zhenrui Yue, Sara Rabhi, Gabriel de Souza Pereira Moreira, Dong Wang, and Even Oldridge. 2023. Llamarec: Two-stage recommendation using large language models for ranking.arXiv preprint arXiv:2311."},{"citing_arxiv_id":"2604.05168","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Instruction-Tuned LLMs for Parsing and Mining Unstructured Logs on Leadership HPC Systems","primary_cat":"cs.AI","submitted_at":"2026-04-06T20:59:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"An instruction-tuned 8B LLaMA model parses HPC logs with accuracy matching larger models and processes 600 million Frontier supercomputer logs to reveal temporal patterns and anomalies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03231","ref_index":67,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning","primary_cat":"cs.CV","submitted_at":"2026-04-03T17:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Multi-Modal Adapter for Vision-Language Models.Computer Vision and Pattern Recognition(2024). doi:10.1109/CVPR52733.2024.02249 [66] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. 2023. mplug-owl: Modular- ization empowers large language models with multimodality.arXiv:2304.14178 (2023). [67] Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. 2025. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154(2025). [68] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition"},{"citing_arxiv_id":"2604.02713","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts","primary_cat":"cs.CL","submitted_at":"2026-04-03T04:10:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"https://storage.googleapis.com/deepmind-media/ gemini/gemini_v2_5_report.pdf Accessed: 2025-12-01. [39] Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2021. Finetuned Language Models Are Zero-Shot Learners.ArXivabs/2109.01652 (2021). https://api.semanticscholar. org/CorpusID:237416585 [40] Joel Wester, Tim Schrills, Henning Pohl, and Niels Van Berkel. 2024. \"As an AI language model, I cannot\": Investigating LLM Denials of User Requests. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1-14. [41] Joel Wester, Tim Schrills, Henning Pohl, and Niels van Berkel. 2024. \"As an AI language model, I cannot\": Investigating LLM Denials of User Requests."}],"limit":50,"offset":0}