{"total":12,"items":[{"citing_arxiv_id":"2606.30412","ref_index":26,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Can LLMs Rank? A Tale of Triads and Triage","primary_cat":"cs.CY","submitted_at":"2026-06-29T14:59:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM ranking reliability for prioritization tasks can be assessed via coefficient of consistency ζ (intra-run circular triads) and Kendall's τ (inter-run distance), with three leading models showing distinct consistency profiles on homelessness allocation and ED triage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18703","ref_index":37,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment","primary_cat":"cs.LG","submitted_at":"2026-06-17T05:30:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LOGICA adds context to pretrained biological LMs via logit-space contrastive alignment with gated adapters, improving AUC on held-out drug-resistance mutation ranking from ~0.55 to ~0.65 while preserving token likelihoods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09043","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity","primary_cat":"cs.LG","submitted_at":"2026-06-08T05:24:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DynaCF dynamically downweights shortcut-sensitive samples in reward model training by tracking margin shifts under online counterfactual perturbations within the Bradley-Terry loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02211","ref_index":104,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Consistency Training while Mitigating Obfuscation via Rate Matching","primary_cat":"cs.CL","submitted_at":"2026-06-01T13:10:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23565","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Understanding Goal Generalisation in Sequential Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:31:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06335","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Eliciting associations between clinical variables from LLMs via comparison questions across populations","primary_cat":"cs.LG","submitted_at":"2026-05-07T14:26:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Indirect elicitation via triplet comparisons recovers meaningful association structures from LLMs and supports conservative causal candidate links across prompted subpopulations.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"j +v 2 j|k , w 2 = σ2 j σ2 j +v 2 j|k .(3) For providing an answer based on this estimate, the LLM ideally should compare ˆX ∗,(3) j to the reference midpointX j,ref = (X (1) j +X (2) j )/2, similar to a regression model 3 P \u0010 Y (e) jk = 2|X (3) j , X(3) k \u0011 =h \u0010 βs ·( ˆX ∗,(3) j −X j,ref) \u0011 (4) =h \u0010 β(e) 0,jk +β (e) 1,jk X (3) j +β (e) 2,jk X (3) k \u0011 (5) withh(η) = 1/(1 + exp(−η)), and scaling parameterβ s. Expanding (5) using (2) and (3) shows thatβ (e) 1,jk =β sw1 andβ (e) 2,jk =β sw2a1 =β sw2ρjk sj sk , hence β(e) 2,jk β(e) 1,jk = w2 w1 sj sk ρjk ,(6) tyingρ jk to the fitted logistic coefficientsβ (e) 1,jk andβ (e) 2,jk through their slope ratio. 2.3 Correlation estimation Symmetric estimator.The directional relation in (6) depends not only on the correlationρ jk,"},{"citing_arxiv_id":"2605.04267","ref_index":1,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"QUIVER: Cost-Aware Adaptive Preference Querying in Surrogate-Assisted Evolutionary Multi-Objective Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-05T20:02:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"QUIVER adaptively mixes objective evaluations with two types of preference queries in surrogate-assisted evolutionary multi-objective optimization to reduce final utility regret, reporting 25% gains on hard WFG benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04328","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Soft Tournament Equilibrium","primary_cat":"cs.AI","submitted_at":"2026-04-06T00:40:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"STE is a differentiable method to compute continuous analogues of the Top Cycle and Uncovered Set from pairwise comparison data for stable set-valued evaluation of cyclic agent interactions.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"empirical membership labels. For the pairwise probabilitiesPθ(a≻b|x ), standard calibration tools such as temperature scaling and reliability diagrams can be applied on held-out comparison data [Guo et al., 2017]. After training, we compute the marginal probabilistic tournament matrixP∈ [0, 1]n×n by averaging over the evaluation distributionQ: Pab =E x∼Q[Pθ(a≻b|x)].(3) In practice, this expectation is approximated by Monte Carlo sampling fromQ. The distribution Q is a substantive part of the evaluation design: changingQchanges the arena and can change the resulting core. By construction, Pab + Pba = 1for a̸ = b, and we setPaa = 1/2by convention. This matrix P is the object analyzed by the subsequent STE operators."},{"citing_arxiv_id":"2408.07199","ref_index":115,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents","primary_cat":"cs.AI","submitted_at":"2024-08-13T20:52:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.13228","ref_index":9,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive","primary_cat":"cs.CL","submitted_at":"2024-02-20T18:42:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.11411","ref_index":33,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Aligning Modalities in Vision Large Language Models via Preference Fine-tuning","primary_cat":"cs.LG","submitted_at":"2024-02-18T00:56:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.18290","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Direct Preference Optimization: Your Language Model is Secretly a Reward Model","primary_cat":"cs.LG","submitted_at":"2023-05-29T17:57:46+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Intuitively, the DPO update increases the relative log probability of preferred to dispreferred responses, but it incorporates a dynamic, per-example importance weight that prevents the model degeneration that we find occurs with a naive probability ratio objective. Like existing algorithms, DPO relies on a theoretical preference model (such as the Bradley-Terry model; [5]) that measures how well a given reward function aligns with empirical preference data. However, while existing methods use the preference model to define a preference loss to train a reward model and then train a policy that optimizes the learned reward model, DPO uses a change of variables to define the preference loss as a function of the policy directly."}],"limit":50,"offset":0}